UPDATES

WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Thursday, 23 January 2014

Hadoop Research Tips

Those who are interested to work on Hadoop, One commonly asked question that I got from these people  is what Hadoop feature can I work on?

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.
  • Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
  • Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
  • Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
  • Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
  • Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
  • Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
  • High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
  • Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.
The first thing for a student who wants to do any of these projects is to download the code from HDFS and MAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

If anybody else have any new project ideas, please add them as comments to this blog post.

Hadoop Installation on Single Machine

To Download and Install Hadoop, the prerequisites are
1. Linux based OS 64-bit OS like
            Ubuntu
            CentOS
            Fedora ... etc
I preferred to use Ubuntu 12.04LTS, later 14.04 LTS(upcomming version)

2. JAVA 1.6 or 1.7 JDK

Go to Downloads folder
> cd Downloads

Un-zip the hadoop tar file
>sudo tar xzf hadoop-1.1.2.tar.gz

I created a folder in /home/hduser/
>mkdir Installations

Move the Hadoop Un-Zip folder to Installations Directory, pointing as Hadoop
>sudo mv /home/hduser/Downloads/hadoop-1.2.1 hadoop

Giving some permissions to hadoop folder
>sudo addgroup hadoop
>sudo chown -R hduser:hadoop hadoop

Restart the terminal inorder to get .bashrc file with some content


JAVA Installation in Ubuntu version

                                                  

I have a 64 bit version of Ubuntu 12.04 LTS installed, so the instructions below only apply to this OS.
Download the Java JDK from

1.           Click Accept License Agreement
2.           Click jdk-6u45-linux-x64.bin
3.           Login to Oracle.com with your Oracle account
4.           Download the JDK to your ~/Downloads directory
5.           After downloading, open a terminal, then enter the following commands.

cd ~/Downloads
chmod +x jdk-6u45-linux-x64.bin
./jdk-6u45-linux-x64.bin
Note:
The jvm directory is used to organize all JDK/JVM versions in a single parent directory.
sudo mkdir /usr/lib/jvm
sudo mv jdk1.6.0_45 /usr/lib/jvm
The next 3 commands are split across 2 lines per command due to width limits in the blog’s theme.

Tuesday, 21 January 2014

Splunk Hadoop Connect 1.1 – Opening the door to MapR; now available on all Hadoop distributions

I am happy to announce that Splunk Hadoop Connect 1.1 is now available. This version of Hadoop Connect rounds out Splunk’s integration with the Hadoop distributions by becoming certified on MapRCloudera,Hortonworks, and Apache Hadoop distributions also have the ability to benefit from the power of Splunk.
Splunk Hadoop Connect provides bi-directional integration to easily and reliably move data between Splunk and Hadoop. It provides Hadoop users the ability to gain real-time analysis, visualization and role based access control for a stream of machine-generated data. It delivers three core capacities: Export data from Splunk to Hadoop, Explore Hadoop directories and Import data from Hadoop to Splunk.
The most significant new feature added to version 1.1 is the ability to select whether you want to map to a remote HDFS cluster or to a mounted file system. The option to map to a mounted file system enables Splunk to integrate with MapR Hadoop distribution. MapR allows users to mount Hadoop via NFS using a feature called direct access NFS.

Apache Tez - Beyond Hadoop (MR)

Thanks to Hadoop Tips
Apache Pig and Hive are higher level abstracts on top of MR (MapReduce). PigLatin scripts of Pig and HiveQL queries of Hive are converted into a DAG of MR jobs, with the first MR job (5) reading from the input the last MR job (2) writing to the output. One of the problem with this approach is that the temporary data between the MR jobs (as in the case of 11 and 9) is written to HDFS (by 11) and read from HDFS (by 9) which leads to inefficiency. Not only this, multiple MR jobs will also lead to initialization over head.
                                             
With the ClickStream analysis mentioned in the blog earlier, to find out the `top 3 urls visited by users whose age is less than 16` Hive results in a DAG of 3 MR jobs, while Pig results in a DAG of 5 MR jobs.

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...