Showing posts from October, 2013

Jobs Page

Here you will get the job updates related to Hadoop Technology, in JOBS , also it is
recommended to follow in LinkedIn . I suggest you to update your jobs in hadoop by following this page hadoop jobs in facebook.

When to use HBase and when MapReduce?

Thanks to Hadoop Tips
Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.

I would like the analogy of HBase/MapReduce to an plane/train. A train can carry a lot of material at a slow pace, while a plane relatively can carry less material at a faster pace. Depending on the amount of the material to be transferred from one location to another and the urgency, either a plane or a train can me used to move material.

Similarly HBase (or in fact any database) provides relatively low latency (response time) at the cost of low throughput (data transferred/processed), while MapReduce provides high latency at the cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked. \

MapReduce Patterns

Developing algorithms to process data on top of Distributing computing because of it's inherent distributed nature is difficult. So there are models like PRAM, MapReduce, BSP (Bulk Synchronous Parallel), MPI (Message Passing Interface) and others. Apache Hadoop, Apache Hama, Open MPI and other frameworks/software make it easier to develop algorithms on top of the earlier mentioned distributed computing models.

Coming back to our favorite topic MapReduce, this model is not something new. Map and Reduce had been known as Map and Fold which are higher order functions and had been used in functional programming languages like Lisp and ML. The different between a regular function and a higher order function is that while a higher order function can take multiple functions as input and return a function as output, a regular function doesn't.

As mentioned earlier Map and Fold had been there for ages and you might have heard about them if you had your education in Computer Scienc…

Why does Hadoop uses KV (Key/Value) pairs?

Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.
(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)

The big question is why use key/value pairs?
MapReduce is derived from the concepts of functional programming. Here is a tutorial on the map/fold primitives in the functional programming which are used in the MapReduce paradigm and there is no mention of key/value pairs anywhere.
According to the Google MapReduce Paper

We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

The only reason I could see why Hadoop used key/value pairs is that the Google Paper had …

BSP (Hama) vs MR (Hadoop)

Thanks to Hadoop Tips
BSP (Bulk Synchronous Parallel) is a distributed computing model similar to MR (MapReduce). For some class of problems BSP performs better than MR and other way. This paper compares MR and BSP.

While Hadoop implements MR, Hama implements BSP. The ecosystem/interest/tools/$$$/articles around Hadoop is much more when compared to Hama. Recently Hama has released 0.4.0 and includes examples for page rank, graph exploration, shortest path and others. Hama has many similarities (API, command line, design etc) with Hadoop and it shouldn't take much time to get started with Hama for those who are familiar with Hadoop.

As of now Hama is in incubator stage and is looking for contributors. Hama is still in the early stage and there is a lot of scope for improvement (performance, testing, documentation etc) . For someone who wants to start with Apache, Hama is a choice. Thomas and Edward had been actively blogging about Hama and are very responsive for any clari…

What should be the input/ouput key/value types be for the combiner?

When only a map and reducer class are defined for a job, the key/value pairs emitted by the mapper are consumed by the by the reducer. So, the output types for the mapper should be the same as the reducer.
(input) -> map -> -> reduce -> (output)
When a combiner class is defined for a job, the intermediate key value pairs are combined on the same node as the map task before sending to the reducer. Combiner reduces the network traffic between the mappers and the reducers.
Note that the combiner functionality is same as the reducer (to combine keys), but the combiner input/output key/value types should be of the same type, while for the reducer this is not a requirement.
(input) -> map -> -> combine* -> -> reduce -> (output)
In the scenario where the reducer class is also defined as a combiner class, the combiner/reducer input/ouput key/value types should be of the same type (k2/v2) as below. If not, due to type erasure the program compiles …

What is the difference between the old and the new MR API?

With the release of Hadoop 0.20, new MR API has been introduced (o.a.h.mapreduce package). There is not much of significant differences between the old MR API (o.a.h.mapred) and the new MR API (o.a.h.mapred) except that the new MR API allows to pull data from within the Map and Reduce tasks by calling the nextKeyValue() on the Context object passed to the Map function.

Also, some of the InputFormats have not been ported to the new MR API. So, to use the missing InputFormat either stop using the new MR API and go back to the old MR API or else extend the InputFormat as required.

Passing parameters to Mappers and Reducers

There might be a requirement to pass additional parameters to the mapper and reducers, besides the the inputs which they process. Lets say we are interested in Matrix multiplication and there are multiple ways/algorithms of doing it. We could send an input parameter to the mapper and reducers, based on which the appropriate way/algorithm is picked. There are multiple ways of doing this

Setting the parameter:

1. Use the -D command line option to set the parameter while running the job.

2. Before launching the job using the old MR API

? 1 2 JobConf job = (JobConf) getConf(); job.set("test", "123");
3. Before launching the job using the new MR API

? 1 2 3 Configuration conf = newConfiguration(); conf.set("test", "123"); Job job = newJob(conf);
Getting the parameter:

1. Using the old API in the Mapper and Reducer. The JobConfigurable#configure has to be implemented in the Mapper and Reducer class.

? 1 2 3 4 privatestaticLong N; publicvoidconfigure(JobConf job) { N =…

Why to explicitly specify the Map/Reduce output parameters using JobConf?

Some thing really simple ....
Q) Why to explicitly specify the Map/Reduce output parameters using JobConf#setMapOutputKeyClass, JobConf#setMapOutputValueClass, JobConf#setOutputKeyClass and JobfConf#setOutputValueClass methods, can't Hadoop framework deduce them from the Mapper/Reducer implementations using Reflection?
A) This is due to Type Erasure :
When a generic type is instantiated, the compiler translates those types by a technique called type erasure — a process where the compiler removes all information related to type parameters and type arguments within a class or method. Type erasure enables Java applications that use generics to maintain binary compatibility with Java libraries and applications that were created before generics.
Nice to learn new things !!!

Hadoop MapReduce challenges in the Enterprise

Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.

Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take  sub-millisecond to start,  but the job startup time in the current open source MapReduce implementation is measured in seconds.

Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.

Platform) The current Hadoop MapReduce implementation does…

Hadoop and MapReduce Algorithms Academic Papers

Hadoop in spite of starting in a web based company (Yahoo), has spawned to solve problems in many other disciplines. Amund has consolidated list of 'Hadoop and MapReduce Algorithms Academic Papers'. This will give an idea where all Hadoop and MapReduce can be used, some of them can be ideas for projects also. The Atbrox Blog is maintaining this list.

Hadoop and JavaScript

Microsoft has announced limited preview version of Hadoop on Azure. JavaScript can also be used to write MapReduce jobs on Hadoop. As of now, Streaming allows any scripting language which can read/write from STDIN/STDOUT to be used with Hadoop. But, what Microsoft is trying make is make JavaScript a first class citizen for Hadoop. There is a session on `Hadoop + JavaScript: what we learned' end of February, which is too long for the impatient. BTW, here is an interesting article on using JavaScript on Hadoop with Rhino.
There had been a lot of work on JavaScript in the browser area for the last few years to improve the performance (especially V8).
Can anyone share use-cases or their experience using JavaScript for HPC (High Performance Computing) in the comments and I will update the blog entry accordingly?

Is Java a prerequisite for getting started with Hadoop?

The query I get often from those who want to get started with Hadoop is if knowledge of Java is a prerequisite or not. The answer is both a Yes and a No, depends on the individual persons interest on what they would like to do with Hadoop.

Why No?

MapReduce provides Map and Reduce primitives which had been as Map and Fold primitives in the Functional Programming world in language like Lisp from quite some time. Hadoop provides interfaces to code in Java against those primitives. But, any language supporting read/write to STDIO like Perl, Python, PHP and others can also be used using Hadoop streaming feature.
Also, there are high level abstractions provided by Apache frameworks like Pig and Hive for which familiarity of Java is not required. Pig can be programmed in Pig Latin and Hive can be programmed using  HiveQL. Both of these programs will be automatically converted to MapReduce programs in Java.
According to the Cloudera blog entry `What Do Real-Life Hadoop Workloads…

Seamless access to the Java SE API documentation

API documentation for Java SE and Hadoop (and other frameworks) can be downloaded for offline access. But, the Hadoop API documentation is not aware of the offline copy of Java SE documentation.
For seamless interaction between the two API, reference to
in the Hadoop API should be replaced with
The below command will replace all such references in the API documentation (note that back slash has to be escaped)
find  ./ -type f -name *.html -exec sed -i 's/http:\/\/\/javase\/6\/docs\//file:\/\/\/home\/praveensripati\/Documents\/Java6API\//' {} \;
This enables for seamless offline access and better productivity.

Hadoop Jar Hell

It's just not possible to download the latest Hadoop related Projects from Apache and use them together because of the interoperability issues among the different Hadoop Projects and their release cycles.

That's the reason why BigTop an Apache Incubator project has evolved, to solve the interoperability issues around the different Hadoop projects by providing a test suite. Also, companies like Cloudera provide their own distribution with different Hadoop projects based on Apache distribution, with proper testing and support.

Now HortonWorks which has been spun from Yahoo joined the sameranks. Their initial manifesto was to make the Apache downloads a source where anyone can download the jars and use them without any issues. But, they have moved away from this with the recent announcement of the HortonWorks Data Platform which is again based on Apache distribution similar to what Cloudera has done with their CDH distributions. Although, HortonWorks and Cloudera ha…

Installation and configuration of Apache Oozie

Many a times there will be a requirement of running a group of dependent data processing jobs. Also, we might want to run some of them at regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.
Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and in…

Introduction to Apache Hive and Pig

Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java  MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.

To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted…

What is the relation between Hibernate and Swap Memory?

With the new HP 430 Notebook, hibernate was not working. Was getting a message that not enough swap for hibernate. Found from this Wiki that (swap memory >= RAM) for hibernate to work.
Since the HP 430 Netbook had enough RAM (4GB), I choose the swap to be 1GB at the time of Ubuntu installation and so hibernate was not working. Again the Wiki has instructions for increasing the size of the Swap.
So, it's better to choose enough swap at the time of the Ubuntu installation for hibernate to work.

HDFS Facts and Fiction

Sometimes even the popular blogs/sites don't get the facts straight, not sure if articles are reviewed by others or not. As I mentioned earlier, changes in Hadoop are happening at a very rapid pace and it's very difficult to keep updated with the latest.

Here is a snippet from ZDNet article

> The Hadoop Distributed File System (HDFS) is a pillar of Hadoop. But its single-point-of-failure topology, and its ability to write to a file only once, leaves the Enterprise wanting more. Some vendors are trying to answer the call.

HDFS supports appends and that's the core of HBase. Without HDFS append functionality HBase doesn't work. There had been some challenges to get appends work in HDFS, but they have been sorted out. Also, HBase supports random/real-time read/write access of the data on top of HDFS.

Agree that NameNode is a SPOF in HDFS. But, HDFS High Availability will include two NameNodes and for now switchover from the active to the standby NameNode is…

Getting started with HDFS client side mount table

With HDFS federation it's possible to have multiple NameNode in a HDFS cluster. While this is good from a NameNode scalability and isolation perspective, it's difficult to manage multiple name spaces from a client application perspective. HDFS client mount table makes multiple names spaces transparent to the client. ViewFs more details on how to use the HDFS client mount table.

Earlier blog entry detailed how to setup HDFS federation. Let's assume the two NameNodes have been setup successfully on namenode1 and namenode2.
Lets map ? 1 /NN1Home to hdfs://namenode1:9001/home and? 1 /NN2Home to hdfs://namenode2:9001/home Add the following to the core-site.xml? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 xmlversion="1.0"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value> </property> <property> <name></name> <value>viewfs:///</value>…

How to start CheckPoint NameNode in Hadoop 0.23 release?

Prior to 0.23 release the masters file in the  conf folder of Hadoop installation had the list of host names on which the CheckPoint NN has to be started. But, with the 0.23 release the masters file is not used anymore, the dfs.namenode.secondary.http-address key has to be set to ip:port in hdfs-site.xml. CheckPoint NN can be started using the sbin/ start secondarynamenode command. Run jps command to make sure that the CheckPoint NN is running and also check the corresponding log file also for any errors.
BTW, Secondary NN is being referred to as CheckPoint NN. But, the code is still using Secondary NN and people still refer it as Secondary NN.

HDFS explained as comics

Manish has done a nice job explaining HDFS as comics for those who are new to HDFS. Here is the link.

HDFS Name Node High Availability

NameNode is the heart of HDFS. It stores the namespace for the filesystem and also tracks the location of the blocks in the the cluster. The location of the blocks are not persisted in the NameNode, but the DataNodes report the blocks it has to the NameNode when the DataNode starts. If an instance of NameNode is not available, then HDFS is not accessible till it's back running.

Hadoop 0.23 release introduced HDFS federation where it is possible to have multiple independent NameNodes in a cluster, where in a particular DataNode can have blocks for more than one Name Node. Federation provides horizontal scalability, better performance and isolation.

HDFS NN HA (NameNode High Availability) is an area where active work is happening. Here are the JIRA, Presentation and Video for the same. HDFS NN HA was not cut into 0.23 release and will be part of later releases. Changes are going in the HDFS-1623 branch, if someone is interested in the code.

Edit (11th March, 2012) : Detail…

Getting started with HDFS Federation

HDFS Federation enables multiple NameNodes in a cluster for horizontal scalability of NameNode. All these NameNodes work independently and don't require any co-ordination. A DataNode can register with multiple NameNodes in the cluster and can store the data blocks for multiple NameNodes.
Here are the instructions to set HDFS Federation on a cluster.

1) Download the Hadoop 0.23 release from here.

2) Extract it to a folder (let's call it HADOOP_HOME) on all the NameNodes (namenode1, namenode2) and on all the slaves (slave1, slave2, slave3)

3) Add the association between the hostnames and the ip address for the NameNodes and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.

4) Make sure that the NameNodes is able to do a password-less ssh to all the slaves. Here are theinstructions for the same.

5) Add the following to .bashrc in the home folder for the NameNodes and all the slaves.? 1 2

Big Data Trendz