Here you will get the job updates related to Hadoop Technology, in JOBS , also it is recommended to follow in LinkedIn . I suggest you to update your jobs in hadoop by following this page hadoop jobs in facebook.
Thanks to Hadoop Tips Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.
would like the analogy of HBase/MapReduce to an plane/train. A train
can carry a lot of material at a slow pace, while a plane relatively
can carry less material at a faster pace. Depending on the amount of
the material to be transferred from one location to another and the
urgency, either a plane or a train can me used to move material.
Similarly HBase (or in fact any database) provides relatively low latency
(response time) at the cost of low throughput (data
transferred/processed), while MapReduce provides high latency at the
cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked. \
Coming back to our favorite topic MapReduce, this model is not something new. Map and Reduce had been known as Map and Fold which are higher order functions and had been used in functional programming languages like Lisp and ML. The different between a regular function and a higher order function
is that while a higher order function can take multiple functions as
input and return a function as output, a regular function doesn't.
As mentioned earlier Map and Fold had been there for ages and you might
have heard about them if you had your education in Computer Scienc…
Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.
(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)
The big question is why use key/value pairs?
MapReduce is derived from the concepts of functional programming. Here is a tutorial
on the map/fold primitives in the functional programming which are used
in the MapReduce paradigm and there is no mention of key/value pairs
According to the Google MapReduce Paper
We realized that most of our computations involved applying a map
operation to each logical “record” in our input in order to compute a
set of intermediate key/value pairs, and then applying a reduce
operation to all the values that shared the same key, in order to
combine the derived data appropriately.
The only reason I could see why Hadoop used key/value pairs is that the
Google Paper had …
While Hadoop implements MR, Hama
implements BSP. The ecosystem/interest/tools/$$$/articles around Hadoop
is much more when compared to Hama. Recently Hama has released 0.4.0
and includes examples for page rank, graph exploration, shortest path
and others. Hama has many similarities (API, command line, design etc)
with Hadoop and it shouldn't take much time to get started with Hama for
those who are familiar with Hadoop.
As of now Hama is in incubator stage and is looking for contributors.
Hama is still in the early stage and there is a lot of scope for
improvement (performance, testing, documentation etc) . For someone who
wants to start with Apache, Hama is a choice. Thomas and Edward had been actively blogging about Hama and are very responsive for any clari…
When only a map and reducer class are defined for a job, the key/value
pairs emitted by the mapper are consumed by the by the reducer. So, the
output types for the mapper should be the same as the reducer. (input) -> map -> -> reduce -> (output)
When a combiner class is defined for a job, the intermediate key value
pairs are combined on the same node as the map task before sending to
the reducer. Combiner reduces the network traffic between the mappers
and the reducers.
Note that the combiner functionality is same as the reducer (to combine
keys), but the combiner input/output key/value types should be of the
same type, while for the reducer this is not a requirement. (input) -> map -> -> combine*
-> -> reduce -> (output)
In the scenario where the reducer class is also defined as a combiner
class, the combiner/reducer input/ouput key/value types should be of the
same type (k2/v2) as below. If not, due to type erasure the program compiles …
With the release of Hadoop 0.20,
new MR API has been introduced (o.a.h.mapreduce package). There is not
much of significant differences between the old MR API (o.a.h.mapred)
and the new MR API (o.a.h.mapred) except that the new MR API allows to
pull data from within the Map and Reduce tasks by calling the nextKeyValue() on the Context object passed to the Map function.
Also, some of the InputFormats have not been ported to the new MR API.
So, to use the missing InputFormat either stop using the new MR API and
go back to the old MR API or else extend the InputFormat as required.
There might be a requirement to pass additional parameters to the mapper
and reducers, besides the the inputs which they process. Lets say we
are interested in Matrix multiplication and there are multiple
ways/algorithms of doing it. We could send an input parameter to the
mapper and reducers, based on which the appropriate way/algorithm is
picked. There are multiple ways of doing this
Some thing really simple ....
Q) Why to explicitly specify the Map/Reduce output parameters using JobConf#setMapOutputKeyClass,
JobConf#setMapOutputValueClass, JobConf#setOutputKeyClass and
JobfConf#setOutputValueClass methods, can't Hadoop framework deduce them
from the Mapper/Reducer implementations using Reflection?
A) This is due to Type Erasure :
When a generic type is instantiated, the compiler translates those types by a technique called type erasure
— a process where the compiler removes all information related to type
parameters and type arguments within a class or method. Type erasure
enables Java applications that use generics to maintain binary
compatibility with Java libraries and applications that were created
Nice to learn new things !!!
Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.
Platform) On the performance measure, to be most useful in a robust
enterprise environment a MapReduce job should take sub-millisecond to
start, but the job startup time in the current open source MapReduce
implementation is measured in seconds.
Praveen) MapReduce is supposed to be for batch processing and not for
online transactions. The data from a MapReduce Job can be fed to a
system for online processing. It's not to say that there is no scope for
improvement in the MapReduce job performance.
Platform) The current Hadoop MapReduce implementation does…
Hadoop in spite of starting in a web based company (Yahoo), has spawned to solve problems in many other disciplines. Amund has consolidated
list of 'Hadoop and MapReduce Algorithms Academic Papers'. This will
give an idea where all Hadoop and MapReduce can be used, some of them
can be ideas for projects also. The Atbrox Blog is maintaining this list.
allows any scripting language which can read/write from STDIN/STDOUT to
be used with Hadoop. But, what Microsoft is trying make is make
comments and I will update the blog entry accordingly?
The query I get often from those who want to get started with Hadoop is
if knowledge of Java is a prerequisite or not. The answer is both a Yes
and a No, depends on the individual persons interest on what they would
like to do with Hadoop.
MapReduce provides Map and Reduce primitives which had been as Map and Fold primitives in the Functional Programming
world in language like Lisp from quite some time. Hadoop provides
interfaces to code in Java against those primitives. But, any language supporting read/write to STDIO
like Perl, Python, PHP and others can also be used using Hadoop streaming
Also, there are high level abstractions provided by Apache
frameworks like Pig and Hive
for which familiarity of Java is not
required. Pig can be programmed in Pig Latin and Hive can be programmed
using HiveQL. Both of these programs will be automatically converted to
MapReduce programs in Java.
According to the Cloudera blog entry `What Do Real-Life Hadoop Workloads…
API documentation for Java SE and Hadoop
(and other frameworks) can be downloaded for offline access. But, the
Hadoop API documentation is not aware of the offline copy of Java SE
For seamless interaction between the two API, reference to
in the Hadoop API should be replaced with
The below command will replace all such references in the API documentation (note that back slash has to be escaped)
find ./ -type f -name *.html -exec sed -i
This enables for seamless offline access and better productivity.
It's just not
possible to download the latest Hadoop related Projects from Apache and
use them together because of the interoperability issues among the
different Hadoop Projects and their release cycles.
That's the reason why BigTop
an Apache Incubator project has evolved, to solve the interoperability
issues around the different Hadoop projects by providing a test suite.
Also, companies like Cloudera provide their own distribution with different Hadoop projects based on Apache distribution, with proper testing and support.
Now HortonWorks which has been spun from Yahoo joined the sameranks. Their initial manifesto
was to make the Apache downloads a source where anyone can download the
jars and use them without any issues. But, they have moved away from
this with the recent announcement of the HortonWorks Data Platform
which is again based on Apache distribution similar to what Cloudera
has done with their CDH distributions. Although, HortonWorks and
Many a times there will be a requirement of running a group of dependent
data processing jobs. Also, we might want to run some of them at
regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.
Installing and configuring Oozie is not straight forward, not only
because of the documentation, but also because the release includes only
the source code and not the binaries. The code has to be got, the
dependencies installed and then the binaries built. It's a bit tedious
process, so this blog with an assumption that Hadoop has been already
installed and configured. Here is the official documentation on how to build and in…
Apache Hive is a framework that
sits on top of Hadoop for doing ad-hoc queries on data in
Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the
complete constructs of SQL.
Hive coverts the HiveQL query
into Java MapReduce program and then submits it to the Hadoop cluster.
The same outcome can be achieved using HiveQL and Java MapReduce, but using Java
MapReduce will required a lot of code to be written/debugged compared to
HiveQL. So, it increases the developer productivity to use Hive.
To summarize, Hive through HiveQL language provides a higher level
abstraction over Java MapReduce programming. As with any other high level
abstraction, there is a bit of performance overhead using HiveQL when compared to
Java MapReduce. But the Hive community is working to narrow down this gap
for most of the commonly used scenarios.
Along the same line Pig provides a higher level abstraction over
MapReduce. Pig supports PigLatin constructs, which is converted…
With the new HP 430
Notebook, hibernate was not working. Was getting a message that not
enough swap for hibernate. Found from this Wiki that (swap memory >= RAM) for hibernate to work.
Since the HP 430 Netbook had enough RAM (4GB), I choose the swap to be
1GB at the time of Ubuntu installation and so hibernate was not working.
Again the Wiki has instructions for increasing the size of the Swap.
So, it's better to choose enough swap at the time of the Ubuntu installation for hibernate to work.
Sometimes even the popular blogs/sites don't get the facts straight, not
sure if articles are reviewed by others or not. As I mentioned earlier,
changes in Hadoop are happening at a very rapid pace and it's very
difficult to keep updated with the latest.
> The Hadoop Distributed File System (HDFS) is a pillar of Hadoop.
But its single-point-of-failure topology, and its ability to write to a
file only once, leaves the Enterprise wanting more. Some vendors are
trying to answer the call.
HDFS supports appends
and that's the core of HBase. Without HDFS append functionality HBase
doesn't work. There had been some challenges to get appends work in
HDFS, but they have been sorted out. Also, HBase supports
random/real-time read/write access of the data on top of HDFS.
Agree that NameNode is a SPOF in HDFS. But, HDFS High Availability
will include two NameNodes and for now switchover from the active to
the standby NameNode is…
With HDFS federation
it's possible to have multiple NameNode in a HDFS cluster. While this
is good from a NameNode scalability and isolation perspective, it's
difficult to manage multiple name spaces from a client application
perspective. HDFS client mount table makes multiple names spaces transparent to the client. ViewFs more details on how to use the HDFS client mount table.
Earlier blog entry detailed how to setup HDFS federation. Let's assume the two NameNodes have been setup successfully on namenode1 and namenode2.
Lets map ? 1 /NN1Home to hdfs://namenode1:9001/home and? 1 /NN2Home to hdfs://namenode2:9001/home Add the following to the core-site.xml? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 xmlversion="1.0"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value> </property> <property> <name>fs.default.name</name> <value>viewfs:///</value>…
Prior to 0.23 release the masters file in the conf folder of Hadoop installation had the list of host names on which the CheckPoint NN has to be started. But, with the 0.23 release the masters file is not used anymore, the dfs.namenode.secondary.http-address key has to be set to ip:port in hdfs-site.xml. CheckPoint NN can be started using the sbin/hadoop-daemon.sh start secondarynamenode
command. Run jps command to make sure that the CheckPoint NN is running
and also check the corresponding log file also for any errors.
BTW, Secondary NN is being referred to as CheckPoint NN. But, the code
is still using Secondary NN and people still refer it as Secondary NN.
NameNode is the
heart of HDFS. It stores the namespace for the filesystem and also
tracks the location of the blocks in the the cluster. The location of
the blocks are not persisted in the NameNode, but the DataNodes report
the blocks it has to the NameNode when the DataNode starts. If an
instance of NameNode is not available, then HDFS is not accessible till
it's back running.
Hadoop 0.23 release introduced HDFS federation
where it is possible to have multiple independent NameNodes in a
cluster, where in a particular DataNode can have blocks for more than
one Name Node. Federation provides horizontal scalability, better
performance and isolation.
HDFS NN HA (NameNode High Availability) is an area where active work is happening. Here are the JIRA, Presentation and Video for the same. HDFS NN HA was not cut into 0.23 release and will be part of later releases. Changes are going in the HDFS-1623 branch, if someone is interested in the code.
HDFS Federation enables multiple NameNodes in a cluster for horizontal scalability of NameNode. All these NameNodes work independently and don't require any co-ordination. A DataNode can register with multiple NameNodes in the cluster and can store the data blocks for multiple NameNodes. Here are the instructions to set HDFS Federation on a cluster.
2) Extract it to a folder (let's call it HADOOP_HOME) on all the NameNodes (namenode1, namenode2) and on all the slaves (slave1, slave2, slave3)
3) Add the association between the hostnames and the ip address for the NameNodes and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.
4) Make sure that the NameNodes is able to do a password-less ssh to all the slaves. Here are theinstructions for the same.
5) Add the following to .bashrc in the home folder for the NameNodes and all the slaves.? 1 2