UPDATES

WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Video Bar

Loading...

Tuesday, 22 October 2013

Jobs Page


Here you will get the job updates related to Hadoop Technology, in JOBS , also it is
recommended to follow in LinkedIn . I suggest you to update your jobs in hadoop by following this page hadoop jobs in facebook.



Wednesday, 16 October 2013

When to use HBase and when MapReduce?

Thanks to Hadoop Tips
Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.

                             
I would like the analogy of HBase/MapReduce to an plane/train. A train can carry a lot of material at a slow pace, while a plane relatively can carry less material at a faster pace. Depending on the amount of the material to be transferred from one location to another and the urgency, either a plane or a train can me used to move material.


Similarly HBase (or in fact any database) provides relatively low latency (response time) at the cost of low throughput (data transferred/processed), while MapReduce provides high latency at the cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked. \

MapReduce Patterns

Developing algorithms to process data on top of Distributing computing because of it's inherent distributed nature is difficult. So there are models like PRAM, MapReduce, BSP (Bulk Synchronous Parallel), MPI (Message Passing Interface) and others. Apache Hadoop, Apache Hama, Open MPI and other frameworks/software make it easier to develop algorithms on top of the earlier mentioned distributed computing models.


Coming back to our favorite topic MapReduce, this model is not something new. Map and Reduce had been known as Map and Fold which are higher order functions and had been used in functional programming languages like Lisp and ML. The different between a regular function and a higher order function is that while a higher order function can take multiple functions as input and return a function as output, a regular function doesn't.

As mentioned earlier Map and Fold had been there for ages and you might have heard about them if you had your education in Computer Science or used any of the functional programming. But, Google with it's paper on MapReduce made the functional programming and higher order functions popular.

Also, along the same lines BSP had been there for quite some time, but Google made BSP model popular with it's Large-scale graph computing at Google paper.

Some models suit well for some type of algorithms, while some won't. To mention, iterative algorithms fit better with the BSP model than the MR model, but still Mahout choose to use MR model than BSP model because Apache Hadoop (MR implementation) is much more mature and has a huge ecosystem when compared to Apache Hama (BSP implementation).

All the models which have been discussed need a different way of thinking to solve a particular problem. Here are the resources containing the pseudo code/logic for implementing some of the algorithms on top of MapReduce.

- Hadoop - The Definitive Guide (Join in Page 283)

- MapReduce Patterns

- Data Intensive Text Processing with MapReduce

Would be following with another blog with patterns around the BSP model.

Why does Hadoop uses KV (Key/Value) pairs?

Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.
(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)


The big question is why use key/value pairs?
MapReduce is derived from the concepts of functional programming. Here is a tutorial on the map/fold primitives in the functional programming which are used in the MapReduce paradigm and there is no mention of key/value pairs anywhere.

According to the Google MapReduce Paper


We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.


The only reason I could see why Hadoop used key/value pairs is that the Google Paper had key/value pairs to meet their requirements and the same had been implemented in Hadoop. And everyone is trying to fit the the problem space in the key/value pairs.
Here is an interesting article on using tuples in MapReduce.

BSP (Hama) vs MR (Hadoop)

Thanks to Hadoop Tips
BSP (Bulk Synchronous Parallel) is a distributed computing model similar to MR (MapReduce). For some class of problems BSP performs better than MR and other way. This paper compares MR and BSP.

While Hadoop implements MR, Hama implements BSP. The ecosystem/interest/tools/$$$/articles around Hadoop is much more when compared to Hama. Recently Hama has released 0.4.0 and includes examples for page rank, graph exploration, shortest path and others. Hama has many similarities (API, command line, design etc) with Hadoop and it shouldn't take much time to get started with Hama for those who are familiar with Hadoop.

As of now Hama is in incubator stage and is looking for contributors. Hama is still in the early stage and there is a lot of scope for improvement (performance, testing, documentation etc) . For someone who wants to start with Apache, Hama is a choice. Thomas and Edward had been actively blogging about Hama and are very responsive for any clarifications in the Hama community.

I am planning to spend some time on Hama (learning and contributing) and would keep posting on this blog on the progress.

What should be the input/ouput key/value types be for the combiner?

When only a map and reducer class are defined for a job, the key/value pairs emitted by the mapper are consumed by the by the reducer. So, the output types for the mapper should be the same as the reducer.
(input) -> map -> -> reduce -> (output)
When a combiner class is defined for a job, the intermediate key value pairs are combined on the same node as the map task before sending to the reducer. Combiner reduces the network traffic between the mappers and the reducers.

Note that the combiner functionality is same as the reducer (to combine keys), but the combiner input/output key/value types should be of the same type, while for the reducer this is not a requirement.

(input) -> map -> -> combine* -> -> reduce -> (output)
In the scenario where the reducer class is also defined as a combiner class, the combiner/reducer input/ouput key/value types should be of the same type (k2/v2) as below. If not, due to type erasure the program compiles properly but gives a run time error.

(input) -> map -> -> combine* -> -> reduce -> (output)

What is the difference between the old and the new MR API?

With the release of Hadoop 0.20, new MR API has been introduced (o.a.h.mapreduce package). There is not much of significant differences between the old MR API (o.a.h.mapred) and the new MR API (o.a.h.mapred) except that the new MR API allows to pull data from within the Map and Reduce tasks by calling the nextKeyValue() on the Context object passed to the Map function.

Also, some of the InputFormats have not been ported to the new MR API. So, to use the missing InputFormat either stop using the new MR API and go back to the old MR API or else extend the InputFormat as required.

Passing parameters to Mappers and Reducers

There might be a requirement to pass additional parameters to the mapper and reducers, besides the the inputs which they process. Lets say we are interested in Matrix multiplication and there are multiple ways/algorithms of doing it. We could send an input parameter to the mapper and reducers, based on which the appropriate way/algorithm is picked. There are multiple ways of doing this

Setting the parameter:

1. Use the -D command line option to set the parameter while running the job.

2. Before launching the job using the old MR API

?
1
2
JobConf job = (JobConf) getConf();
job.set("test", "123");

3. Before launching the job using the new MR API

?
1
2
3
Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);

Getting the parameter:

1. Using the old API in the Mapper and Reducer. The JobConfigurable#configure has to be implemented in the Mapper and Reducer class.

?
1
2
3
4
private static Long N;
public void configure(JobConf job) {
    N = Long.parseLong(job.get("test"));
}

The variable N can then be used with the map and reduce functions.

2. Using the new API in the Mapper and Reducer. The context is passed to the setup, map, reduce and cleanup functions.

?
1
2
Configuration conf = context.getConfiguration();
String param = conf.get("test");

Why to explicitly specify the Map/Reduce output parameters using JobConf?

Some thing really simple ....
Q) Why to explicitly specify the Map/Reduce output parameters using JobConf#setMapOutputKeyClass, JobConf#setMapOutputValueClass, JobConf#setOutputKeyClass and JobfConf#setOutputValueClass methods, can't Hadoop framework deduce them from the Mapper/Reducer implementations using Reflection?

A) This is due to Type Erasure :

When a generic type is instantiated, the compiler translates those types by a technique called type erasure — a process where the compiler removes all information related to type parameters and type arguments within a class or method. Type erasure enables Java applications that use generics to maintain binary compatibility with Java libraries and applications that were created before generics.

Nice to learn new things !!!

Hadoop MapReduce challenges in the Enterprise

Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.

Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take  sub-millisecond to start,  but the job startup time in the current open source MapReduce implementation is measured in seconds.

Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.

Platform) The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time.

Platform) Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application.

Praveen) Apache Hadoop had a pluggable Scheduler architecture and has Capacity, Fair and FIFO Scheduler. The FIFO Scheduler is the default one. Schedulers allow multiple applications and multiple users to share the cluster at the same time.

Platform) Current Hadoop MapReduce implementations derived from open source are not equipped to address the dynamic resource allocation required by various applications.

Platform) Customers also need support for workloads that may have different characteristics or  are  written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.

Praveen) NextGen MapReduce allows for dynamic allocation of resources. Currently there is only support for RAM based requests, but the framework can be extended for other parameters like CPU, HDD etc in the future.

Platform) As mentioned in part 2 of this blog series, the single job tracker in the current Hadoop implementation is not separated from the resource manager, so as a result, the job tracker does not provide sufficient resource management functionalities to allow dynamic lending and borrowing of available IT resources.

Praveen) NextGen MapReduce separates resource management and task scheduling into separate components.

To summarize, NextGen MapReduce addresses some of the concerns raised by Platform, but it will take some time for NextGen MapReduce to get stabilized and be production ready.

Hadoop and MapReduce Algorithms Academic Papers

Hadoop in spite of starting in a web based company (Yahoo), has spawned to solve problems in many other disciplines. Amund has consolidated list of 'Hadoop and MapReduce Algorithms Academic Papers'. This will give an idea where all Hadoop and MapReduce can be used, some of them can be ideas for projects also. The Atbrox Blog is maintaining this list.

Hadoop and JavaScript

Microsoft has announced limited preview version of Hadoop on Azure. JavaScript can also be used to write MapReduce jobs on Hadoop. As of now, Streaming allows any scripting language which can read/write from STDIN/STDOUT to be used with Hadoop. But, what Microsoft is trying make is make JavaScript a first class citizen for Hadoop. There is a session on `Hadoop + JavaScript: what we learned' end of February, which is too long for the impatient. BTW, here is an interesting article on using JavaScript on Hadoop with Rhino.
There had been a lot of work on JavaScript in the browser area for the last few years to improve the performance (especially V8).
Can anyone share use-cases or their experience using JavaScript for HPC (High Performance Computing) in the comments and I will update the blog entry accordingly?

Is Java a prerequisite for getting started with Hadoop?


The query I get often from those who want to get started with Hadoop is if knowledge of Java is a prerequisite or not. The answer is both a Yes and a No, depends on the individual persons interest on what they would like to do with Hadoop.


Why No?

MapReduce provides Map and Reduce primitives which had been as Map and Fold primitives in the Functional Programming world in language like Lisp from quite some time. Hadoop provides interfaces to code in Java against those primitives. But, any language supporting read/write to STDIO like Perl, Python, PHP and others can also be used using Hadoop streaming feature.

Also, there are high level abstractions provided by Apache frameworks like Pig and Hive for which familiarity of Java is not required. Pig can be programmed in Pig Latin and Hive can be programmed using  HiveQL. Both of these programs will be automatically converted to MapReduce programs in Java.

According to the Cloudera blog entry `What Do Real-Life Hadoop Workloads Look Like?` - Pig and Hive constitute a majority of the workloads in a Hadoop cluster. Below is the histogram from the mentioned Cloudera blog.


Why Yes?

Hadoop and the ecosystem can be easily extended for additional functionality like developing custom Input and OutputFormats, UDF (User Defined Functions) and others. For customizing Hadoop knowledge of Java is mandatory.

Also, many times it's required to get deep into Hadoop code as to why something is behaving a particular way or to know more about the functionality of a particular module. Again knowledge of Java comes handy here.

Hadoop projects come with a lot of different roles like Architect, Developer, Tester, Linux/Network/Hardware Administrator and some of which require explicit knowledge of Java and some don't. My suggestion is if you are genuinely interested in Big Data and think that Big Data will make a difference then deep dive into Big Data technologies irrespective of knowledge about Java.
Happy Hadooping !!!!

Seamless access to the Java SE API documentation

API documentation for Java SE and Hadoop (and other frameworks) can be downloaded for offline access. But, the Hadoop API documentation is not aware of the offline copy of Java SE documentation.
For seamless interaction between the two API, reference to

http://java.sun.com/....../ConcurrentLinkedQueue.html

in the Hadoop API should be replaced with

file:///home/praveensripati/....../ConcurrentLinkedQueue.html

The below command will replace all such references in the API documentation (note that back slash has to be escaped)

find  ./ -type f -name *.html -exec sed -i 's/http:\/\/java.sun.com\/javase\/6\/docs\//file:\/\/\/home\/praveensripati\/Documents\/Java6API\//' {} \;

This enables for seamless offline access and better productivity.

Hadoop Jar Hell

It's just not possible to download the latest Hadoop related Projects from Apache and use them together because of the interoperability issues among the different Hadoop Projects and their release cycles.

That's the reason why BigTop an Apache Incubator project has evolved, to solve the interoperability issues around the different Hadoop projects by providing a test suite. Also, companies like Cloudera provide their own distribution with different Hadoop projects based on Apache distribution, with proper testing and support.

Now HortonWorks which has been spun from Yahoo joined the same ranks. Their initial manifesto was to make the Apache downloads a source where anyone can download the jars and use them without any issues. But, they have moved away from this with the recent announcement of the HortonWorks Data Platform which is again based on Apache distribution similar to what Cloudera has done with their CDH distributions. Although, HortonWorks and Cloudera have their own distribution, they would be actively contributing the Apache Hadoop ecosystem.

With the maturity of BigTop it would be possible to download different Hadoop related jar files from Apache and use them directly instead of depending on the distributions from HortonWorks and Cloudera.

As mentioned in the GigaOm Article, such distributions from HortonWorks and Cloudera make them easy to support their customers as they have to support limited number of Hadoop versions and they would also know the potential issues with those versions.

Installation and configuration of Apache Oozie

Many a times there will be a requirement of running a group of dependent data processing jobs. Also, we might want to run some of them at regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.

Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.

So, here are the steps to install and configure

- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.

- Download a release containing the code from Apache Oozie site and extract the source code.
- Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.
?
1
bin/mkdistro.sh -DskipTests
- Once the build is complete the binary file oozie-4.0.0.tar.gz should be present in the folder where Oozie code was extracted. Extract the tar.gz file, this will create a folder called oozie-4.0.0.
- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip,  hadoop-client-*.jar and  hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.

When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.
?
1
2
3
4
5
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:37)
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:34)
        at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:217)
- Prepare a war file using the below command. oozie.war file should be there in the oozie-4.0.0/oozie-server/webapps folder.
?
1
bin/oozie-setup.sh prepare-war
- Create Oozie related schema using the below command
?
1
bin/ooziedb.sh create -sqlfile oozie.sql -run
- Now is the time to start the Oozie Service which runs in Tomcat.
?
1
bin/oozied.sh start
- Check the Oozie log file logs/oozie.log to ensure Oozie started properly. And, run the below command to check the status of Oozie or instead go to the Oozie console at http://localhost:11000/oozie
?
1
bin/oozie admin -oozie http://localhost:11000/oozie -status
- Now, the Oozie client has to be installed by extracting the oozie-client-4.0.0.tar.gz. This will create a folder called oozie-client-4.0.0.
With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.

- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).

- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean
?
1
2
3
4
5
6
7
8
     <property>
          <name>hadoop.proxyuser.training.hosts</name>
          <value>localhost</value>
     </property>
     <property>
          <name>hadoop.proxyuser.training.groups</name>es
          <value>training</value>
     </property>
- Make sure that HDFS and MR are started and running properly.
- Copy the examples folder in HDFS using the below command
?
1
bin/hadoop fs -put /home/training/tmp/examples/ examples/
- Now run the Oozie example as
?
1
oozie job -oozie http://localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/job.properties -run
- The status of the job can be got using the below command
?
1
oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-tucu
In the upcoming blogs, we will see how to write some simple work flows and schedule tasks in Oozie.

Introduction to Apache Hive and Pig

Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java  MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.

To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted into Java MapReduce program and then submitted to the Hadoop cluster.


While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Some time back, Cloudera published statistics about the workload character in a typical Hadoop cluster and it can be easily observed that Pig and Hive jobs are a good part of the jobs in Hadoop cluster. Because of the higher developer productivity many of the companies are opting for higher level abstracts like Pig and Hive. So, we can bet there will be a lot of job opening around Hive and Pig when compared to MapReduce developers.
Although Programming Pig book has been published (October, 2011) some time back, Programming Hive book was published recently (October, 2012). For those who have experience working with RDBMS, getting started with Hive would be a better option than getting started with Pig. Also, note that PigLatin language is not very difficult to get started with.

For the underlying Hadoop cluster it's transparent whether a Java MapReduce job is submitted or a MapReduce job is submitted through Hive and Pig. Because of the batch oriented nature of MapReduce jobs, the jobs submitted through Hive and Pig are also batch oriented in nature.

For real time response requirements, Hive and Pig doesn't meet the requirements because of the earlier mentioned batch oriented nature of MapReduce jobs. Cloudera developed Impala which is based on Dremel (publication from Google) for interactive ad-hoc query on top of Hadoop. Impala supports SQL like query and is compatible with HiveQL. So, any applications which are built on top of Hive should work with minimal changes with Impala. The major difference between Hive and Impala is that while HiveQL is converted into Java MapReduce jobs, Impala doesn't covert the SQL query into a Java MapReduce jobs.

Someone can ask whether to go with Pig or Hive for a particular requirement which is a topic for another blog

What is the relation between Hibernate and Swap Memory?


With the new HP 430 Notebook, hibernate was not working. Was getting a message that not enough swap for hibernate. Found from this Wiki that (swap memory >= RAM) for hibernate to work.
Since the HP 430 Netbook had enough RAM (4GB), I choose the swap to be 1GB at the time of Ubuntu installation and so hibernate was not working. Again the Wiki has instructions for increasing the size of the Swap.

So, it's better to choose enough swap at the time of the Ubuntu installation for hibernate to work.

HDFS Facts and Fiction

Sometimes even the popular blogs/sites don't get the facts straight, not sure if articles are reviewed by others or not. As I mentioned earlier, changes in Hadoop are happening at a very rapid pace and it's very difficult to keep updated with the latest.

Here is a snippet from ZDNet article

> The Hadoop Distributed File System (HDFS) is a pillar of Hadoop. But its single-point-of-failure topology, and its ability to write to a file only once, leaves the Enterprise wanting more. Some vendors are trying to answer the call.

HDFS supports appends and that's the core of HBase. Without HDFS append functionality HBase doesn't work. There had been some challenges to get appends work in HDFS, but they have been sorted out. Also, HBase supports random/real-time read/write access of the data on top of HDFS.

Agree that NameNode is a SPOF in HDFS. But, HDFS High Availability will include two NameNodes and for now switchover from the active to the standby NameNode is a manual task to be done by an operator, work is going on to make this automatic. Also, there are mitigations around SPOF in HDFS like having a Secondary NameNode, writing the NameNode meta data to multiple locations.

One thing to ponder if HDFS is so unreliable, we wouldn't have seen so many cluster using HDFS. Also, on top of HDFS other Apache frameworks are also being built.

Getting started with HDFS client side mount table

With HDFS federation it's possible to have multiple NameNode in a HDFS cluster. While this is good from a NameNode scalability and isolation perspective, it's difficult to manage multiple name spaces from a client application perspective. HDFS client mount table makes multiple names spaces transparent to the client. ViewFs more details on how to use the HDFS client mount table.

Earlier blog entry detailed how to setup HDFS federation. Let's assume the two NameNodes have been setup successfully on namenode1 and namenode2.

Lets map
?
1
/NN1Home to hdfs://namenode1:9001/home
and
?
1
/NN2Home to hdfs://namenode2:9001/home
Add the following to the core-site.xml
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
xml version="1.0"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>viewfs:///</value>
    </property>
    <property>
        <name>fs.viewfs.mounttable.default.link./NN1Home</name>
        <value>hdfs://namenode1:9001/home</value>
    </property>
    <property>
        <name>fs.viewfs.mounttable.default.link./NN2Home</name>
        <value>hdfs://namenode2:9001/home</value>
    </property>
</configuration>
Start the cluster with the `sbin/start-dfs.sh` command from the Hadoop Home and make sure the NameNodes and the DataNodes are working properly.
Run the following commands
?
1
bin/hadoop fs -put somefile.txt /NN1Home/input
Make sure that somefile.txt is in the hdfs://namenode1:9001/home/input folder from NameNode web console.
?
1
bin/hadoop fs -put somefile.txt /NN2Home/output
Make sure that somefile.txt is in the hdfs://namenode2:9001/home/output folder from NameNode web console.

How to start CheckPoint NameNode in Hadoop 0.23 release?


Prior to 0.23 release the masters file in the  conf folder of Hadoop installation had the list of host names on which the CheckPoint NN has to be started. But, with the 0.23 release the masters file is not used anymore, the dfs.namenode.secondary.http-address key has to be set to ip:port in hdfs-site.xml. CheckPoint NN can be started using the sbin/hadoop-daemon.sh start secondarynamenode command. Run jps command to make sure that the CheckPoint NN is running and also check the corresponding log file also for any errors.
BTW, Secondary NN is being referred to as CheckPoint NN. But, the code is still using Secondary NN and people still refer it as Secondary NN.

HDFS explained as comics

Manish has done a nice job explaining HDFS as comics for those who are new to HDFS. Here is the link.

HDFS Name Node High Availability

NameNode is the heart of HDFS. It stores the namespace for the filesystem and also tracks the location of the blocks in the the cluster. The location of the blocks are not persisted in the NameNode, but the DataNodes report the blocks it has to the NameNode when the DataNode starts. If an instance of NameNode is not available, then HDFS is not accessible till it's back running.

Hadoop 0.23 release introduced HDFS federation where it is possible to have multiple independent NameNodes in a cluster, where in a particular DataNode can have blocks for more than one Name Node. Federation provides horizontal scalability, better performance and isolation.

HDFS NN HA (NameNode High Availability) is an area where active work is happening. Here are the JIRA, Presentation and Video for the same. HDFS NN HA was not cut into 0.23 release and will be part of later releases. Changes are going in the HDFS-1623 branch, if someone is interested in the code.


Edit (11th March, 2012) : Detailed blog entry from Cloudera on HA.

Sunday, 6 October 2013

Getting started with HDFS Federation


HDFS Federation enables multiple NameNodes in a cluster for horizontal scalability of NameNode. All these NameNodes work independently and don't require any co-ordination. A DataNode can register with multiple NameNodes in the cluster and can store the data blocks for multiple NameNodes.
Here are the instructions to set HDFS Federation on a cluster.

1) Download the Hadoop 0.23 release from here.

2) Extract it to a folder (let's call it HADOOP_HOME) on all the NameNodes (namenode1, namenode2) and on all the slaves (slave1, slave2, slave3)

3) Add the association between the hostnames and the ip address for the NameNodes and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.

4) Make sure that the NameNodes is able to do a password-less ssh to all the slaves. Here are theinstructions for the same.

5) Add the following to .bashrc in the home folder for the NameNodes and all the slaves.
?
1
2
3
4
export HADOOP_DEV_HOME=/home/praveensripati/Installations/hadoop-0.23.0
export HADOOP_COMMON_HOME=$HADOOP_DEV_HOME
export HADOOP_HDFS_HOME=$HADOOP_DEV_HOME
export HADOOP_CONF_DIR=$HADOOP_DEV_HOME/conf
6) For some reason the properties set in the .bashrc are not being picked by the Hadoop scripts and need to be explicitly set at the beginning of the scripts.

Add the below in libexec/hadoop-config.sh
?
1
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_27
7) Create a tmp folder in HADOOP_HOME folder on the NameNodes and on all the slaves.

8) Create the following configuration files in the $HADOOP_HOME/conf folder on the NameNodes and on all the slaves.

core-site.xml
?
1
2
3
4
5
6
7
xml version="1.0"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
</configuration>
 hdfs-site.xml
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.federation.nameservices</name>
        <value>ns1,ns2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns1</name>
        <value>namenode1:9001</value>
      </property>
    <property>
        <name>dfs.namenode.rpc-address.ns2</name>
        <value>namenode2:9001</value>
      </property>
</configuration>
9) Add the slave host names to the conf/slaves file on the NameNode on which the command in step 10 will be run.
?
1
2
3
slave1
slave2
slave3
10) Start the Hadoop daemons. This command will start the NameNodes on the nodes specified in the hdfs-site.xml and also start the DataNodes on the nodes specified in the conf/slaves file.
?
1
sbin/start-dfs.sh
11) The NameNode should be started on namenode1 and namenode2 and the DataNode on all the slaves.

11) Check the log files on all the NameNodes and on the Slaves for any errors.

12) Check the Home page for all the NameNode. The number of DataNodes should be equal to the number of slaves.
?
1
2
http://namenode1:50070/dfshealth.jsp
http://namenode2:50070/dfshealth.jsp
13) Stop the Hadoop daemons.
?
1
sbin/stop-dfs.sh

Once multiple NameNodes are up and running, client side mount tables can be used to get a unified view of the different namespaces to the application. Getting started with mount tables coming soon :)

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...