UPDATES

WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Thursday, 31 March 2016

Getting started with NextGen MapReduce (single node) in easy steps

Without any detailed explanation of what-is-what which is due for another blog entry, here are simple steps to get started with MRv2 (next generation MapReduce) in easy steps. Find more details about MRv2 details here. So, here are the steps

1) Download the Hadoop 2.x release here.

2) Extract it to a folder (let's call it $HADOOP_HOME).

3) Add the following to .bashrc in the home folder.

?
1
2
3
4
5
6
export HADOOP_HOME=/home/bigdata/Installations/hadoop-2.7.2
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
4) Create the namenode and the datanode folder in $HADOOP_HOME folder.
?
1
2
mkdir -p $HADOOP_HOME/yarn/yarn_data/hdfs/namenode
mkdir -p $HADOOP_HOME/yarn/yarn_data/hdfs/datanode
5) Create the following configuration files in $HADOOP_HOME/etc/hadoop folder.
etc/hadoop/yarn-site.xml

?
1
2
3
4
5
6
7
8
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>

etc/hadoop/core-site.xml
?
1
2
3
4
   <property>
       <name>fs.defaultFS</name>
       <value>hdfs://localhost:8020</value>
   </property>
etc/hadoop/hdfs-site.xml
?
1
2
3
4
5
6
7
8
9
10
11
12
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/bigdata/Installations/hadoop-2.7.2/yarn/yarn_data/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/bigdata/Installations/hadoop-2.7.2/yarn/yarn_data/hdfs/datanode</value>
   </property>
etc/hadoop/mapred-site.xml
?
1
2
3
4
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
6) Format the NameNode
?
1
hadoop namenode -format
7) Start the Hadoop daemons
?
1
2
3
4
5
6
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
hadoop-daemon.sh start secondarynamenode
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver
8) Time to check if the installation has been a success or not

    a) Check the log files in the $HADOOP_HOME/logs folder for any errors.

    b) The following  consoles should come up

?
1
2
3
http://localhost:50070/ for NameNode
http://localhost:8088/cluster for ResourceManager
http://localhost:19888/jobhistory for Job History Server
    c) Run the jps command to make sure that the daemons are running.
?
1
2
3
4
5
6
7
2234 Jps
1989 ResourceManager
2023 NodeManager
1856 DataNode
2060 JobHistoryServer
1793 NameNode
2049 SecondaryNameNode
9) Create a file and copy it to HDFS
?
1
2
3
4
5
6
7
mkdir in
vi in/file
Hadoop is fast
Hadoop is cool
hadoop dfs -copyFromLocal in/ /in
10) Run the example job.
?
1
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /in /out
10) Verify that the output folder with the proper contents has been created through the NameNode Web console (http://localhost:50070/dfshealth.jsp) in the /out folder.

11) Stop the daemons once the job has been through successfully.

?
1
2
3
4
5
6
hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop datanode
hadoop-daemon.sh stop secondarynamenode
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager
mr-jobhistory-daemon.sh stop historyserver
Note: It will be easy if all hadoop start/stop commands can be make as a shell script and execute the script file

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...