WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Video Bar


Monday, 31 March 2014

Downloading files from YouTube in Ubuntu

There are a lot of nice videos in YouTube from tops for kids to machine learning. Some of these videos are so interesting, feel like viewing them again and again. When you find this pattern, it's better to download the videos. Not only does this allow for offline view, but also save the bandwidth. Bandwidth cap makes this even more useful.

`youtube-dl` is a very useful command to download files from YouTube in Ubuntu. `youtube-dl`has got a lot of nice options, here are some of the options I use

youtube-dl -c -t -f 5 --batch-file=files.txt

-c -> resume partially downloaded file
-t -> Use the title of the video in the file name used to download the video
-f -> Specify the video format (quality) in which to download the video.
--batch-file -> Specify the name of a file containing URLs of videos to download from youtube in batch mode. The file must contain one URL per line.

Caching Proxy - Installation and Configuration

Setting up a Hadoop cluster is all easy with a bit of familiarity with system and network administration. It's all interesting, the only frustrating thing is the downloading of the patches after the installation of the OS and the downloading of the packages for the softwares on top of OS. The downloads can go to all the way close to a GB also, which might take a couple of minutes to hours based on the internet bandwidth.

Here is where caching tools really help. They will cache the downloaded packages to one of the designated local machine (lets call it the cache server) and the other machines can point to the cache server to get the packages. This way the packages are downloaded from the internet for the first time and from then on the local cache server will be used for getting the packages. This approach will not only save the network bandwidth, but will also make the whole installation process faster.

For debian systems, apt-cacher-ng is designed to cache the packages and is really easy to install and configure. Here are the steps involved:

Friday, 14 March 2014

HIVE Installation & Setup Guide

  1. Ubuntu / CentOS
  2. Hadoop 1.x/ 2.x , I prefer to install with 2.x
Step –> 1: Download and Install
Download the Hive from the Apache Download Mirror and i place it in /home/bigdata/Installations/ directory.
$ cd /home/bigdata/Installation
$ wget http://redrockdigimark.com/apachemirror/hive/stable/apache-hive-1.2.1-bin.tar.gz ( i preferred to download hive-1.2.1.tar.gz, as it is stable version)

$ sudo tar xzf hive-1.2.1.tar.gz

Step –> 2:

After downloading and installation. Now we are moving to edit hive-env.sh file for Configuration. To configure hive, there I have installed and give permission to bigdata.
In $HIVE_HOME/conf/hive-env.sh
export JAVA_HOME=/opt/jdk1.80_10 

Step –> 3:
add hbase path to bashrc
$ gedit .bashrc
and add following lines to it
export HIVE_HOME=/home/bigdata/Installations/hive-1.2.1/
export PATH=$PATH:$HIVE_HOME/bin

Step –> 4:
Restart the terminal and start hadoop, then start hive
$ hive
>show tables;
(to test hive installation)

Happy Hadooping !!

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...