WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Wednesday, 30 April 2014

Experts Video - Hadoop World

Watch the experts Doug Cutting, Cloudera's Chief Architect, and Todd Papaioannou Splunk's CTO as they share their thoughts on big data in Asia on Bloomberg Television's Singapore Sessions

Monday, 28 April 2014

Using Apache Hadoop and Impala with MySQL for Data Analysis

source: Cloudera blog; Thanks to Alexander Rubin of Percona
Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.
I’ve used Cloudera Manager to install Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:
PurposeServer specs
Namenode, Hive metastore, etc + Datanodes2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, RAID 10 with 8 SAS drives
Datanodes only4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8GB RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data Export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:
select * into outfile '/tmp/ontime.psv'
from ontime;

(You can use “|” or any other symbol as a delimiter.) Alternatively, you can download data directly from www.transtats.bts.gov using this simple script:
for y in {1988..2013}
        for i in {1..12}
                wget $u -o ontime.log
                unzip On_Time_On_Time_Performance_${y}_${i}.zip

Automating things using IFTTT

I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found ifttt.com which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter.

The acronym IFTTT is a bit cryptic to remember and expands to `IF This Then That`. The service had been running for almost 4 years, but had been a bit flaky when using it.

Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some complex recipes like if-then-else and others.

The recipes has already been created for any new blog posted here to be tweeted, need to wait and see if this post gets tweeted or not.

Thursday, 10 April 2014

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here.
                                                IMG_9370 by NeoSpire from Flickr under CC
In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...