Skip to main content

Posts

Showing posts from April, 2014

Using Apache Hadoop and Impala with MySQL for Data Analysis

source: Cloudera blog; Thanks to Alexander Rubin of Percona Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post. I’ve used Cloudera Manager to install Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs: PurposeServer specsNamenode, Hive metastore, etc + Datanodes2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, RAID 10 with 8 SAS drivesDatanodes only4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8GB RAM, single 4TB drive
As you can see those a pretty old se…

Automating things using IFTTT

I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found ifttt.com which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter.


The acronym IFTTT is a bit cryptic to remember and expands to `IFThis Then That`. The service had been running for almost 4 years, but had been a bit flaky when using it.

Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some complex recipes like if-then-else and others…

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here. In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.