source: Cloudera blog; Thanks to Alexander Rubin of Percona Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post. I’ve used Cloudera Manager to install Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs: PurposeServer specsNamenode, Hive metastore, etc + Datanodes2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, RAID 10 with 8 SAS drivesDatanodes only4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8GB RAM, single 4TB drive As you can see those a pretty old se…
I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found ifttt.com which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter.
The acronym IFTTT is a bit cryptic to remember and expands to `IFThis Then That`. The service had been running for almost 4 years, but had been a bit flaky when using it.
Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some complex recipes like if-then-else and others…
Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A
cluster is a group or a network of machines wired together acting a
single entity to work on a task which when run on a single machine takes
much more longer time.` The given task is split and processed by
multiple machines in parallel and so that the task gets completed
faster. Jesse Johnson puts it in simple and clear terms what a cluster
is all about and how to design distributed algorithms here.
In a Big Data cluster, the machines (or nodes) are neither as powerful
as a server grade machine nor as dumb as a desktop machine. Having
multiple (like in thousands) server grade machines doesn't make sense
from a cost perspective, while a Desktop grade machine fails often which
has to be appropriately handled. Big Data clusters have a collection of
commodity machines which fall in between a server and a desktop grade