WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Video Bar


Wednesday, 12 October 2016

Getting to Know the Apache Hadoop 3 Alpha

Source: Cloudera Blog
This is article about Hadoop 3.x version release from Cloudera Blog post

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.
Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.
The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1.
Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

Monday, 10 October 2016


Thanks to Hortonworks blog resource -- Carter Shanklin && Nita Dembla

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP(Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.
In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.


Tuesday, 4 October 2016

RDBMS vs NoSQL Data Flow Architecture

Here is a view for difference between RDBMS and a NoSQL Database. There are a lot of differences, but the data flow is as shown below in those systems.

In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.

In case of a NoSQL database like HBase, the data is first written to the Log (WAL), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.

Log files (first step in NoSQL) are just appended at the end and is much faster than writing to the Database (first step in RDBMS). The NoSQL data flow discussed above gives a higher threshold/rate during data inserts/updates in case of NoSQL Databases when compared to RDBMS.

Happy Hadooping......!!

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...