WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Wednesday, 4 July 2018

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera

Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. This post explains how it works.
HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.
However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.
Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel initiated and drove the HDFS-EC project under HDFS-7285 in collaboration with the broader Apache Hadoop community. HDFS-EC is currently targeted for release in Hadoop 3.0.
In this post, we will describe the design of HDFS erasure coding. Our design accounts for the unique challenges of retrofitting EC support into an existing distributed storage system like HDFS, and incorporates insights by analyzing workload data from some of Cloudera’s largest production customers. We will discuss in detail how we applied EC to HDFS, changes made to the NameNode, DataNode, and the client read and write paths, as well as optimizations using Intel ISA-L to accelerate the encoding and decoding calculations. Finally, we will discuss work to come in future development phases, including support for different data layouts and advanced EC algorithms.


When comparing different storage schemes, there are two important considerations: data durability (measured by the number of tolerated simultaneous failures) and storage efficiency (logical size divided by raw usage).
Replication (like RAID-1, or current HDFS) is a simple and effective way of tolerating disk failures, at the cost of storage overhead. N-way replication can tolerate up to n-1 simultaneous failures with a storage efficiency of 1/n. For example, the three-way replication scheme typically used in HDFS tolerates up to two failures with a storage efficiency of one-third (alternatively, 200% overhead).
Erasure coding (EC) is a branch of information theory which extends a message with redundant data for fault tolerance. An EC codec operates on units of uniformly-sized data termed cells. A codec can take as input a number of data cells and outputs a number of parity cells. This process is called encoding. Together, the data cells and parity cells are termed an erasure coding group. A lost cell can be reconstructed by computing over the remaining cells in the group; this process is called decoding.
Table 1. XOR (exclusive-or) operations

Monday, 25 September 2017


Thanks to Carter Shanklin & Nita Dembla from Hortonworks for valuable post.
One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.


Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.
Hive Interactive Query


Thanks to Hortonworks Team for the valuable post.
In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation infrastructure for more than a year in running all of the Hortonworks internal CI / CD infrastructure.
With this, we can now run Hadoop on Hadoop to certify our releases! Let’s dive right in!


The introductory post on Engineering @ Hortonworks gave the readers an overview of the scale of challenges we see in delivering an Enterprise Ready Data platform.
Essentially, for every new release of a platform, we provision Hadoop clusters on demand, with specific configurations like authentication on/off, encryption on/off, DB combinations, and OS environments, run a bunch of tests to validate changes, and shut them down.
And we do this over and over, day in and day out, throughout the year.

Sunday, 24 September 2017


Thanks to Vinay Shukla(Leading Data Science Product Management at Hortonworks)
Our customers increasingly leverage Data Science, and Machine Learning to solve complex predictive analytics problem. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching.
While everyone wants to predict the future, truly leveraging Data Science for Predictive Analytics remains the domain of a select few. To expand the reach of Data Science, the Modern Data Architecture (MDA) needs to address the following 4 requirements:
  • Enable Apps to consume predictions and become smarter
  • Bring predictive analytics to the IOT Edge
  • Become easier, more accurate & faster to deploy and manage
  • Fully support data science life cycle
The below diagram represents where Data science fits in the MDA.


The end-users consumes data, analytics and the results of Data Science analytics via data centric applications (or apps). A vast majority of these applications today don’t leverage Data Science, Machine Learning or Predictive Analytics. A new generation of enterprise and consumer facing apps are being built to take advantage of Data Science/Predictive Analytics and provide context driven insights to nudge end-users to next set of actions. These apps are called Data Smart Applications.

Tuesday, 18 April 2017


Thanks to Hortonworks Blog and  Yanbo Liang
R is one of the primary programming languages for data science with more than 10,000 packages. R is an open source software that is widely taught in colleges and universities as part of statistics and computer science curriculum. R uses data frame as the API which makes data manipulation convenient. R has powerful visualization infrastructure, which lets data scientists interpret data efficiently.
However, data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. To address R’s scalability issue, the Spark community developed SparkR package which is based on a distributed data frame that enables structured data processing with a syntax familiar to R users. Spark provides distributed processing engine, data source, off-memory data structures. R provides a dynamic environment, interactivity, packages, visualization. SparkR combines the advantages of both Spark and R.
In the following section, we will illustrate how to integrate SparkR with R to solve some typical data science problems from a traditional R users’ perspective.


SparkR’s architecture consists of two main components as shown in this figure: an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
Operations executed on SparkR DataFrames get automatically distributed across all the nodes available on the Spark cluster.

Wednesday, 12 October 2016

Getting to Know the Apache Hadoop 3 Alpha

Source: Cloudera Blog
This is article about Hadoop 3.x version release from Cloudera Blog post

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.
Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.
The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1.
Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

Monday, 10 October 2016


Thanks to Hortonworks blog resource -- Carter Shanklin && Nita Dembla

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP(Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.
In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.


Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...