WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Wednesday, 4 July 2018

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera

Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. This post explains how it works.
HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.
However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.
Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel initiated and drove the HDFS-EC project under HDFS-7285 in collaboration with the broader Apache Hadoop community. HDFS-EC is currently targeted for release in Hadoop 3.0.
In this post, we will describe the design of HDFS erasure coding. Our design accounts for the unique challenges of retrofitting EC support into an existing distributed storage system like HDFS, and incorporates insights by analyzing workload data from some of Cloudera’s largest production customers. We will discuss in detail how we applied EC to HDFS, changes made to the NameNode, DataNode, and the client read and write paths, as well as optimizations using Intel ISA-L to accelerate the encoding and decoding calculations. Finally, we will discuss work to come in future development phases, including support for different data layouts and advanced EC algorithms.


When comparing different storage schemes, there are two important considerations: data durability (measured by the number of tolerated simultaneous failures) and storage efficiency (logical size divided by raw usage).
Replication (like RAID-1, or current HDFS) is a simple and effective way of tolerating disk failures, at the cost of storage overhead. N-way replication can tolerate up to n-1 simultaneous failures with a storage efficiency of 1/n. For example, the three-way replication scheme typically used in HDFS tolerates up to two failures with a storage efficiency of one-third (alternatively, 200% overhead).
Erasure coding (EC) is a branch of information theory which extends a message with redundant data for fault tolerance. An EC codec operates on units of uniformly-sized data termed cells. A codec can take as input a number of data cells and outputs a number of parity cells. This process is called encoding. Together, the data cells and parity cells are termed an erasure coding group. A lost cell can be reconstructed by computing over the remaining cells in the group; this process is called decoding.
Table 1. XOR (exclusive-or) operations

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...