Apache Tez - Beyond Hadoop (MR)

Thanks to Hadoop Tips
Apache Pig and Hive are higher level abstracts on top of MR (MapReduce). PigLatin scripts of Pig and HiveQL queries of Hive are converted into a DAG of MR jobs, with the first MR job (5) reading from the input the last MR job (2) writing to the output. One of the problem with this approach is that the temporary data between the MR jobs (as in the case of 11 and 9) is written to HDFS (by 11) and read from HDFS (by 9) which leads to inefficiency. Not only this, multiple MR jobs will also lead to initialization over head.
With the ClickStream analysis mentioned in the blog earlier, to find out the `top 3 urls visited by users whose age is less than 16` Hive results in a DAG of 3 MR jobs, while Pig results in a DAG of 5 MR jobs.

Apache Tez (Tez means Speed in Hindi) is a generalized data processing paradigm which tries to address some of the challenges mentioned above. Using Tez along with Hive/Pig, a single HiveQL or PigLatin will be converted into a single Tez Job and not as a DAG of MR jobs as is happening now. In Tez, data processing is represented as a graph with the vertices in the graph representing processing of data and edges representing movement of data between the processing.

With Tez the output of a MR job can be directly streamed to the next MR job  without writing to HDFS. If there are is any failure, the tasks from the last checkpoint will be executed. In the above DAG, (5) will read the input from the disk and (2) will write the output to the disk. The data between the remaining vertices can happen in-memory or can be streamed over the network or can be written to the disk for the sake of checkpointing.

According to the Tez developers in the below video, Tez is for frameworks like Pig, Hive and not for the end users to directly code to Tez.
HortonWorks had been aggressively promoting Tez as a replacement to Hadoop runtime and is also publishing a series of articles on the same (1, 2, 3, 4). A DAG of MR jobs would also run efficiently on a Tez runtime than on a Hadoop runtime.

Work is in progress to run Pig and Hive on the Tez runtime. Here are the JIRAs (Pig and Hive) for the same. Need to wait and see how far it is accepted by the community. In the future we should be able to run Pig and Hive on Hadoop and Tez runtime by just switching a flag in the configuration files.

Tez is a part of a bigger initiative Stinger, which I will cover in another blog entry.


Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Big Data Trendz