3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Thanks to Carter Shanklin & Nita Dembla from Hortonworks for valuable post.
One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.

STARTING OFF: 3X PERFORMANCE GAINS IN HDP 2.6 WITH HIVE LLAP

Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.
Hive Interactive Query

A few key callouts:
  1. HDP 2.6 runs this set of queries in 5155.3 seconds, compared to 14983.4 seconds for HDP 2.5, almost a 3x improvement for HDP 2.6.
  2. HDP 2.6 is faster than HDP 2.5 on almost all queries.
  3. HDP 2.6 is more than 5x faster than HDP 2.5 on 8 queries.
Now let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.

HIVE WORKING SMARTER, NOT HARDER

Hive’s core execution engine and operators are highly optimized, and designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.
Hive - Key Performance Advancements - HDP 2.6

HIVE LLAP SUPPORTS THE FULL SUITE OF 99 TPC-DS QUERIES

Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.) For reference we’ve posted the full set of trivially-modified queries used in the the remainder of this blog post in the updated Hortonworks testbench repository.

COMPARING APACHE HIVE TO APACHE IMPALA

Apache Impala is a SQL-on-Hadoop engine built specifically for interactive query. For this benchmark we compared Hive to Impala using the identical set of 99 trivially-modified TPC-DS queries, which can be found in our GitHub repository.
To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.
Comaring Hive to Impala
This graph breaks down a full comparison of all runtimes, across the 60 queries which could be run in both engines.
Hive versus Impala
When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:
Impala SQL Gaps

These missing features are not obscure SQL features. Instead they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.

HIVE LLAP PERFORMANCE ACROSS ALL 99 QUERIES

For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.
Hive Query Timings

HARDWARE CONFIGURATION (USED IN ALL TEST CONFIGURATIONS)

9 WORKER NODES, EACH WITH:

  • 256 GB RAM
  • Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  • 2x HGST HUS726060AL4210 A7J0 disks for HDFS and YARN storage
  • Cisco VIC 1227 10 Gigabit Network Connection

HDP 2.6 SOFTWARE CONFIGURATION

  • Ambari-managed HDP 2.6.2 stack
  • Hive version = 2.1.0.2.6.2.4-4
  • Tez version = 0.8.4.2.6.2.4-4
  • LLAP container size: 180GB (per node)
  • LLAP Heap Size: 128GB (per daemon)
  • LLAP Cache Size: 32GB (per daemon)
  • OS Setting: net.core.somaxconn set to 16k, ntpd and nscd running
  • Data: ORCFile format, scale 10,000, with daily partitions on fact tables (load scripts available in the GitHub repository).
  • Note: This HDP build will also be available in Hortonworks Data Cloud for AWS as version 1.16.5 within 7 days from this blog publication.

CDH 5.12 SOFTWARE CONFIGURATION

  • CDH-managed CDH 5.12 stack
  • Impala version = 2.9
  • Impala was given all memory on the worker nodes, using all defaults provided by Cloudera Manager.
  • Data: Parquet format, scale 10,000, with daily partitions on fact tables.

Popular posts from this blog

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

INTEGRATE SPARKR AND R FOR BETTER DATA SCIENCE WORKFLOW

How-to: Ingest Email into Apache Hadoop in Real Time for Analysis