How Impala Scales for Business Intelligence: New Test Results

From Clodera Blog: Thanks to Yanpei Chen, Alan Choi, Dileep Kumar, David Rorke, Silvius Rus, and Devadutta Ghat
Impala, the open source MPP query engine designed for high-concurrency SQL over Apache Hadoop, has seen tremendous adoption across enterprises in industries such as financial services, telecom, healthcare, retail, gaming, government, and advertising. Impala has unlocked the ability to use business intelligence (BI) applications on Hadoop; these applications support critical business needs such as data discovery, operational dashboards, and reporting. For example, one customer has proven that Impala scales to 80 queries/second, supporting 1,000+ web dashboard end-users with sub-second response time. Clearly, BI applications represent a good fit for Impala, and customers can support more users simply by enlarging their clusters.
Cloudera’s previous testing already established that Impala is the clear winner among analytic SQL-on-Hadoop alternatives, and we will provide additional support for this claim soon. We also showed that Impala scales across cluster sizes for stand-alone queries. Future roadmap also aims to deliver significant performance improvements.
That said, there is scant public data about how Impala scales across a large range of concurrency and cluster sizes. The results described in this post aim to close that knowledge gap by demonstrating that Impala clusters of 5, 10, 20, 40, and 80 nodes will support an increasing number of users at interactive latency.
To summarize the findings:
  • Enlarging the cluster proportionally increases the users supported and the system throughput.
  • Once the cluster saturates, adding more users leads to proportional increase in query latency.
  • Scaling is bound by CPU, memory capacity, and workload skew.
The following describes the technical details surrounding our results. We’ll also cover the relevant metrics, desired behavior, and configurations required for concurrency scalability for analytic SQL-on-Hadoop. As always, we strongly encourage you to do your own testing to confirm these results, and all the tools you need to do so have been provided.

Test Setup

Cluster
For this round of testing, we used a 5-rack, 80-node cluster. To investigate behavior across different cluster sizes, we divided these machines into clusters of 5, 10, 20, 40, and 80 nodes. Each node has hardware considered typical for Hadoop:

  • CPU: Dual-socket, 12-core, 24-threads Intel Xeon E5-2630L 2.00GHz processor
  • Disk: 12 Hewlett-Packard spinning SATA disks with 7200 RPM and 2TB each
  • Memory: 64GB RAM (below what Cloudera recommends to demonstrate that Impala scales out well even with moderate resources per node)
  • Network: 10 Gbps Ethernet
The cluster runs on RHEL 6.4 with CDH 5.3.3 and Impala 2.1.3, which were the newest CDH and Impala versions available when this project began. Every worker node runs an Impala Daemon and a HDFS DataNode.
Workload
The workload involved is derived from TPC-DS. Although this workload is not an ideal match for self-service BI and analytics, and more closely mimics reporting use cases, TPC-DS is publicly available and thus allows others to reproduce our results.
  • Data schema: Generated by TPC-DS data generator
  • Data size: 15TB (scale factor 15,000 in TPC-DS data generator)
  • Data format: Apache Parquet file, Snappy compression
  • Queries: The “Interactive” queries from our previous blog posts about Impala performance. (See our post from May 2013 for details and Github for the queries themselves.)
  • Load: Reproduces the spirit of the TPC-DS “throughput run”
    • A configurable number of concurrent, continuous query streams
    • Each stream submits queries one after another
    • Different streams run the set of all queries once in randomized order
Each query stream corresponds to a user. This workload mimics a scenario where many concurrent users continuously submit queries in random order, one after another. This concurrency model is actually more demanding than is typical in real life, where there is usually a gap between successive queries as users spend some time thinking about query output before writing a new query.
We intentionally constrained the query set to interactive queries for two reasons:
  • An analysis of customer workloads across industry verticals indicates that interactive queries make up more than 90% of queries for analytic SQL-on-Hadoop. Thus, concurrency scale will be driven by the concurrency level of interactive queries.
  • Each of these interactive queries touches 80GB of data on average after partition pruning. These queries are non-trivial.
We recorded the average of three repeated measurements for each concurrency setting.
Performance Goals
A system with good concurrency scalability should have the following characteristics:
  • Uses all available hardware resources: As more users access the cluster, query throughput maximizes at a saturation point where some cluster hardware resource is fully utilized.
  • High performance: At saturation point, query latency is low and throughput is high.
  • Scalable in the number of users: Adding users after saturation leads to proportionally increasing latency without compromising throughput.
  • Scalable in cluster size: Adding hardware to the system leads to proportionally increasing throughput and decreasing latency.

Results

Users Supported Scales with Cluster Size
For each cluster size, we continuously added more users until the latency increased beyond a set threshold of “interactive latency.” For each cluster size, we documented the number of users that makes latency cross the threshold.
As you can see, for a given threshold, increasing the cluster size results in increasing the number of users supported. The general shape of the graphs is identical for different thresholds. For a fixed size, a cluster can add more users with increasing latency, allowing the cluster to support many users before exceeding a given threshold.
This result is impressive. It supports the contention that to maintain low query latency while adding more users, you would simply add more nodes to the cluster.
Saturation Throughput Scales with Cluster Size
Cluster saturation throughput is another important performance metric. When there is a large number of users (the cluster is saturated), you want larger clusters to run through the queries proportionally faster. The results indicate that this is indeed the case with Impala.
Again, the results indicate that increasing the cluster size will allow you to increase the cluster saturation throughput.
Cluster Behavior as More Users are Added
A different view on the data allows one to identify distinct cluster operating regions as more users are added. The results below show query latency on the 80-node cluster as we progressively add more users.
There is initially a region where the cluster is under-utilized, and query latency remains constant even as we add more users. At the extreme right of the graph, there is a saturation region, where adding more users results in proportionally longer query latency. There is also a transition region in between.
The shape of the graph on the right side is important because it indicates gracefully degrading query latency. So although fluctuations in real-life workloads can often take the cluster beyond saturation, in those conditions, query latencies would not become pathologically high.
Performance Limited by CPU and Memory Capacity
The cluster is both CPU and memory bound. Thus, Impala is efficiently utilizing the available hardware resources.
The graph below on the left shows the CPU utilization across the 40-nodes cluster when we’re running 40 concurrent users or query streams, when the cluster is saturated. The cluster is CPU bound on several nodes that have 90% or higher CPU utilization. The variation in CPU utilization is due to execution skew (more about that below).
The graph on the right shows the memory utilization. It’s more uniform across the cluster, and caps at around 90% utilized.
Why Admission Control is Necessary
Previously you saw that when we add more users to a cluster, we get gracefully degrading query latency. We achieve this behavior by configuring Admission Control.
Impala’s Admission Control feature maintains the cluster in an efficient and not-overwhelmed state by limiting the number of queries that can run at the same time. When users submit queries beyond the limit, the cluster puts the additional queries in a waiting queue. More users means a longer queue and longer waiting time, but the actual “running time” is the same. That is how we achieve graceful latency degradation.
The following is a conservative heuristic to set the admission control limit. It requires running typical queries in the workload in a stand-alone fashion, then finding the per-node peak memory use from the query profiles.
Set admission control limit = (RAM size – headroom for OS and other CDH services) / 
(max per-node peak memory use across all queries) *
(safety factor < 1)
We used 5GB for headroom for OS and other CDH services, and a safety factor value of 0.7.
On larger clusters, the same queries result in lower values for the per-node peak memory use, and this heuristic will give higher limits for Admission Control. Again, this heuristic is  conservative.
Scaling Overhead and Execution Skew
A query engine that scales simplifies cluster planning and operations: To add more users while maintaining query latency, just add more nodes.
However, the behavior is not ideal. In our tests, increasing the cluster size Nx resulted in below-Nx increase in the number of users supported at the same latency.
One reason for this scaling overhead is skew in how the workload is executed. This skew is visible in the CPU utilization graph above, where some nodes are CPU-bound while others have spare CPU capacity.
The graph below shows what happens when we vary the cluster size. On small clusters, there is almost no gap between maximum and minimum CPU utilization across the cluster. On large clusters, there is a large gap. The overall performance is bottlenecked on the slowest node. The bigger the gap, the bigger the skew, and the bigger the scaling overhead.
The primary source of execution skew occurs during the fact table scan HDFS operator. It arises out of uneven data placement across different nodes.
The fact table is partitioned by date. Many of the queries filter by date ranges in a month, so all but 30 partitions will be pruned. Data in partitions are stored as Parquet files, each a 256MB granularity HDFS block. Most partitions have three blocks, and the average is 4.7 blocks per partition (see graph below).
The net effect is that after partition pruning, there will be around 30 partitions remaining and 90-150 Parquet blocks that will be scanned across the cluster. On an 80-node cluster, most nodes will scan one or two blocks, and some nodes could end up scanning three or more blocks or zero blocks. This is a source of heavy skew. Smaller clusters would have on average more blocks per node, and statistically smooth out the node-to-node variation.
We verified the behavior by examining the query profiles. This skew impacts all queries, and propagates through the rest of the query execution after the fact table scan HDFS operator.

Summary and Recommendations

For customers running BI applications, a good analytic SQL-on-Hadoop backend should have the following properties:
  • Scales better to large clusters. As datasets and the number of users grow, clusters will also grow. Solutions that can prove themselves at large cluster sizes are better.
  • Achieves fast, interactive latency. This enables human specialists to explore the data, discover new ideas, then validate those ideas, all without waiting and losing their train of thought.
  • Makes efficient use of hardware – CPU as well as memory. One can always buy bigger machines and build larger clusters. However, an efficient solution will support more users from the hardware available.
  • Simplifies planning. Adding more users should not require a complete redesign of the system, or migrate wholesale to a different platform. Supporting more users should be a simple matter of adding more nodes to your cluster.
The results presented here show that Impala achieves the above goals. In these tests, we found that Impala shows good scaling behavior, with increasing cluster sizes being able to support increasing throughput and number of users. For this workload, Impala performance is limited by available CPU and memory capacity, and skew in how data is placed across the cluster. These results fall in line with the scaling behavior that most customers see.
We also have some general recommendations for analytic SQL-on-Hadoop software and hardware vendors:
  • Concurrency scale should get more attention. BI use cases highlight the need to design for a large number of users. The technical challenges at high concurrency, such as load balancing to address execution skew, are only starting to be discovered.
  • Admission Control is an important design point. Our heuristic is based on the number of queries, and would be less effective when some queries require a lot more processing than others and are a lot “bigger” than others.
  • Analytic SQL-on-Hadoop engines should prioritize CPU efficiency. Each user adds incremental CPU demands, so an engine should aim to execute each query with as little CPU work as possible.
  • Hardware needs memory as well as CPU capacity. CPU capacity is necessary as just discussed. Memory capacity is also needed, because data sizes and working sets will increase over time.
Overall, concurrency and cluster scalability involves an interplay between hardware properties, software configurations, and the workload to be serviced.
We have seen only the tip of the iceberg for this complex problem space. Look for our future posts for more information!
(An abbreviated version of these results was presented at Strata+ Hadoop World London 2015.)
Acknowledgements
The authors would like to thank Arun Singla for contributing to an earlier version of this work, and Ippokratis Pandis for in-depth comments during review.

Happy Hadooping !!

Popular posts from this blog

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

INTEGRATE SPARKR AND R FOR BETTER DATA SCIENCE WORKFLOW

How-to: Ingest Email into Apache Hadoop in Real Time for Analysis