WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Video Bar


Sunday, 13 September 2015

Meet Cloudera’s Apache Spark Committers

From Cloudera Blog, thanks to    for valuable post in cloudera blog.
The super-active Apache Spark community is exerting a strong gravitational pull within the Apache Hadoop ecosystem. I recently had that opportunity to ask Cloudera’s Apache Spark committers (Sean Owen, Imran Rashid [PMC], Sandy Ryza, and Marcelo Vanzin) for their perspectives about how the Spark community has worked and is working together, and the work to be done via the One Platform initiative to make the Spark stack enterprise-ready.
Recently, Apache Spark has become the most currently active project in the Apache Hadoop ecosystem (measured by number of contributors/commits over time), if not the entire ASF. Why do you think that is?
Owen: Partly because of scope: Apache Spark has been many sub-projects under an umbrella from the start, some large and complex in their own right, and has tacked on several more in just the last six months. Culture is another reason for it; even small changes are tracked in JIRA as a form of coordination and documentation of the history of changes, and big tasks broken down into many commits. But I think Apache Spark has attracted so much contribution because the time has been right for a fresh take on processing and streaming in the orbit of Hadoop, and there is plenty of pent-up interest from those who work with Hadoop in participating in an important second-generation processing paradigm project in this space.
Rashid: It’s a combination of timing and culture. MapReduce had been around for a while, and there was growing frustration at both clunky API and the high overhead. Also, from the early days, Apache Spark has encouraged community involvement. I remember when I was using Spark 0.5, long before it was an Apache project, and in conversations Matei immediately suggested that I should submit patches. Before long, I had submitted a few bug fixes and new features. Of course, the project has gotten a lot bigger now, so its much harder to make changes, especially to the core. But I think there is still that sense that anybody can submit patches.

How Impala Scales for Business Intelligence: New Test Results

From Clodera Blog: Thanks to Yanpei Chen, Alan Choi, Dileep Kumar, David Rorke, Silvius Rus, and Devadutta Ghat
Impala, the open source MPP query engine designed for high-concurrency SQL over Apache Hadoop, has seen tremendous adoption across enterprises in industries such as financial services, telecom, healthcare, retail, gaming, government, and advertising. Impala has unlocked the ability to use business intelligence (BI) applications on Hadoop; these applications support critical business needs such as data discovery, operational dashboards, and reporting. For example, one customer has proven that Impala scales to 80 queries/second, supporting 1,000+ web dashboard end-users with sub-second response time. Clearly, BI applications represent a good fit for Impala, and customers can support more users simply by enlarging their clusters.
Cloudera’s previous testing already established that Impala is the clear winner among analytic SQL-on-Hadoop alternatives, and we will provide additional support for this claim soon. We also showed that Impala scales across cluster sizes for stand-alone queries. Future roadmap also aims to deliver significant performance improvements.
That said, there is scant public data about how Impala scales across a large range of concurrency and cluster sizes. The results described in this post aim to close that knowledge gap by demonstrating that Impala clusters of 5, 10, 20, 40, and 80 nodes will support an increasing number of users at interactive latency.
To summarize the findings:
  • Enlarging the cluster proportionally increases the users supported and the system throughput.
  • Once the cluster saturates, adding more users leads to proportional increase in query latency.
  • Scaling is bound by CPU, memory capacity, and workload skew.
The following describes the technical details surrounding our results. We’ll also cover the relevant metrics, desired behavior, and configurations required for concurrency scalability for analytic SQL-on-Hadoop. As always, we strongly encourage you to do your own testing to confirm these results, and all the tools you need to do so have been provided.

Test Setup

For this round of testing, we used a 5-rack, 80-node cluster. To investigate behavior across different cluster sizes, we divided these machines into clusters of 5, 10, 20, 40, and 80 nodes. Each node has hardware considered typical for Hadoop:

Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...