Meet Cloudera’s Apache Spark Committers

From Cloudera Blog, thanks to    for valuable post in cloudera blog.
The super-active Apache Spark community is exerting a strong gravitational pull within the Apache Hadoop ecosystem. I recently had that opportunity to ask Cloudera’s Apache Spark committers (Sean Owen, Imran Rashid [PMC], Sandy Ryza, and Marcelo Vanzin) for their perspectives about how the Spark community has worked and is working together, and the work to be done via the One Platform initiative to make the Spark stack enterprise-ready.
Recently, Apache Spark has become the most currently active project in the Apache Hadoop ecosystem (measured by number of contributors/commits over time), if not the entire ASF. Why do you think that is?
Owen: Partly because of scope: Apache Spark has been many sub-projects under an umbrella from the start, some large and complex in their own right, and has tacked on several more in just the last six months. Culture is another reason for it; even small changes are tracked in JIRA as a form of coordination and documentation of the history of changes, and big tasks broken down into many commits. But I think Apache Spark has attracted so much contribution because the time has been right for a fresh take on processing and streaming in the orbit of Hadoop, and there is plenty of pent-up interest from those who work with Hadoop in participating in an important second-generation processing paradigm project in this space.
Rashid: It’s a combination of timing and culture. MapReduce had been around for a while, and there was growing frustration at both clunky API and the high overhead. Also, from the early days, Apache Spark has encouraged community involvement. I remember when I was using Spark 0.5, long before it was an Apache project, and in conversations Matei immediately suggested that I should submit patches. Before long, I had submitted a few bug fixes and new features. Of course, the project has gotten a lot bigger now, so its much harder to make changes, especially to the core. But I think there is still that sense that anybody can submit patches.

Ryza: I think one missing consideration in analyses like these is that Spark is a young project compared to projects like core Apache Hadoop. While it would be a far stretch to say that HDFS, MapReduce, and YARN are “done,” most of their core functionality have been built, gone through multiple revisions, and ultimately stabilized. Core Spark is just now emerging out of its protean stage, and many of Spark’s subprojects are still there. So, put simply, in Spark, there’s more work to be done.
This is of course not the minimize the fact that there’s enormous momentum around Spark. If I had to pin down a couple reasons that it’s been so successful at attracting contributors, I’d say first that its users are developers, and that its developer focus is a strong point. Developers get excited to give back, and have more ability to do so than the users of tools like Apache Hive. On a more zoomed-out level, Spark has taken a really bold approach at satisfying a broad set of needs in the ecosystem, and I think people are excited about being part of something that’s more than just incremental improvement.
What are your personal impressions about the Apache Spark community and how it works together?
Rashid: I haven’t been involved in such a large project before. It’s great to see so many contributions from the community, but also feel the pain of how long it can take to get pull requests reviewed and committed. (And I feel especially bad since I’m part of the problem as a committer!)  But despite how much bigger the community has become, and how much more complex the code is now, Spark keep pulling in more and more contributors, which is fantastic.  It certainly wouldn’t be where it is today without the patches, bug reports, testing, and discussions from everyone.
Still, I worry sometimes that ordinary users have it rough: If you aren’t prepared to dive into Spark internals, it can be hard to understand why things don’t work the way you expect. I keep having conversations with customers, where things start failing for them as they try to go beyond the simple demos. Often they feel like they must be really stupid, since they only hear positive things about Spark, and I explain that no, it’s not them—it’s easy to get stuck on some details of lazy evaluation, it is hard to understand why some jobs are slow and some are fast, it is hard to understand how accumulators work, and it is hard to configure memory.
Owen: I find it active and vibrant, but often disorganized in a way you might expect from a big project. It feels like at least 3-6 sub-communities. There’s so much going on that, although it’s true that lots of contributions from lots of people are going in, an equally record-breaking number of questions and JIRAs fall through the cracks. Spark’s community has deployed or built plenty of tools to help speed the process of contributing (docs, extensive tests, intelligent integration tests, committer scripts, and so on) yet still looks bottlenecked and overwhelmed at times. The scale of the community is both a blessing and obstacle that is rarely observed, and it’s fascinating to watch it play out.
What are some examples of the most interesting things on the roadmap, and in what ways are you contributing to help make Spark enterprise-ready?
Vanzin: Apache Spark is reaching a state in which it provides a sufficiently powerful abstraction such that Spark itself can start to focus inward. So instead of big new features, I’d expect a lot more work targeted at ease of use/debugging, scalability, security, and performance at scale. And that is a good thing; it means that users will get to play with a really stable interface, while the inner workings of the framework are constantly being fine tuned and improved.
Ryza: Coming from a Hadoop background, my focus recently has been on the last few weak points where Apache Spark lags behind MapReduce, in the hopes that someone porting their workload over to Spark should never see regressions or complications. To that end, some larger work I did earlier this year was to enable Spark’s shuffle to operate directly on serialized data (SPARK-4550), which makes it less profligate in its memory consumption and more stable overall. I’ve also been working and reviewing a lot around dynamic allocation (e.g. SPARK-4352SPARK-4316), which allows Spark to be fluid in its resource consumption in the way that MapReduce is.
Rashid: I’m excited about improving the scalability of Spark. There are a different set of bottlenecks when you want to optimize working with a 100GB on 10 nodes vs. working with 100TB on 1,000 nodes. Also, when you get to that scale, fault-tolerance becomes far more important; there were a number of very serious bugs that were only recently discovered, with SPARK-8103 already merged and SPARK-5259SPARK-5945, and SPARK-8029 in progress. A lot of committers feel that the Scheduler code is really nasty and hard to change, which most likely means there are more lingering bugs in there. At Cloudera, our own testing has helped us discover some of these issues, but we know there is more to go.
I’m also really interested in features that make it easier for the end user to understand what Apache Spark is doing, and how to debug and optimize their jobs. I’ve got tons of ideas for this, but unfortunately I’m terrible at developing UIs. But here’s an area where I’m hoping to leverage that wonderful community of Spark developers (and I’ve already started with SPARK-9516)!
Owen: The less-obvious thing that actually intrigues me is, what comes after Apache Spark? Spark 1.x that is. So much has changed and been learned to date in 1.x, while needing to maintain a lot of compatibility, that it’s interesting to speculate about what a 2.x could do differently without this constraint.
Based on those things, do you foresee new use cases for Apache Spark on the horizon?
Owen: I suspect that Spark’s next phase will turn to streamlining, integration, and bug fixing rather than big-bang new sub-projects that would change its use cases significantly. That is, it will be more about ‘hardening’ support for existing use cases.
Vanzin: New use cases are a natural result of the framework’s extensibility and ease of use. Spark provides a really good building block for all sorts of applications, and there are few limits to what can be achieved. You can already see many external projects that build on the existing API, providing new data sources for Spark SQL, integration with all sorts of external systems, creative uses of Spark’s monitoring capabilities, among others. Improving Spark itself means that more people will see it as a great ecosystem on which to develop tools and applications, and that’s really the main goal.
Ryza: Apache Spark is slowly but surely coming into its own as a production-grade ETL tool. Since Cloudera brought Spark into the platform in early 2014, I’ve been telling customers, “Spark is tackling your exotic problems in complex analytics, but I’d be careful about porting over your daily job that cleans a petabyte of data.” Soon, the latter will be a reasonable thing to do.
You work closely with Intel’s Spark contributors/committer. How are joint contributions planned and executed?
Owen: This is carried out on the project mailing list and JIRA, aside from some informal side conversations. It’s not just a good idea to discuss in the open, but necessary to get early feedback and buy-in.
Vanzin: I want to emphasize Sean’s response: The most important thing for the community of an open source project is to keep the community involved, meaning that everybody should have the chance to provide input about the project’s direction.
Ryza: Agreed with Sean and Marcelo. I think what the partnership comes down to is that we get together and agree on some roadmap items that we’re interested in jointly pursuing. One recent example of this was incorporating locality preferences in the executor requests filed through dynamic allocation (SPARK-4352). Then we give attention to those JIRAs and help review and move things forward.
What is your advice for people who are interested in contributing to Spark, but are intimidated by the size/activity of the community?
Owen: Reading the wiki page about contributions is a must. Don’t think you must or even should start by opening a pull request; start by warming up by helping on the mailing list to understand the dynamics of the project. The good news is that “good” changes (impactful, simple, tested, clearly explained) are generally merged very quickly.
Rashid: I’d add that bug reports, especially with reproductions, are extremely useful. I know that is not as appealing, everyone feels much better about submitting actual changes to the core. At least for me, if a contributor can clearly demonstrate the issue they are trying to fix, I’m happy to work with them to develop the eventual complete patch.
It’s also helpful to very clearly describe why the change is being made. I’ve reviewed a few patches, where at first I think they are trying to fix bug X, but without a test case. And only after some further rounds of review and my eventual prodding for a test case, does it become clear that they are actually trying to add a new feature Y, and it turns out that feature is actually undesirable because of how it interacts with the rest of Spark, so the contribution isn’t accepted. I always feel bad when that happens after a long process, instead of up front.
Finally, I want to emphasize that is especially important for non-native English speakers. It’s great that Apache Spark has a worldwide community, but committer-ship is biased to English speakers. Also a reproduction or a test case is especially helpful at getting beyond language barriers.
Vanzin: Feeling intimidated by the size and the amount of activity is not a privilege of those outside of the community. Even someone who’s spent a long time working on certain areas of Spark feels intimidated when they need to delve into a new part of the source base. So, the first advice is not to feel overwhelmed; try to find an area where you feel you can apply your skills and focus on it. And, as Imran mentions, bringing good engineering practices—good, reproducible issues, writing test cases, being responsive to feedback—goes a long way in building reputation.

Happy Hadooping !!


Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Big Data Trendz