WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Sunday, 31 May 2015

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

this post is from cloudera blog, thanks to IIya Ganelin
Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.
I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk.
Capital One’s original use case for Spark was to surface product recommendations for a set of 25 million users and 10 million products, one of the largest datasets available for this type of modeling. Moreover, we had the goal of considering all possible products, specifically to capture the “long tail” (AKA less frequently purchased items). That problem is even harder, since to generate all possible pairings of user and products, you’d have 250e12 combinations—which is more data than you can store in memory, or even on disk. Not only were we ingesting more data than Spark could readily handle, but we also had to step away from the standard use case for Spark (batch processing with RDDs) and actually decompose the process of generating recommendations into an iterative operation.
Learning how to properly configure Spark, and use its powerful API in a way that didn’t cause things to break, taught me a lot about its internals. I also learned that at the time, there really wasn’t a consolidated resource that explained how those pieces fit together. The end goal of Spark is to abstract away those internals so the end-user doesn’t have to worry about them, but at the time I was using it (and to some degree today), to write efficient and functional Spark code you need to know what’s going on under the hood. This blog post is intended to reveal just that: to teach the curious reader what’s happening, and to highlight some simple and tangible lessons for writing better Spark programs.
Note: this post is not intended as a ground-zero introduction. Rather, the reader should have some familiarity with Spark’s execution model and the basics of Spark’s API.

Security, Hive-on-Spark, and Other Improvements in Apache Hive 1.2.0

Apache Hive 1.2.0, although not a major release, contains significant improvements.
Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.
Shortly thereafter, Apache Hive 1.1.0 was released (renamed from Apache Hive 0.15.0), which included more significant features—including Hive-on-Spark.
Last week, the community released Apache Hive 1.2.0. Although a more narrow release than Hive 1.1.0, it nevertheless contains improvements in the following areas:

New Functionality

  • Support for Apache Spark 1.3 (HIVE-9726), enabling dynamic executor allocation and impersonation
  • Support for integration of Hive-on-Spark with Apache HBase (HIVE-10073)
  • Support for numeric partition columns with literals (HIVE-10313HIVE-10307)
  • Support for Union Distinct (HIVE-9039)
  • Support for specifying column list in insert statement (HIVE-9481)

Performance and Optimizations


Usability and Stability

For a larger but still incomplete list of features, improvements, and bug fixes, see the release notes. (Most of the Hive-on-Spark JIRAs are missing from the list.)
The most important improvements and fixes above (such as those involving security, for example) are alreadyavailable in CDH 5.4.x releases. As another example, CDH users have been testing the Hive-on-Spark public beta since its first release, as well as improvements made to that beta in CDH 5.4.0.
We’re looking forward to working with the rest of the Apache Hive community to drive the project continually forward in the areas of SQL functionality, performance, security, and stability!

Sunday, 24 May 2015

Apache Hadoop Infrastructure Considerations and Best Practices

Thanks to Lisa Sensmeier and hortonworks Link

Bit Refinery is a Hortonworks Technical Partner and recently certified with HDP. Bit Refinery is a VMware© Cloud Infrastructure-as-a-Service (IaaS) provider featuring virtualization technology hosted within their fully redundant virtual data centers. Bit Refinery offers a hosted Hortonworks Sandbox providing an easy way to experience and learn Hadoop with ease. All the tutorials available from the Hortonworks Sandbox work just as if you were running a localized version of the Sandbox.
Brandon Hieb, Managing Partner at Bit Refinery, is our guest blogger, and in this blog, he provides insight to virtualizing Hadoop infrastructures.
Here at Bit Refinery we provide infrastructure for companies large and small which includes a variety of big data applications running on both bare-metal and VMware servers. With this new technology constantly changing, it’s hard to keep up with the different required resources needed which could range from a traditional Hadoop node to an in-memory application such as Apache Ignite. In this blog post, we hope to provide some valuable insight on infrastructure guidelines based on our experiences and current big data customers.

  • Embrace redundancy, use commodity servers
  • Start small and stay focused
  • Monitor, monitor and monitor
  • Create a data integration process
  • Use compression
  • Build multiple environments (Dev, Test, Prod)


  • Mix master nodes with data nodes
  • Virtualize data nodes
  • Overbuild
  • Panic – Google is your friend
  • Let the Wild Wild West take over!

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...