Posts

Showing posts from May, 2015

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

this post is from cloudera blog, thanks to IIya Ganelin
Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark. I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk. Capital One’s original use case for Spark was to surface product recommendations for a set of 25 million users and 10 million products, one of the largest datasets available for this type of modeling. Moreover, we had the goal …

Security, Hive-on-Spark, and Other Improvements in Apache Hive 1.2.0

Apache Hive 1.2.0, although not a major release, contains significant improvements. Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0. Shortly thereafter, Apache Hive 1.1.0 was released (renamed from Apache Hive 0.15.0), which included more significant features—including Hive-on-Spark. Last week, the community released Apache Hive 1.2.0. Although a more narrow release than Hive 1.1.0, it nevertheless contains improvements in the following areas: New FunctionalitySupport for Apache Spark 1.3 (HIVE-9726), enabling dynamic executor allocation and impersonationSupport for integration of Hive-on-Spark with Apache HBase (HIVE-10073)Support for numeric partition columns with literals (HIVE-10313HIVE-10307)Support for Union Distinct (HIVE-9039)Support for specifying column list in insert stateme…

Apache Hadoop Infrastructure Considerations and Best Practices

Image
Thanks to Lisa Sensmeier and hortonworks Link
Bit Refinery is a Hortonworks Technical Partner and recently certified with HDP. Bit Refinery is a VMware© Cloud Infrastructure-as-a-Service (IaaS) provider featuring virtualization technology hosted within their fully redundant virtual data centers. Bit Refinery offers a hosted Hortonworks Sandbox providing an easy way to experience and learn Hadoop with ease. All the tutorials available from the Hortonworks Sandbox work just as if you were running a localized version of the Sandbox. Brandon Hieb, Managing Partner at Bit Refinery, is our guest blogger, and in this blog, he provides insight to virtualizing Hadoop infrastructures. Here at Bit Refinery we provide infrastructure for companies large and small which includes a variety of big data applications running on both bare-metal and VMware servers. With this new technology constantly changing, it’s hard to keep up with the different required resources needed which could range from a tradit…