UPDATES

WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Monday, 25 September 2017

3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Thanks to Carter Shanklin & Nita Dembla from Hortonworks for valuable post.
One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.

STARTING OFF: 3X PERFORMANCE GAINS IN HDP 2.6 WITH HIVE LLAP

Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.
Hive Interactive Query

YINCEPTION: A YARN BASED CONTAINER CLOUD AND HOW WE CERTIFY HADOOP ON HADOOP

Thanks to Hortonworks Team for the valuable post.
In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation infrastructure for more than a year in running all of the Hortonworks internal CI / CD infrastructure.
With this, we can now run Hadoop on Hadoop to certify our releases! Let’s dive right in!

CERTIFYING HORTONWORKS PLATFORMS

The introductory post on Engineering @ Hortonworks gave the readers an overview of the scale of challenges we see in delivering an Enterprise Ready Data platform.
Essentially, for every new release of a platform, we provision Hadoop clusters on demand, with specific configurations like authentication on/off, encryption on/off, DB combinations, and OS environments, run a bunch of tests to validate changes, and shut them down.
And we do this over and over, day in and day out, throughout the year.

Sunday, 24 September 2017

DATA SCIENCE FOR THE MODERN DATA ARCHITECTURE

Thanks to Vinay Shukla(Leading Data Science Product Management at Hortonworks)
Our customers increasingly leverage Data Science, and Machine Learning to solve complex predictive analytics problem. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching.
While everyone wants to predict the future, truly leveraging Data Science for Predictive Analytics remains the domain of a select few. To expand the reach of Data Science, the Modern Data Architecture (MDA) needs to address the following 4 requirements:
  • Enable Apps to consume predictions and become smarter
  • Bring predictive analytics to the IOT Edge
  • Become easier, more accurate & faster to deploy and manage
  • Fully support data science life cycle
The below diagram represents where Data science fits in the MDA.

DATA SMART APPLICATIONS

The end-users consumes data, analytics and the results of Data Science analytics via data centric applications (or apps). A vast majority of these applications today don’t leverage Data Science, Machine Learning or Predictive Analytics. A new generation of enterprise and consumer facing apps are being built to take advantage of Data Science/Predictive Analytics and provide context driven insights to nudge end-users to next set of actions. These apps are called Data Smart Applications.

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...