Posts

3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Image
Thanks to Carter Shanklin & Nita Dembla from Hortonworks for valuable post. One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries. These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries. STARTING OFF: 3X PERFORMANCE GAINS IN HDP 2.6 WITH HIVE LLAPLet’s start out with a s…

YINCEPTION: A YARN BASED CONTAINER CLOUD AND HOW WE CERTIFY HADOOP ON HADOOP

Image
Thanks to Hortonworks Team for the valuable post. In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation infrastructure for more than a year in running all of the Hortonworks internal CI / CD infrastructure. With this, we can now run Hadoop on Hadoop to certify our releases! Let’s dive right in! CERTIFYING HORTONWORKS PLATFORMSThe introductory post on Engineering @ Hortonworks gave the readers an overview of the scale of challenges we see in delivering an Enterprise Ready Data platform. Essentially, for every new release of a platform, we provision Hadoop clusters on demand, with specific configurations like authentication on/off, encryption on/off, DB combinations, and OS environments, run a bunch of tests to validate changes, and shut them down. And we do this over and over, day in and day out, throughout the year.

DATA SCIENCE FOR THE MODERN DATA ARCHITECTURE

Image
Thanks to Vinay Shukla(Leading Data Science Product Management at Hortonworks) Our customers increasingly leverage Data Science, and Machine Learning to solve complex predictive analytics problem. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching. While everyone wants to predict the future, truly leveraging Data Science for Predictive Analytics remains the domain of a select few. To expand the reach of Data Science, the Modern Data Architecture (MDA) needs to address the following 4 requirements: Enable Apps to consume predictions and become smarterBring predictive analytics to the IOT EdgeBecome easier, more accurate & faster to deploy and manageFully support data science life cycleThe below diagram represents where Data science fits in the MDA. DATA SMART APPLICATIONS The end-users consumes data, analytics and the results of Data Science analytics via data centric applications (or apps). A vast majority of these…

INTEGRATE SPARKR AND R FOR BETTER DATA SCIENCE WORKFLOW

Image
Thanks to Hortonworks Blog and  Yanbo Liang R is one of the primary programming languages for data science with more than 10,000 packages. R is an open source software that is widely taught in colleges and universities as part of statistics and computer science curriculum. R uses data frame as the API which makes data manipulation convenient. R has powerful visualization infrastructure, which lets data scientists interpret data efficiently. However, data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. To address R’s scalability issue, the Spark community developed SparkR package which is based on a distributed data frame that enables structured data processing with a syntax familiar to R users. Spark provides distributed processing engine, data source, off-memory data structures. R provides a dynamic environment, interactivity, packages, visualization. SparkR combine…

Getting to Know the Apache Hadoop 3 Alpha

Source: Cloudera Blog
This is article about Hadoop 3.x version release from Cloudera Blog post

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.
Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process. The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1. Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

WHERE IS APACHE HIVE GOING? TO IN-MEMORY COMPUTING

Image
Thanks to Hortonworks blog resource -- Carter Shanklin && Nita Dembla

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP(Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources. In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.
NOT ALL MEMORY ARCHITECTURES ARE EQUAL

RDBMS vs NoSQL Data Flow Architecture

Image
Here is a view for difference between RDBMS and a NoSQL Database. There are a lot of differences, but the data flow is as shown below in those systems.
In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.
In case of a NoSQL database like HBase, the data is first written to the Log (WAL), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.
Log files (first s…