Thanks to Vinay Shukla(Leading Data Science Product Management at Hortonworks)
Our customers increasingly leverage Data Science, and Machine Learning to solve complex predictive analytics problem. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching.
While everyone wants to predict the future, truly leveraging Data Science for Predictive Analytics remains the domain of a select few. To expand the reach of Data Science, the Modern Data Architecture (MDA) needs to address the following 4 requirements:
  • Enable Apps to consume predictions and become smarter
  • Bring predictive analytics to the IOT Edge
  • Become easier, more accurate & faster to deploy and manage
  • Fully support data science life cycle
The below diagram represents where Data science fits in the MDA.


The end-users consumes data, analytics and the results of Data Science analytics via data centric applications (or apps). A vast majority of these applications today don’t leverage Data Science, Machine Learning or Predictive Analytics. A new generation of enterprise and consumer facing apps are being built to take advantage of Data Science/Predictive Analytics and provide context driven insights to nudge end-users to next set of actions. These apps are called Data Smart Applications.

Writing Data Smart Apps is hard. The app developer needs to write not only the traditional app logic, but also the logic to invoke predictive analytics. These Smarter Data Apps also face a set of common problems-such as entity disambiguation, data quality analysis and anomaly detection. Since today’s data platform don’t provide these functionality, the app developers is responsible for solving these problems.
We have seen this issue before, and frameworks such as JavaEE & Spring Framework evolved to addresses common application concerns. Now we need the next generation application framework to make writing Data Smart Applications easier. We are starting to see this evolution. Salesforce Einstein is helping applications in Salesforce Cloud to become smarter but similar functionality is yet to be available in open source.


Internet of Things is rapidly expanding and the market size estimates are huge. IDC estimates global IT spending on IoT related items will reach $1.29 trillion by 2020. Edge Intelligence has the potential to deliver insights and predictions where it is needed most, at faster speed, without requiring a persistent network connection. What is needed is to deliver predictions at the edge. But the predictive models need not be created at the edge. Today the model training at the edge is painfully slow and we can create better models faster in the Data Center. What is needed is to deliver these models to the edge where they can provide predictions even while being disconnected from the data center. Often the models degrade with time and drift, to address these issues the edge needs to be able to report back on model performance and ask for new models when the performance falls below certain threshold.


Businesses are collecting ever bigger datasets, running more compute intensive deep learning & machine learning algorithms, across a bigger compute cluster. This requires a mature and sophisticated Big-Data and Big Compute platform. The platform needs to leverage hardware advances and transparently make them available to big data analytics and Data Smart Apps. Hardware advances such a GPU, FPGA, RDMA etc should be made transparently available to compute framework with right level of resource sharing and isolation semantic. YARN already support GPU with node-labels but this functionality is going to evolve to provide finer grained control.
A lot of data science workloads leverage Python libraries and R packages. Managing these dependencies in a distributed cluster is a non-trivial issue. We have made advances with Package management in SparkR and virtual env support with PySpark but much more is needed. Upcoming Hadoop 3 will provide Docker Support and that will allow developer packaged environment to run as a YARN job and will be easier to manage.
Tuning, Debugging and tracing a distributed system remains hard. As Data Science on big data goes mainstream, we needs to make distributed systems easier to manage, debug, trace and tune.


Data Science is a team sport. Data Scientists collaborate, explore corporate datasets, wrestle with data, deploy machine learning while keeping up with the onslaught of new machine learning techniques and libraries. A complete Data Science platform needs to support the full data science life cycle. It needs to provide data scientists the choice of their favorite Notebook from Jupyter, Zeppelin to RStudio and allow them a wide choice of data science languages &  frameworks to use. The platform should make collaboration easier and help data scientist be more aligned with modern Software Engineering practices such as code review, continuous integration, and delivery.
Model Deployment & Management is a critical part of completing the data science loop and the framework needs to support model deployment, versioning, A/B testing, Champion/Challenger and provide standard ways to promote & use the models.
Deep Learning (DL) is top of mind for many, and selecting the right DL framework, to selecting the right problems for DL, remains an art form. The platform needs to provide guidance and choice of right DL frameworks to use and provide better integration with hardware resources to improve training time and performance.


At Hortonworks we are evolving our platform to to further democratize Data science. We are working with IBM to deliver Data Science Experience on HDP, working with our HDF team to bring intelligence to the edge and evolving YARN and HDP to become faster, more accurate and easier access to insights needed to deliver insights to Data Smart Apps. A lot is on our plate and we are super excited to share with our customers the advances they have asked for. Stay tuned.


Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Big Data Trendz