Posts

Showing posts from June, 2015

Cascading: A Java Developer's Companion to the Hadoop World

Image
Thanks to Dhruv Kumar, as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. Now, Currently Dhruv Kumar is at partner solution, Hortonworks .

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Image
Thanks to Ted Malaska  and Cloudera
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems. In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop. Streaming PatternsThe four basic streaming patterns (often used in tandem) are…

New in CDH 5.4: Sensitive Data Redaction

Image
Thanks to Michael Yoder
The best data protection strategy is to remove sensitive information from everyplace it’s not needed


Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like select * from table where creditcard = '1234-5678-9012-3456', where is that query information ultimately stored? This concern affects anyone managing a Hadoop cluster containing sensitive information. At Cloudera, we set out to address this problem through a new feature called Sensitive Data Redaction, and it’s now available starting in Cloudera Manager 5.4.0 when operating on a CDH 5.4.0 cluster. Specifically, this feature addresses the “leakage” of sensitive information into channels unrelated to the flow of data–not the data stream itself. So, for example, Sensitive Data Redaction will get…