WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Tuesday, 30 June 2015

Cascading: A Java Developer's Companion to the Hadoop World

Thanks to Dhruv Kumar, as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. Now, Currently Dhruv Kumar is at partner solution, Hortonworks .

Thursday, 4 June 2015

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Thanks to Ted Malaska  and Cloudera
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.

Streaming Patterns

The four basic streaming patterns (often used in tandem) are:
  • Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr.
  • Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.
  • NRT Event Partitioned Processing:  Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.
  • Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultra-low latency to functionality and accuracy.
In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.

Streaming Ingestion

Traditionally, Flume has been the recommended system for streaming ingestion. Its large library of sources and sinks cover all the bases of what to consume and where to write. (For details about how to configure and manage Flume,Using Flume, the O’Reilly Media book by Cloudera Software Engineer/Flume PMC member Hari Shreedharan, is a great resource.)

New in CDH 5.4: Sensitive Data Redaction

Thanks to Michael Yoder
The best data protection strategy is to remove sensitive information from everyplace it’s not needed

Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like select * from table where creditcard = '1234-5678-9012-3456', where is that query information ultimately stored?
This concern affects anyone managing a Hadoop cluster containing sensitive information. At Cloudera, we set out to address this problem through a new feature called Sensitive Data Redaction, and it’s now available starting in Cloudera Manager 5.4.0 when operating on a CDH 5.4.0 cluster.
Specifically, this feature addresses the “leakage” of sensitive information into channels unrelated to the flow of data–not the data stream itself. So, for example, Sensitive Data Redaction will get credit-card numbers out of log files and SQL queries, but it won’t touch credit-card numbers from the actual data returned from an SQL query. nor modify the stored data itself.


Our first step was to study the problem: load up a cluster with sensitive information, run queries, and see if we could find the sensitive data outside of the expected locations. In the case of log files, We found that SQL queries themselves were written to several log files. Beyond the SQL queries, however, we did not observe any egregious offenders; developers seem to know that writing internal data to log files is a bad idea.
That’s the good news. The bad news is that the Hadoop ecosystem is really big, and there are doubtless many code paths and log messages that we didn’t exercise. Developers are also adding code to the system all the time, and future log messages might reveal sensitive data.
Looking more closely at how copies of SQL queries are distributed across the system was enlightening. Apache Hive writes a job configuration file that contains a copy of the query, and makes this configuration file available “pretty much everywhere.” Impala keeps queries and query plans around for debugging and record keeping and makes them available in the UI. Hue saves queries so they can be run again. This behavior makes perfect sense: users want to know what queries they have run, want to debug queries that went bad, and want information on currently running queries. When sensitive information is in the query itself, however, this helpfulness is suddenly much less helpful.
One way to tackle such “leakage” of sensitive data is to put log files in an encrypted filesystem such as that provided by Cloudera Navigator Encrypt. This strategy is reasonable and addresses compliance concerns, especially in the event that some users require the original queries.

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of ...