WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Wednesday, 12 October 2016

Getting to Know the Apache Hadoop 3 Alpha

Source: Cloudera Blog
This is article about Hadoop 3.x version release from Cloudera Blog post

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.
Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.
The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1.
Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

Monday, 10 October 2016


Thanks to Hortonworks blog resource -- Carter Shanklin && Nita Dembla

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP(Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.
In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.


Tuesday, 4 October 2016

RDBMS vs NoSQL Data Flow Architecture

Here is a view for difference between RDBMS and a NoSQL Database. There are a lot of differences, but the data flow is as shown below in those systems.

In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.

In case of a NoSQL database like HBase, the data is first written to the Log (WAL), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.

Log files (first step in NoSQL) are just appended at the end and is much faster than writing to the Database (first step in RDBMS). The NoSQL data flow discussed above gives a higher threshold/rate during data inserts/updates in case of NoSQL Databases when compared to RDBMS.

Happy Hadooping......!!

Wednesday, 17 August 2016

Resolving Lock Contention in Apache Solr: A Performance-Analysis Detective Story

This case study is an instructive example of how performance analysis is a multi-faceted process that often leads one in surprising directions. 
Apache Solr Near Real Time (NRT)  Search allows Solr users to search documents indexed just seconds ago. It’s a critical feature in many real-time analytics applications. As Solr indexes more and more documents in near real time, end-user expectations for performance get higher and higher.
However, recently the Cloudera Search team found that Solr NRT indexing throughput often hit a bottleneck even when there are plenty of CPU, disk, and network resources available. Latency was average, in the hundreds of milliseconds range. Considering that Solr NRT indexing is a mainly machine-to-machine operation, without a human waiting for indexing to complete, that latency range was actually fairly good.
Furthermore, some customers reported other issues under heavy Solr NRT indexing workloads, such as connection resets, that could be “cascading” performance effects as a result of the throughput bottleneck. This behavior changed little even with different indexing batch sizes, numbers of indexing threads, or numbers of shards and replicas involved.
In the remainder of this post, I’ll describe how we identified the cause of this bottleneck via scientific methodfor performance analysis and custom tools, designed a permanent fix in partnership with Solr committers, and the lessons learned for doing performance analysis overall.

Sunday, 7 August 2016

How-to: Ingest Email into Apache Hadoop in Real Time for Analysis

source: Cloudera blog
Apache Hadoop is a proven platform for long-term storage and archiving of structured and unstructured data. Related ecosystem tools, such as Apache Flume and Apache Sqoop, allow users to easily ingest structured and semi-structured data without requiring the creation of custom code. Unstructured data, however, is a more challenging subset of data that typically lends itself to batch-ingestion methods. Although such methods are suitable for many use cases, with the advent of technologies like Apache SparkApache Kafka, and Apache Impala (Incubating), Hadoop is also increasingly a real-time platform.
In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communication (email, IM, social media, proprietary communication tools, and so on) for a set period of time. Once data has grown past its retention period, it can then be permanently removed; in the meantime, such data is subject to e-discovery requests and legal holds. Even outside of compliance use cases, most large organizations that are subject to litigation have some form of archive in place for purposes of e-discovery.
Traditional solutions in this area comprise various moving parts and can be quite costly and complex to implement, maintain, and upgrade. By using the Hadoop stack to take advantage of cost-efficient distributed computing, companies can expect significant cost savings and performance benefits.
In this post, as a simple example of this use case, I’ll describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange.

Setting Up Apache James

Being the most common form of electronic communication, email is almost always the most important thing to archive. In this exercise, we will use Microsoft Exchange 2013 to send email via SMTP journaling to an Apache James server v2.3.2.1 located on an edge node in the Hadoop cluster. James is an open source SMTP server; it’s relatively easy to set up and use, and it’s perfect for accepting data in the form of a journal stream.

Saturday, 25 June 2016

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.
Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.
In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark, Apache Solr, and Apache HBase to do just that for a medical device information use case. Specifically, you will use a public dataset to convert narrative text into searchable fields.
Although this example concentrates on medical device information, it can be applied in many other scenarios where processing and persisting images is required. Insurance companies, for example, can make all their scanned documents in claims files searchable for better claim resolution. Similarly, the supply-chain department in a manufacturing facility could scan all the technical data sheets from parts suppliers and make them searchable by analysts.

Use Case: Medical Device Registration

Recent years have seen a flurry of changes in the field of electronic drug product registration. The IDMP (Identification of medical products) ISO standard is one such message format for registering products and the substances contained within them, with the Medicinal Product ID, Packaging ID and Batch ID being used to track the products in the cases of adverse experiences, illegal import, counterfeiting, and other issues ofpharmacovigilance. The standard asks that not only do new products need to be registered, but that the older/archived filing of every product to which the public could be exposed must also be provided in electronic form.
To comply with IDMP standards in different companies, companies must be able to pull and process data from multiple data sources, such as RDBMS as well as, in some cases, legacy product data sheets. While it is well known how to ingest data from RDBMS via technologies like Apache Sqoop, legacy document processing requires a little more work. For the most part, the documents need to be ingested, and relevant text needs to be programmatically extracted at scale using existing OCR technologies.
We will use a data set from the FDA that contains all of the 510(k) filings ever submitted by medical device manufacturers since 1976. Section 510(k) of the Food, Drug and Cosmetic Act requires device manufacturers who must register, to notify FDA of their intent to market a medical device at least 90 days in advance.
This dataset is useful for several reasons in this case:
  • The data is free and in the public domain.
  • The data fits right in with the European regulation, which activates in July 2016 (where manufacturers must comply with new data standards). FDA fillings have important information relevant to deriving a complete view of IDMP.
  • The format of the documents (PDF) allows us to demonstrate simple yet effective OCR techniques when dealing with documents of multiple formats.
To effectively index this data, we’ll need to extract some fields from the images. Below is a sample document, with the potential fields that can be extracted.

Friday, 17 June 2016

Analyse Tweets using Flume, Hadoop and Hive

Note : Also don't forget to do check another entry on how to get some interesting facts from Twitter using R here. And also this entry on how to use Oozie for automating the below workflow. Here is a new blog on how to do the same analytics with Pig (using elephant-bird).

It's not a hard rule, but almost 80% of the data is unstructured, while the remaining 20% is structured data. RDBMS helps to store/process the structured data (20%), while Hadoop solves the problem of storing/processing both types of data. The good thing about Hadoop, is that it scales incrementally with less CAPEX in terms of software and hardware.

With the ever increasing usage of smart devices and the high speeds internet, unstructured data had been growing at a very fast rate. It's common to Tweet from a smart phone, take a picture and share it in Facebook.

In this blog we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API (more here) to get the Tweets. The service is free, but requires the user to register for the service. Cloudera wrote a three part series (123) for Twitter Analysis using Hadoop, the code for the same is here. For the impatient, I will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.

Flume has the concepts of agents. The sources, sinks and the intermediate channels are the different types of agents. The sources can push/pull the data and send it to the different channels which in turn will send the data to the different sinks. Flume decouples the source (Twitter) and the sink (HDFS) in this case. Both the source and the sink can operate at different speeds, also it's much easier to add new sources and sinks. Flume comes with a set of sources, channels, sinks and new onces can be implemented by extending the Flume base classes.

Thursday, 31 March 2016

Apache Sentry is Now a Top-Level Project

Source: Cloudera
The following post was originally published by the Sentry community at apache.org. We re-publish it here for your convenience.
We are very excited to announce that Apache Sentry has graduated out of Incubator and is now an Apache Top-Level Project! Sentry, which provides centralized fine-grained access control on metadata and data stored in Apache Hadoop clusters, was introduced as an Apache Incubator project back in August 2013. In the past two and a half years, the development community grew significantly to a large number of contributors from various organizations. Upon graduation, there were more than 50 contributors, 31 of whom had become committers.sentry

What’s Sentry?

While Hadoop has strong security at the filesystem level, it lacked the granular support needed to adequately secure access to data by users and BI applications. This problem forces users to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop. Sentry provides the ability to enforce role-based access control to data and/or privileges on data for authenticated users in a fine-grained manner. For example, Sentry’s SQL permissions allow access control at the server, database, table, view and even column scope at different privilege levels including select, insert, etc for Apache Hive and Apache Impala (incubating). With role-based authorization, precise levels of access could be granted to the right users and applications.