UPDATES

WELCOME TO BIGDATATRENDZ      WELCOME TO CAMO      Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop      Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle     

Video Bar

Loading...

Saturday, 25 June 2016

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.
Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.
In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark, Apache Solr, and Apache HBase to do just that for a medical device information use case. Specifically, you will use a public dataset to convert narrative text into searchable fields.
Although this example concentrates on medical device information, it can be applied in many other scenarios where processing and persisting images is required. Insurance companies, for example, can make all their scanned documents in claims files searchable for better claim resolution. Similarly, the supply-chain department in a manufacturing facility could scan all the technical data sheets from parts suppliers and make them searchable by analysts.

Use Case: Medical Device Registration

Recent years have seen a flurry of changes in the field of electronic drug product registration. The IDMP (Identification of medical products) ISO standard is one such message format for registering products and the substances contained within them, with the Medicinal Product ID, Packaging ID and Batch ID being used to track the products in the cases of adverse experiences, illegal import, counterfeiting, and other issues ofpharmacovigilance. The standard asks that not only do new products need to be registered, but that the older/archived filing of every product to which the public could be exposed must also be provided in electronic form.
To comply with IDMP standards in different companies, companies must be able to pull and process data from multiple data sources, such as RDBMS as well as, in some cases, legacy product data sheets. While it is well known how to ingest data from RDBMS via technologies like Apache Sqoop, legacy document processing requires a little more work. For the most part, the documents need to be ingested, and relevant text needs to be programmatically extracted at scale using existing OCR technologies.
Dataset
We will use a data set from the FDA that contains all of the 510(k) filings ever submitted by medical device manufacturers since 1976. Section 510(k) of the Food, Drug and Cosmetic Act requires device manufacturers who must register, to notify FDA of their intent to market a medical device at least 90 days in advance.
This dataset is useful for several reasons in this case:
  • The data is free and in the public domain.
  • The data fits right in with the European regulation, which activates in July 2016 (where manufacturers must comply with new data standards). FDA fillings have important information relevant to deriving a complete view of IDMP.
  • The format of the documents (PDF) allows us to demonstrate simple yet effective OCR techniques when dealing with documents of multiple formats.
To effectively index this data, we’ll need to extract some fields from the images. Below is a sample document, with the potential fields that can be extracted.

Friday, 17 June 2016

Analyse Tweets using Flume, Hadoop and Hive

Note : Also don't forget to do check another entry on how to get some interesting facts from Twitter using R here. And also this entry on how to use Oozie for automating the below workflow. Here is a new blog on how to do the same analytics with Pig (using elephant-bird).

It's not a hard rule, but almost 80% of the data is unstructured, while the remaining 20% is structured data. RDBMS helps to store/process the structured data (20%), while Hadoop solves the problem of storing/processing both types of data. The good thing about Hadoop, is that it scales incrementally with less CAPEX in terms of software and hardware.

With the ever increasing usage of smart devices and the high speeds internet, unstructured data had been growing at a very fast rate. It's common to Tweet from a smart phone, take a picture and share it in Facebook.

In this blog we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API (more here) to get the Tweets. The service is free, but requires the user to register for the service. Cloudera wrote a three part series (123) for Twitter Analysis using Hadoop, the code for the same is here. For the impatient, I will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.

Flume has the concepts of agents. The sources, sinks and the intermediate channels are the different types of agents. The sources can push/pull the data and send it to the different channels which in turn will send the data to the different sinks. Flume decouples the source (Twitter) and the sink (HDFS) in this case. Both the source and the sink can operate at different speeds, also it's much easier to add new sources and sinks. Flume comes with a set of sources, channels, sinks and new onces can be implemented by extending the Flume base classes.


Introduction to HDFS Erasure Coding in Apache Hadoop

Thanks to blog contributors from Cloudera Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compar...