Showing posts from June, 2016

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale. Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time. In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark, Apache Solr, and Apache HBase to do just that for a medical device information use case. Specifically, you will use a public dataset to convert narrative text into searchable fields. Although this example concentrates on medical device information, it can be applied in many other scenarios where processing and persisting images is required. Insurance companies, for example, can make all their scanned documents in claims files searchable for better claim resolution. Similarly, the supp…

Analyse Tweets using Flume, Hadoop and Hive

Note : Also don't forget to do check another entry on how to get some interesting facts from Twitter using R here. And also this entry on how to use Oozie for automating the below workflow. Here is a new blog on how to do the same analytics with Pig (using elephant-bird).

It's not a hard rule, but almost 80% of the data is unstructured, while the remaining 20% is structured data. RDBMS helps to store/process the structured data (20%), while Hadoop solves the problem of storing/processing both types of data. The good thing about Hadoop, is that it scales incrementally with less CAPEX in terms of software and hardware.

With the ever increasing usage of smart devices and the high speeds internet, unstructured data had been growing at a very fast rate. It's common to Tweet from a smart phone, take a picture and share it in Facebook.

In this blog we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API (more here) to get the Tweets.…