Hortonworks is excited to announce that our first hands-on, performance based certification exam is now available! The HDP Certified Developer (HDPCD) exam is designed for Hadoop developers working with frameworks like Pig, Hive, Sqoop and Flume. This new approach to Hadoop certification is designed to allow individuals an opportunity to prove their Hadoop skills in a way that is recognized in the industry as meaningful and relevant to on-the-job performance.
Instead of multiple-choice questions, the exam consists of tasks executed on a live, three-node Hortonworks Data Platform cluster:
The exam has three main categories of tasks that involve:
Click here to view a detailed list of objectives for the HDPCD exam.
Thanks to Ted Malaska and Cloudera Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment. The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems. In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.
Streaming PatternsThe four basic streaming patterns (often used in tandem) are…
Thanks to Hortonworks Blog and Yanbo Liang R is one of the primary programming languages for data science with more than 10,000 packages. R is an open source software that is widely taught in colleges and universities as part of statistics and computer science curriculum. R uses data frame as the API which makes data manipulation convenient. R has powerful visualization infrastructure, which lets data scientists interpret data efficiently. However, data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. To address R’s scalability issue, the Spark community developed SparkR package which is based on a distributed data frame that enables structured data processing with a syntax familiar to R users. Spark provides distributed processing engine, data source, off-memory data structures. R provides a dynamic environment, interactivity, packages, visualization. SparkR combine…
source: Cloudera blog Apache Hadoop is a proven platform for long-term storage and archiving of structured and unstructured data. Related ecosystem tools, such as Apache Flume and Apache Sqoop, allow users to easily ingest structured and semi-structured data without requiring the creation of custom code. Unstructured data, however, is a more challenging subset of data that typically lends itself to batch-ingestion methods. Although such methods are suitable for many use cases, with the advent of technologies like Apache Spark, Apache Kafka, and Apache Impala (Incubating), Hadoop is also increasingly a real-time platform.
In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communicat…