Posts

Cloudera Data Hub: Where Agility Meets Control

Image
Cloudera’s new Data Hub cloud service, powered by Cloudera Data Platform, enables users to seamlessly migrate on-premises data management and analytics workloads to the cloud as well as implement new cloud workloads in pursuit of your cloud-first data management strategy. On August 22nd, Cloudera demonstrated its Data Hub service during a webinar highlighting key business benefits, use cases, and product capabilities. Below is a brief overview of the topics covered and some of the most frequently asked questions from attendees. What is Cloudera Data Hub?Cloudera Data Hub is a powerful cloud service on Cloudera Data Platform (CDP) that makes it easier, safer, and faster to build modern, mission-critical, data-driven applications with enterprise security, governance, scale, and control. The cloud-native service is powered by a suite of integrated open source technologies that delivers the widest range of analytical workloads such as data marts and data engineering. The three distinguishing…

HBase Performance CDH5 (HBase1) vs CDH6 (HBase2)

Image
Thanks to Cloudera Blog

HBase Customers upgrading to CDH 6 from CDH 5, will also get an HBase upgrade moving from HBase1 to HBase2. Performance is an important aspect customers consider.We measured performance of CDH 5 HBase1 vs CDH 6 HBase2 using YCSB workloads to understand the performance implications of the upgrade on customers doing in-place upgrades (no changes to hardware).  About YCSBFor our testing we used the Yahoo! Cloud Serving Benchmark (YCSB). YCSB is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems. The original benchmark was developed by workers in the research division of Yahoo! who released it in 2010.  More info on YCSB at https://github.com/brianfrankcooper/YCSB In our test environment YCSB @1TB data scale was used, and run workloads included YCSB default workloads and customized workloads.  YCSB test workloads us…

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.  An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations: Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented i…

Big Data Trendz