Apache Hadoop Infrastructure Considerations and Best Practices

Thanks to Lisa Sensmeier and hortonworks Link

Bit Refinery is a Hortonworks Technical Partner and recently certified with HDP. Bit Refinery is a VMware© Cloud Infrastructure-as-a-Service (IaaS) provider featuring virtualization technology hosted within their fully redundant virtual data centers. Bit Refinery offers a hosted Hortonworks Sandbox providing an easy way to experience and learn Hadoop with ease. All the tutorials available from the Hortonworks Sandbox work just as if you were running a localized version of the Sandbox.
Brandon Hieb, Managing Partner at Bit Refinery, is our guest blogger, and in this blog, he provides insight to virtualizing Hadoop infrastructures.
Here at Bit Refinery we provide infrastructure for companies large and small which includes a variety of big data applications running on both bare-metal and VMware servers. With this new technology constantly changing, it’s hard to keep up with the different required resources needed which could range from a traditional Hadoop node to an in-memory application such as Apache Ignite. In this blog post, we hope to provide some valuable insight on infrastructure guidelines based on our experiences and current big data customers.

  • Embrace redundancy, use commodity servers
  • Start small and stay focused
  • Monitor, monitor and monitor
  • Create a data integration process
  • Use compression
  • Build multiple environments (Dev, Test, Prod)


  • Mix master nodes with data nodes
  • Virtualize data nodes
  • Overbuild
  • Panic – Google is your friend
  • Let the Wild Wild West take over!


Embrace redundancy, use commodity servers

We talk to many companies that foray into Hadoop by business users or an analytics group within the company. Oftentimes the infrastructure folks are brought in at a later date and a majority do not have any training or knowledge of how Hadoop works. This usually leads to an overdesigned cluster that triples or even quadruples the budget. Hadoop was mainly created because the founders wanted a low-cost, redundant data store that would allow deep analysis of the data. This can be achieved using low cost servers with JBOD (just a bunch of disks) and single power supplies. Companies such as Thinkmate (a SuperMicro reseller) sell ideal Hadoop nodes in the 5k range.

Start small and stay focused

We’ve all seen the statistics of how many projects fail in companies due to the level of complexity and expense. The beauty of Hadoop is it allows you to start small and add nodes as you go. Choose a small project to get started which allows both development and infrastructure staff to become familiar with the interworkings of this new technology. They will be hooked in no time!

Monitor, Monitor, Monitor

Although Hadoop offers redundancy at the data level and management level, there are lots of moving parts to be monitored. Hortonworks comes with Nagios which is the leading open-source monitoring package available. By default, it monitors all the nodes and services in a cluster. With Nagios, it’s easy to add some additional checks such as disk health in each server.

Create a data integration process

One of the best things about Hadoop is it lets you populate it with data and define data structures at a later time. Getting data in and out is pretty easy with tools such as Sqoop and Flume but creating a data integration process up front is essential. This includes different layers such as staging and base as well as naming standards and locations. Creating a wiki is a great way to keep proper documentation of data sources and where they live within the cluster.

Use compression

For years, Enterprises have had a love-hate relationship with compression. Although it saves on space, performance was always impacted when it comes to production systems. Hadoop on the other hand thrives on the use of compression and it can increase storage usage by up to 80%! Here is a blog post with more details.http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Build multiple environments (Dev, Test, Prod)

Just like any other infrastructure project, we always advise our customers to build multiple environments. Not only is this a general best practice but it is also important because of the nature of Hadoop. Each project within the Apache Ecosystem is constantly changing and having a non-production environment to test upgrades and new functionality is vital.


Mix master nodes with data nodes

Master nodes and data nodes play two vastly different roles. Master nodes should be placed on servers that have fully redundant features such as RAID and multiple power adapters. They also play well in a virtual environment due to the extra redundancy this provides. Data nodes on the other hand are the work horses of the cluster and need to be dedicated solely to the important function. Mixing these roles together on the same server usually leads to unwanted results and issues.

Virtualize data nodes

We see companies out there touting “Hadoop in the Cloud” and provide clusters located on virtualized servers. Although locating master nodes on virtualized servers isn’t a bad idea, having them act as data nodes is a no-no. The concept of Hadoop is bringing the processing to the data and not the other way around. Having data nodes all share the same storage infrastructure mostly nullifies all the great benefits that Hadoop provides.


It’s easy to get carried away building your first cluster. The costs of hardware and software are low but it’s important to only build to what your initial requirements require. You may find the specifications of the servers you chose need to be altered based on the results and performance of your initial project. It’s easy to add, not as easy to take away.

Panic – Google is your friend

When things go wrong, don’t panic! Unlike other commercial software, Hadoop is driven by the open source community and there is a very good chance the problem you are having is just a quick Google search away. Luckily just within the last 2 years, we’ve encountered less and less issues and it’s amazing how mature the Hadoop ecosystem has become.

Let the Wild Wild West take over!

This is where the hype tramples on best practice. We’ve constantly heard that Hadoop is a great “data lake” where you can just put all of your data and deal with it later. This is very true although just like any other data repository, you need to instill best practices, documentation and rules or by the time you turn around, you will have an out of control tsunami of data!


Hadoop is great data platform that continues to add new features and functionality quicker than any commercial software vendor would be able to. We have been doing this for almost two years now and we are in a constant state of learning new tips and tricks to gain the most of our customer’s Hadoop clusters. Creating a common knowledgebase and best practices ensures both users and management are able to quickly gain confidence in this new and exciting technology.
Keep Calm and Hadoop on!


Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Big Data Trendz