Thanks to Hortonworks Team for the valuable post.
In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation infrastructure for more than a year in running all of the Hortonworks internal CI / CD infrastructure.
With this, we can now run Hadoop on Hadoop to certify our releases! Let’s dive right in!


The introductory post on Engineering @ Hortonworks gave the readers an overview of the scale of challenges we see in delivering an Enterprise Ready Data platform.
Essentially, for every new release of a platform, we provision Hadoop clusters on demand, with specific configurations like authentication on/off, encryption on/off, DB combinations, and OS environments, run a bunch of tests to validate changes, and shut them down.
And we do this over and over, day in and day out, throughout the year.


Certifying our platforms is thus an arduous deal. The specific challenge that is relevant to this post is how resources get shared and utilized efficiently when we bring up and down these Hadoop clusters. We discuss these issues related to resource management below.
  1. Scheduling needs: The existing infrastructure had very limited scheduling primitives to support hundreds of users running workloads at various points in the day. It was difficult to work with unpredictable resource sharing, minimal insights into historical usage, lack of limits per user/tenant etc. These are some of the things that YARN is build to do!
  2. Older infrastructure of VMs
    • The matrix of combinations discussed in the introductory post used to be tested on an older generation of infrastructure based on VMs. The older infrastructure had its share of pros and cons.The most significant con is that VMs got used in our infrastructure in several places where we really didn’t need them. We didn’t need the very coarse isolation that VMs provide in terms of both isolations as well as security sandboxing.
    • Using VMs in that manner didn’t help us in realizing the huge potential for resource utilization optimizations. We needed to do something to get more out of our hardware investments.
  3. Need for more internal long-lived clusters: From the viewpoint of delivering a rigorously tested platform, the number of internal instances where clusters would run forever – and go through operations and upgrades that a typical customer goes through – was just not enough.
  4. Scaling needs: The existing VM based infrastructure was simply not scaling to our needs. Note that some of our needs are actually very unusual – we rapidly bring up and down VMs at large scale every day.
There is a dogfood opportunity here that helps both us internally as well as our customers and users: Use what we ship and ship what we use.
We recently talked about how Apache Hadoop is moving towards a multi-colored YARN. The need for such a multi-colored YARN is coming from the demand for new use-cases and technology drivers discussed in that post.


Looking at the internal pains discussed above, we asked ourselves – “Our users and customers have a very powerful platform for scheduling and managing their applications. Why not use that same platform for our internal needs?” To this end, nearly 15 months ago, we started the migration of our workloads to a YARN-based container cloud.
What is this new infrastructure? It uses the well-known HDP components that you are already familiar with – Ambari, Zookeeper, HDFS, YARN newly augmented with Docker support and Slider. It also is one of the few fully functioning production clusters that is running all the time inside Hortonworks.
If you look carefully, through this mechanism, one can spot a YARN cluster running inside (containers running on) another YARN cluster! That is, we have finally realized YInception – YARN on YARN!
YInception: Hadoop clusters running on Hadoop cluster
YInception: Hadoop clusters running on Hadoop cluster
This is not a radically new idea, though. This is very similar to how source code compilers go through their own evolution. Through the bootstrapping process, developers use an older version of a compiler to build the newer version of the compiler! What we are doing here is to use an older version of Hadoop as a computing platform to test the newer versions of Hadoop. And Hive. And Spark. And everything!


Different type of apps run on this platform : (a) containerized apps and services – a key promise of YARN in Apache Hadoop 3.0 and (b) legacy applications that run inside containers made to look like VMs each with its own IP address, SSH based access etc.
HWX internal workloads running on YARN container cloud
HWX internal workloads running on YARN container cloud
Let’s look at one specific workload. The following is a depiction of a system-test cluster running on this container cloud. On the base YARN cluster, we dynamically bring up N containers, and then use Ambari to install HDP components inside those containers!
Running Hadoop system test clusters on Hadoop
Running Hadoop system test clusters on Hadoop
Similarly, running all unit-tests of the Apache Hadoop project can take more than 6-7 hours. We have developed a parallel Master-Worker framework that runs on the YARN Container Cloud which brings the unit-test run-time to 10-15 minutes! We will talk more about this in one of the next posts.


We have been running hard and fast with this new architecture for more than a year! The following graphs show some key metrics of this cluster over the last 15 months.
YARN Container Cloud Memory growth
YARN Container Cloud Memory growth
Growth in terms of containers’ memory resource over time.Yellow: Used + Reserved MemoryBlue: Total cluster Memory
YARN Container Cloud VCores Growth
YARN Container Cloud VCores Growth
Growth in terms of containers’ CPU resource over time.Yellow: Used + Reserved CPUBlue: Total cluster CPU
As you can see, the cluster slowly got built over time. The workloads likewise slowly got migrated from the old platform to the new one.
Just looking at the number of apps and containers, we so far have more than 2.4M containers allocated!!
And close to 400K applications completed so far!


Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Big Data Trendz