Getting to Know the Apache Hadoop 3 Alpha

Source: Cloudera Blog
This is article about Hadoop 3.x version release from Cloudera Blog post

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.
Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.
The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1.
Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

HDFS Erasure Coding:


HDFS Erasure Coding is a major new feature, and one of the driving features for releasing Hadoop 3.0.0. As described in this Cloudera blog post, erasure coding can reduce storage costs by up to 50% compared to 3x replication while maintaining the same or better durability. Moreover, as shown in a joint Intel-Cloudera blog post, erasure coding adds minimal overhead when using Intel’s optimized ISA-L routines, and can actually improve read and write performance in some conditions.
Considering trends in data growth and datacenter hardware, we foresee HDFS erasure coding being an important feature in years to come.

YARN Timeline Service v.2


3.0.0-alpha1 includes an early preview of YARN Timeline Service v.2, including. This includes container events, metrics, and application-specific information like the number of map and reduce tasks.
The Timeline Service v2 is a new implementation that improves upon the scalability and reliability of TSv1, and enhances usability by introducing flows and aggregation.
More details are available in the YARN Timeline Service v.2 docs.

Shell Script Rewrite


The Hadoop shell scripts have been rewritten with an eye toward unifying behavior, addressing numerous long-standing bugs, improving documentation, as well as adding new functionality. This update affects users who use Hadoop environment variables or integrate with the shell commands.
See HADOOP-9902 for incompatible changes and other details, as well as the the Unix Shell Guide.

Support for Multiple Standby NameNodes


HDFS NameNode high availability with QuorumJournalManager uses a Paxos quorum to store the NameNode edit log. With a three-node quorum, this change means we can tolerate the loss of any one node and still continue operation.
However, business-critical deployments may wish to run with higher levels of fault-tolerance, e.g. a five-node quorum to be able to tolerate the loss of any two nodes.
QuorumJournalManager already supports an arbitrary number of nodes, but fault tolerance was limited since HDFS was only able to run a single active and single standby NameNode. Hadoop 3 eliminates this restriction by supporting running multiple standby NameNodes. This improves the fault tolerance of HDFS.
See the updated HDFS HA docs for information on configuring multiple standby NameNodes.

Java 8 Minimum Runtime Version


Another major motivation for a new major release was bumping the minimum supported Java version to Java 8. This is critically important as Java 7 was end-of-lifed in April 2015, meaning Oracle would no longer publicly support it with security fixes.
Thus, all Hadoop 3 JARs are compiled for a Java 8 language version, and require a Java 8 runtime version. This is also important because many libraries only support Java 8, so this bump allowing Hadoop to upgrade its dependencies to more modern versions.

New Default Ports for Several Services

To avoid bind errors on startup, the default ports of the NameNode, Secondary NameNode, DataNode, and KMS have been moved out of the Linux ephemeral range (32768-61000). See the release notes for HDFS-9427 and HADOOP-12811 for a full list of port changes.
This update should improve the reliability of rolling restarts on large clusters.

Intra-DataNode Balancer


Intra-DataNode balancing functionality addresses the intra-node skew that can occur when disks are added or replaced. See the disk balancer section in the HDFS Commands Guide for more information.
Stay tuned for more information on this feature in a future blog post!

Miscellaneous


The 3.0.0-alpha1 release is also the first release that incorporates a number of changes to the release process.
  • We are now using Apache Yetus to automatically generate our release notes and changelog from JIRA, a vast improvement over our previous system of manually editing a text file.
  • Building the release artifacts has been Docker-ized and streamlined with a new create-release script, which improves correctness and enables us to make more frequent releases.
  • Hadoop’s versioning scheme has been adjusted, letting us more easily release from multiple concurrent branches in parallel (e.g. 2.6.x, 2.7.x, 3..0.x), as well as the addition of -alphaX/-betaX tags.

Conclusion


Apache Hadoop 3.0.0-alpha1 marks a major development milestone on the road to a 3.0.0 GA release. The features and improvements listed here will eventually be incorporated into CDH after our usual rigorous testing and integration process.
In the meantime, we encourage you to track the progress of these new features as new alphas and betas are released. Savvy users can get involved by testing and providing feedback by filing JIRAs on the Apache issue tracker.
Andrew Wang is a Software Engineer at Cloudera, Apache Hadoop committer and member of the Apache Hadoop PMC, and the release manager for Hadoop 3.

Popular posts from this blog

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Getting started with NextGen MapReduce (single node) in easy steps

Hadoop Installation on Single Machine