Skip to main content

Apache Sentry is Now a Top-Level Project

Source: Cloudera
The following post was originally published by the Sentry community at apache.org. We re-publish it here for your convenience.
We are very excited to announce that Apache Sentry has graduated out of Incubator and is now an Apache Top-Level Project! Sentry, which provides centralized fine-grained access control on metadata and data stored in Apache Hadoop clusters, was introduced as an Apache Incubator project back in August 2013. In the past two and a half years, the development community grew significantly to a large number of contributors from various organizations. Upon graduation, there were more than 50 contributors, 31 of whom had become committers.sentry

What’s Sentry?

While Hadoop has strong security at the filesystem level, it lacked the granular support needed to adequately secure access to data by users and BI applications. This problem forces users to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop. Sentry provides the ability to enforce role-based access control to data and/or privileges on data for authenticated users in a fine-grained manner. For example, Sentry’s SQL permissions allow access control at the server, database, table, view and even column scope at different privilege levels including select, insert, etc for Apache Hive and Apache Impala (incubating). With role-based authorization, precise levels of access could be granted to the right users and applications.


What’s New

During incubation, Sentry had six releases and has continued to grow on providing unified authorization policy management across different Hadoop components. Some of them include:
  • Sentry allows for multiple permission models, and also enforces the same permission model across multiple compute frameworks and data access paths.
  • Support for Apache Solr (Search)
  • Synchronizing SQL table permissions with HDFS file permissions
  • Audit log support for data governance purposes
  • Sentry High Availability (HA)
  • Import/export tool for replicating permissions to other clusters
  • Support for Apache Kafka, Solr, and Apache Sqoop

Future Work

Graduation is a terrific milestone, but only the beginning for Sentry. We are looking forward to continuing to help grow the Sentry community and fostering a strong ecosystem around the project.
We are targeting significant enhancements across the areas of:
  • Ease of Sentry enablement and management of permissions
  • Feature parity with access control capabilities of mature relational database systems
  • Attribute-Based Access Control (ABAC), including permissions based on data sensitivity tags
  • Integration with additional hadoop ecosystem frameworks so that existing permissions can be enforced across additional access paths

How to Get Involved

The Sentry community now includes new core committers, an active developer mailing list where future releases and patches are discussed, and increasing interest in running additional frameworks on Sentry. We strongly encourage new people join Sentry and contribute through jumping on the discussions on the mailing list, filing bugs through Jira, reviewing other’s’ code or even providing new patches.

Comments

Popular posts from this blog

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Thanks to Ted Malaska  and Cloudera
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems. In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop. Streaming PatternsThe four basic streaming patterns (often used in tandem) are…

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce

Source: Cloudera Blog The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.  An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations: Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented i…

3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Thanks to Carter Shanklin & Nita Dembla from Hortonworks for valuable post. One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries. These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries. STARTING OFF: 3X PERFORMANCE GAINS IN HDP 2.6 WITH HIVE LLAPLet’s start out with a s…