Simply Measured Engineering Blog

Integrating Hive and HBase

Why?

HBase makes an excellent storage engine, but the interfaces for interacting with HBase are very low-level. You have to clearly understand the intricacies of how to effectively use cells, along with varying query patterns.

This model is powerful, and enables very high performance query and persistence systems, but starts to break down when analysts want to query data in an ad-hoc fashion, or when performing a simple aggregation query would require custom code using HBase scanners, or a map/reduce job.

In most shops, you end up with a small team of engineers dedicated to writing map/reduce jobs and maintaining your cluster. These engineers tend to operate as developers, DBAs and operations engineers. This structure is not sustainable as your organization grows.

Possible solutions

The Hadoop and big data world is starting to embrace SQL, or other query languages, as an interface to help with this. Essentially, what was old is new again.

In HBase, you have several potential options:

  • Apache Hive – Hive is a complete data warehouse for Hadoop, which includes a HBase storage adapter.
  • Phoenix – a SQL layer for HBase by the engineers at Salesforce.
  • Apache Drill – Drill is an implementation of the Google Dremel paper, focused on maintaining a pluggable model for different storage engines, and utilizing existing open source wherever possible.
  • Cloudera Impala – Impala is very similar to Drill, but takes a slightly different approach. It’s written in C++, utilizing JNI when it needs to interact with Java. It takes a “brute-force” approach to interacting with HDFS, and the HBase APIs when dealing with HBase.

Enter Hive, and how we’re using it

At Simply Measured, we’ve chosen Hive for a few reasons:

  1. Maturity

    It’s important to underscore the importance of this. Our reports are the “life-blood” of our company. If we can’t generate them, our customers have no reason to pay us. It was important that we chose a technology that was already proven in the market. Hive started as a project at Facebook and transitioned to the Apache Software Foundation, and has several years of use and bug fixes under its belt.

  2. Support for nested data types (ARRAY, STRUCT, MAP)

    The support for nested data types is important for us. A good example of this is Twitter “tweet” data. A single tweet could be tracked by several different customers, and utilizing an ARRAY datatype, we can easily tag that tweet to multiple customers, in a fashion that is easily queryable. This data is, for most intents and purposes, immutable after it’s written.

    Given a schema:

    sql CREATE EXTERNAL TABLE tweets ( tweet_id_hash STRING, tweet_id STRING, ... metadata STRUCT<accounts: ARRAY<STRING>> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key,tweet:tweet_id,metadata:metadata', 'hbase.table.default.storage.type' = 'binary' ) TBLPROPERTIES ( 'hbase.table.name' = 'tweets' );

    This query is possible:

    sql SELECT * FROM tweets t LATERAL VIEW explode(metadata.accounts) tweets AS accounts WHERE accounts.account_id = 'foo';

    Our queries aren’t exactly like this in production, but you can get a high-level overview from this. This does come at the cost of a map/reduce job, but if you can further use your key to limit the query, Hive can perform “predicate pushdown” into HBase’s APIs, therefore limiting the scope of the map/reduce scan.

  3. HCatalog and the Hive Metastore

    One challenge as our organization has grown is dealing with schema management and documentation. We are split into a few different engineering teams dealing with reports – one team deals with data collection, one deals with data storage and the reporting, and one builds the report queries and templates. Data Collection determines what fields are available at different data sources (Twitter, Facebook, Tumblr, etc.) Platform (data storage) gets this data persisted to the database. Analysts write reports and queries for this data.

    The Hive Metastore ties this all together. Data Collection works with Platform to create tables and to organize the data. This resulting schema can be communicated to the Analyst team so they can build reports.

    HCatalog provides a great REST API for interacting with the Hive Metastore, and we are building internal tooling around this. It also simplifies our data storage and retrieval methods. We’ve built a library called prognosticator that can read a HCatalog schema, and given a Map, will persist that data to HBase in a form that Hive can read.

  4. Extensibility

    Hive didn’t provide every function we would need. We’ve started a project, sm-hive-udf, with the UDF’s (user defined functions) that we need.

Trade-offs

This does come with a trade-off: map/reduce jobs. The total throughput on a map/reduce job is fantastic, at the cost of latency. Our reporting is essentially a batch operation to begin with, so this works well with map/reduce. We are looking forward to initiatives like Hortonworks’ Stinger Initiative to greatly improve the performance of Hive.

As with any project, Hive has had a few bugs that have impacted us, but since it’s open source, we can apply patches and contribute back to the project, not only improving our product, but helping others do the same.

The other projects

Each of the other SQL layers mentioned has pros and cons of their own.

Phoenix is the most tightly integrated with HBase. Its query parser / engine generates native HBase API calls, utilizing Skip Scans (link plz) and coprocessors. We may have gone with this project if it had supported complex datatypes, and will evaluate it again in the future as it continues to advance. As it is only focused on HBase, it doesn’t allow you to query across disparate datasources like Hive, Drill and Impala.

Cloudera Impala is a promising piece of technology. It avoids map/reduce entirely, while still allowing you to query/join across disparate datasources (sequence files in HDFS, Parquet files in HDFS, HBase tables, etc.) It currently doesn’t support complex datatypes, and has single-point of failures in its query execution model.

Apache Drill is the newcomer to the bunch, and difficult to comment on too much considering it’s pre-alpha at the time of this post. They appear to be taking an approach that is a good hybrid of Phoenix and Impala (and trying to integrate with existing technologies.) This will be interesting to follow as it happens.

Summary

This has greatly simplified our query and persistence code, from being hundreds of lines of code per each new data source, to basically being a schema defined in the Hive metastore, a RESTful service that looks up schemas and persists according to that schema, and a few lines of HiveQL (practically SQL) to define the report query.