Simply Measured Engineering Blog

Simply Measured Report Themes

Report Themes

The first project assigned to me at Simply Measured was simply called “themes” or “themeing”. Themes would offer our customers the ability to change the colors of their reports. At the time reports were offered in only one dark style – predominately grays and blues on a black background. Users wanted the ability to customize reports in their brand colors or custom theme colors. They had also requested a lighter and brighter version of the standard theme.

Simply Measured Reports on the web

Some background information

When viewing a Simply Measured report on the web a few things are happening behind the scenes.

Excel and online versions of the same report

Excel File

Your Simply Measured report is an Excel .XLSX file. A .XLSX is essentially many directories and XML files compressed into a ZIP archive. Take a .xlsx file and change the extension to .zip. The structure of the uncompressed files is defined by the Office Open XML File Formats Standard (ECMA-376).

Magic

The Simply Measured application is performing some magic to parse all of the XML files, read them according to the rules defined in ECMA-376 and extract the necessary information needed to display our pretty charts in your browser. Docx4j is just one of the open source libraries helping us to do this.

JSON

The data extracted from the XLSX is then serialized to JSON, a format that the front end of the application can understand. The result is pretty charts in your browser just as you would see them in Excel.

Themes – How does it all work?

Well, it’s complicated.

The ECMA-376 specification defines the theme of a document as “the color scheme, fonts, and effects, which can be referenced by parts of the document—such as text, drawings, charts and diagrams”.

Great! We can just swap out the theme files specified in the spec and there you have it: report themes! Not so fast. It turns out that Excel only supports a limited number of colors within a theme and that limit is far less than the number of colors found throughout our reports. So if Excel only supports a limited number of colors, how is it supporting the range we’re using now? It turns out that colors are strewn through almost every XML file within the structure of a XLSX. They’re in themes, theme overrides, styles, charts, and drawings. This makes it impossible to switch out the theme and call it a “themed report”.

In the end it was decided that a one-to-one color mapping was necessary to transform a report from the dark theme to the light theme (or any other theme). This mapping would then be used to replace any XML file that contained RGB values. On top of that, in some cases a line width in a dark report may be too garish in a light report. A gradient in a dark report may need a larger range of colors in a lighter theme. This definitely wasn’t going to be as easy as replacing the theme.

The technology to solve the problem.

Programmatically re-themeing an Excel spreadsheet wasn’t something you could Google or find the solution to on stackoverflow.com. Believe me, I tried. This was a new and interesting problem. When it came to our solution we didn’t use anything that I would consider “cool” or “cutting edge” in development of this feature. We used the right tools for the job; we did cool things with “common” tools.

XSLT and XPath

XSLT (Extensible Stylesheet Language Transformations) is a transformation tool which gives you the ability to transform any XML file into another format. The XSLT processor takes an XML source file, applies a XSLT stylesheet to that file and the result is the requested format. XSLT helped us to do a transformation on the XML files to change one RGB color (or any other attribute) to another value. XSLT stylesheets use XPath (XML Path Language) to do pattern matching and define replacements for those patterns.

Example XSLT stylesheet:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" >

    <xsl:template match="@*|node()">
     <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
     </xsl:copy>
    </xsl:template>

    <xsl:template match="/c:chartSpace//a:srgbClr[@val=&apos;000000&apos;]">
     <xsl:copy>
         <xsl:attribute name="val">
            <xsl:value-of select="111111"/>
         </xsl:attribute>
     </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

The resulting XML file will contain only nodes that matched one of the xsl:template matches. The first xsl:template will simply copy all nodes to the result. The second xls:template uses the XPath expression to find all a:srgbClr elements that are children of the c:chartSpace element with a @val attribute equal to ‘000000’. The xsl:copy of that template will replace black (000000) with white (111111).

For example:

<c:spPr>
      <a:ln w="9525">
            <a:solidFill>
    <a:srgbClr val="000000"/>
           </a:solidFill>
      </a:ln>
</c:spPr>

Will become:

<c:spPr>
      <a:ln w="9525">
            <a:solidFill>
    <a:srgbClr val="111111"/>
           </a:solidFill>
      </a:ln>
</c:spPr>

With Simply Measured report themes, once the stylesheet is applied to the appropriate XML files, the entire structure is compressed back into a XLSX file that is then viewable in Excel or in our web-based report viewer.

Testing with XMLUnit

Our XSLT Stylesheets were huge – sometimes thousands of lines long. How do we make sure all of the expected transformations were made to the resulting file? We needed a way to compare the results of each XPath expression on the original file the new themed file. We’re using about 30 distinct XPath expressions which are applied hundreds of times to dynamic color combinations. This is where XMLUnit comes into our JUnit tests. XMLUnit allows you to evaluate XPath expressions on XML and provides a list of nodes that match that expression.

To test the previous example you would need to write a test class that extends org.custommonkey.xmlunit.XMLTestCase. Here is an example of what a test method could look like to test the change from RGB 000000 to 111111.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
 * 
 * @param originalDoc original document
 * @param transformedDoc document transformed by XSLT
 * @param beforeRGB original RGB color
 * @param expectedRGB theme RGB color
 * @throws XpathException
 */
public void testColorReplacement(Document originalDoc, Document transformedDoc, String beforeRGB, 
    String expectedRGB) throws XpathException
{ 
  XpathEngine engine = createXpathEngine();
  String xpathExpression = "/c:chartSpace//a:srgbClr";
  
  NodeList originalNodes = engine.getMatchingNodes(xpathExpression, originalDoc);
  NodeList transformedNodes = engine.getMatchingNodes(xpathExpression, transformedDoc);

    Assert.assertEquals(originalNodes.getLength(), transformedNodes.getLength());

    for(int i = 0; i < originalNodes.getLength(); i++){
      if(originalNodes.item(i).getAttributes().getNamedItem("val").getNodeValue().equals(beforeRGB))
      {
          Assert.assertEquals("RGB color replacement failure", expectedRGB, 
              transformedNodes.item(i).getAttributes().getNamedItem("val").getNodeValue());
      }
    }
}

private XpathEngine createXpathEngine(){
  HashMap<String, String> m = new HashMap<String, String>();
  m.put("xsl", "http://www.w3.org/1999/XSL/Transform");
  m.put("a", "http://schemas.openxmlformats.org/drawingml/2006/main");
  m.put("c", "http://schemas.openxmlformats.org/drawingml/2006/chart");
    NamespaceContext ctx = new SimpleNamespaceContext(m);
    XpathEngine engine = XMLUnit.newXpathEngine();
    engine.setNamespaceContext(ctx);
    
    return engine;
}

Integrating Hive and HBase

Why?

HBase makes an excellent storage engine, but the interfaces for interacting with HBase are very low-level. You have to clearly understand the intricacies of how to effectively use cells, along with varying query patterns.

This model is powerful, and enables very high performance query and persistence systems, but starts to break down when analysts want to query data in an ad-hoc fashion, or when performing a simple aggregation query would require custom code using HBase scanners, or a map/reduce job.

In most shops, you end up with a small team of engineers dedicated to writing map/reduce jobs and maintaining your cluster. These engineers tend to operate as developers, DBAs and operations engineers. This structure is not sustainable as your organization grows.

Possible solutions

The Hadoop and big data world is starting to embrace SQL, or other query languages, as an interface to help with this. Essentially, what was old is new again.

In HBase, you have several potential options:

  • Apache Hive – Hive is a complete data warehouse for Hadoop, which includes a HBase storage adapter.
  • Phoenix – a SQL layer for HBase by the engineers at Salesforce.
  • Apache Drill – Drill is an implementation of the Google Dremel paper, focused on maintaining a pluggable model for different storage engines, and utilizing existing open source wherever possible.
  • Cloudera Impala – Impala is very similar to Drill, but takes a slightly different approach. It’s written in C++, utilizing JNI when it needs to interact with Java. It takes a “brute-force” approach to interacting with HDFS, and the HBase APIs when dealing with HBase.

Enter Hive, and how we’re using it

At Simply Measured, we’ve chosen Hive for a few reasons:

  1. Maturity

    It’s important to underscore the importance of this. Our reports are the “life-blood” of our company. If we can’t generate them, our customers have no reason to pay us. It was important that we chose a technology that was already proven in the market. Hive started as a project at Facebook and transitioned to the Apache Software Foundation, and has several years of use and bug fixes under its belt.

  2. Support for nested data types (ARRAY, STRUCT, MAP)

    The support for nested data types is important for us. A good example of this is Twitter “tweet” data. A single tweet could be tracked by several different customers, and utilizing an ARRAY datatype, we can easily tag that tweet to multiple customers, in a fashion that is easily queryable. This data is, for most intents and purposes, immutable after it’s written.

    Given a schema:

    sql CREATE EXTERNAL TABLE tweets ( tweet_id_hash STRING, tweet_id STRING, ... metadata STRUCT<accounts: ARRAY<STRING>> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key,tweet:tweet_id,metadata:metadata', 'hbase.table.default.storage.type' = 'binary' ) TBLPROPERTIES ( 'hbase.table.name' = 'tweets' );

    This query is possible:

    sql SELECT * FROM tweets t LATERAL VIEW explode(metadata.accounts) tweets AS accounts WHERE accounts.account_id = 'foo';

    Our queries aren’t exactly like this in production, but you can get a high-level overview from this. This does come at the cost of a map/reduce job, but if you can further use your key to limit the query, Hive can perform “predicate pushdown” into HBase’s APIs, therefore limiting the scope of the map/reduce scan.

  3. HCatalog and the Hive Metastore

    One challenge as our organization has grown is dealing with schema management and documentation. We are split into a few different engineering teams dealing with reports – one team deals with data collection, one deals with data storage and the reporting, and one builds the report queries and templates. Data Collection determines what fields are available at different data sources (Twitter, Facebook, Tumblr, etc.) Platform (data storage) gets this data persisted to the database. Analysts write reports and queries for this data.

    The Hive Metastore ties this all together. Data Collection works with Platform to create tables and to organize the data. This resulting schema can be communicated to the Analyst team so they can build reports.

    HCatalog provides a great REST API for interacting with the Hive Metastore, and we are building internal tooling around this. It also simplifies our data storage and retrieval methods. We’ve built a library called prognosticator that can read a HCatalog schema, and given a Map, will persist that data to HBase in a form that Hive can read.

  4. Extensibility

    Hive didn’t provide every function we would need. We’ve started a project, sm-hive-udf, with the UDF’s (user defined functions) that we need.

Trade-offs

This does come with a trade-off: map/reduce jobs. The total throughput on a map/reduce job is fantastic, at the cost of latency. Our reporting is essentially a batch operation to begin with, so this works well with map/reduce. We are looking forward to initiatives like Hortonworks’ Stinger Initiative to greatly improve the performance of Hive.

As with any project, Hive has had a few bugs that have impacted us, but since it’s open source, we can apply patches and contribute back to the project, not only improving our product, but helping others do the same.

The other projects

Each of the other SQL layers mentioned has pros and cons of their own.

Phoenix is the most tightly integrated with HBase. Its query parser / engine generates native HBase API calls, utilizing Skip Scans (link plz) and coprocessors. We may have gone with this project if it had supported complex datatypes, and will evaluate it again in the future as it continues to advance. As it is only focused on HBase, it doesn’t allow you to query across disparate datasources like Hive, Drill and Impala.

Cloudera Impala is a promising piece of technology. It avoids map/reduce entirely, while still allowing you to query/join across disparate datasources (sequence files in HDFS, Parquet files in HDFS, HBase tables, etc.) It currently doesn’t support complex datatypes, and has single-point of failures in its query execution model.

Apache Drill is the newcomer to the bunch, and difficult to comment on too much considering it’s pre-alpha at the time of this post. They appear to be taking an approach that is a good hybrid of Phoenix and Impala (and trying to integrate with existing technologies.) This will be interesting to follow as it happens.

Summary

This has greatly simplified our query and persistence code, from being hundreds of lines of code per each new data source, to basically being a schema defined in the Hive metastore, a RESTful service that looks up schemas and persists according to that schema, and a few lines of HiveQL (practically SQL) to define the report query.