Testing Cascading applications

This post explores how we apply our test-driven-development philosophy to analytics problems. In particular, it shows how use test-driven development with Cascading, which we’ve recently started using to drive analytics at Wealthfront.

Cascading let’s us specify complicated analytics pipelines in Java. It works well for problems that would normally require multiple MapReduce jobs to get a single result of interest, since the developer isn’t forced into thinking of the best way to cast the problem into a set of interrelated MapReduce jobs. The abstraction layer it provides is similar to SQL, with the ability to easily define custom functions and aggregators. Behind the scenes, Cascading generates a directed acyclic graph of interrelated Hadoop steps.

When we first started experimenting with Cascading, we didn’t find any good examples of testing functionality of a Cascading job. We found that common practice is to “test” a job by running it on a sampled data set and make sure nothing blows up. This is not the same as verifying the functional correctness of the underlying algorithm. See for instance the overview of testing in Cascading in the Cascading documentation.

At Wealthfront, we require all code to have good test coverage.  Code only ships when all tests pass. After all, people trust us to manage their money, and buggy code is unacceptable. We use analytics to drive business decisions, improve our product, and improve our operations. This lead us to think hard about how we wanted to test our Cascading applications. This blog post explores best practices we’ve come up with for functional testing our analytics applications.

 

Modularity

Good testing requires a modular code base. Testing smaller modules in isolation keeps the scope bounded enough that it’s manageable to think through all the edge cases that require testing. In Cascading, the SubAssembly class provides a logical unit of modularity. As an example, consider this SubAssembly I wrote to filter out page views generated by bots:

In this code, we classify some subset of bot-generated page views based on IP address, remote host, and user agent using the IsNonBotPageView filter (not shown here for brevity). This SubAssembly is part of a larger analytics application which measures conversion rates for visitors to the site. We need to filter out bots, so that they don’t pollute the numbers we get. Because we wrote this as a separate module, we can test bot filtering in isolation of the the larger computation.

In-memory tests

We like our tests to be fast. Slow tests can’t get run often enough to keep up with the fast-paced development cycle we have at Wealthfront. One key ingredient to making tests fast is eliminating IO (network and disk). All of our Cascading tests run entirely in memory.
Cascading provides the LocalFlowConnector class which can execute a Cascading flow (job) in memory in a single process. This is great for us, but all of the Taps (data locations) provided by cascading require IO. To get around this, we’ve created a small set of tap classes we use for testing that source and sink data in memory.

Serialization

We use cascading.avro for the inputs and outputs of all Cascading jobs. We like Avro because files can be compressed with Snappy and still streamed to Hadoop. This helps us minimize our disk footprint. To simulate readable and writable Avro files in memory, we’ve created AvroInMemoryInputTap and AvroInMemoryOutputTap classes respectively.

Since Cascading stores intermediate Hadoop files values with SequenceFiles and makes them visible via Tuple objects, we’ve also created similar TupleInputTap and TupleOutputTap classes. This is useful for intermediate input values for which we haven’t defined an Avro schema. To see how this all comes together, below is a test for the above subassembly: