Link to wealthfront.com

Fork me on GitHub

Wednesday, September 5, 2012

Explore Your Multivariate Data with Crossfilter

In every business, developing a deep understanding of your customers is a necessity if you want to consistently exceed their expectations for your product or service. Every page a customer views is likely generating metrics that will eventually be crunched and digested in every effort to help your company improve even just a little bit better. At Wealthfront, as we heavily practice TDD, we've always had a saying that a feature isn't complete until it's tested, but we've also had a saying that a feature isn't complete until it also generates metrics.

Now, when most people talk about metrics and analytics, they usually focus on one of two major areas, either the aggregation and processing aspects of your data or the presentation and visualization of data. I think it's fair to leave the bulk of the processing to the big data guys out there, and as far as visualizations go it's hard to have missed some of the work done by Mike Bostock (the D3 library powering some viz for the NYT & more) and the rest of his cronies.

For this post, I'm actually going to focus on Square's little javascript library Crossfilter (also bearing quite a few Mike Bostock fingerprints), which sits nicely between your data and presentation, providing some handy transformations and making it easy to work with larger datasets and then presenting them in D3 with interactivity across multiple varibles. I'm sure by this point, I've hit my buzzword quota, but it's important to stress the word "interactivity" here. Being able to isolate variables, interact with them, and see how they relate to other aspects of your data is a great way to tease out an understanding of how your data "fits" together.

Now let's get to some code. Let's say we have a list of US Presidents:

var presidents = [
  {
    "number":1,
    "president":"George Washington",
    "birth_year":1732,
    "death_year":1799,
    "took_office":"1789-04-30",
    "left_office":"1797-03-04",
    "party":"No Party"
  },{
   ...
  },{
    "number":44,
    "president":"Barack Obama",
    "birth_year":1961,
    "death_year":null,
    "took_office":"2009-01-20",
    "left_office":null,
    "party":"Democratic"
  }
];

We'll quickly convert our Strings to Date objects.

var ymdFormat = d3.time.format("%Y-%m-%d");
presidents.forEach(function(p) {
  p.took_office = ymdFormat.parse(p.took_office);
  p.left_office = ymdFormat.parse(p.left_office);
});

Introducing Crossfilter.

The beauty of Crossfilter is that you can take data that you have, define a few key dimensions for it, and then work on your dataset by those dimensions. While there are plenty of good libraries that provide good filtering, sorting, grouping, and reducing capabilities, the real value of Crossfilter is how the library manages the full dataset for you efficiently, as opposed to running your data through the cascading transforms for each representation you need.

In our example we're looking at presidents, so let's facet our data by their political party.

// Use the crossfilter force.
var cf = crossfilter(presidents);

// Create our dimension by political party.
var byParty = cf.dimension(function(p) { return p.party; });
We can apply filters and groupings to the data easily. We'll start by grouping our data by our party dimension. The group can be reduced in a variety of ways, such as the number of values by each key, totalled by some property of the grouped values, or use your own aggregation. Out of the box, Crossfilter will do a count aggregation.
var groupByParty = byParty.group();
groupByParty.top(Infinity).forEach(function(p, i) {
  console.log(p.key + ": " + p.value);
});

We get the result:

Republican: 18
Democratic: 16
Whig: 4
Democratic-Republican: 4
Federalist: 1
No Party: 1

Note that we get our data out of our dimension or group by the top() function where we supply the maximum number of values we want back. When we supply Infinity, we get the full list of values.
Let's filter our presidents by those who ran in the Whig party and display the list.

byParty.filterExact("Whig");
byParty.top(Infinity).forEach(function(p, i) {
  console.log(p.number + ". " + p.president);
});

Produces the list:

13. Millard Fillmore
9. William Henry Harrison
10. John Tyler
12. Zachary Taylor

Keep in mind that since our list is ordered by political party with each member has the same party, we know we shouldn't rely on our result ordering.

Let's clear our filter.

byParty.filterAll();

Adding a second dimension to manipulate our dataset in concert with our other dimension is easy. We'll create a dimension by the year a president took office.

// Create a dimension by the year a president took office.
var byTookOffice = cf.dimension(function(p) { 
  return p.took_office; 
});
console.log("Total # of presidents: " 
  + byTookOffice.top(Infinity).length);

// filter to presidents starting after 1900.
byTookOffice.filter([new Date(1900, 1, 1), Infinity]);
console.log("# of presidents starting after 1900: " 
  + byTookOffice.top(Infinity).length);

The results show that our starting after 1900 filter reduces the list from 44 to 19 presidents. Additionally, without any work our byParty dimension was also updated and we can list our count by party with the same code as before:

groupByParty.top(Infinity).forEach(function(p, i) {
  console.log(p.key + ": " + p.value);
});

This time it results in:

Republican: 11
Democratic: 8
Federalist: 0
No Party: 0
Whig: 0
Democratic-Republican: 0

Pretty Pictures (Sorta)

Finally, I'll pull it together into a D3 visualization. There are plenty of good tutorials on that, and using Crossfilter with D3 is trivial to set up just by using your Crossfilter data from .top() as the data for your visualization. (You can click on the fiddle to see the code)

And that's pretty much it. There are a few other neat little features, but in general, the library keeps it's scope pretty tight and just does a great job for this tremendously useful use-case. The code is really cool, so I encourage you to check out what's happening under the hood. The flights dataset and visualization on the Crossfilter page is a nice piece of eye-candy and you can immediately see utility for your own kinds of problems.