Tuesday, November 24, 2015

3 Ways to Integrate Third-Party Metrics into Your Data Platform

From both a company and engineering standpoint, Wealthfront values making decisions backed by quantitative methods. As noted in our previous blog post, Statistics is Eating the World, the key to enabling decisions backed by data is to build a platform that is automated and can scale across all aspects of our business. One of the most valuable characteristics of our data platform is integration with third party systems to include external data into our models and dashboards. In this post I'll go over why integration with third parties is valuable, and outline 3 strategies we use to get data from third parties. I'll also cover common considerations that we make to ensure our data pipeline runs reliably.

Tracking Data Internally vs. Externally

Often times, analysis can be done exclusively with data from our internal systems. For example, if Wealthfront was interested in what information our clients want to see for their account, we could monitor web usage and page views to start forming conclusions about visitor intent. There are other cases, however, where the data we need lies outside of our internal systems. For example, if we needed to gather information about a certain traded security's prices, we would need to obtain that data externally from an exchange or data vendor. Here are some general uses for importing data from external sources:
• We import data from systems that directly interact with our major systems to confirm functional competence
• We import data from vendors that don't interact with our systems, but impact our business metrics
• We import data from providers that will enable additional analysis and data quality

For these cases it is necessary that we create an automated pipeline that can access data from the third party source and transform it into a format that can be stored into our data platform to be used in models and dashboards. This process is commonly known in industry as Extract, Transform, Load, or ETL. Third party systems come in all shapes and sizes, so no one strategy will suffice to extract data from everyone. The 3 ways we commonly ETL data are as follows:
1. Making calls to a Third-Party REST or SOAP API
2. Web scraping
3. Secure File Transfer Protocol (SFTP)

REST and SOAP APIs

Often times, a vendor will provide an Application Program Interface, or API, for its users to obtain key metrics for the service it's providing. When extracting data through an API, the ETL code will make an HTTP request to a given API endpoint for the desired metrics. Depending on the individual system, the API will return a response in a standardized format such as JSON, CSV, or XML. The ETL parses and converts the data to a normalized format (Avro format in Wealthfront's case) and writes the data to the data storage. Here is an example ETL that is obtaining trading prices for Google, Apple, and Facebook Stock from a fictional API:

public class StockPricesETL {
private static final String API_URL = "https://api.stockprices.com";
private static final String STOCK_PRICES_ENDPOINT = "/prices/%s";
private static final String ACCESS_TOKEN = "?access_token=someAccessToken";
private static final String DATE_PARAM = "&date=%s";
private static final List<String> STOCKS = ImmutableList.of("GOOG", "FB", "AAPL");
private static final long SLEEP_TIME = 2000;

public List<StockPriceAvro> getStockPrices(LocalDate date) {
final ObjectMapper mapper = new ObjectMapper();
final List<StockPriceAvro> stockAvros = new ArrayList<>();

for (String stock : STOCKS) {
final String responseJSON json = HttpsConnection.sendGet(formatEndpoint(date, stock));
final PriceList parsedStocks = mapper.readValue(responseJSON, PriceList.class);
sleep(SLEEP_TIME);
}

return stockAvros;
}

private String formatEndpoint(LocalDate date, String stock) {
return String.format("%s%s%s%s", API_URL,
String.format(STOCK_PRICES_ENDPOINT, stock),
ACCESS_TOKEN,
String.format(DATE_PARAM, date);
}
}

In addition to making the call to the correct endpoint api.stockprices.com, it is necessary to include the proper authorization for the API call ?access_token=someAccessToken. In this example, an access token was included within the URL of the request as a query parameter. Depending on the API, the authorization may need to be sent through a different method, such as through  header of the HTTP request, or in the body of a request (common for SOAP APIs). When using APIs you may also need to be mindful of the frequency of requests made, as many institute a rate limit for API requests.

Web Scraping

Not all vendors provide an API to extract data from, and in these cases it will be necessary to web-scrape the web user interface and dashboard that users log-into via browsers to obtain the data. When codifying this into an ETL, the ETL looks similar to the ETL for APIs. HTTP requests are made to the web server intended for internet browsers, and once proper authorization is provided, the ETL can access a URL that has the relevant data. The data will often be in a normalized format like a CSV, which needs to be parsed and converted into the appropriate format for data storage.

The process for implementing a web scraping strategy can often be a frustrating process compared to ETLs that directly request from an API. Many websites are not built with the intention of being browsed by a script, requiring the engineer to have to creatively devise a strategy for finding the right sequence of HTTP requests and HTML cues to get to the right data. Additionally, there is no guarantee from the vendor that the website will stay the same, potentially causing a scheduled ETL to fail without forewarning. For these ETLs, we build monitoring to detect and alert when an ETL has failed due to potential updates made to the web server.

Secure File Transfer Protocol (SFTP)

There are some vendors that have neither an API nor web dashboard, or simply prefer transferring data across SFTP. Typically, an agreement is made between the two parties on what the file directory and name, as well as the normalized format, such as CSV. The ETL for this strategy consists of making an FTP Get request for the desired data to read in the file. The file ideally should be in a standardized format, which can be parsed to convert into a type that can be written into data storage.

The implementation of the ETL using SFTP is fairly straightforward, but there is some added overhead in the maintenance of the SFTP server itself. The access for users performing Put and Get requests needs to be established such that no user can read/change data beyond what is appropriate. In addition file sizes should be monitored to make sure disk space doesn't fill up, and that ETL data are backed up in a separate location should the SFTP server fail at any time.

Qualities of Reliable Third-Party ETLs

An important consideration of third-party ETLs is that they are communicating with a system that we have no control over. Situations can and will occur when your scheduled ETL fails because it attempts to access a system that is down or has been modified. Here are a few examples of instrumentation that has helped us maintain reliable ETLs:
• Exception handling: we make sure that the system properly publishes exceptions if a request makes a bad request, or if the ETL cannot properly convert a response into the normalized format (e.g. because of a missing or null value).
• Retrying requests: given inconsistencies in network connections, request queueing and other factors, it's possible that a request sent does not get the proper response, but the third party service is running fine. In these cases, implementing a standard retry functionality will make sure that ETLs will succeed through these issues.
• Data quality checks: sometimes the third party will return a successful response, but will return no data or incorrect data. Having checks on the output or in downstream data quality checks for recent data can help catch these incidents.
• Well-structured and maintainable code: it's common during the lifetime of a third-party ETL that it will have to be updated, either because of a version update in the API, or because of a change in web UI for a web-scraping ETL. Keeping the code readable and maintainable will minimize the time needed to update them.
• Real-time monitoring and reporting: as is true for any live system, it is necessary to create instrumentation that will check the status of scheduled ETLs. At the minimum, monitoring should check that ETLs run on schedule and execute successfully. Should the ETL fail for any reason, the system will notify on-call engineers that a production level failure has occurred.

Thursday, October 29, 2015

How We Avoid A/B Testing Errors with Shrinkage

At Wealthfront, we use A/B tests to make our services more effective, informative, and convenient. Each A/B test consists of two or more variants of a feature or design, which we expose to visitors at random until we have have enough data to infer which is better. In a previous post, we explained why we use Bayesian inference for evaluating A/B tests. Today I will share a technique we use to avoid inference errors on A/B tests with many variants: a concept called shrinkage. We'll start with a hypothetical use case that everyone can relate to: bananas.

Running tests at the online banana stand

Let's say you work for a popular online banana stand, and I ask you what proportion of banana stand visitors purchase a banana. Without any data, it would be reasonable to assume that the proportion $\theta_\text{control}$ is a value somewhere between 0% and 100% with equal probability. While the proportion is a fixed value, this probability distribution captures our uncertainties in its true value. The essence of Bayesian statistics is updating this distribution as we get more data.

Now what if you know the proportion of banana stand visitors who buy a banana, and I ask you what the proportion will be if you change your website's buttons from green to pink? You have no data about this scenario, so you’ll have to guess again, but it is likely that this new proportion, $\theta_\text{pink}$, is close to the original proportion, $\theta_\text{control}$. Shrinkage is the technique we use to incorporate this intuition into our Bayesian model. As we collect data, using shrinkage helps us carefully discern whether visitors prefer the pink buttons to the green ones. Before we explain shrinkage in detail, let’s cover how Bayesian models work.

Bayesian models

As mentioned above, a simple assumption is that $\theta_\text{control}$ is uniformly distributed between 0 and 1. Mathematically speaking, that means the probability density $P(\theta_\text{control})$ is constant between $\theta_\text{control} = 0$ and $\theta_\text{control} = 1$; $P(\theta_\text{control}) = 1$. We can easily update this distribution as we collect more data $D$. If I observe a random visitor and find out that they buy a banana, I have a data point $y_1=1$, and Bayes rule states that the probability of $\theta_\text{control}$ given this information $D = \{y_1 = 1\}$ is $P(\theta_\text{control}|D) = \frac{P(\theta_\text{control})P(D|\theta_\text{control})}{P(D)} \qquad [1]$ We already decided that $P(\theta_\text{control}) = 1$. By definition, $\theta_\text{control}$ is the proportion of visitors that purchase bananas, so the probability $P(D|\theta_\text{control})$ of observing that this person bought a banana, given $\theta_\text{control}$, is just $\theta_\text{control}$. Plugging in these two facts to equation [1] gives us that $P(\theta_\text{control}|D) = \frac{\theta_\text{control}}{P(D)} \qquad [2]$ The only thing we don't know yet is $P(D)$, which we can calculate by enforcing that the total probability is 1: $\int_0^1P(\theta_\text{control}|D)d\theta_\text{control} = 1$ Using equation [2], $\int_0^1\frac{\theta_\text{control}}{P(D)}d\theta_\text{control} = 1$ Since $\int_0^1\theta_\text{control}d\theta_\text{control} = 0.5$ and $P(D)$ is independent of $\theta_\text{control}$, this becomes $\frac{0.5}{P(D)} = 1$ This implies that $P(D) = 0.5$, so the probability $P(\theta_\text{control}|D) = 2\theta_\text{control}$.

The intuition behind this result is that, since we observed someone buy a banana, the proportion of visitors who buy bananas must be greater than 0. Based on this evidence, it is also more likely to be high than low, but we will update that claim as we collect more data.

In this way, we can use our observations $D$ to update our estimate of how likely each possible value of $\theta_\text{control}$ is. The model can be summarized by this diagram at right. This diagram can be read from bottom to top to say that our observations $y_j$ are randomly 0 or 1 with some likelihood $\theta_\text{control}$, and that we assume $\theta_\text{control}$ could be uniformly anywhere between 0 and 1, which we will update as we observe data. The top distribution shown is our initial assumption, and each variable is assumed to be drawn from the distribution(s) above it. If we collect enough data $D = \{y_1 = 1, y_2 = 0, y_3 = 0, \ldots\}$, we may end up with a probability distribution like this:

The more data we collect, the more precisely we can estimate $\theta_\text{control}$.

Multiple comparisons

Suppose we test three different button color changes (plus control) and collect this data:
 Variant Purchases Visitors control 4 210 pink 1 200 blue 4 190 yellow 6 220

The model we just described would give the following probability distributions:

According to this model, there is more than a 95% chance that yellow buttons are more attractive than pink buttons. But we are comparing 4 proportions, meaning that we are doing 6 pairwise comparisons, so we actually have much more than a 5% chance of making an error. As we compare more proportions simultaneously, the risk of making a random error dramatically increases. The frequentist method to deal with this is to require higher confidence for each comparison. This is called the Bonferroni correction. In our example, since we are doing 6 comparisons, we should require $1 - 0.05/6 = 99.2\%$ confidence in each of our comparisons. This is a high bar, and it means that we need to collect substantially more data.

There’s no getting around the fact that we need more data, but fortunately we (usually) don't have to worry about multiple comparisons with Bayesian statistics. Rather than use the Bonferroni correction, we can simply use a different model with shrinkage. And rather than resolve to collect a huge number of data points, we can end the experiment whenever its result becomes significant, which happens faster if the difference between variants is more pronounced.

Shrinkage

Previously we started with the assumption that any proportion $\theta_i$ is equally likely before we collected any data. Since it is related to the other $\theta_i$ proportions, though, a better assumption is that the proportions $\theta_\text{control}$, $\theta_\text{pink}$, $\theta_\text{blue}$, $\theta_\text{yellow}$ all share a common distribution. This is a compromise between the frequentist null hypothesis that all $\theta_i$ are equal and the original Bayesian model, which treats all $\theta_i$ as totally unrelated quantities.

For this fictional example, we can see that the $\theta_i$ are clustered around roughly 0.02 (is it extremely unlikely that four uniformly random proportions would all be in such a small interval). To quantify this, we make another Bayesian model for the distribution of the means, and update it as we collect more data. Intuitively, since the proportions we want to measure are related, they are probably more similar than we observe. In the extreme case, if all $\theta_i$ truly are exactly equal, any deviations in the data would be random effects.

In the model for the parameter means, we assume all $\theta_i$ are drawn from a beta distribution with mean $\omega$ and spread $k$. In this way, the closer together the sample means, the tighter our predicted distribution for them becomes. For our observed means in this example, the estimated distribution over $\theta_i$ parameters looks roughly like:

We use this in place of the uniform prior distribution we had before. By coupling our original model with this new distribution for each $\theta_i$, we create the hierarchical Bayesian model at right.

This model assumes that each visitor observing variant $i$ has a chance $\theta_i$ to buy a banana. Each proportion comes from a common distribution of banana purchase rates, and we have prior assumptions about the mean $\omega$ and spread $k$ of that distribution. These distributions all work together to give us more accurate estimates of each $\theta_i$.

This model "shrinks" our proportions toward their common mean, so that until we have enough data to prove otherwise, we make conservative estimates about the parameter differences. Here is the plot of our new predicted distributions for each color:

In our banana stand example, our estimate for how much greater $\theta_\text{yellow}$ is than $\theta_\text{pink}$ drops from 261% to 113%, and our confidence that the difference is positive drops from 95% to 90%.

More practical data

I generated random data to see how well each inference method worked. The random data comes from the following hypothetical scenario: suppose we have 4 variants with exact banana purchase proportions 2%, 2.2%, 2.4%, and 2.6%. We watch 300 visitors for each variant, then run our inference methods to estimate the distribution of each proportion.

I ran this simulation 1000 times and counted
• the number of trials where any type S errors (where $\theta_1$ is predicted to be greater than $\theta_2$ when actually $\theta_2 > \theta_1$) occurred
• the average mean squared error of both models:
 Model Trials with S errors Mean squared error with shrinkage $24$ $1.9\times10^{-3}$ without shrinkage $42$ $2.1\times10^{-3}$

With high confidence (using whichever frequentist or Bayesian inference method you want), we can say that including shrinkage prevents us from making some spurious type S errors and gives us more accurate results overall in this scenario.

It may seem counterintuitive that shrinking the parameters together gives a more accurate result, but it turns out that under a normal model for at least three parameters, there is a factor by which you can shrink the observed means in any direction and obtain a more accurate result on average; this factor is called the James-Stein estimator.

Summary

You may still have doubts as to whether shrinkage is truly the best way to avoid inference errors. The truth is that any inference method makes a tradeoff between power (chance of discerning which quantity is actually greater) and low error rates. Shrinkage is a technique that allows us to trade power for low error rates. Its advantage is that it is perfectly suited to our application: the measurement of similar quantities with multiple comparisons. It intelligently makes more conservative estimates in accordance with this similarity. At Wealthfront, it helps us sleep at night to know that our experiments are resolved with both the precise, unbiased automation of a computer while incorporating sane intuition of a human.

Tuesday, October 13, 2015

Testing JavaScript Modules

In our previous post we outlined benefits of writing JavaScript in modules. One benefit we didn't mention was how much more testable JavaScript modules are. In your test file you can require the module under test and let the require system import any dependencies.

The goal of this post is to outline some of the common testing patterns of JavaScript modules and highlight one of the libraries we have open-sourced to help make testing private state in JavaScript modules even easier.

Public Functions

Below is an example of a utilities file that simply exposes an object with some functions. This is the easiest example of a module to test because all functions are public.

We simply require the file and write tests that the add function works as expected.

Private Functions

Simple Objects

Building on our example from above, here is the same module, but with a private helper function.

When we require the utils module, we can't access Utils.add. We want to make it visible in the test environment in order to test that function, but not visible to the rest of our code base.

Here at Wealthfront we have built privatestate, a testing library to expose private state of JavaScript modules in our test environment. Below are some examples that outline its benefits. Check out its documentation for more detail!

privatestate lets us expose the private functions like this:

Allowing us to write tests against add.

Functions

Similarly to exposing a function on objects, we can expose helper functions on functions the same way.

Stubbing

Sometimes we want to replace our private state and dependencies when we are in test to help make testing higher level functionality easier. We will use sinon in these examples.

Private Functions

We've seen how to properly expose a private function when in a testing environment, but what if we want to stub that function and replace it with something else?

Expose it as we did previously:

And stub it out in our test:

Dependencies

Stubbing dependencies is extremely important for testing JavaScript modules as modules often have multiple dependencies. We use Proxyquire (and its family of related packages).

Proxyquire lets you pass your require call an object of stubs to use instead of module dependencies.

If we are testing utils.js and we want to verify that getString returns helper1.msg, we want to stub out Helper so that we can specify msg. Using Proxyquire, we pass the require statement for Utils our stub for Helper.

Wrapping Up

JavaScript modules help improve the development and test lifecycles by being able to work on standalone modules and expose only the functionality we want depending on the execution environment. While making things only accessible in the module itself is beneficial, it can also make things harder to test. Thankfully privatestate and Proxyquire enable us to test all of those hard-to-reach places.

Go forth and test all of the things!