Statistics Is Eating the World

August 15, 2014

Wealthfront’s culture encourages self-directed engineers and researchers. By doing so we learn faster and make better decisions informed by facts on the ground. While this practice is obvious for startups with a handful of people, scaling up to large teams is challenging.

We set out to scale our culture of decentralized learning and decision-making as we transitioned into hypergrowth early last year. This post introduces six principles we adopted for this transition.

In short, we decided to bet on statistics.

Metrics

How to maintain coherence in learning and decision-making as teams grow is the fundamental question for scaling teams. Subjective techniques don’t work because everyone has different past experience. Autocracy does not work because great engineers and researchers expect to be empowered. In other words, how can humans achieve efficient distributed consensus in absence of centralized coordination?

We decided to adopt Bayesian reasoning on well-defined metrics to solve this. Metrics serve two roles. First, they quantify what happened in the past. As we are onboarding many new people, remembering our past learning becomes increasingly important. Second, metrics quantify future goals and help us avoid justifying past choices by cherry picking data after the fact.

Bayesian reasoning is the iterative practice of formulating, testing, and updating a hypothesis about how an action will impact a metric. As the basis of Bayesian machine learning, we at least know this approach works for computers. When quantified by metrics, it should also work for people.

We apply this approach every day. For example, every Monday we get together in project teams, discuss metrics from the previous week, and update what we will do the upcoming week based on where we expect to have the biggest impact. Rather than raw numbers, we seek to understand the intuition underlying our numbers via graphical visualization and dashboards.

Thus, every technical role at Wealthfront requires fluency with data and statistics.

Democratized Data

Most teams make decisions increasingly slowly as they grow in size. Needing to solicit consensus from more and more people often causes this. We are optimizing to minimize this anti-pattern by applying metrics and Bayesian reasoning. Thus, our question is how everyone making decisions identifies and finds the data they need to build their hypotheses and learn by observing data.

We decided to solve this by distributing the analytics load across our team by adopting a culture of data democratization: data and analytics are universally accessible by anyone and collectively owned, just like our source code (excluding specific data in compliance with applicable regulation). Doing so liberates data science to be a discipline practiced by everyone in his or her respective domain of expertise, rather than isolated in a silo.

Making data available is also consistent with our company culture of transparency.

Analytics Is Code

Having agreed to democratize metrics and data, the next question is the most challenging: how do we ensure our analyses are always repeatable and consistently high quality, given everyone is rapidly iterating in parallel? In other words, how do we prevent analytic anarchy?

Short answer: systematic peer review. We apply the same engineering discipline to data and analytics as we do for all production code. Doing so enables us to leverage all our proven high-productivity software engineering tools: source control, code review, automated testing, and continuous deployment. We outlawed the usual ad hoc mess of untestable Excel worksheets, brittle R scripts, and manually cobbled graphical dashboards.

Thus, our trust in decentralized learning and decision-making arises from confidence in our engineering craftsmanship.

We believe programming languages have a spectrum of productivity. As such, we believe in choosing the right tool for the job. There are no magic bullets. For example, SQL provides analytic reporting and relational exploratory analysis, but not visualization. R is fantastic for exploratory analysis and visualization, but not tabular reporting. Hadoop provides great data pipelines and large-scale map-reduce, but lacks ad hoc data exploration. Machine learning is implemented across the spectrum based on input data size.

Quality Matters

Our clients trust our tools and services to help make some of life’s most important financial decisions. We take this trust and responsibility extremely seriously. Data and analysis being correct is not a “nice to have” for us.

Peer-reviewed automated testing is the best verification methodology that scales with a rapidly growing team. We combine peer review and automated testing via automated data quality (DQ). DQ seeks to satisfy three types of constraints:

  • Integrity: data has integrity, meaning no data is missing or technically corrupt
  • Accuracy: data is semantically accurate; the value of all data is within expected range, such as standard deviation having non-negative value
  • Freshness: data is regularly updated within a pre-defined window (e.g. every 2 hours), as defined by a SLA per data set

While believing in DQ is easy, baking it into everyday life is hard. Too many people misplace confidence in spreadsheets or untested scripts.

We apply the same development methodology for data as we do for all other code. We implement DQ for all offline data stores and analytic code. We codify unit and integration DQ tests along every step of our pipelines from ingestion to reporting. We sweep through resting data stores to evaluate DQ invariants every hour. Critical DQ failures are auto-escalated to on-call day or night.

One Platform

How to implement a tech stack and tool chain that meets the above requirements is the next question. Many legacy analytics patterns do not fit, as we are optimizing for different criteria.

We are optimizing for a world where everyone codes, so tools that hide code are counterproductive. We prefer horizontal scalability, so we avoid stacks that scale vertically such as traditional OLAP. We require flexibility to meet unanticipated future needs, so prefer modular components rather than monolithic software trying to do everything. We need a small learning curve so everyone can become productive quickly. We are fortunate to not have legacy boat anchors to support.

We decided on a single, horizontally scalable, modular, pipelined data platform. Data is loaded from all origin sources: event ingestion, internal operational systems, external partner systems, SaaS APIs, and infrastructure service providers. We standardize on avro and have a single avro IDL repository serving as our global unified metadata repository.

We optimize for different query patterns and workloads by pipelining data into multiple independent stores for concurrent query by different use cases: HDFS for map-reduce, relationally indexed for SQL, and compressed CSV for R. We layer domain-specific code pipelines to support our primary derived analytic use cases: financial analytics, risk analytics, web analytics, business analytics, optimization, standardization, recommendation, and experimentation.

Continuous Elastic Deployment

Our analytic workload is highly dynamic. Our baseline static workload combines hourly and daily jobs. Our dynamic workload varies widely based on large-scale simulations and ad hoc jobs, varying day-to-day up to 50x.

Given our unpredictable workload, we decided to adopt horizontally scalable cloud services. We are extending our infrastructure to support both continuous and elastic deployment, federating across both cloud providers and our bare metal data centers.

To enable everyone to iterate fast, we are automating all our data and analytics devops. For example:

  • Multiple instances: concurrently operate multiple parallel instances of data stores and end-to-end pipelines, enabling optimization for different workloads and geographic redundancy
  • Automated provisioning: anyone can provision store instances or redeploy pipeline phases, enabling everyone to be self-sufficient
  • Automated snapshots and restoration: snapshot data multiple times per day and support tool-driven restores

These capabilities build on our automated operational infrastructure to eliminate manual ops work and empower everyone.

Together our adoption of these six principles is helping us scale our decentralized learning and decision-making culture. Subsequent posts will drill into technical choices around codifying these principles, and lessons we are learning along the way.