Link to

Fork me on GitHub

Friday, August 22, 2014

Summer Spent Learning and Experimenting

Though my decision of where to spend my 2014 summer may not carry the same drama or headlines as Lebron's summer decision, my experiences this summer would undoubtedly influence my pursuits after college. My priority was to find an internship program with supportive mentors and a company with an engineering backbone. In these aspects, Wealthfront has gone beyond my expectations.

Data Week
Wealthfront's Data Week coincided with the start of my internship. Data Week is dedicated to providing the Wealthfront team with the tools and skills necessary to be fluent with data and analytics. After brief introductions and setup, I found myself packed into a 20'x15' conference room with the whole engineering and product team. Daily workshops would encompass data methodologies and technologies that are essential to Wealthfront's data platform. These seminars were encouraging for people of any or no data expertise to attend. Depth and difficulty ranged from 'Intro to SQL' with select and query clauses to using Cascading on EMR. For example, the MapReduce seminar consisted of a high level overview of the MapReduce design architecture as well as the Cascading framework, followed by exercises to implement a new Cascading job within the data platform.

After the workshops, we split into three teams, each covering a different domain of Wealthfront (i.e. brokerage operations, investment services, and consumer Internet data). Each team's mission was to create data sources and dashboards providing diverse metrics on the operations and services at Wealthfront. The dashboards vary from daily cash flow metrics to Wealthfront's Money Ballers softball team statistics. For example, the dashboard shown below is a time series of the number of pull requests per day. The data source is generated using multiple data inputs from our online systems. The inputs are processed using Cascading and then loaded into our data warehouse. From there, the derived data is queried and loaded into this dashboard. By analyzing our engineering workflow, we can continually better our development processes.

I am extremely impressed by Wealthfront's encouraging attitude towards experimentation and learning. Already on my first week, I had the opportunity to explore the data warehouse, hack at a Cascading job, and learn about different perspectives on data analytics and data quality.

Internship Project
Holding true to its core engineering values, the Wealthfront team continually searches for and prioritizes processes to automate. One such process was the manual approval procedure for new accounts and retainment of client application information. Wealthfront retained physical copies of client new account applications in compliance with the Financial Industry Regulatory Authority (FINRA).  This process does not scale well with Wealthfront's current client and AUM growth. We wanted to develop an automated procedure of storing electronic copies of client applications that would be compliant with SEC and FINRA regulations. When developing a solution, I adopted Wealthfront's design methodology of identifying the goals and invariants, researching the technologies and dependencies, and developing an implementation and migration solution. Of course, the solution had to be coupled with a generous serving of tests to help you sleep easy at night.

One project goal was to electronically sign and store each retained new account application in a third-party storage provider. In compliance with FINRA regulations, the client records would have to be stored in write once, read many (WORM) format. This essentially meant redundant data storage filers, restrictions of delete permissions, and full audit logs of storage activity. We would ensure the integrity and security of confidential client information by encrypting the records and properly storing all data.

Another equally important goal was to implement an automated procedure to process and store each new client's data. This automation substitutes the old print-and-sign procedure and runs in the background, requiring no interaction with our Client Services team. A queue implementation would perfectly satisfy these requirements. Essentially, the queue worker would poll a database table for the next unprocessed payload, in this case, the next client's data. The worker would then encrypt and upload the data.

The next concern is at what point of the client application approval process to push the data onto the queue for storage. The most sensible answer is to inject this feature at the point where the Client Services team approves an applicant to open an account. We could then leverage the process of electronically storing data by pushing hundreds of clients' data onto the queue rather than print them out one-by-one. The automated service reduces hours of printing and signing documents every week. As new signups increase every week, we can ensure a minimal delay period between sign up and funding the account; thus, providing a better experience for our new clients.

Final Thoughts
What I value the most from this summer is the horizontal and vertical exposure to Wealthfront's many services. I have had the opportunity to explore and contribute to the user management service, the web service client, the mobile server, and the data platform. This is made possible by the mentors' willingness to offer advice and suggestions, while leaving ample room for personal struggles, failures, and successes. Another great aspect of this internship is witnessing the growth of Wealthfront. I continually meet new people from all walks of life every week, making this summer an incredible experience.

Thursday, August 21, 2014

Learning from Leaders

As a student looking for internships, two points that mattered most to me were finding a place with great mentors and finding a place where I could build strong relationships.

Two months into my internship at Wealthfront, I’ve looked back to evaluate whether I achieved my goals. To me, the most beneficial aspect of the intern program has been our discussions with the experienced executives at the company. These weekly Exec/Intern Q&A discussions have given interns the opportunity to gain some insight on topics ranging from finance and entrepreneurship to career choices and family. Here are some of the lessons I’ve learned from the past two months.

Importance of Learning

Learning on the job is a main focus of any successful internship. However, I learned this summer that learning shouldn’t stop there. In fact, learning is at the center of the decision-making process here and a central tenet for the whole company. When it comes to deciding the next big initiative, the most important question to answer is what can be learned from the project. This follows the principle of the scientific method where a hypothesis must be established. The hypothesis tells us what we want to learn from an experiment or initiative. According to Avery Moon, VP Research and Engineering, “the only failed project is the one where nothing was learned.”

In order to learn from a hypothesis, we need a way to measure the success of a project. Taking a Bayesian approach, if the hypothesis is the prior, then there must be a measurable observation that helps to form a stronger posterior. This approach makes tracking metrics and collecting data essential for successful learning. Every engineer at Wealthfront is encouraged to create dashboards and metrics to follow the impact of their projects on the overall growth and success of the company. Without keeping track of these metrics, it can be hard to know why something went wrong or why something went well.

The executives pointed out the same approach can be applied to personal learning as well. Maintaining a journal is an effective way to qualitatively evaluate ourselves. It forces us to remember events and thoughts of the day. As a result, we can understand which areas of our life have room for improvement and which areas we have success. Tracking quantitative metrics can be more difficult in our day-to-day lives, but there are certain situations where it makes a lot of sense. One simple example is logging workouts. Keeping track of a workout can show us where we are improving and encourage continuous progress. If we’re able to keep track of the results our personal experiments (whether quantitatively or qualitatively), we can better understand both our mistakes and successes.

Wednesday, August 20, 2014

Lessons Learned when Scaling our Data Analytics Platform

Over the past year, we’ve gone from a single Java server running all analytics to a multi-node data pipeline. Along the way, we’ve refined our metrics for all parts of the business: from web and mobile analytics, to investment research, to trading and operations. We build metrics dashboards for everything we do, empowering us to make bottom-up, metrics-driven decisions.  In a recent post, we explained why we bet on statistics to scale our bottom-up culture. This post gives a brief overview of the technology we use, with some details of the key lessons we’ve learned along the way.

System Overview

Our data pipeline is, by design, fairly simple. We load data from our online systems every hour. We store all data in Avro files.  We use Avro because it integrates with Hadoop, and Avro schemas add type safety to our data.

We use Hadoop for offline batch processing, via Cascading. We prefer Cascading to raw Hadoop, because Cascading provides a rich library of common data operations, like join, group by, and filter, on top of Hadoop’s basic map and reduce capability.

We also use Cascading to load data from Avro files into our data warehouse. We run all ad-hoc queries against the data warehouse. Additionally, we support views in our data warehouse. When needed, we migrate view computations to Cascading to enhance query performance. We also export data from our data warehouse into dashboards, and do statistical evaluation of query results in R.

Finally, we load derived data stored in Avro files into our online systems. This enables us to incorporate offline data processing back into the product.

Consider an example of how data flows through our system: We load data about deposits, trades, and withdrawals from our production system into Avro files. A Cascading job rolls this information up into daily account metrics for each account, tracking things like account balance and daily rate of return. In our data warehouse, we further roll up per-account metrics into a company-wide assets-under-management view, which we export into a popular dashboard. We also load account metrics data into R, along with web and mobile data, where we use machine learning to analyze factors affecting financial attention.

Lessons Learned

While building this pipeline, we developed a set of best practices around data. These insights map to any data platform, irrespective of particular technology choices. In our opinion, these learnings have been central to scaling our data platform.

Lesson 1: Load all data

We load almost all data from our online systems into our offline systems. This includes more than just the data from our production databases. We also load data from databases we don’t directly access from our production code, like our build server and issue tracker. By having all data available in the data platform, we’re empowered to answer whatever questions we want, whether about the business at large, some particular product or feature, or operations.

In slight contradiction to what I just said, there are classes of data we specifically chose not to load into our data platform. The first is any data for which retention is in violation of applicable regulations. We regularly audit our data systems for regulatory compliance. The second is sensitive client data, like personally identifiable information. We don’t need this data for aggregate analytics, and we protect our client’s data first and foremost. We also take additional security steps, such as encrypting all offline resting data, to further protect this data.

Ideally, all data we load into our system would be immutable, that is, no records would ever change. This is the approach recommended in the lambda architecture because it lets you reconstruct the full history of changes when doing analysis. Most of our transactional systems log changes, so we can use this for extract-transform-load (ETL). Unfortunately some systems, particularly third-party systems, are only able to provide data in its present state at the time we run ETL. For these systems, we load a snapshot of the data into our system, augmented with the time of the snapshot. This lets us approximate the immutable history of data based on point-in-time snapshots. While we miss records that came and went between ETL steps, we are at least able to track trends over time.

Lesson 2: Design for quality

Data is only useful if it is correct and fresh. In a multi-system data pipeline, it is easy to violate either condition. Freshness is easy to check. We have monitoring systems at all stages of the pipeline to validate data freshness is within our SLAs. We get alerted when these SLAs are violated. Because all upstream components will eventually fail, we make our ETL and batch processing jobs automatically catch up from where they left off, rather than just process the most recent data. This allows the system to self-heal when the upstream component is brought back online.

Correctness is a harder problem. We put several checks in place to ensure data correctness. One of common way to compute the wrong thing is to misunderstand what an input data source represents. To protect against this and other computational errors, we code review all ETL steps, Cascading jobs, data warehouse views, dashboards, and R analytics. This gives the reviewer a chance to point out misunderstandings before they affect our data systems.

To further ensure computational correctness, we write rigorous unit tests to validate both happy path and edge cases. In addition to helping us work through issues as we go, these tests help us ensure new changes don’t break existing functionality. To facilitate good testing practices, we’ve built framework classes to reduce the burden of testing. For an example, see our earlier post on Testing Cascading applications.

Finally, we run data quality checks on live data to make sure it satisfies expected invariants. We have two basic types of data quality checks we can write. The first are Cascading jobs that run sanity checks on our raw and derived Avro files. These can detect issues in our data pipeline itself as well as issues in our underlying production systems. Second, we build sanity checks based on SQL statements that we run against our data warehouse. As a simple example, we check all tables for duplicate entries. This is a common failure mode, since our data warehouse does not have a way to ensure uniqueness when loading data. The code we write to test for duplicates looks like this:

assertNoDups("trades", "id");

This ultimately unrolls into checking the counts are equal that result from running the following two SQL queries:


Lesson 3: Make it easy

Our goal in building this data platform is to improve the services we offer our clients by using data to drive decisions. This only happens if the data is readily available to whatever person or system needs it. We’ve developed a layered approach to making data accessible, starting with our dashboards.

Dashboards are the first way we interact with data. We use dashboards to quickly see how we’re doing and look for improvement opportunities. We’ve built dashboards for each area of the business. We use dashboards to track things like new-feature adoption, the business at-large, and operational performance. As an example, this dashboard tracks service deployment times:

The next layer in our data stack is the data warehouse. Here, we run ad-hoc queries. When needed, we process the results of these queries in R. We combine views and offline-computed tables to make the data easy to understand and query. Our online production data is generally too low-level to be useful in analysis without this pre-processing step.

Finally, in both our ETL and Cascading workflows, we’ve added framework classes to simplify the work required to process data. As an example, we’ve built type-safe Java wrappers around the core Cascading classes to reduce errors from passing strings around and eliminate boilerplate code. In particular, for a simple Filter condition, we would write:

public boolean isRemove(Source clientSourceTable) {
  return !source.getIsFirstClick();

instead of the raw Cascading equivalent:

public boolean isRemove(FlowProcess flowProcess, FilterCall<Void> filterCall) {
  TupleEntry entry = filterCall.getArguments();
  boolean isFirstClick = entry.getBoolean(“is_first_click”);
  return !isFirstClick;

Why this matters

While these are not our only learnings about scaling our data platform, they have been important. They’ve enabled everyone in the company to make data-driven decisions.  Loading all data into our platform allows us to ask any question we want. Our rigorous quality checks enable us to trust the results of our analyses. Because we’ve made data easily accessible, each employee is empowered to use data when making decisions on how to best help our clients manage their investments.

Friday, August 15, 2014

Statistics Is Eating the World

Wealthfront’s culture encourages self-directed engineers and researchers. By doing so we learn faster and make better decisions informed by facts on the ground. While this practice is obvious for startups with a handful of people, scaling up to large teams is challenging.

We set out to scale our culture of decentralized learning and decision-making as we transitioned into hypergrowth early last year. This post introduces six principles we adopted for this transition.

In short, we decided to bet on statistics.


How to maintain coherence in learning and decision-making as teams grow is the fundamental question for scaling teams. Subjective techniques don’t work because everyone has different past experience. Autocracy does not work because great engineers and researchers expect to be empowered. In other words, how can humans achieve efficient distributed consensus in absence of centralized coordination?

We decided to adopt Bayesian reasoning on well-defined metrics to solve this. Metrics serve two roles. First, they quantify what happened in the past. As we are onboarding many new people, remembering our past learning becomes increasingly important. Second, metrics quantify future goals and help us avoid justifying past choices by cherry picking data after the fact.

Bayesian reasoning is the iterative practice of formulating, testing, and updating a hypothesis about how an action will impact a metric. As the basis of Bayesian machine learning, we at least know this approach works for computers. When quantified by metrics, it should also work for people.

We apply this approach every day. For example, every Monday we get together in project teams, discuss metrics from the previous week, and update what we will do the upcoming week based on where we expect to have the biggest impact. Rather than raw numbers, we seek to understand the intuition underlying our numbers via graphical visualization and dashboards.

Thus, every technical role at Wealthfront requires fluency with data and statistics.

Democratized Data

Most teams make decisions increasingly slowly as they grow in size. Needing to solicit consensus from more and more people often causes this. We are optimizing to minimize this anti-pattern by applying metrics and Bayesian reasoning. Thus, our question is how everyone making decisions identifies and finds the data they need to build their hypotheses and learn by observing data.

We decided to solve this by distributing the analytics load across our team by adopting a culture of data democratization: data and analytics are universally accessible by anyone and collectively owned, just like our source code (excluding specific data in compliance with applicable regulation). Doing so liberates data science to be a discipline practiced by everyone in his or her respective domain of expertise, rather than isolated in a silo.

Making data available is also consistent with our company culture of transparency.

Analytics Is Code

Having agreed to democratize metrics and data, the next question is the most challenging: how do we ensure our analyses are always repeatable and consistently high quality, given everyone is rapidly iterating in parallel? In other words, how do we prevent analytic anarchy?

Short answer: systematic peer review. We apply the same engineering discipline to data and analytics as we do for all production code. Doing so enables us to leverage all our proven high-productivity software engineering tools: source control, code review, automated testing, and continuous deployment. We outlawed the usual ad hoc mess of untestable Excel worksheets, brittle R scripts, and manually cobbled graphical dashboards.

Thus, our trust in decentralized learning and decision-making arises from confidence in our engineering craftsmanship.

We believe programming languages have a spectrum of productivity. As such, we believe in choosing the right tool for the job. There are no magic bullets. For example, SQL provides analytic reporting and relational exploratory analysis, but not visualization. R is fantastic for exploratory analysis and visualization, but not tabular reporting. Hadoop provides great data pipelines and large-scale map-reduce, but lacks ad hoc data exploration. Machine learning is implemented across the spectrum based on input data size.

Quality Matters

Our clients trust our tools and services to help make some of life’s most important financial decisions. We take this trust and responsibility extremely seriously. Data and analysis being correct is not a “nice to have” for us.

Peer-reviewed automated testing is the best verification methodology that scales with a rapidly growing team. We combine peer review and automated testing via automated data quality (DQ). DQ seeks to satisfy three types of constraints:
  • Integrity: data has integrity, meaning no data is missing or technically corrupt
  • Accuracy: data is semantically accurate; the value of all data is within expected range, such as standard deviation having non-negative value
  • Freshness: data is regularly updated within a pre-defined window (e.g. every 2 hours), as defined by a SLA per data set
While believing in DQ is easy, baking it into everyday life is hard. Too many people misplace confidence in spreadsheets or untested scripts.

We apply the same development methodology for data as we do for all other code. We implement DQ for all offline data stores and analytic code. We codify unit and integration DQ tests along every step of our pipelines from ingestion to reporting. We sweep through resting data stores to evaluate DQ invariants every hour. Critical DQ failures are auto-escalated to on-call day or night.

One Platform

How to implement a tech stack and tool chain that meets the above requirements is the next question. Many legacy analytics patterns do not fit, as we are optimizing for different criteria.

We are optimizing for a world where everyone codes, so tools that hide code are counterproductive. We prefer horizontal scalability, so we avoid stacks that scale vertically such as traditional OLAP. We require flexibility to meet unanticipated future needs, so prefer modular components rather than monolithic software trying to do everything. We need a small learning curve so everyone can become productive quickly. We are fortunate to not have legacy boat anchors to support.

We decided on a single, horizontally scalable, modular, pipelined data platform. Data is loaded from all origin sources: event ingestion, internal operational systems, external partner systems, SaaS APIs, and infrastructure service providers. We standardize on avro and have a single avro IDL repository serving as our global unified metadata repository.

We optimize for different query patterns and workloads by pipelining data into multiple independent stores for concurrent query by different use cases: HDFS for map-reduce, relationally indexed for SQL, and compressed CSV for R. We layer domain-specific code pipelines to support our primary derived analytic use cases: financial analytics, risk analytics, web analytics, business analytics, optimization, standardization, recommendation, and experimentation.

Continuous Elastic Deployment

Our analytic workload is highly dynamic. Our baseline static workload combines hourly and daily jobs. Our dynamic workload varies widely based on large-scale simulations and ad hoc jobs, varying day-to-day up to 50x.

Given our unpredictable workload, we decided to adopt horizontally scalable cloud services. We are extending our infrastructure to support both continuous and elastic deployment, federating across both cloud providers and our bare metal data centers.

To enable everyone to iterate fast, we are automating all our data and analytics devops. For example:
  • Multiple instances: concurrently operate multiple parallel instances of data stores and end-to-end pipelines, enabling optimization for different workloads and geographic redundancy
  • Automated provisioning: anyone can provision store instances or redeploy pipeline phases, enabling everyone to be self-sufficient
  • Automated snapshots and restoration: snapshot data multiple times per day and support tool-driven restores
These capabilities build on our automated operational infrastructure to eliminate manual ops work and empower everyone.

Together our adoption of these six principles is helping us scale our decentralized learning and decision-making culture. Subsequent posts will drill into technical choices around codifying these principles, and lessons we are learning along the way.

Tuesday, August 12, 2014

The Case for Object-Oriented JavaScript

JavaScript is a flexible and powerful programming language that drives most modern-day web sites and applications. The jQuery library, in particular, has a simple API for traversing and manipulating the DOM. Just choose a your selector, add an event handler, and voilĂ ! Instant UI magic. The ease of jQuery, though, comes with a price. Without much thought in organizing your code, it’s easy to end up with a hot mess of jQuery selector soup -- and that’s the last thing we want in a complex application like ours.

In this blog post, I will show you the case for object-oriented design in a functional language like JavaScript, using examples from Wealthfront's codebase.

JavaScript the Easy Way

Most front-end developers learn JavaScript the “easy” way first. Add an event handler here. Another one there. Chain some methods. Nest anonymous functions a few levels deep. It’s a quick and dirty way of getting things done, but it eventually makes testing, debugging, and adding new features a pain. 

Let’s look at an example. Here’s an older piece of our code that leverages Dropbox’s zxcvbn library to assess the strength of a user’s password on each keystroke:

Essentially, every input element with the class zxcvbn will have the behavior defined above. The input elements should have the data attributes notify and confirm, which refer to the message container and confirmation input elements, respectively. On .assess(), we grab the password’s value, run some preliminary validations, then pass the value to the zxcvbn function to give us a password score.

While this chunk of code is fairly simple, it has a few weaknesses in design:

1. There is no separation of concerns. The .assess() function not only conducts the business logic of password evaluation, but also updates the view with the appropriate message. Ideally, this logic should be handled in two separate functions.

2. It’s hard to overwrite the default configuration. Notice the tight coupling between the DOM elements. The code expects the password input to have data attributes for the notify and confirm elements. We can’t specify the DOM elements independently of each other.

3. It doesn’t allow code reuse. If we wanted this logic on another page, we would have to copy and paste it over, or render this script like a partial. But if we did that, we would invoke a new anonymous function on each page load, rather than reuse the same function.

JavaScript the Namespaced Way

One of the ways to untangle spaghetti JavaScript code is to create separate functions for the individual units of logic and define them on a common namespace. This allows us to call the functions anywhere within our application.

Let’s see how this looks using a different part of our password code, defined on the wfApp.settings.password namespace. When a user changes their password, we validate their input and show the response from the server:

This logic is clearly separated out into discrete functions, which makes reading and testing the code easier. Notice, though, that we still have selectors such as $('#password-form') scattered throughout the code.

There are a few ways to remedy this:

1. Update the functions to refer to this.form instead. We already have the property this.form that refers to $('#password-form'), and using that prevents us from diving into the DOM more times than necessary. The value of this.form could also be specified as an argument in a separate function. However, this means we would need to invoke that function on every page that uses this component; otherwise, this.form could still point to the jQuery object from a different view.

2. Pass the the selector or jQuery object as a param. This would allow us to apply the functions to any element we pass in. But this may seem unnecessarily redundant, especially since we use the same set of functions to manipulate the same DOM element.

If you look carefully, you'll see that the code actually has the beginnings of objected-oriented design. After all, wfApp.settings.password is an object and maintains states like this.form and this.submitUrl across its functions. However, it's not quite object-oriented because wfApp.settings.password is an object literal, which means there's only one instance of it -- and that's our problem.

With just a few tweaks, we can flip this code into a more object-oriented model to address our configurability issues.

JavaScript the Object-Oriented Way

While JavaScript is a popular language, its full object-oriented programming capabilities are often underutilized. JavaScript doesn’t have the concept of a class, but we can create objects with the same properties using constructor functions. The instances inherit the same methods defined on the prototype.

If we begin to think of our DOM elements and data more as concrete objects with set methods, then we can reorganize our code into something reusable and configurable.

Creating an Object Constructor

As most of our password code deals with view logic, let’s make a constructor for a new object called PasswordView:

This constructor should be the place where we create our jQuery objects. Since we want the selectors to be configurable, the function should consume a hash with the desired selectors and create the jQuery objects accordingly.

It should look something like this:

Here, we have three elements that belong to PasswordView:

- $el, which refers to the form element

- $inputCurrent, which refers to the input for the current password

- $inputNew, which refers to the inputs for the new password and confirmation

Now we can create actual PasswordView objects by calling new wfApp.settings.PasswordView() and passing in a hash of options.

It should look like this:

Because it's a constructor function, we can create multiple PasswordView objects each with custom selectors.

Replacing Other jQuery Selectors

In the other functions, we should now replace the instances of $('#password-form') and the like with our new properties. For instance, we currently have this for the .submit() function:

Lines 6 and 7 refer to $('#password-form input[name=new], #password-form input[name=verify]'). Let's replace that with our new property this.$inputNew. Also, line 10 refers to $('#password-form input'). Let's replace that with this.$el.find('input').

With our changes, it should look like this:

Adding Methods to the Object's Prototype

Let's move all of the current functions to be defined on PasswordView’s prototype. This means that all the PasswordView instances will have methods that point to the same function on the prototype.

Our .submit() function should now be defined as follows:

Identifying Other Objects

We now have a concrete PasswordView with easily configurable selectors, but are there any other objects that appear in our code? Since we are passing password-related information to the server, we can encapsulate that information into its own object, which can conveniently just call Password. Let’s create its constructor function with properties that correspond to the form's input fields:

What methods should Password have? One thing that comes to mind is the validation logic. Right now, PasswordView runs the validation, but that responsibility should belong on the model, since it’s a data-related check. So let’s move .validate() onto Password’s prototype:

We want the view to update the text based on the model’s validation result. So let’s pass in the model as one of the view’s option and grab its validation result on submit.

The other advantage of creating a separate Password object is the ability for our PasswordView object to use different Password objects with different constraints. If we had a Password object with different validation logic, we could easily pull that in without affecting our view logic.

With our new objects and methods defined on the prototype, our code should look like this:

We can see the advantages of this approach:

1. There is a clear separation of concerns. Password handles the validation logic, while PasswordView simply shows the text of the validation result.

2. It’s easy to overwrite the default configuration. PasswordView doesn’t assume any default selectors, so we can pass in any form element of our choosing and just define it once.

3. The same code can be used across multiple views. All we have to do is create new Password and PasswordView objects on the pages that require these components. All we need is something like this at the bottom of our markup:

Final Thoughts

As JavaScript is such a flexible language, there are many ways to write JavaScript code, ranging from functional to object-oriented programming to a hybrid of the two. Personally, I am apt toward the object-oriented approach, as it's easier to track the DOM elements in play and reuse logic across different views. If rolling your own object-oriented code seems ambitious, using a JavaScript framework like Backbone, Ember, or Angular can help you set up an object-oriented web application. However, as you can see, you can also just use jQuery and vanilla JavaScript to achieve the same result.

Tuesday, July 29, 2014

Writing JavaScript Tests for Asynchronous Events with Deferreds

A while back, one of our engineers at Wealthfront wrote an in-depth article about deferreds. It's important that if the functionality you build relies on deferred objects to trigger asynchronous events, that the behavior is properly tested. In this post I will go through 2 simple examples on how to test functionality that uses deferred objects. The first example tests functionality where a deferred object triggers handlers after an Ajax call is made with jQuery's post() method. The second example tests functionality where a deferred object triggers handlers after a DOM element finishes a hide animation.

Which libraries do I need to write JavaScript Tests with deferreds?

The Deferred class is from the jQuery library, so that will be necessary. Our engineering team and the examples I will show use the QUnit testing framework to write our JavaScript tests. We also use the Sinon library to create spies, stubs, and mocks for our tests. This Github project contains all of the javascript and tests used in this article and demonstrates how to potentially configure these libraries together.  

An example: Trigger the handlers of an Ajax POST call made with jQuery's post() method

Recently, I updated how our front-end system handled users cancelling their applications for a Wealthfront account. The example below shows a simplified example of what that code looked like:  

Notice in this example that we are going to send an Ajax POST call when we click on the ‘#cancel-link’ link. We would like the ‘#cancel-status’ span to update whether the cancel application request successfully completed or failed. We have created separate handlers to update the DOM with the appropriate message through the .done() and .fail() methods. Using the .done() and .fail() methods help us allocate what work should be done in the event that the request to cancel the application was successfully or unsuccessfully processed.  
If our request to cancel the application returned a successful response (status 200-299 or 304), the handler in the .done() method will execute and the html view will update with the status of whether the application was cancelled.  
In contrast if system does not respond or responds with any other status, the deferred with trigger any functions defined in the .fail() method.  

How can I use a deferred to test my .fail() and .done() handlers?

We would now like to test that the handlers properly update the DOM when the request is successfully or unsuccessfully returned. It’s undesirable to send the HTTP request every time a test is run because it slows the performance of the test, and it runs parts of the system that we are not interested in testing. Instead, we will mock the Ajax POST call to return our own local deferred object. 

Note that at this point our local deferred object has added the two handlers responsible for updating the DOM with a successful and failed cancel transaction. You can see from the last assertion that the ‘cancel-status’ span text is still blank. To trigger the callback on the handler you update the deferred object with the .resolve() method to trigger the .done() callback as follows:  

When we call the .resolve() method to trigger the callback, we have the option to pass in arguments that will be passed to the functions declared in the .done() method. In our example we pass in an object literal that has the 'status' property of either 'success' or 'not-found' status, to simulate the type of object that would be passed through the response of an actual Ajax call. Our assertions validate that our handlers properly received the object with the status to display the appropriate message. It's not necessary to pass in an object when you are updating a deferred object with .resolve() or .reject(). The next example shows using the deferred object with the .reject() method with no arguments to trigger the .fail() callback:  

Another example: Using deferred objects to trigger events from jQuery DOM animations

Deferred objects can also be useful in tests for controlling the completion of local asynchronous events. In the following JavaScript, we want a DOM element to appear after the link has finished fading from a .hide().  

We want to test with an assertion that the ‘hide-status’ span appears after the link has been clicked. But if we take a straightforward approach, the test will fail. This is because the assertion is checked immediately after the hide is initiated, and still requires time before the hide completes and the span is updated.  

While its reasonable to consider placing a timed delay before the assertion is checked, the approach could end up being inconsistent and hard to maintain. Here we can use a deferred object to have complete control over the completion of the hide. The next example shows two tests. The first test asserts that the 'hide-status' message returned is 'in-progress' as soon as the event is fired, but before the hide completes. The second test verifies that after the event has fired and the hide completes, the status message is changed to 'finished'.  

By mocking the call of the hide method to return a local deferred method, we can now be certain that the .done() callback has executed by the time the assertion is checked.  

Checking for asynchronous conflicts with deferred objects

The hide animation example we were using actually has a flaw. When the user clicks on the link a single time, the page works as intended and hides the link with an animation. But what if the user clicks twice or multiple times on the link before the animation finishes? Multiple .hide() animations are queued and potentially reset the state of the link to visible as soon as the initial hide animation finishes. Since this is unintended behavior, we want to create a test to expect what we truly want, which is that only one .hide() is called no matter how many times the user clicks on the link.  

The test above sets up the scenario where the hideLink method is called a second time before the animation completes. By using our own local deferred object in the test, we can prevent the .done() method from ever resetting the hideStatus value before the second call is made to hideLink. The test will fail in the current version of the code, but we can pass the test by adding a condition not to call .hide() again if the state of hideStatus is already 'in-progress'.  

The above version of the hideLink now only calls .hide() once if the user clicks the link multiple times and passes the test we just created. This example can be extended to many other cases that occur because of asynchronous conflicts. For example, you may need to check that a certain signup or buy transaction is only executed once if the button is clicked multiple times. The deferred object allows you to recreate these situations in tests that may not always be possible to isolate while running the code in real-time to verify your data and execution of code.  

Useful Links on using deferreds:

Monday, July 7, 2014

My Internship with Wealthfront

I’m wrapping up my first couple of weeks at Wealthfront, and I’m very confident to say that I made the right decision for my last internship before graduation. On my first week, I was told “Now you know what it’s like to work at a hypergrowth startup company”. I couldn’t agree more with the sentiment, it has been an amazing time to be here, whether it was preparing for our recent announcement of reaching $1 billion in AUM to digging into research projects that might just impact how we reach our next $1 billion. Let me try and start from the beginning and describe my start as a Wealthfront quantitative research intern.

Working environment

The anticipation of joining a vibrant start-up began a month before my term with the introduction emails. Interns from across the nation gave their backgrounds and hobbies. I had never seen such a major energetic cohort with interests ranging from extreme hiking to music production and to ballroom dancing. On my first day, I was pleasantly surprised by the sophistication and breadth of the data and technology, coupled with documentations that are crowd-sourced and well-maintained. In addition, every intern was assigned a mentor to assist and supervise. Thanks to all the help that Wealthfront provides, I spent little time setting up and immediately jumped in on quantitative analysis using R. R provides a wide variety of statistical and graphical techniques and is the favorite programming tool for most statisticians. I was delighted to find that the Wealthfront research team has adopted R as the primary programming language for data visualization, data analysis and data modeling.

While my doctoral dissertation focuses heavily on R analysis, the pace and style of development here is on an entirely different level. Wealthfront is dedicated to being a data-driven company. Strategic decisions are made based on insights generated from data analysis and research. This is reflected in the company’s heavy emphasis on building an efficient and secured data platform, and maintaining and improving data quality. From the analysis side, all code is organized in functions and then R packages. Computationally extensive models are carefully designed and written in ReferenceClass. Rather than writing ad-hoc code, we write structured scripts to ensure our analysis is reproducible. And finally, everything is tested to ensure accuracy. All R functions and classes are coupled with tests.

First project 

My first task was to analyze the spread of the Exchange Traded Funds (ETFs) Wealthfront invests in. Our investment team is led by the renowned economist Dr. Burton Malkiel, who famously stated in his bestselling book Random Walk Down Wall Street (soon to be released in its 11th edition) that for a long term investment horizon, the optimal strategy is to invest in a portfolio diversified across relatively uncorrelated asset classes, customized for individual risk tolerance. Following his investment philosophy, we invest in a widely diversified portfolio consisting of 11 asset classes. Each asset class is represented by a low-cost, passive ETF (investment whitepaper). Wealthfront simply charges a management fee of  0.25% for accounts over $10,000, and charges no trading commissions. My first project was to calculate and visualize our daily average buy and sell prices of our 2014 ETF trades. The difference between the buy and sell prices should reflect the ETFs’ spread and our execution quality. Wealthfront buys and sells ETFs at different time points during a day. What I analyzed was the “effective” spread that we paid or collected on average in our trades. As a start, I studied the spread for two major ETFs that we invest in: VTI and MUB. VTI is chosen to represent the asset class of US stocks; and MUB represents the asset class of  municipal bonds. 

The first step is to load real-time trading data into R. Our data is stored in a unified cloud data warehouse (DWH), which has fast query performance and handles large-scale data. Perhaps not surprisingly (since we are a technology company), our day-to-day operations generate large amount of data. Our data engineers do a very good job delivering and maintaining high quality data. Thanks to their work, there is a large series of well-formatted and well-cleaned data tables available for various research purposes. To make things even better, Wealthfront has developed its own R function to load data directly from the DWH by calling a simple string of SQL queries. For the spread analysis, all I had to do was the following: 

where 8897 is the instrument ID of VTI. The output of function
ExecuteSqlQuery is a data frame. In this study, data is a data frame with columns date, time, quantity, price, fee and action for each trade that we’ve made since the beginning of this year, ordered chronologically by the time trades occurred.    

With the data frame ready, for each ETF at each trading day, I calculated:

where  represents the number of sales made during that day. Similarly for purchases, I calculated: 

where  represents the number of purchases made on each specific day. And the corresponding daily spread for this particular ETF is calculated as: 

In R, this can be quickly done by just a few lines of code. Suppose we need to calculate the effective spread on day

I wrote a function to iterate through all the trading days for this year and the final result is a vector of spreads.

Lastly, all code comes with tests. To test my function, named as
CalcAvg, I wrote the following test:

Results for VTI and MUB are presented below. In the first two graphs, the upper plot shows daily time series for Average Sell Price (dashed line) and Average Buy Price (solid line); the lower plot shows daily time series for Spread. The graphs are written in R using package

The third and last graph provides a quick comparison of average spread between VTI and MUB since January 1, 2014. We can see that in general, VTI has a much tighter spread than MUB; and the average spread for both ETFs are positive. This plot is also made in R with

Final thoughts 

I come from an academic environment, where our research may not always be as applicable to real-world situations. Before arriving, I listened to academic heavyweights such as Dr. Burton Malkiel vouching for Wealthfront’s philosophy and I was delighted in the first week to find articles from scientific journals as relevant readings. After I finished my first project, I moved on to use various statistical research methods involving stepwise regression, LASSO regression, random forests classification/regression and other machine learning methods to provide in-depth insights about Wealthfront’s clients. It’s exciting to bring the depth of fundamental research together with the rubber-meets-the-road approach favored by Silicon Valley firms such as Wealthfront. 

While it’s the work and research that most interests me, Wealthfront has practiced a rapid iteration approach and expanded my capabilities. From participating in team softball games to long-term strategy meetings, in the short time I've been here I've already had a wide range of experiences that have enhanced my skill-set and allow me to view things from a much different perspective. My first few weeks have been fantastic. I'm excited to take the latest statistical theory from academia and apply it to real-world problem-solving that will benefit Wealthfront's clients.