Link to wealthfront.com

Fork me on GitHub

Wednesday, May 2, 2012

Belt and Suspenders: Safety and Trust in a Distributed System

Wealthfront is built on a traditional service-oriented architecture: requests from web browsers are dispatched to a cluster of Rails instances, which will in turn query back-end services. As a single page load might trigger tens of internal remote calls, it quickly becomes difficult to understand why a back-end service was called, who initiated the call, and where was the time spent.

As described last year on this blog, we built a simple tracing system to address this issue. As our services use HTTP to communicate, HTTP headers are used as a side channel to request and collect tracing information. If X-WF-Trace: true is sent with the request when querying a service, tracing is enabled and a header of the form X-WF-Trace: (local/q=GetAccount,200,406,((local/q=GetPrices,200,254,()))) is returned with the response. In this example, we learn that the queried service called GetAccount which returned a 200 OK status in 406 milliseconds. We also learn that GetAccount called GetPrices to help it accomplish that.

In this blog post, I'll describe how we built a similar system to verify that a request was initiated from a specific logged in user. The goal of this system is twofold. First, we want to prevent the possibility of information leakage such as displaying a client account to a different client because some identifiers were mingled somewhere in the flow of queries. Second, we want to generate a detailed audit trail when someone acts on behalf of somebody else.

The key idea behind the system we built is what we call a request authorization token. The token is a JSON array encrypted and authenticated. The array contains the authorized user, the underlying user and the expiration time. The underlying user is a free-form string identifying the person who requested the token. In most cases, it is the same as the authorized user.

Our Rails instances get a fresh request authorization token from our user manager on each page load. The token is passed using the X-WF-Authorization header to our back-end services with all remote calls made while rendering the page. Back-end services then rely on the UserVerifier to verify that the query is properly authorized. Of course, the token is propagated if a back-end service calls another back-end service.

The UserVerifier has two modes: enforcing and permissive (stolen from SELinux.) In enforcing mode, if a request does not include a proper authorization token, a 401 Unauthorized status is returned. In permissive mode, the lack of authorization is logged to disk and published to an internal dashboard. The default mode is enforcing, but the permissive mode can be used when introducing verification to a new query.

Unfortunately, using the UserVerifier is cumbersome. We strongly believe that unless this system is trivial to use, no one will use it and its entire purpose would be defeated. Consequently, we came up with an alternate, declarative solution. Instead of using the UserVerifier directly, query arguments can be annotated with the @Owned annotation.

Our framework takes care of the rest for you. Behind the scenes, we rely on what we call OwnerExtractors.

Implementing an OwnerExtractor is not simple. However, they only have to be written once and are thus highly leveraged. Our entire system only relies on a dozen extractors.

Last but not least, we also improved our internal tools to deal with request authorization tokens. Our simple wrapper around curl is called ikq. If we try to invoke the GetAccount query using ikq, it will fail with a 401.

To generate and send a request authorization token, we can use the -a flag. Before invoking GetAccount, ikq will first contact our user manager to obtain a new token. The user manager will record the UNIX user name of the engineer who invoked the ikq command as part of the audit trail.