Tools for Debugging Distributed Systems

June 27, 2011

We’ve previously written about the importance of internal tooling for creating a culture of empowering engineers and building a leveraged business. Our first example was adding bash completion to a curl wrapper script. Today I’d like to describe some of the internal tooling we use to make ourselves more productive in the distributed service oriented architecture that we maintain in our production environment. The three things I’ll be talking about are distributed tracing, profiling across a large group of machines and building a REPL environment for working with your code on an ad hoc basis.

1. Tracing

Distributed tracing is a technique that receives little attention but can be hugely valuable. Google’s tracing system is called Dapper and is impressive in it’s scale and flexibility. Wealthfront’s infrastructure has a much more limited but still quite useful distributed tracing system.

We use an RPC over HTTP framework to initiate requests and HTTP headers as a side channel to request and collect tracing information. In most cases the tracing system uses a no-op implementation to avoid the overhead and additional network traffic involved in tracing. Our tracing code is driven by an interface called Tracer.

We have two implementations of this interface. The first is called EmptyTracer that does nothing. Most requests will get this Tracer implementation, adding very little overhead. The second is RecordingTracer and it builds a response header containing information about the request and which requests were invoked to provide a response.

To signal that we’d like a request to be traced we pass “X-WF-Trace: true” as a header when initiating a request. This automatically configures a few things:

    • Our outgoing RPC requests propagate the header to enable tracing
    • Each external request is timed and recorded in the Tracer
    • Our internal execution framework propagates child tracers to
      internal sub-requests
    • Each blocking internal request is timed and recorded in the Tracer
    • Each non-blocking internal request is recorded, but not timed
      (because doing so would require blocking on completion)
    • When the response is formatted, a header containing all recorded requests and their transitive requests is compiled and included in the response

Here’s a simple request (Foo) that calls one internal query (Bar):

$ curl -H "X-WF-Trace: true" -id "q=Foo" http://localhost:8080
HTTP/1.1 200 OK
Server: kawala
X-WF-Trace: (local/q=Bar,200,4067,((local/q=Baz,200,2541,())))

In this request Bar returned a 200 OK status to Foo in 4.067 seconds and called Baz to help it accomplish that

Here’s a simple request that that retrieves some information about my Wealthfront account:

$ curl -H "X-WF-Trace: true" -id "q=GetAccount&p0=338" http://localhost:8081
HTTP/1.1 200 OK
Server: kawala
X-WF-Trace: (10.0.0.1:8082/q=GetPriceQuotesRealtime,200,6,())
[...]

In this request GetAccount fetches some prices to value my account by calling GetPriceQuotesRealtime on a remote service and returns in 6 milliseconds.

There are lots of improvements to be made to our tracing system. I look forward to automatically enabling tracing on some percentage of requests and forwarding the results to a collection point that can aggregate tracing information and present it nicely to developers. Similar to the way we look for statistical deviations in exceptions to automatically notify developers about problems, detection of tracing deviations will be a fun problem to experiment with.

2. Profiling

Large clusters of services can make profiling real world execution time difficult. We recently introduced a profiling tool that aggregates timing information across diverse clusters. Each piece of timing information contains a scope that encompasses many events. The tool exposes several methods of collecting timing information

public interface Chronograph {

  // Transparently timing a block of code
  public <T> T time(String scope, String eventName, Callable<T> c);

  // Stop watch timing similar to perf4j
  public RunningChronograph start(String scope, String eventName);

  // Convenience methods
  public <T> T time(Object scope, String eventName, Callable<T> c);
  public <T> T time(Class <?> scope, String eventName, Callable<T> c);
  public RunningChronograph start(Class<?> scope, String eventName);
  public RunningChronograph start(Object scope, String eventName);

}

The least intrusive method uses Guice AOP. Adding a @Timed annotation to a method will transparently time and record the method’s execution:

public class Foo {

  @Timed
  public Baz bar() {
    // ...
  }

}

This will record timing information with a scope of Foo and an event of bar.

Internally, we use a custom MBeanCollector that pushes data to a collectd service, which records the information in rrd files. We can then view graphs of the data.

Here’s a graph showing a service that aggregates some pricing information from a real time exchange feed. Each data point takes a few microseconds to process and it’s easy to see the volume “smile” with more volume at the beginning and end of the trading day.

3. REPL

Both previous tools operating on existing production code. When we need an environment to develop ad hoc code to answer a specific question we can pull up a REPL that integrates with our production environment with some safety restrictions to prevent side effects. Our current implementation uses clojure, because it’s convenient for us to use a JVM language and I like writing lisp. There’s a similar scala REPL too for engineers who prefer that environment.

Recently I wanted to gather all ADR instruments that trade over the counter so I could evaluate a modification to our trading system. I loaded up the BNY Mellon’s ADR directory, entered my parameters and exported a file. I threw the symbols into a buffer in emacs and connected to the clojure repl.

A quick line of code later and I had all the OTC ADRs that are approved for trading in our trading system:

user> (map :symbol (filter :approved-long (filter (comp not nil?) (map
inst all-symbols))))
("AHEXY" "ADDYY" ...

In this code all-symbols is the list of symbols, the inst function returns information about the stock from our securities master and :approved-long is a map key that contains a boolean, true if the
instrument is approved for trading. After applying this filter, I returned the symbol. Armed with this information I was quickly on my
way to the trading system.

All of these tools were developed quickly and improved over time as needed. The initial implementations took no more than a few hours to implement and release and have paid themselves back many times in improved productivity and visibility into production. A well known Brian Kernighan quote is “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you are as clever as you can be when you write it, how will you ever debug it?” One of the many things we use internal tools to do is try to reduce the complexity of distributed debugging.