Building a Distributed Work Queue with MySQL and RabbitMQ

Many of the daily processes at Wealthfront follow a similar pattern: they perform some operation for each of a large list of items. For example, we might have a daily process that performs some optimization for each active account. This type of process usually operates through some form of work queue, and in the past we’ve been able to use in-memory queues with error reporting or persistent queues operating on a single machine. However, as our client base grows and the number of batch jobs we run increases, we’ve had to move to more sophisticated types of batch processing models, often involving multiple machines.

For this specific queue, we had a few goals in mind which informed our design. We wanted to be able to process queued items in parallel across multiple machines, record successes and failures in a database, monitor for failures, and guarantee that no item is processed more than once. We settled on a design that utilized both our MySQL database and a message queue (we used RabbitMQ).

The gist of the design is fairly straightforward. All queue workers inherit from a base class implementing logic around polling, execution, and exception handling. That logic is accessible through the method “handleQueuedItem.” The subclasses also supply their own specific handling logic. The producer service persists a row to the database representing a queued item. The row contains the item to evaluate and more columns to store additional metadata. The producer then adds a message to RabbitMQ containing the ID of the database row it just persisted and the specific handler subclass which should evaluate that row. We read that info in the consumer, which then deserializes the arguments and constructs an instance of intended subclass. The subclass definition looks something like this:

class EvaluateQueuedItem extends Handler<ItemClass, QueueRow> {

  private final Id<QueueRow> id;

  public EvaluateQueuedItem(Id<QueueRow> id) {
    this.id = id;
  }

  @Override
  public void process() {
    handleQueuedItem(id, this::handle, QueueRow.class);
  }

  private void handle(ItemClass item) {
    // Handle the item.
  }

}
Code language: Java (java)

The implementation of the Handler superclass is similar to the code below, where “U” is the type representing the database row for the queued item and “T” is the type of the item it provides.

public abstract class Handler<T, U extends QueuedItem<T>> {

  protected final void handleQueuedItem(
      Id<U> id,
      Function<T, Result> action, 
      Class<U> rowClass) {
    try {
      T item = inTransaction(dbConnection -> {
        U row = dbConnection.get(rowClass, id);
        if (row.getPolledAt() == null) {
          row.setPolledAt(clock.get());
          return item;
        }
        return null;
      });
      if (item != null) {
        Result result;
        try {
          result = action.apply(item)
        } catch (Exception exception) {
          logError(exception);
          result = new Failure(exception);
        }
        inTransaction(dbConnection -> {
          U row = dbConnection.get(rowClass, id);
          if (result.isSuccess()) {
            row.setSucceededAt(clock.get());
          } else {
            row.recordFailure(clock.get(), result);
          }
        })
      }
    } catch (Exception exception) {
      logError(exception);
    }
  }

}
Code language: Java (java)

We assume in this example (and this is true in our production implementation) that any code in the “inTransaction” block is guaranteed to be atomic and throws an exception if the changes cannot be committed.

As you can see, there’s no locking directly involved in this code. It simply does nothing if it finds that the item has already polled, and since the check and the update to the polled time are in the same transaction we know that at most one thread can actually proceed to execute the action. There is one possible failure case here, but it tends not to occur in practice due to concurrency control within RabbitMQ. When two threads collide trying to poll a row, exactly one will succeed because we use optimistic locking. This avoids a situation where no thread has polled an item. We further mitigate the risk of failures by adding retries to the polling step and monitoring for queued items which are not polled within a certain amount of time. We also have tooling in production to allow engineers to retry rows which have failed. If a row has not been polled, we can guarantee that the action has not yet been executed.

One of the trade-offs we’ve made in this design is that we’ve increased the queue overhead compared to simpler implementations. We have all the overhead of RabbitMQ plus the additional cost of making multiple database calls to poll items and record results. In our use case this was an easy decision because our computations took long enough to make the overhead insignificant as a proportion of total execution time, and the most significant factor for us was reliability. Optimizing for a larger number of very short-running actions might have resulted in a different queue design.

Disclosure

Nothing in this communication should be construed as an offer, recommendation, or solicitation to buy or sell any security. Wealthfront’s financial advisory and planning services, provided to investors who become clients pursuant to a written agreement, are designed to aid our clients in preparing for their financial futures and allow them to personalize their assumptions for their portfolios. Additionally, Wealthfront and its affiliates do not provide tax advice and investors are encouraged to consult with their personal tax advisors.

All investing involves risk, including the possible loss of money you invest, and past performance does not guarantee future performance. Wealthfront and its affiliates rely on information from various sources believed to be reliable, including clients and third parties, but cannot guarantee the accuracy and completeness of that information.

Engineering Blog – Wealthfront

Building a Distributed Work Queue with MySQL and RabbitMQ

Interested in what you see here?