When Garbage Collection is Failure

September 14, 2010

For systems that absolutely require a constant responsiveness Java is generally not an option because of the pauses associated with the garbage collector. Even the alternative collectors, such as parallel and concurrent-mark-sweep, have a pause associated with them for major and minor collections. This problem can be avoided, though, by making sure the garbage collector never runs, which means making sure memory usage is constant or at least growing slowly enough that collections can be put off to a convenient time.

Though we don’t do it here at Wealthfront, high frequency trading is an example of an application with these kinds of requirements. To compete on the very fastest levels responsiveness must be on the sub-millisecond level, so randomly occurring pauses of 10ms or more are clearly unacceptable.

With Java, the only way to keep the garbage collector from running is to not generate any garbage, so that means object pooling. Object pooling is often more trouble than it is worth because of the need to keep a reference count, which must be synchronized, in order to know when an object can be reclaimed. This combined with the necessity of building a data structure–the pool itself–to provide “new” objects generally means that overall program throughput is better using standard Java methods for allocation.

High frequency trading systems, however, have characteristics that make the use of object pools an efficient choice: they generally use a message passing architecture so the vast majority of the objects that would be garbage have a very definite lifecycle, and since they are real-time any thread that has become backed up has failed and should be killed. These properties allow object pools to be implemented for messages without synchronization, even when there are multiple threads consuming the messages.

This can be done by implementing the message pool as a circular queue, with each message containing a separate flag for each consuming thread indicating whether that thread has released the object. When a message is being allocated, all the flags are checked; any threads still outstanding are considered failed and should be terminated. An implementation might look something like this:

class CircularPool {

 private CircularPoolObject head;
 private final FailCallback failCallback;

 public CircularPool(FailCallback failCallback, int capacity) {
  this.failCallback = failCallback;
  head = new CircularPoolObject(failCallback);
  head.next = head;
  for(int i = 0; i < capacity; i++) {
   addNew();
  }
 }

 private void addNew() {
  CircularPoolObject p = new CircularPoolObject(failCallback);
  p.next = head.next;
  head.next = p;
  head = p;
 }

 public CircularPoolObject getNew() {
  head.allocate();
  CircularPoolObject newObject = head;
  head = head.next;
  return newObject;
 }

}

class CircularPoolObject {
 private final boolean[] doneFlags = new boolean[NUM_CONSUMERS];
 private final FailCallback failCallback;
 CircularPoolObject next = null;

 CircularPoolObject(FailCallback failCallback) { this.failCallback = failCallback; }

 void allocate() {
  for(int i = 0; i < NUM_CONSUMERS; i++) {
   if(!doneFlags[i]) {
    failCallback.fail(i);
   }
   doneFlags[i] = false;
  }
 }

 void dealloc(int consumerIndex) {
  doneFlags[consumerIndex] = true;
 }
}

interface FailCallback {
 void fail(int consumerIndex);  // kill the consumer that has failed
}

With Java it is usually very hard to make a system completely garbageless without using custom libraries. Standard solutions for things like logging and data structures all generate garbage, so the best one can usually achieve is to delay garbage collection until a convenient time–after market hours, for instance. Using the java flag -XX:NewRatio=1 will devote the entire heap to the “new” garbage collector generation, and giving the same value to -Xmx and -Xms will make sure the system doesn’t waste any time expanding the heap.

Still, if a system is generating garbage it will always be in danger of needing to collect at an inopportune time–because of extremely high market volume, for example. Systems that cannot afford to garbage collect then need to be architected so that impending collections are treated as system failures, activating backup and failover mechanisms. For a trading system, this could mean doing a live handoff of trading responsibilities to a failover machine before the collection occurs.

Obviously, even systems without such strict performance requirements can benefit from very robust failure mechanisms. Hopefully this post will get you thinking carefully about memory usage and failure modes in your own applications!