Baking availability SLA into the code

December 09, 2009

Availability and Partition Tolerance are essential for many distributed systems. A simple (though not comprehensive) way to measure both is using response time SLAs between services as implied from Jeff Darcy‘s observation:

Lynch (referring to the 2002 SIGACT paper) also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential component of these definitions (a point made even more explicitly in the Dynamo paper).

At kaChing we think that SLAs is something that should be baked into the code so the developer will have to think of while creating the service contact. For that reason we created a @ResponseTime annotation for out internal services:

@Retention(RUNTIME)
@Target({ TYPE })
public @interface ResponseTime {
  long warn();
  long error();
  TimeUnit unit() default MILLISECONDS;
}

A typical service query for online requests is annotated with

@ResponseTime(warn = 100, error = 200)

Where the time depends on the services constraints, access to resources for example. A ping query for example has a

@ResponseTime(warn = 2, error = 4)

and an offline analytics call may take hours

@ResponseTime(warn = 10, error = 15, unit = HOURS)

Nevertheless, every request should have an SLA and the developer must think of it when writing the implementation.
Once we have this tool the SLA of subsequent services (B, C & D) a service (A) needs to call can statically be computed to verify that no path in the tree of subsequent services calls exceeds the root service (A) SLA. In other other words, for service queries A, B, C & D. If A calls (B and B calls C) and in parallel A calls D then we should have SLA(A) > max (SLA(B) + SLA(C), SLA (D)).
kaChing’s service query container sample the time it takes for every call using perf4j and logs times and SLA violations to local disk. If the query’s time exceeds the warning threshold it will be logged accordingly, but if the error threshold is broken then the container will try to terminate the query. Terminating the query is a bit harsh, but the since our client timeout is using the SLA as well then most chances are that the client gave up on the query and either retried (service queries are Idempotent) or aborted. Another reason to shoot down a runaway query that exceeds the SLA error time is that it may be locking or consuming resources from other queries and slowing the whole system down.
The perf4j messages are also piped through a log4j socket appender (async buffers in each side of the pipe) to a central hub. The hub then does statistics on the aggregated times, it loads the SLAs of the queries and checks that the cluster is not violating its SLA. The central hub can then send a daily report on SLAs and real time alerts pinpointing a slower then expected service. Keeping the reports and comparing them to historical reports are helping to see an improvement or regression in every part of the site.

This monitoring technique is only a small portion of the automated system operation a small startup must have in order to stay flexible. Stay tuned for more monitoring automation posts.