# Continuous Deployment for Data, Not Just Services

Last week, I made a mistake which caused some pages of our website to be unavailable for several minutes. I was developing a new feature that allows our customers to see how the money managers of our platform have been sticking to their investing strategy in the past, which we consider a key ingredient in identifying outstanding money managers. These values are computed every night as an offline process and then stored in a protocol buffer message per money manager, itself persisted into a Voldemort store which is then read by our front-end servers. This is an efficient way of compressing a large amount of data with built-in support for forward compatibility. If the message contains fields which are unknown by the reader, they are ignored, allowing us to add new data to the message without having to worry about the version of the message descriptor which is currently used by the readers.

The mistake I made was in thinking that the same principle would apply to protocol buffer enumerations (which is only partially correct, but I’ll skip the details). Because of this misundertanding of the specification, the messages generated during the nightly process revealed a bug. It took several minutes until our monitoring hub, one of our immune system services which scraped our web site, reported the problem. The quick fix was to replace the new messages with their previous versions.

At Wealthfront, when a production issue occurs, we organize a post-mortem, a meeting with 3 or 4 engineers in order to do a 5 Why analysis and produce a list of tasks to improve our system. We strongly believe that understanding the root cause of a problem is required in order to decide how much time and effort we should invest to avoid making the same mistakes again. Although I will not detail the anlysis here, you can read this post to have an example. Since we’re applying continuous deployment, we need to assume that things can go wrong anywhere in the system, and therefore need to build an immune system that prevents our customers from being impacted by failures in our infrastructure.

Inspired by the Deployment manager, which rolls back our services after deployment if a problem is detected, we decided that the messages should be rolled back automatically if they cause problems.

Even though the code is unit-tested and some sanity checks are applied to the messages before deploying them to Voldermort, the application needs to check that the system as a whole is happy with the new data. To do so, we are leveraging the monitoring hub to verify the availability of the affected web pages. The monitoring hub exposes an API to verify the availability of a web page, by performing multiple HTTP requests. The idea is to call the monitoring hub before and after the update in order to make sure that the page is still available.

@Inject MonitoringHub monitoringHub;
@Inject VoldermortStore<Manager> store;

void conditionalUpdate(Id<Manager> managerId, PortfolioMessage newMessage) {
PageAvailability beforeAvailability =
monitoringHub.getPageAvailability(managerId);
Versioned<ManagerMessage> currentMessage =
store.get(managerId);
try {
store.put(managerId, newMessage);
PageAvailability afterPageAvailability =
monitoringHub.getPageAvailability(managerId);
}