Last week, I made a mistake which caused some pages of our website to be unavailable for several minutes. I was developing a new feature that allows our customers to see how the money managers of our platform have been sticking to their investing strategy in the past, which we consider a key ingredient in identifying outstanding money managers. These values are computed every night as an offline process and then stored in a protocol buffer message per money manager, itself persisted into a Voldemort store which is then read by our front-end servers. This is an efficient way of compressing a large amount of data with built-in support for forward compatibility. If the message contains fields which are unknown by the reader, they are ignored, allowing us to add new data to the message without having to worry about the version of the message descriptor which is currently used by the readers.
The mistake I made was in thinking that the same principle would apply to protocol buffer enumerations (which is only partially correct, but I’ll skip the details). Because of this misundertanding of the specification, the messages generated during the nightly process revealed a bug. It took several minutes until our monitoring hub, one of our immune system services which scraped our web site, reported the problem. The quick fix was to replace the new messages with their previous versions.
At Wealthfront, when a production issue occurs, we organize a post-mortem, a meeting with 3 or 4 engineers in order to do a 5 Why analysis and produce a list of tasks to improve our system. We strongly believe that understanding the root cause of a problem is required in order to decide how much time and effort we should invest to avoid making the same mistakes again. Although I will not detail the anlysis here, you can read this post to have an example. Since we’re applying continuous deployment, we need to assume that things can go wrong anywhere in the system, and therefore need to build an immune system that prevents our customers from being impacted by failures in our infrastructure.
Inspired by the Deployment manager, which rolls back our services after deployment if a problem is detected, we decided that the messages should be rolled back automatically if they cause problems.
Even though the code is unit-tested and some sanity checks are applied to the messages before deploying them to Voldermort, the application needs to check that the system as a whole is happy with the new data. To do so, we are leveraging the monitoring hub to verify the availability of the affected web pages. The monitoring hub exposes an API to verify the availability of a web page, by performing multiple HTTP requests. The idea is to call the monitoring hub before and after the update in order to make sure that the page is still available.
@Inject MonitoringHub monitoringHub; @Inject VoldermortStore<Manager> store; void conditionalUpdate(Id<Manager> managerId, PortfolioMessage newMessage) { PageAvailability beforeAvailability = monitoringHub.getPageAvailability(managerId); Versioned<ManagerMessage> currentMessage = store.get(managerId); try { store.put(managerId, newMessage); PageAvailability afterPageAvailability = monitoringHub.getPageAvailability(managerId); checkDegradation(beforeAvailability, afterPageAvailability); } catch (PageDegradationException e) { store.put(managerId, currentMessage.getValue()); throw e; } }
The put and get methods are the java Map like abstraction of the Voldemort store. PageAvailability is the return type of the monitoring hub API, containing HTTP response code, time to load the page, content of the page, etc.
12 lines of code (100 lines of tests) are sufficient to ensure that if a degradation is detected, the system will automatically rollback the newly generated message. The most obvious degradation would be the page no longer being rendered, but we also verify the presence of keywords in the response text and could do some statistics on the rendering times to detect anomalous situations. Since the exception is propagated, it will be collected by our monitoring system which will force one of our engineers to acknowledge the issue.
The implementation of this conditional update shares some similarities with transactional updates, typically applied to database updates. The main difference is that we first commit, and then rollback if necessary, which is not the ideal solution. We are also considering applying the conditional update on our staging environment first, and doing the “commit”, i.e. actual production update, only afterwards, but an other important value at Wealthfront is “Embrace Pareto“, and we haved reached a reasonably safe solution with the simple implementation.
I hope that this example helped you understand how Wealthfront applies Eric Ries advice “we’re willing to make any mistake ONE time“, not firing or punishing anybody for that. But the problem would be repeating that mistake a second time, if the organization didn’t learn.