We haven’t said much about our continuous deployment system recently. Mostly that’s because there hasn’t been much to say. We invest in systems and infrastructure through a process called proportional investment: we spend time on areas that cause us problems and our deployment infrastructure has performed well, requiring little incremental investment.
However, recently we made a significant improvement that is worth discussing: the addition of API compatibility checking to our instance deployment strategy. Our existing instance deployment strategy was a fairly intricate process involving unannouncing the service from our ZooKeeper based service discovery system, deploying the new package, running a self test to verify that the instance will be happy on production (checking database connectivity, Java Cryptography Extension installation and a number of other things that have caused us issues in the past), announcing the service and then monitoring system and business metrics to gauge the production impact of the new code. While that monitoring frequently caught service compatibility issues, the fundamental tradeoff of speed of deployment against comprehensiveness of the safety provided by post-deployment monitoring had occasionally caused us trouble.
The most significant class of the problems were when an engineer moved too fast and deployed new code out of order, leaving a newly deployed service trying to invoke RPC calls that didn’t exist on an older service. This is a common problem in engineering teams managing service oriented architectures with complex distributed systems. Our development and infrastructure methodologies make us somewhat resistant to these types of errors because our ability to affect change at scale quickly means that there is relatively little global system state that an engineer has to manage in their head at any given time. Significant system changes that can take other organizations a few months to execute can be performed safely in our system within a hour or two. That makes errors of this type less frequent, but doesn’t eradicate them.
When presented with service compatibility issue due to deployment ordering recently, we had a choice to make. We could add a human managed process to the release cycle or we could attempt to manage it programmatically. Our aversion to human processes is that they are more error prone and slow than well written equivalent automation, so we embarked on automating a solution.
Our production backend services are based on a JSON-over-HTTP RPC framework. A URL endpoint exposes an arbitrary number of invokable calls with typed arguments. Some of these arguments are optional, with default values. We had a bit of internal debate over the scope of the solution. Should we type check RPC invocations or just check existence and method arity? Ultimately, we decided that every issue we had identified that could have been solved by a system like this over the last two years would have been solved by the simpler solution. While type checking the RPC arguments would have been cool, it was also needlessly complex for the problem we were facing.
We began the implementation by standardizing on a way for services to expose the RPC calls they use. These can either be declared statically during service build time, discovered through reflection at either build time or live on production or by recording invocations on production by our RPC client. There are a number of tradeoffs in each of these solutions and it’s simplest to say that we use a mixture of these techniques depending on the service and the certainty that we desire.
Next we need a way to retrieve the RPC calls that a service supports. Luckily we had a debugging RPC call named “Help” already in production. We added support for an additional machine readable output format. Optional arguments are handled by exposing that the RPC call supports different arities.
Finally we added compatibility checking to our deployment system. When a new instance is released, but before it announces itself to take production traffic, the deployment system verifies that all RPC calls the new instance makes are present on production and that all RPC calls existing production services make to the service type of the new instance are still available. If there are problems with either check, the deployment is stopped, a rollback is initiated and developers are notified with specifics.
Since the release of this system a few months ago, we recently had the first significant automatic rollback caused by an service compatibility issue. There are a few points worth making in summary. First, this relatively small investment in infrastructure a few months ago was still good last week and will continue to be useful for a long time to come. Second, the relative infrequency of issues of this type (especially when compared to our current rate of 20-50 deployments per day), makes it incredibly unlikely that human process would have shown enough continued vigilance to catch this. Third, if you’re interested in working with us to democratize access to sophisticated investment advice while working on state of the art infrastructure, email us at email@example.com.