Web applications usually start out as a single codebase. In time, that little monolith grows. Before you know it, the application, team, and business have all grown to the point that the not-so-little monolith is a bottleneck. It’s time to start breaking it down into manageable, orthogonal services.
There are a lot of good reasons for moving toward a service-oriented architecture – less disruptive deployments, flexibility in choice of language and tooling, smaller codebase, better fault isolation – but I’m going to assume that you are already sold on the idea.
Switching over to a newly created service is a difficult problem. If that service should store data, you likely want to move the existing data over to the new service. This is easy if you take down the relevant parts of the site for a few hours while you move the data over, but chances are that you don’t want to do that. Likewise, you also want to avoid downtime caused by bugs or performance issues in the new service that are only discovered when it is exposed to production traffic.
The goal of this plan is to avoid any scary “flip the switch and pray” moments. Every move you make should be small, low risk, and easy to undo.
Before writing the service, before thinking about the API, before even making changes to the monolith codebase, think about how ready it is to have major changes made in it. Are there unit or functional tests for flows you’ll be working with? It’s nice to feel assured that your changes aren’t breaking anything. Besides, this stuff should have had tests all along.
Another question: are you capturing performance and error rate metrics? Do you have good dashboards set up for these metrics? Do you know how often each endpoint is being hit? These are always good to have, but they are especially important when making major changes to the architecture. It’s easy to accidentally make things slower or introduce an error case and not find out about it. Plus, if you make things faster, it feels good to show off the graph in Slack.
It’s also well worth your time to clean up unused features. Why waste time porting something just to turn around and delete it? Cleaning up old features is also a good exercise to re-familiarize yourself with everything that happens in that part of the code. You may find some surprises!
The Beauty and the Beast
The existing monolith might be a bit of a beast. In all likelihood, the logic we want to extract is spread around in several odd corners. We’ll need to tame the beast before we can introduce it to our beautiful new service. We need to gather up all that logic and move it behind an interface. This interface will eventually become a template for the API exposed by the service, and will be directly implemented by the service’s client library. This interface also makes a good point for adding an instrumentation decorator that captures metrics like how often different methods are called and how long they take.
This interface needs to encapsulate the implementation well enough that the interface works just as well for the existing code as it would for a new service. If some piece of functionality is awkward to move behind that interface, it will likely also be awkward to integrate into the service. It is a much lower cost to change the interface now than to wait until after the service has been built. If the code is large and complicated, it may be worthwhile to move things behind the interface one at a time. This lets you deploy smaller changes at one time.
Building that interface should clarify what the service needs to do. Armed with that knowledge, decide on the design of the service. This is a good point to write an RFC. Getting your design down on paper helps find bugs in your thinking. As someone once said, “weeks of coding can save you hours of planning.” Don’t know what to write? Pretend a three-year-old is questioning you, and answer “why?” about the decisions you made. Reconsider your plans when you don’t find your own arguments convincing.
Building a service is basically just building a smaller web app, so the process of actually writing it doesn’t require much explanation. That said, there are a few things you should consider. What makes the connection between the monolith and the service secure and private? What metrics do you need to monitor on the new service? How are failures handled: what happens to the monolith if the service goes down, what happens to the service if one of its dependencies goes down? What’s the plan for restoring from backups? Trying out failure cases (try restoring from a backup) will often find issues in these processes. Remember, a backup that you can’t restore from isn’t a backup.
Bringing the service up
Before switching production to the shiny new service, you want to feel sure that it will handle the load and will be stable. To do that, we’ll test it with some traffic. To capture the full variety of possible values, there is nothing quite like production traffic, so that’s what we’ll use.
Using that interface you just created, make a proxy within the monolith. This proxy will start duplicating some traffic to the new service. The new service isn’t in production “for real” yet, so this will still rely on the existing code’s results. Errors the new service returns should be logged but otherwise ignored. If the call to the new service is expected to take a meaningful amount of time, consider running it in parallel with the existing logic or even running it asynchronously. Either way, make sure there is a timeout so it can’t slow down the application too much if it is responding slowly.
Start with a small percentage of traffic going to the new service. Gradually ramp up the load so you can see and fix issues with performance when they are still minor. Minimize the factor the load increases at each step by increasing it in geometric (1, 2, 4, 8, 16%) increments instead of linear (1, 5, 10, 15%, etc). It doesn’t hurt to have a feature flag system that lets you quickly turn off the traffic to the service without deploying a configuration change, just in case.
In order to know it’s safe to move the load up to the next step, you need metrics to tell how it is currently doing. Make sure you aren’t seeing errors from the new service. Watch both the time it takes the service to serve the requests as well as the total time it takes to run both the new and the old code. If it makes sense for the type of service, also check how often the result returned from the service differs in a meaningful way from the existing code’s result.
If this service does processing that has real-world side effects (such as sending email), send those to a sandbox account or dummy backend for now.
If the service stores data, queries from the data will return the wrong results since the data is missing. Once you are duplicating 100% of write operations you can start backfilling data to fix that (if you backfill the data before the service receives all writes, the data will just fall out of date again). Watch out for performance problems that only crop up once all the data is present. A problem like a missing index could go unnoticed until there is some real volume of data. To avoid impacting the performance of either the existing or the new datastore, you may need to throttle the backfill process.
The not-so-big switch
Once the service has been running on 100% of traffic for a while without problems, it’s time to start using it for real. At this point both the old code and the new service are working on all requests, so we’ll simply switch the roles for some portion of requests. Again here, we’ll gradually ramp up the proportion of traffic we switch over.
To switch roles for a request, switch which implementation is sandboxed when interacting with the outside world and switch which results are used. Even once 100% of the operations are switched over to the new service, keep duplicating operations (at least, write operations) to the old implementation. That keeps the data in that implementation fresh so you can switch back to it if need be. This delays the point of no return until after the new service has been serving 100% of production load for long enough for you to be totally confident in it.
The old implementation should end not with a bang but with a yawn. Crossing the point of no return should be the boring part. Stop writing to the old implementation, clean up the code, take a backup, drop the table. Share some celebratory cookies.
While waiting to shake out any bugs before killing off the old implementation, think back over how things went. What helped the process go well? What was unnecessary? What could have been better? Make a list of things to do differently next time. Share your insights. Order cookies.