Here at Sumo Logic, we run a log management service that ingests and indexes many terabytes of data a day; our customers then use our service to query and analyze all of this data. Powering this service are a dozen or more separate programs (which I will call assembly from now on), running in the cloud, communicating with one another. For instance the Receiver assembly is responsible for accepting log lines from collectors running on our customer host machines, while the Index assembly creates text indices for the massive amount of data pumping into our system constantly being fed by the Receivers.
We deploy to our production system multiple times each week, while our engineering teams are constantly building new features, fixing bugs, improving performance, and, last but not least, working on infrastructure improvements to help in the care and wellbeing of this complex big-data system. How do we do it? This blog post tries to explain our (semi)-continuous deployment system.
Running through hoops
In any continuous deployment system, you need multiple hoops that your software must pass through, before you deploy it for your users. At Sumo Logic, we have four well defined tiers with clear deployment criteria for each. A tier is an instance of the entire Sumo Logic service where all the assemblies are running in concert as well as all the monitoring infrastructure (health checks, internal administrative tools, auto-remediation scripts, etc) watching over it.
This is the first step in the sequence of steps that our software goes through. Originally intended as a nightly deploy, we now automatically deploy the latest clean builds of each assembly on our master branch several times every day. A clean build means that all the unit tests for the assemblies pass. In our complex system, however, it is the interaction between assemblies which can break functionality. To test these, we have a number of integration tests running against Night regularly. Any failures in these integration tests are an early warning that something is broken. We also have a dedicated person troubleshooting problems with Night whose responsibility it is, at the very least, to identify and file bugs for problems.
We cut a release branch once a week and use Stage to test this branch much as we use Night to keep master healthy. The same set of integration tests that run against Night also run against Stage and the goal is to stabilize the branch in readiness for a deployment to production. Our QA team does ad-hoc testing and runs their manual test suites against Stage.
Right before production is the Long tier. We consider this almost as important as our Production tier. The interaction between Long and Production is well described in this webinar given by our founders. Logs from Long are fed to Production and vice versa, so Long is used to monitor and trouble shoot problems with Production.
Deployments to Long are done manually a few days before a scheduled deployment to Production from a build that has passed all automated unit tests as well as integration tests on Stage. While the deployment is manually triggered, the actual process of upgrading and restarting the entire system is about as close to a one-button-click as you can get (or one command on the CLI)!
After Long has soaked for a few days, we manually deploy the software running on Long to Production, the last hoop our software has to jump through. We aim for a full deployment every week and often times will do smaller upgrades of our software between full deploys.
Being Production, this deployment is closely watched and there are a fair number of safeguards built into the process. Most notably, we have two dedicated engineers who manage this deployment, with one acting as an observer. We also have a tele-conference with screen sharing that anyone can join and observe the deploy process.
Closely associated with the software infrastructure are the social aspects of keeping this system running. These are:
We have well defined ownership of these tiers within engineering and devops which rotate weekly. An engineer is designated Primary and is responsible for Long and Production. Similarly we have a designated Jenkins Cop role, to keep our continuous integration system and Night and Stage healthy.
Group decision making and notifications
We have a short standup everyday before lunch, which everyone in engineering attends. The Primary and Jenkins Cop update the team on any problems or issues with these tiers for the previous day.
In addition to a physical meeting, we use Campfire, to discuss on-going problems and notifying others of changes to any of these tiers. If someone wants to change a configuration property on night to test a new feature, the person would update everyone else on campfire. Everyone (and not just the Primary or Jenkins Cop) is in the loop about these tiers and can jump in to troubleshoot problems.
Automate almost everything. A checklist for the rest.
There are certain things that are done or triggered manually. In cases where humans operate something (a deploy to Long or Production for instance), we have a checklist for engineers to follow. For more on checklists, I refer you to an excellent book, The Checklist Manifesto.
This system has been in place since Sumo Logic went live and has served us well. It bears mentioning that the key to all of this is automation, uniformity, and well-delineated responsibilities. For example, spinning up a complete system takes just a couple of commands in our deployment shell. Also, any deployment (even a personal one for development) comes up with everything pre-installed and running, including health checks, monitoring dashboards or auto-remediation scripts. Identifying and fixing a problem on Production is no different from that on Night. In almost every way (except for waking up the Jenkins Cop in the middle of the night and the sizing), these are identical tiers!
While automation is key, it doesn’t take away the fact that people who run and keep things healthy. A deployment to production can be stressful, more so for the Primary than anyone else and having a well defined checklist can take away some of the stress.
Any system like this needs constant improvements and since we are not sitting idle, there are dozens of features, big and small that need to be worked on. Two big ones are:
Red-Green deployments, where new releases are rolled out to a small set of instances and once we are confident they work, are pushed to the rest of the fleet.
More frequent deployments of smaller parts of the system. Smaller more frequent deployments are less risky.
In other words, there is a lot of work to do. Come join us at Sumo Logic!