How StepZen Balances Shipping Quickly and Running a Reliable Service
All companies running cloud services have to balance the twin goals of shipping software quickly and keeping the service running reliably. Disciplines like Site Reliability Engineering, or SRE, specialize in making this tradeoff effectively. However, that which works well in a large company is hard to do for a small team like StepZen. But the tradeoff is even more important for a startup.
This article is about how one startup (ours) makes this tradeoff.
It starts with two rules. These are guiding principles that we try and follow because they help align the twin goals of shipping fast and operating the service reliably.
Rule 1: Keep things simple
This is the first rule because it is the most important one. And the rule is not as obvious as it appears. Specifically, this is neither about about how we write software, nor about how our customers use our services. It's about how we operate and run our service.
A simple system is one that needs little knowledge to operate. For an engineer building a system, it's about all the things they do not need to know or remember. Because modern services are very sophisticated under the covers, it takes extra work to keep things simple to operate. It's this principle that drives the following behaviors.
- We automate things whenever possible.
- We do not expose parameters or knobs, but instead choose values thoughtfully and standardize them.
- We do not add additional code, additional languages, or additional build processes to the stack, even if they are better in some way.
- Everyone should know where and how to find information if and when they need to. Do not over-organize, do not under-organize.
- We use external services to do things outside our core function. Even if they are not exactly perfect for what we need. Because it is one additional thing that you do not need to know or understand deeply. Simplicity above perfection!
Rule 2: Make problems visible early
This rule is almost as important as the first rule, and it is not about testing and catching issues. It's about visibility. Obviously, we should invest in testing. We should run all test code on every change. But also we need to make test failures visible, for example:
- Stop commits when even one test fails
- Keep logs of test runs
We take this further by deploying the tip of our codebase to a production-like environment automatically. By deploying a possibly broken system, you make the problems with it visible to everyone in the company. This creates a shared context and enables collaborative efforts to fix issues. It creates a culture where keeping the codebase always release-worthy becomes a habit.
This principle implies that we can ship small incremental changes and we have a drumbeat of micro-feature releases to big dramatic releases.
Culture
The rules express our philosophy in a nutshell and guide a lot of our decision making. But the rules only work if we make an additional cultural shift. The visibility can create a challenging environment that undermines confidence. Therefore, the environment must be blameless and supportive. Ownership of problems has to be collective.
So how do we put the rules into practice?
-
Deployment Context. We run a two branch system within our core git repository. These are
main
andproduction.
The tips of both these branches are deployed using otherwise identical stacks. There are very few differences. The most important one being thatproduction
is the live service, andmain
is internal. But for now, the important thing is that the similarities are far more important than the differences.Differences create more things that we need to know and remember. Therefore, they increase complexity. Eliminate the differences whenever possible. Keep things simple.
-
Why not the traditional three branches? We have resisted (and it's not always possible to do this) running a
staging
branch. It changes the culture, and the quality ofmain
becomes less important. And that has a knock on impact of building debt in the form of undone testing and quality validation. And this, in turn, creates a friction in the tradeoff between shipping quickly and running reliably.A two branch system is simpler. Keep things simple.
-
Forced release on Tuesday. We force a release to production every Tuesday. This does not imply that we can't release on other days. We can, and we do. But the Tuesday release ensures that work does not get built up on
main.
We view unreleased code as unserviced tech debt. And it calls attention to anything broken on Monday, and aligns focus to get those issues resolved. Over a period of time, Tuesday releases have become non-events. This is because main is almost always release worthy. Nobody gets fussed about it anymore.Releases make things visible.
-
Post-deployment Tests. Post-deployment tests put our platform through some of our most involved usecases. Because these usecases are driven through known queries, and known results, they prevent regressions. Following any deployment from the tip of main, or from the tip of production, we run these tests. And in the event that they fail, we roll back the deployment. The deployment tests are triggered by the deployment, and run within two minutes of a competed deployment.
Deployment tests run the things that real users do. They are not purpose-built test harnesses. And because we run these tests after every commit to main, we can usually pinpoint exactly which change caused which usecase to fail.
Look for the problems that your users will see, and make them visible.
-
Triage. We don't have an onerous development process. We keep tickets simple and the ticket count small. We do that in part by removing tickets that we don't intend to address in the near future.
By keeping the number of tickets under control, we get focus on the right set of priorities. One of the reasons we are able to do this is that bug tickets are less frequent, and get dealt with quickly. So most open tickets are really just feature requests, and therefore subject to prioritization.
Keeping things simple makes keeping things simple simpler.
-
Production Bugs. If a rollback will fix a production problem, we rollback first and then talk about it. If a rollback will not fix the problem, (i.e. the bug was introduced a while ago), we roll forward with a bug fix. If we have to do this, we try to be as surgical as possible. The goal is to get out of the critical situation and then do any deeper thinking.
Keeping
main
almost always release worthy means that we can roll forward quickly and effectively without cherry picks. It's just another release. And we can do releases safely and efficiently using an automated process. Also, because the bug stayed unnoticed for a while, it usually means that relatively few customers were impacted by it.To me, the most interesting fact here is that we have had to do roll-forwards only once or twice in the last few years. It indicates that the rest of the processes are working well.
Keeping things simple makes it easier to make things resilient.
-
Configurations. When we write code, we use configuration parameters rather than hard-coded parameter values. This is just good practice. However, when we do our deployments, we prefer to hard-code carefully chosen values over values that can be flexibly updated post-deployment. In a way, we treat configuration values like code. They are resolved when we build and deploy our containers. Changing a configuration value requires a release.
Configurations are one of the things that are different between
main
andproduction
. For example, they determine where the logs go. They determine which Kubernetes cluster to deploy to. They determine which database to use, etc.Making configurations part of the code means that any of our releases can be recreated more faithfully. The configuration values come from the code and therefore are linked to the release tag. And changed configuration values can be seen by looking at
git diff
outputs.This makes rollbacks and problem determination more reliable and much easier.
Make all changes visible, especially configuration changes
-
Secrets. Some configuration values can't be public - private keys, and passwords, for example. Therefore, they can't be embedded in the code. We use a cloud service to store our secrets. We strongly favor configurations over secrets. If it doesn't absolutely need to be secret, we do not make it one. This keeps the number of secrets small (less than a handful).
Don't have many secrets. They can't be made visible. They are hard to keep. Moreover, if they cause problems, those are hard to debug.
Summary
The ability to delivery continuously and get changes of all types — new features, configuration changes, bug fixes etc. — into production and into the hands of our users, safely and quickly is an imperative for modern startups and established enterprises alike. At StepZen, we achieve this by ensuring our code is always in a ready for production state, even as our development team grows and makes changes on a daily basis. Our guiding principles of keeping things simple and ensuring that we have visibility into issues early serve us well and means we have eliminated processes like "dev complete" or "code freeze." The major benefit is that by working on the weekly cadence we get features and product to market quickly and we can get feedback from users throughout the delivery lifecycle based on working software.
Check us out on stepzen.com, and we're always happy for your feedback on our Community Discord.