
One of our five foundations
Operations Foundation
What makes good Operations?
To understand this, you must first think - what do the users of our services expect?
They expect the service to be there when they show up. If it’s not there, they expect that you already know, and you are working on it.
They expect features to appear regularly, and for those rollouts not to interrupt their usage (much). They expect bugs to get fixed in a reasonable time.
They expect things to work as expected, or at least as described. They expect some help. They expect their data to be secure. They expect you to tell them if that’s not true.
All this tells us that drama is bad, failures should be minimised, and predictability is king.
Achieving reliability, velocity and clarity
These things are choices, management choices, because they all cost money to achieve, and you will get as much of each as you pay for. But none of these things are just for sale. There’s no service called AWS Reliability Server. You have to design your development processes to achieve them.
Reliability is a choice. You decide how much time you put in to flexible designs and then painstaking testing to achieve reliability. You’ll need a good test rig. Do you have one? You’ll need to spend time chasing down small, persistent bugs. Do you give your devs time for that?
Reliability needs a deep understanding of what your systems are doing. That means observability - instrumented code, dedicated effort to expose both technical and business metrics, both in live and in test environments. It needs good load simulators, that exercises many code paths and at scale.
Velocity is not gained by thrashing your devs to work longer hours and focus on features. That might get you through a single crunch, but the tech debt piling up will bleed your velocity before long. Do you have good, live-like test environments with good, live-like data in them? Hint, the second part is actually harder and you may need to pass this as a requirement into your data design.
Velocity comes from finding bugs at the first possible opportunity, which means testing and CI/CD. It comes from limiting tech debt by just spending some time burning down the worst of it. It means decent documentation, because new developers will always need it.
Clarity is taking time to do the secondary things for your users. Do you write good user docs? Do you do videos showing how to do things? Do you get a UI specialist to figure out workflows and design the controls?
Clarity comes from the right thing to do being the easiest thing to do. It comes from contexturalised help. It comes from clean design.
Behind the scenes in the Fantastic Website Corporation
A service that users love, rely on and build into their life is never that way by accident. You cannot leave release, support, problem management or testing to chance. Hope is not a strategy.
As a company moves from a small startup where yes, anything goes, up to a larger enterprise, the changes are not because The Money Men demanded enterpriseyness. The change are because you need to be seen as dependable by your customers, or you won’t stay large.
All the effort on change management, incident control, release engineering, monitoring and alerting that big organisations engage in is not just stodgy ITIL thinking, it’s because these are the secondary activities that underpin good operations.
As a small company, you got where you are going because you move fast, right? Well, the Wile. E. Coyote stuff might be OK when you have 7 customers, but half a million customers will expect you not to run face-first into a painting of a tunnel. Even if the Shiny New Website Company somehow went through it.
So do you know what you need to do to improve release to the point where you are not scared of a rollback? How about data migrations? How about customer-side code?
Do you know what observability is realistically enough? And how do you get it with 3rd-party software?
How do you design data models for good live-like test environments?
All of these questions, and more, are under our foundation of operations.
Drama belongs on TV
If your code release are dramatic, they are under-engineered. If change management is dramatic, it’s probably not got all the right people doing all the right things. If incidents are dramatic, you… well, OK, some big incidents are dramatic, but small ones should be simple and easily learned from.
With our understanding of the abstract factors, we can create the right approach for you.
This is part of a series on all our foundations. Here are links to the next entries:
People.
Development
Security
Operations, this post
AI in a week.
John Denholm OPERATIONS
people leader leadership culture processes stratgey organisation