Safety nets instead of guard rails - Dealing with errors in IT infrastructure platforms
(Dieser Artikel ist auch auf Deutsch verfügbar)
There exists a widespread saying: "To Err is human". The message behind it is simple: as humans, we are not perfect and we make mistakes. Of course, this also applies to software and platform engineers like me. That's why we have various methods in our field to reduce the probability and impact of errors.
But let me define the term first. By errors I mean two different things in this context: Firstly, what is generally referred to as a bug, i.e. code that does not do what it is supposed to (e.g. an incorrect calculation or a missing input check). And secondly, undesired actions, such as accidentally deleting entries in a database or rolling out a new code version in the production environment instead of in a test environment. A distinction is also often made between errors and mistakes, with the difference being that an error exists if I could not really have known the result beforehand. For my purposes, I use them interchangeably.
Reducing the probability of errors
The best known and most widely used method for reducing the probability of errors is testing. From unit testing to system testing, it is always about the same thing: Testing units of a system or the entire system in an automated way to see if it behaves as expected. This allows us to catch bugs before an end user sees them or they can cause damage. However, there are several problems here: complete 100% test coverage is difficult and would be impractically complex and expensive (for most cases) to achieve. In addition, the tests are also written by humans and therefore will have errors themselves.
Another common method for finding bugs at an early stage is code reviews. In keeping to the motto "Four eyes see more than two", one of the purposes of a code review is to find possible bugs. However, code reviews also help to validate the readability and comprehensibility of the code and to share knowledge.
Both testing and code reviews reduce the probability of errors, but cannot completely rule them out. And if we focus on platform engineering, writing tests is often way more complex than it is for software (infrastructure is more complex to set up than a test or mock function in software). That means it is therefore less practical to do, and will therefore uncover fewer possible errors.
So if we cannot lower the probability of errors enough, we have to at least minimize their impact.
Minimize the impact of mistakes
In my opinion, the most important tool for minimizing the impact of mistakes when building infrastructure platforms is fast infrastructure-as-code (IaC) automation.
If we have described our infrastructure in a declarative way and can quickly bring it to a defined state automatically, we can easily roll back to a functioning state or roll out a fix. That is because we are not describing the path, but the destination. IaC tools such as Terraform or Kubernetes are then able to reach our desired target state from any (messed up) current state. It should not matter how we created the faulty state. Be it because we have added a bug in the declarative manifests or because we have made a mistake during manual operations activities and configured something incorrectly or even deleted it.
If we are dealing with stateless infrastructure such as containers or configuration such as firewall rules, this rollback is quick and easy. However, other components such as databases have a state that cannot be rolled back or restored so easily. For me, therefore, a cleanly automated infrastructure also includes automated backups. And more importantly, what is often forgotten, an automated (and simple) way to restore these backups.
The tech content and YouTube video creator Theo Browne has illustrated the principle that for me sits behind this thinking in videos as "safety nets instead of guard rails". Although he aimed it primarily at software development.
The guard rail for me is the unit test and the code review. In many cases it prevents me from plunging into the depths. But should I slip and fall over it, I will still fall. Here the safety net comes in. It will not prevent me from slipping and falling, but it will catch me so that I don't fall too far and can climb back up without injury. That for me is the automation and the backup. Both will not prevent me from making a mistake, but they will minimize any impact, because they allow me to easily correct and fix my mistake.
An example
To make this a bit more practical I want to describe a situation that happened to me in a customer project a few years ago.
First a bit of context: It was a bigger IoT project with several thousand devices out in the field that sent millions of messages per day. The team I was part of was developing and operating a data processing and analysis pipeline, that in the end stored transformed and calculated data in several storage systems. One of them was ElasticSearch, so that the data could be used for graphical analysis using the integrated dashboarding tool Kibana.
As part of a refactoring I was renaming indices (the equivalent of tables in ElasticSearch) and transforming their data to a newer structure. Back then that was only possible by copying the data into a new index with the intended name and then deleting the old one. That was when the mishap happened: Instead of deleting an older already transformed index, I deleted the index and all the data for the current day.
Could a "guard rail" have prevented the mistake? I could have properly automated the procedure beforehand and tested it thoroughly. Then my mistakenly deleted index would have likely been caught. But doing that would have cost way more time and effort than was appropriate for a rather small change at non-critical data that was also not end user visible.
But thanks to our "safety net" nothing much happened anyway. Of course we had automated external backups of the data in ElasticSearch. But for performance and cost reasons these were only done nightly. So for the data of the current day they weren't of help. But when designing the architecture we assumed that errors would happen (be they technical or human), so we built in protection mechanisms. All data coming into the platform was, in raw and unprocessed form and as quickly as possible, persisted into object storage (AWS S3), with Kafka as a buffer in between for batching. This happened in parallel to the data going through our processing pipeline.
That enabled us to read old raw data from S3 and send it through our processing pipeline again. To that end we had extended the pipeline, which was a series of Apache Spark streams (connected with Kafka), with a reprocessing mode. Using this mode we could easily reprocess data from a given time range. The data could either be read from S3, or if it was still available there, directly from Kafka.
Thanks to this mechanism I was able to quickly reprocess the data from the current day that I had accidentally deleted. Apart from a scare for me and some temporarily failed dashboards, nothing much happened in the end.
This situation illustrates nicely what I mean by safety nets: We assume that mistakes happen and plan and implement mechanisms to quickly and easily deal with such mistakes and correct them. In my situation these mechanisms were the raw data backups and the prepared tools to reprocess the data.
When the impact is expensive
The way of thinking I have described in the last few paragraphs has an implicit assumption: Mistakes are not severe (in terms of damage they cause) and can be easily corrected. In the Industrial IoT and smart factory projects I work in this assumption mostly holds true. Use cases on these platforms often deal with optimizing production or digitizing and visualizing it in the first place. If we lose data or an application goes down, this does not have bigger financial consequences.
But there are software areas where the situation looks very different. A bug in a control unit in a vehicle (especially if it has autonomous functions) can in the worst case kill someone. The same goes for medical devices. Or to stay in the smart factory context, control units for production lines can quickly cause millions in damages if products are faulty, production has to be stopped or a machine is damaged and has to be repaired.
In all of these cases we need other approaches and we have to rule out and prevent errors from happening as far as humanly possible. Thorough unit tests are often not enough and we move into fields such as formal verification. That is very expensive and time-consuming, but still better and cheaper than the potential damage.
A mixed bag
The consequences of mistakes fall somewhere between these two extremes (no cost vs death). Most projects will be on the side of minor consequences. In reality we will always have a balance between errors we can catch beforehand, and mistakes we can let happen and correct them afterwards. It is a mixed bag.
All project stakeholders should (ideally before a project starts) have a clear understanding that mistakes happen. They should consider together what the impact of mistakes is and what the costs are. Based on that they can define the effort that is sensible for preventing mistakes in the first place and what is cheaper to correct afterwards. In the end it is a question of cost and benefit.
In my experience these considerations are rarely done explicitly, but they still implicitly impact our decisions and approaches. It is a good idea to recognize this balance from time to time and to define the position of your project.
Summary
Adding "safety nets" to our infrastructure platforms is in my opinion a way more practical approach and protection from mistakes than building ever higher "guard rails". Chasing that mythical complete test coverage only costs money and the nerves of developers. Trying to keep developers and operator from making mistakes on production systems by constraining them into tight and formal security processes will only prevent them from efficiently doing their jobs and will lead to them developing unsafe workarounds. And a peer review is a good idea and will catch many errors, but never everything.
So in the end, mistakes will happen, regardless of our efforts. We can only strife to reduce their probability. And if mistakes happen, we will be glad if we have a safety net that catches us.