Why I use GitOps

I've used the term GitOps in several of my previous articles, but have never really explained what it means for me. Just recently I again experienced the benefits a well-engineered GitOps setup can bring: For my employer MaibornWolff I have been designing and building a Smart Factory GitOps Kubernetes platform for a customer the last several months. A few weeks ago we finished the initial production-ready release and as the last step wanted to upgrade our Kubernetes clusters to the latest version. Thanks to GitOps this was literally a matter of a two line Git commit and some waiting time. No complex hand-holding, no manual steps. Just a Git commit and drinking tea (or coffee or whatever your beverage of choice is).

Inspired by this experience I want to take the chance to explain what I understand by GitOps, why I use it and what clear advantages I see.

What is GitOps

There are several slightly different definitions of GitOps floating around. For me, it is a philosophy and concept for managing infrastructure and software deployments. It makes version control repositories (and Git is the default version control system nowadays) the central and single place for infrastructure management and operations. A repository contains a complete and declarative description of a system (infrastructure and applications). Changes to the system can only be made through Git. In theory, I could completely delete my infrastructure and recreate it from the description stored in Git (excluding data stored in databases and other stateful services).

You might have already heard of Infrastructure as Code (IaC), which is very similar, in that infrastructure is described in a declarative way using code. Or more abstract: A formal language that can be understood and processed by an automated system. But GitOps takes IaC a step further. Not only must the state be described as code in a declarative manner, it must also be applied automatically, by a GitOps operator. Such an operator is triggered whenever changes are made in Git, and its job is to make sure the real system conforms to the description in Git.

GitOps also takes advantage of features provided by Git Forges (like GitHub or GitLab). Changes are made in feature branches and get merged via Pull/Merge Requests (PR/MR) which enforce code reviews and approvals.

In summary, the two core components of GitOps are: a declarative description of a system in Git, and an automated deployment and management of that system.

CI/CD with Pipelines

Depending on what definition of GitOps you read, a Continuous Integration and Deployment (CI/CD) system using pipelines (e.g. GitHub Actions or GitLab CI) satisfies the GitOps principles.

You can describe your infrastructure using for example OpenTofu (an open source fork of Terraform) and have a pipeline run on every push to the Git repository, every merge, every created Git tag. The pipeline can then run the terraform code to ensure the real infrastructure matches what is described in Git. Using the plan feature of OpenTofu/Terraform, a pipeline can also help with Code Reviews of feature branches by showing what would change in the infrastructure.

This is an easy system to implement. Most major Git Forges already contain integrated pipeline features, so you don't need to deal with setting up a separate system. And tools like Terraform have official integrations or recipes for the Pipeline systems, so they can be easily configured. A pipeline is also a very flexible and generic system, as you can run whatever tools or scripts you want. Let's say terraform does not support a feature you need, you can just write your own script to implement that. For example: In a project a few years ago we were configuring an Azure IoT Hub. But some configuration options regarding device management were not exposed as terraform resources. So we wrote a little script that took care of configuring these. It was just a few lines of code and done within an hour.

But in my opinion, such pipeline setups are a diluted form of GitOps. To explain why, we first need to talk about GitOps using operators.

Using Operators

An operator is a piece of software that automates the deployment and management of a system. It reads in an intended state for the system and tries to adapt the real system to the intended state. For GitOps, operators read the declarative definition of the system from Git. An operator either runs externally from the system it manages or inside it (for example Kubernetes Operators normally run as pods in Kubernetes itself). It also works independently of events in Git, meaning it will do its management work even if there are no changes. This is called a continuous reconcile loop, as the operator continuously tries to match the intended state with the actual state to ensure they are equal.

Pipelines are not true GitOps

Taking in only the first two sentences of the previous paragraph, a pipeline would also fit the bill. But in contrast to an operator, a pipeline only runs when something in Git happens, it is event-triggered. And for me, this is the big difference. An operator has the continuous reconcile loop, so runs all the time (or at least very regularly), while a pipeline only runs on changes in Git.

Let's assume my system has a failure or other fault (for example a database or service going down). An operator running all the time will very quickly detect that problem and move to correct it by trying to get the system back into its intended state. The same applies if I make a change in my system manually, bypassing the GitOps process. The operator will detect and revert my change. This behavior of GitOps operators forces me to make changes via Git, otherwise they will be overwritten immediately.

And this behavior of continuous reconciliation is why pipelines for me are only a diluted form of GitOps. It is better than nothing, but not true GitOps.

Advantages of GitOps

Using GitOps with Operators instead of a normal deployment automation (often using pipelines) has many advantages.

  • Continuous reconcile loop: As already described, operators compare the intended to the real state all the time, so can detect failures or manual changes and get the system back into its intended state. This removes many manual operations processes of recovering failed services, since recovery is just one of the actions an automated operator can take to get a system back to its declared state. A human hopefully only has to seldom intervene if the automatic correction fails.
  • If you have many systems running (for example platforms and clusters in multiple regions or for multiple teams), operators normally sit close to or inside their system, with a separate operator instance for each installation. That makes scaling much easier, because there is no longer a central pipeline to deal with all systems, but a fleet of independent operators that can all run in parallel, making deployments at scale much easier and faster.
  • Operators also employ a pull principle, meaning they pull in changes from Git instead of being triggered. This leads to a decoupling between the central version control system and the deployment system, making both easier to handle.
  • The pull principle and operators being close to the system they manage also makes the life of security and firewall teams easier. Instead of a central pipeline system having to have inbound access to all your installations, the operators in your systems only need outbound access to your Git Forge, making them easier to secure. And your central Git is normally already accessed by the entire company, so giving a few more systems access to it is normally no problem. Especially as modern Git tools like GitLab or GitHub not only allow access via the established and secure SSH protocol, but also provide integrated mechanisms to limit access via such channels to only specific repositories and to read-only access.

Aside from using operators with GitOps, using Git itself has additional advantages:

  • Git acts as a single source of truth, meaning I only need to look at one central system to get a complete picture of my infrastructure, to know what is deployed and where. And since operators are taking care of deployments, I can be sure reality also matches what is declared in Git.
  • I get a complete history of my infrastructure and any changes just by looking at the Git history. There is no need for special logging or change detection, I can just look at all the commits.
  • Doing rollbacks can be done easily by just reverting the problematic commit or making a new commit with the older state/version I want to go back to. Since GitOps is all about matching an actual system state to a declared state, an operator can just use normal deployment mechanisms and does not need special rollback logic (with the exception of stateful systems like databases).
  • Git being the single source of truth also means Git contains the complete state of the system. Meaning in the event of a disaster, I can recover the system using only from the state in Git. Of course stateful things like data in a database are an exception that require additional effort with backups and restore, but all the stateless parts of my system are covered.
  • Git Forges nowadays have many integrated mechanisms for ensuring a standard of quality. They can enforce peer reviews (as a form of four eyes principle) on Pull/Merge requests, they can limit who can push or merge changes. They can also use pipelines to validate proposed changes. This can range from simple syntactic checks, over static analysis to automatic validation in ephemeral sandbox environments. And operators don't need to know or deal with any of this, they can just rely on the Git system to handle it, concentrating on their own specific management logic.

All of these advantages we can get basically for free by using Git or at least with very little effort. Whereas for other deployment mechanisms we would have to implement all of them on our own or rely on separate specialized tools.

GitOps in my projects

The previous paragraphs have all been rather abstract and theoretical. To make it more accessible, I want to describe the GitOps setup I and my colleagues currently use in my projects at MaibornWolff.

I primarily build Kubernetes-based platforms for customers with a focus on Industrial IoT and Smart Factory use cases. The platforms often have a multi-cluster architecture, with clusters in each factory and central clusters in the cloud. The on-premise factory clusters host machine connectivity solutions like Cybus Connectware or HiveMQ Edge. In addition to any workloads that are production-critical or latency-sensitive, so would not work stable over a potentially unstable cloud connection. Data produced by the machines is sent on to the central cloud cluster that hosts most of the workloads (data processing, visualization, functional services).

The basis for all our setups is FluxCD acting as the GitOps operator. We chose FluxCD because it is lightweight and flexible and very easy to install and use. But it also scales nicely for bigger and more complex scenarios.

FluxCD is built on a few primitives: First we have sources, which can be Git repositories, Helm repositories, OCI registries or in newer versions even S3 buckets. And then we have controllers for Helm to install and manage Helm Charts, and Kustomize, to manage either plain Kubernetes manifests or those customized using Kubernetes Kustomize. And since we are on Kubernetes all of them are of course defined using Kubernetes manifests for Custom Resource Definitions provided by FluxCD.

In projects we always have a central GitOps repository for the platform that defines the entire setup. The repository contains manifests for all the applications and tools that should be part of the platform. All written as Helm charts and FluxCD Kustomizations. For preexisting applications, like the kube-prometheus stack for monitoring, we just pull in the official helm chart from their registry and add in the specific configuration for our setup. Other tools are custom-built, and for those we write our own Helm charts or Kustomize templates.

Besides the shared application manifests, the repository contains definitions for each cluster that is part of the platform. These definitions reference the applications and enrich or template them with cluster-specific configuration (for example resource scaling or domain names). Each cluster definition describes the complete state of the cluster, meaning all applications that should be running on that cluster with their specific configuration.

At its core this is, in my opinion and once you have actually worked with it, a very simple construct that can be extended to support use cases like multiple teams per cluster.

Going into detail on the entire setup would go beyond the scope of this post, so I will leave that for a future article.

One of the advantages of this structure is that it can be applied to widely different setups. A customer is still getting started with the whole Smart Factory and only has a central cluster in the cloud? We can put together a simple variant of the setup. Another customer aims to have several hundred mini clusters in his factories? The setup can accommodate that. It also doesn't matter what kind of Kubernetes setup the customer has or wants to use. Azure AKS, AWS EKS, managed Kubernetes from Giantswarm, a simple setup with K3s. All of them are possible, the GitOps structure does not care about that.

The same structure also works if we manage not only the applications deployed into clusters but also the clusters themselves with GitOps. Solutions like Giantswarm employ Cluster API to provide a declarative GitOps-style interface for deploying clusters on different cloud and on-premise infrastructure providers (from AWS and Azure to VMWare vSphere and Bare Metal). With the setup we use, managing the clusters themselves is just another layer of the same structure.

Many people use ArgoCD instead of FluxCD for their GitOps setup. Both satisfy the GitOps principles and will allow you to build working platforms. But in my opinion, ArgoCD is more complex to use and moves away from this focus of doing everything via Git. It is still a great tool and has better UI capabilities than FluxCD, but for me FluxCD gives me the pure core GitOps approach.

Conclusions

I have been using forms of GitOps for years and variants and derivations of the structure I described above in a number of projects. Conceptually GitOps is more complex and it has a higher initial setup effort than writing some scripts and slapping a pipeline on them, but for me the advantages clearly outweigh the higher initial effort.

It forces me to use a declarative description of my setup. This makes changes easier to do and the entire setup more resilient to problems. I need to think only about the intended target state of my system, and can leave the complexity of how to get there to an automated GitOps operator. This also makes things like upgrading applications a no-brainer, because when done right, it is no more effort than changing a few version numbers in manifests and committing them in Git.

GitOps setups are also scalable, and if done right, improve security by reducing the potential attack surface. And using Git gives me many benefits we already covered, from an automatic history to approval workflows.

Every project I did in the last years has used a GitOps component, and I would not want to do a project without it anymore.