FluxCD - Who watches your GitOps deployment
In my last blog post I have described the GitOps setups I build using FluxCD in my job at MaibornWolff. The base idea is simple, you make all your changes in Git, FluxCD runs a continuous reconcile loop to watch for those changes and apply them. But how do you know the changes were actually applied and that there weren't any problems? The question is who watches the watcher, or more concrete, who watches FluxCD. And in today's post I want to explain ways to keep an eye on your FluxCD setup.
Keep watch
There is always the manual option. After committing your changes, start your favorite Kubernetes UI (for example k9s) and check the status of GitRepository, Kustomization and HelmRelease resources. If you are, like me, more of a command line purist, a watch kubectl get gitrepo,kustomization,helmrelease will also do the trick. Or the separate Capacitor UI for FluxCD.
This approach is fine if you are actively testing things anyway (e.g. on a local or sandbox cluster), but in big multi-cluster setups you don't want to babysit every change for every cluster. And FluxCD continuously reconciles the state. Even if there are no changes in Git, it will still compare the state in Git with the real state in the cluster and get the cluster in line if needed. This protects from manual (unintended) changes by humans or other tools. But it also means something can go wrong with the FluxCD loop at any time, even when you are not actively watching. So only manually watching FluxCD is not a good idea and we need an automated option.
Status Notification
One of the components of FluxCD is the Notification Controller. It supports inbound and outbound events. Inbound events via Receivers can be used to actively trigger FluxCD to pull new changes, e.g. by having GitHub call a webhook on every Git push.
But the interesting part is the Provider API. With it we can tell FluxCD to send notifications to external systems whenever anything happens with the reconcile.
The linked docs page has an example how to have FluxCD send a message to Slack in case of an error with a Kustomization. The same is possible with Microsoft Teams and other chat apps (Discord, Matrix). Or you can pipe the event into an alerting system like PagerDuty or OpsGenie and handle it there.
Another option is to let FluxCD set the Git Commit status in your Forge. The docs have examples and explanations for GitHub, Gitlab, and others. That way you will get the little green or red marker for the commit in the Forge UI, just as if you had run a CI/CD Pipeline using GitHub Actions or Gitlab CI. Depending on your notification settings you will even get an email with the status, but mostly the status check is useful to get a quick visual confirmation if your commit was seen and applied by FluxCD.
As an example, with the following manifests you can have FluxCD set the commit status in Gitlab:
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
name: gitlab-status
namespace: default
spec:
type: gitlab
address: https://gitlab.com/1234 # URL of your project, use id of your repo instead of path
secretRef:
name: gitlab-notification-token # Secret that contains a PAT
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
name: gitlab-status
namespace: default
spec:
providerRef:
name: gitlab-status
eventSources:
- kind: Kustomization
name: "root" # Name of your entrypoint kustomization to not be flooded in notifications
Using the Notification Controller to get status updates has a big drawback: If your GitOps setup is even slightly complex, you will have Kustomizations nested inside other Kustomizations alongside dependsOn relationships. This will lead to reconcile loops temporarily running into timeouts or failing waiting for other resources. In itself this is no problem, during the next reconcile cycle, FluxCD will try again and most likely eventually succeed. But with notifications enabled, FluxCD will send an event after the first time a resource times out or fails. Meaning you will get a lot of false positives producing noise that will have already cleared by the time you can check the cluster.
In my experience, the Git Commit status is a nice feature, but sending alerts directly to chat should not be done unless you tested that your setup is not prone to producing unneeded noise.
Prometheus Monitoring
The solution I use and recommend is metrics-based monitoring and alerting using Prometheus. For any production system, monitoring and alerting are a must anyway, so FluxCD can and should be just one more component to keep track of.
The various FluxCD controllers expose Prometheus-compatible metrics at port 8080 in the standard /metrics path. If you use the kube-prometheus-stack you will want to use a ServiceMonitor manifest to scrape FluxCD. Unfortunately the metrics port is not part of the Service manifest included in the standard FluxCD deployment. So you will either need to patch them or deploy separate services. They should look like this (for the kustomize-controller):
apiVersion: v1
kind: Service
metadata:
name: kustomize-controller-metrics
namespace: flux-system
spec:
ports:
- name: http-metrics
port: 8080
targetPort: 8080
selector:
app: kustomize-controller
type: ClusterIP
With this you can create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kustomize-controller
namespace: flux-system
spec:
selector:
matchLabels:
app: kustomize-controller
endpoints:
- port: http-metrics
interval: 30s
The FluxCD docs describe the available metrics. For a basic alerting setup I recommend the following alert expressions:
GitRepositorynot ready:gotk_reconcile_condition{kind="GitRepository",status="False"} > 0Kustomizationnot successful:gotk_reconcile_condition{kind="Kustomization",status="False"} > 0HelmReleasenot successful:gotk_reconcile_condition{kind="HelmRelease",status="False"} > 0
The key with alerting on FluxCD reconcile status is to use a long enough pending time. As mentioned above, in any complex setup with nested resources that also have dependencies FluxCD will invariably run into temporary failures. Or resources have to wait a few minutes for their dependents to reconcile. This will show up in the metrics as unsuccessful resources, the very thing you want to alert on. So you need to have a pending time long enough that any normal deployment back-and-forth will have sorted itself out but not so long that you don't get notified in due course if there is a real problem. I have found a good pending time to be 10 to 15 minutes. If a deployment does not succeed in that time frame, there is very likely a problem you should investigate.
With just these three rules, you get a surprising amount of insight into your cluster. If an updated Pod crashes during startup or it stays in Pending status due to missing resources or volumes, the associated HelmRelease or Kustomization will not become ready. And that will trigger an alert. Of course this does not cover if a Pod crashes after the deployment is finished. For these situations you still need alerting separate from FluxCD resources.
Conclusion
Any system that is declared productive or production-ready must have proper automated monitoring and alerting. If your customer or end user need to tell you that your system is down, you did something wrong.
If a GitOps approach is core to your system, then of course your GitOps tool needs to be covered by your monitoring so you get paged if it has problems. Active manual watching is good for testing, but not for large-scale automated rollouts. Also don't just trust that because your change was successfully tested and rolled out in your development cluster, that it will do the same in your production environment. Small differences in configuration or even external factors can be the difference between a successful and a stalled or failed rollout. If you have tens of clusters, manually monitoring a rollout is impractical, so like with everything GitOps, automation is key.