Adopting Kubernetes has introduced several new complications on how to verify and validate all the manifests that describe your application. There are several tools out there for checking the syntax of manifests, scanning them for security issues, enforcing policies etc.
But at the most basic case one of the major challenges is to actually understand what each change means for your application (and optionally approve/reject the pull request that contains that change).
This challenge was already present even outside GitOps, but it has become even more important for teams that use GitOps tooling (such as Argo CD) for their Kubernetes deployments.
The problem
Any major Git platform has built-in support for showing diffs between the proposed change and the current code when a Pull Request is created. In theory, the presented diff should be enough for a human to understand what the changes contain and how they will affect the target environment.
In practice however several teams have adopted a templating tool (such as Kustomize or Helm) that is responsible for rendering the actual Kubernetes manifests for a target cluster.
As a quick example let’s say that you need to review a Pull Request with the following changes:
This seems simple enough. You assume that this change will increase the number of replicas to 20 (let’s say it is Black Friday and you want to increase the capacity of your web store ASAP). You merge the pull request and … nothing happens.
What you didn’t know is that there is a downstream Kustomize overlay that also defines replicas on its own. So the proposed change has no effect at all. The problem was that the pull request contains only the segment of a Kustomize source manifest and doesn’t show a diff for the end result (the full rendered manifest).
The problem is even more apparent when your organization is using Helm. Let’s say that you need to approve a pull request with the following changes:
As a human, it is very difficult to understand what exactly is happening here. You need to mentally run the templates in your head and decide if this change is correct or not. Wouldn’t it be nice if the diff had the actual manifest that is created from this chart?
Essentially the diff functionality found in your Git system is not enough when it comes to complex Kubernetes applications.
Using the built-in Diff functionality in the Argo CD GUI
One of the main benefits of using the Argo CD UI during a deployment is the built-in diff feature. When a resource is “out-of-sync” (i.e. it differs from what is in git) Argo CD will mark it with a special color/icon. In the following example, somebody has changed the service resource of an application:
You can then click on the service and see the diff
The big advantage here is that Argo CD has already integrated support for Kustomize and Helm. The diff you will see is on the final rendered manifests which is exactly what you want as you can preview changes in their full context.
Unfortunately, this method also has several disadvantages.
The first one is that Argo CD shows only diffs for applications when the auto-sync (and self-heal) behavior are disabled. This means that you are losing the main benefit of GitOps. The proper way to follow GitOps is to have auto-sync enabled (and self-heal as well) as this way guarantees the basic premise that the cluster and the Git repository contain the same thing, that is, that the desired state and actual state have not diverged.
But the second problem regarding Continuous Delivery is that the diff on the manifests is shown when the changes are already committed and pushed. And this is too late to perform any serious review. Ideally you want to review changes as early as possible. A Pull Request allows you to add comments, talk with your team about the changes and also reject the Pull Request altogether without affecting a production system.
Using the built-in diff functionality in the Argo CD GUI is great for validating a change and doing a last sanity check just before production. But it should not be the main review milestone of a manifest change. And ideally you should setup all your applications to sync automatically, so having this diff process is not available in the first place.
Using the local Diff feature of the Argo CD CLI
We have seen the built-in diff UI in Argo CD is shown very late in the delivery process. Can we use the same diff approach earlier in the life of a change?
It turns out that the Argo CD CLI also comes with a diff command. This diff command takes a “–local” parameter that allows you to compare what is happening in the cluster against ANY local files which don’t have to be pushed (or even committed at all). It will also automatically run your favorite template tool as it is defined in the Argo CD application.
Here is how it looks
This approach is very promising as you could in theory use it inside a CI system with the following process:
- Open a pull request with the suggested changes
- Have your CI system checkout the pull request
- Run inside a CI pipeline “argocd diff –local” against the cluster where the pull request is destined (It also uses again the built-in support for kustomize/Helm within Argo CD)
- Present the diff to the user in order to take decisions about the pull request
This sounds great in theory but in practice has several shortcomings.
The most obvious one is that you need to provide your CI pipeline with credentials that access the cluster where Argo CD is installed. This forfeits one of the main benefits of GitOps – the pull mechanism where the credentials stay within the cluster.
An even bigger concern however is what happens when you have multiple clusters. Which cluster should you pick to compare against? What if the chosen cluster has CRDs or other resources that are custom to it?
This process can also become very complex with remote or secure cluster instances. For example if you have an Argo CD cluster in Asia and your CI system is running in the US, connectivity between the two might be very slow or even impossible.
In summary “argocd diff –local” is great for local experimentation and quick adhoc checks, but for a production deployment process there is a better way to achieve the same result (spoiler: it doesn’t involve the Argo CD CLI at all neither needs cluster access).
Pre-rendering manifests in a second Git repository
Let’s take a step back. We have been looking for ways to show an enhanced diff as part of a pull request and ignore the existing diff that is already provided by the Git provider (as we have seen that this doesn’t work with the final manifest).
There is a way however to enhance the built-in diff and make it work on the final manifests.
The solution is to use 2 GitOps repositories, for each application/cluster. One git repository has the manifests in their unprocessed form (e.g. as Kustomize overlays) as before. There is now a second Git repository that has the final rendered manifests. And Argo CD is pointed at the latter.
Here is how it would look:
This process should be familiar to you if you have ever used a preprocessor or code generator. Essentially there is an automated process (can be the CI system or something else) that does the following:
- A human creates a pull request on “Source” Git repo with the suggested change
- A “copy” process takes the contents of the pull request and applies the respective template tool (i.e. Helm/Kustomize) to create the final rendered manifest
- A second pull request is opened automatically on the “Rendered” Git repo with the contents of the manifests
- A human sees the diff of the second pull request and this time the diff is between rendered manifests and not snippets/segments.
- If the Pull Request is approved it is merged on both Git repositories. Thus the second repository has always rendered manifests
- Argo CD monitors the second repository and applies the changes (the integrated support for Helm and Kustomize within Argo CD itself is not used at all, Argo CD is only syncing raw manifests)
This is a very valid process and I have seen it used in several companies with success.
The big advantage of course is that the diff you get in the Git provider provides you with the full information about what will change in the application AFTER all manifests are processed:
Here is the same example with the Helm chart, but this time we are using a second Git repository that has the rendered manifest stored.
However, I am personally against this process, as it complicates things a lot and increases the number of moving parts.
- It doubles the number of repositories for any given application (or branches if multiple branches are used)
- It introduces another point of failure which is the copy process that converts source YAML to final manifests
- It completely bypasses the effort put in Argo CD to process manifests on its own.
- It might be confusing for people who now have 2 Git repositories to work with and opens the possibility of mistakes in both ends (either committing stuff on the “source” repo that never makes it to “rendered” repo or vice versa).
In general I think that this is an overkill solution for a problem that can be solved more elegantly as we will see later in the article. Still, if you follow this approach and it works for you, make sure that you have safeguards and monitoring in place (especially for the copy/commit automated process).
Intermission: Preview Terraform plans
You might think that previewing the full manifests for a pull request is a new problem that Kubernetes introduced. It isn’t. There have been several tools before Kubernetes that had to deal with the exact same issue and it would make sense to look at what they do.
The most obvious candidate to examine is terraform. If you are not familiar with terraform it is a declarative tool that allows you to define your infrastructure in a HCL file and then “apply” your changes.
Terraform users have a very similar problem. A pull request that contains terraform changes (especially in big projects) is not immediately clear for a human to understand (unless you are an expert on running terraform mentally in your head).
To solve this issue, terraform has a “plan” command which reads the changes, decides what it will do and prints a nice summary of all the proposed changes without actually doing anything
The plan functionality in terraform is crucial to terraform teams as it removes the guesswork on what terraform will do when the changes are applied.
With this summary at hand, the next step is obvious. We can simply attach the output of the plan command to the pull request. Humans can now look at both the diff of the hcl files but also the plan summary and decide if the change is valid or not.
This workflow is so common that an open source project – https://www.runatlantis.io/ does exactly that .
- You make your changes on the terraform files
- You create a pull request
- Atlantis runs the “plan” command and attaches the result in the PR
- You can then approve/comment the PR as a human
- Atlantis then runs the “apply” command to actually modify your infrastructure
- Atlantis locks the workspace until the PR is merged, preventing a second PR from overriding the changes before the first PR is merged.
This workflow is very effective for end-users but has several security drawbacks which are similar to the argo CLI diff approach
- You need bidirectional communication between your Git provider and the server that runs Atlantis
- The server that runs Atlantis will run terraform on its own and thus it needs all credentials that terraform has in your organization.
- The server that runs Atlantis also needs to have access to your remote terraform state. Essentially Atlantis has the keys to your kingdom.
Another downside of Atlantis is that it’s Pull Request based, whereas proprietary Terraform CD tools on the market feature a dashboard where every project / workspace can be browsed (similar to how you have a dashboard in ArgoCD where you can see your Applications and whether or not they are synced).
In theory we could create a similar system for “argocd diff –local
” but given the security implications, there is a much better approach that is helped by the GitOps principles.
Still, the basic idea of attaching a diff in a pull request for greater context offers several advantages for the end user that cannot be overstated.
Render Kubernetes manifests on the fly
One of the most important principles of GitOps is that at any given time the cluster state is the same as what is described in the git state. We really like how terraform Atlantis works but there is a way to improve the workflow and make it more secure and more robust by taking advantage of the Argo CD guarantee for GitOps.
The Terraform CLI that runs on the Atlantis instance needs credentials to your infrastructure because it must both read the terraform state and also create the actual infrastructure once changes are “applied”.
With Argo CD we can completely bypass this limitation because we already have the cluster state right there. It is stored in the target of the Pull Request!
This means that we don’t need any credentials either to the cluster or to Argo CD. We can simply run a diff between the files of the Pull request and the branch it is targeted at. The extra addition here is that we will also run a preprocessing step for the template solution (Helm or Kustomize) in order to get the full manifest
So the full process is as follows:
- Somebody opens a Pull request to the manifest repo
- We check out the code of the pull request and run Kustomize, Helm or other templating tool in order to have the final rendered manifests of what is changed
- We check out the code of the branch that is targeted by the Pull Request (e.g. main) and find the same environment and again run the same templating tool to get the final manifest
- We run a diff between the final manifests from the two previous steps
- We show the diff to the human operator that will decide if the pull request will be merged or not
The beauty of this approach is that unlike Atlantis, we never access the Argo CD cluster. All information is coming from Git (an advantage of using GitOps). This means that your Argo CD cluster could be in China with a very slow connection (or even an isolated connection) and your CI server doesn’t need to know anything about it. In fact the location and security access of your Argo CD server is now irrelevant as we don’t interact with it in any way.
Notice in the diagram above that unlike Atlantis, our CI server has a direct connection only to the Git repository. The Argo CD instance still cares only about the Git repository it monitors.
This makes our approach much more secure as the credentials of the cluster still stay within the cluster itself and we only interact with the Git repository.
One thing to notice here is that unlike Atlantis or “argocd diff
” command we are not comparing desired state in Git to actual state (acquired from cloud provider’s API or Kubernetes API), we are comparing two versions of desired state stored in different branches of a Git repository. While being a good enough approximation, this approach is not 100% equivalent to the “argocd diff
” one.
A corner case scenario would be Helm Capabilities – built-in variables populated by querying k8s cluster for API version and available resources. Some Helm templates use this information to render correct resource versions, appropriate for specific cluster’s version and available CRDs. This information has to be supplied manually to “helm template” command to achieve parity with “argocd diff
”.
Attaching the full manifest diff to a pull request
The icing on the cake is that we will also attach the full manifest diff to the Pull request (as Atlantis does).. This is how it would look:
We now need to explain to users that the automatic diff of the pull request is not what they should look at anymore because it only has part of the story (see the problem description at the beginning of this article). Instead, they should look at our attached diff and get the full picture of what has changed and make decisions accordingly.
The attached diff is especially important for people that use Helm as you can see a diff between plain YAML instead of trying to run manually Golang templates in your head.
Enforcing changes during environment promotion
If you follow this diff approach where the full manifests changes are shown in the Pull Request it will be much easier for you and your team to collaborate on GitOps changes as everybody will have the full context of each incoming change.
A secondary benefit however of this diff approach is also to know what is NOT changed.
In my previous article about promotions between GitOps environments using folders, a lot of people asked about how you can guarantee that extracting a common setting from downstream Kustomize overlays and promoting it your base overlay can be safely executed in a single step.
I was really puzzled by this query, until I realized that most people that were asking this question looked at the simpler diff of the pull request and thus lacked the full context of the change.
Let’s take an example. You have two environments qa and staging with the following settings:
UI_THEME=light CACHE_SIZE=2048kb SORTING=ascending N_BUCKETS=42
You want to add a new setting called PAGE_LIMIT=25
and promote it gradually first to qa and then to staging
You modify/commit the qa environment
UI_THEME=light CACHE_SIZE=2048kb PAGE_LIMIT=25 SORTING=ascending N_BUCKETS=42
The deployment goes ok and you make the same change to the Staging environment. It works fine there as well.
Now you decide that this new setting should be the same across both environments and you decide to move it to the parent overlay (which is common to all non-prod environments).
So the actions you take are
- Delete the setting from the QA environment
- Delete the setting from the Staging environment
- Add the setting into the parent overlay that both environments depend on
- commit/Push all the above in a single step
A lot of people were concerned about this process and asked how you can enforce that the whole process will work without affecting the existing environments.
We can finally answer this question by simply looking at the enhanced diff of the above commit
That’s right. All diffs are completely empty. Even though there are changes in the individual Kustomize files the end result (the rendered manifests) are EXACTLY the same.
This means that if you approve this Pull request Argo CD will do absolutely nothing and you are certain that all environments will be oblivious to this refactoring.
Of course the basic diff of the pull request is not that smart and shows the diff changes in text in the individual files.
So in this scenario we have the extreme case when the built-in diff of the pull request doesn’t have the full context of what is going on because it doesn’t understand the full manifests.
Conclusion
Previewing changes before applying them is a pillar of modern software automation and in the case of Kubernetes applications this is not always a straightforward process because of the templating of the manifests.
In this article we have seen several ways of previewing the changes in Argo CD applications:
- Basic diff of the Git platform (not recommended)
- Native diff of the Argo CD UI
- Diff local files with the Argo CD CLI
- Pre-rendering manifests in a second Git repository
- Rendering manifests on the fly for each Pull request (recommended)
We hope that this process is helpful for you and your team when of course it is combined with static analysis, syntax validation, security scans and other sanity checks that run against your Kubernetes manifests.
Happy diffing!
Photo by Arno Senoner on Unsplash