There are many articles and videos about practicing Continuous Delivery (CD) with applications, but not nearly as many for infrastructure. The same can be said for GitOps applied to infrastructure. That is a bit strange given that applications and infrastructure are almost the same today. Both are defined as code, and everyone stores code in Git repositories. Hence, GitOps is just as good of a fit for infrastructure as for anything else. Since, today, infrastructure is defined as code, there is no reason not to use “good” coding practices. So, making pull requests, validating suggested changes, and applying continuous delivery processes makes just as much sense for infrastructure as for applications.
Combining CD with GitOps
With that in mind, we’ll explore how to combine Continuous Delivery (CD) with GitOps and apply the processes on infrastructure. We’ll use Terraform as a way to define and apply infrastructure resources and codefresh.io to run CD pipelines. We’ll use Azure as our playground. Specifically, we’ll create, modify, and destroy an Azure Kubernetes Service (AKS) cluster.
If you prefer a different cloud provider, please visit the AWS or Google Cloud versions of the article.
If you are new to Codefresh, please Create A Free Account. You’ll get unlimited builds for life for free.
That wasn’t much of an intro. I am probably supposed to explain the benefits of GitOps and Continuous Delivery. I should probably say at least a few words why I chose Terraform, Codefresh, and Azure. But I will not do that, at least not here. I will assume that you have at least a high-level understanding of those concepts, processes, and tools. Instead, we’ll dive straight into practical examples on setting up a Continuous Delivery (CD) process backed by GitOps, and focused on managing your infrastructure.
If you’re interested in GitOps, I encourage you to watch What Is GitOps And Why Do We Want It? on YouTube. As for continuous delivery, I already wrote a lot on that subject, but I’m too lazy to dig through past posts in TechnologyConversations.com, so I’ll leave you to search through it. Similarly, I will not dive into the reasons why I chose Terraform. I already did that in The DevOps Toolkit: Catalog, Patterns, And Blueprints. Get that course or purchase the book. It’ll be just enough to a Red Bull for yet another sleepless night, during which I’ll continue adding more material to it. I’ll be using codefresh.io simply because it’s awesome. Finally, I tend to use AWS, Azure, and Google Cloud equally, so I’ll probably cover the other two later. This is focused on Azure.
How about that? That was probably the shortest introduction into something that aims at combining two processes and, at least, three different tools. I should probably get an award for skipping the theory and jumping straight into the how-to part.
Let’s see whether we can set up and validate everything in 30 minutes or less.
All the commands are available in the 01-03-terraform-aks-cf.sh Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
Getting The Code
I already created a Git repository with Terraform and Codefresh pipeline definitions to skip the tedious part of writing .tf
and .yaml
files. All we have to do as a start is open the repo in a browser.
If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Installing Windows Subsystem For Linux (WSL) YouTube video. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
open https://github.com/vfarcic/cf-terraform-aks
If the open
command does not work on your operating system, replace it with echo
, copy the output, and paste it in your favorite browser. In this particular case, it might be easier to just copy the address, but, as you will see later, some of the other examples will generate the addresses dynamically. So, get used to echo
, at least when going through my examples.
Next, you will need to fork the repo. We’ll soon make some changes to the code, and you wouldn’t be able to push them to my repo. So, it needs to be yours.
If you do not know how to fork a GitHub repo, the only thing I can say is “shame on you”. Google how to do that. I will not spend time explaining that.
Next, we’ll clone the newly forked repository.
Please replace [...]
with your GitHub organization in the command that follows. If you forked the repo into your personal account, then the organization is your GitHub username.
# Replace `[...]` with the GitHub organization
export GH_ORG=[...]
git clone https://github.com/$GH_ORG/cf-terraform-aks
cd cf-terraform-aks
Next, we need to copy the Terraform files and Codefresh pipeline from the orig
directory into the repository root. That might sound strange, but there is a good reason behind it. I might be experimenting with that repo. The files in the root might be configured with my info. To avoid any potential issues, I stored the “golden” version of the files inside that directory.
cp orig/*.tf .
cp orig/codefresh.yml .
Now that we have all the definitions, we should switch our focus towards setting up Azure pre-requisites.
Setting Up A Azure Resource Group
Before we proceed, we need to prepare our Azure account. To be more specific, we’ll create a resource group and a service principal that will allow Terraform to authenticate in Azure.
As you hopefully know, almost everything in Azure is organized inside resource groups. So, we’ll need to create one. To do that, first, we need to ensure that you are logged in. We’ll need az
CLI for that. If you do not already have it, go to the https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest section of the documentation and follow the instructions for your operating system.
While we are in the “installing” mood, please set up jq as well. We’ll use it to parse JSON outputs.
Remember that if you are using Windows Subsystem For Linux (WSL), you should follow the Linux instructions for installing CLIs.
The first thing we’ll do is authenticate.
az login
Next, we’ll create a resource group called devops-catalog-aks
.
az group create
--name devops-catalog-aks
--location eastus
Finally, we’ll create a service principal. To do that, we need to retrieve your subscription ID. It is available through the account list
command.
az account list
You might have one or more accounts. Search for the id
field in the account you want to use, copy the value, and paste it instead of [...]
in the command that follows.
export ARM_SUBSCRIPTION_ID=[...]
To be on the safe side, we’ll set the selected subscription to be the currently active one.
az account set -s $ARM_SUBSCRIPTION_ID
Now we are finally ready to create a service principal that will be used by Terraform to authenticate.
export SERVICE_PRINCIPAL=$(
az ad sp create-for-rbac
--role="Contributor"
--scopes="/subscriptions/$ARM_SUBSCRIPTION_ID")
We created the service principal and stored the output inside the SERVICE_PRINCIPAL
variable. Let’s take a look at what that output is.
echo $SERVICE_PRINCIPAL
The output is as follows.
{
"appId": "fd4426b0-3304-440d-aa7f-5497bdf79376",
"displayName": "azure-cli-2020-09-09-19-57-55",
"name": "http://azure-cli-2020-09-09-19-57-55",
"password": "V1g6vcVfMajPRGNolpR0mMt-kbt8KfRA9-",
"tenant": "e82964cf-adc2-4086-b7c2-bfbc1601d491"
}
For Terraform to use that service principal, it needs to have those fields stored in environment variables with specific names. That’s where jq
comes in. We’ll use it to extract tenant
, appId
, and password
.
export ARM_TENANT_ID=$(
echo $SERVICE_PRINCIPAL |
jq ".tenant")
export ARM_CLIENT_ID=$(
echo $SERVICE_PRINCIPAL |
jq ".appId")
export ARM_CLIENT_SECRET=$(
echo $SERVICE_PRINCIPAL |
jq ".password")
Finally, we’ll store all those variables in a source file so that you do not need to go through the same process every again.
echo "export ARM_SUBSCRIPTION_ID=$ARM_SUBSCRIPTION_ID
export ARM_TENANT_ID=$ARM_TENANT_ID
export ARM_CLIENT_ID=$ARM_CLIENT_ID
export ARM_CLIENT_SECRET=$ARM_CLIENT_SECRET
" | tee creds
The next time you need those variables, all you’ll have to do is execute source creds
.
source creds
Do not worry about the possibility of accidentally pushing that file to the Git repo. That could indeed be disastrous. However, I already added /creds
to .gitignore
, so you can feel safe.
Everything we did was things that everyone would do in Azure, no matter the specific needs. You probably knew all that already.
Now we can move to the part that is specific to Terraform.
Preparing Terraform Definitions
Now that we set up the pre-requisite resources in Azure, we can turn our attention towards Terraform.
Please note that I will not go into detail on how Terraform works since this is focused on converting Terraform definitions into CD pipelines. So, we’ll focus only on the relevant parts within the context of Continuous Delivery. If you’re interested in more details about Terraform, please consult The DevOps Toolkit: Catalog, Patterns, And Blueprints.
The most critical and often overlooked part of Terraform is state storage. When used by a single person, and executed manually from a laptop, state storage is not that important. Or, to be more precise, it is not critical to store it somewhere other than on the local file system with, potentially, backing it up somewhere else. But, when working in a team, and, especially, when automating Terraform through pipelines, the location of storage becomes much more important.
I just realized that I “jumped the gun”, so let me backtrack a bit.
Terraform must store state about your managed infrastructure and configuration. That state is used by Terraform to map real-world resources to your configuration, to keep track of metadata, and to improve performance for large infrastructures. By default, the state is stored in a local file named terraform.tfstate
.
In other words, without knowing the state of the resources, Terraform cannot do its job. Every time we execute terraform apply
, it would think it is the first time we’re doing that. Many people do not even realize that right away. As long as Terraform is executed from the same directory, the state is there, and its purpose might not be obvious. But, what happens if we run it from a pipeline?
When executing Terraform commands from a pipeline, we cannot make the assumption that the state from previous executions will be present in subsequent builds. When running pipelines from VMs, we cannot be sure that it will always be the same machine. It might, or it might not. It’s like a lottery. But, when running pipelines from containers, the state will inevitably be gone forever as soon as the container running a pipeline is shut down. We could fix that by mounting persistent storage to each container running pipelines, but that would create other problems like, for example, “race conditions”. Fortunately, Terraform has a solution baked int, and we’ll explore it soon.
The critical thing to understand is that we will need to have a storage bucket. We’ll see, later on, how to use it. For now, let’s focus on creating the bucket. In Azure, that means the creation of a storage account and storage container.
az storage account create
--name devopscatalog
--resource-group devops-catalog-aks
--location eastus
--sku Standard_LRS
az storage container create
--name devopscatalog
--resource-group devops-catalog-aks
--account-name devopscatalog
--public-access blob
Now that we have the bucket, let’s take a look at the Terraform definitions I prepared.
We’ll start with the variables.
cat variables.tf
The output is as follows.
variable "region" {
type = string
default = "eastus"
}
variable "resource_group" {
type = string
default = "devops-catalog-aks"
}
variable "cluster_name" {
type = string
default = "docatalog"
}
variable "dns_prefix" {
type = string
default = "docatalog"
}
variable "k8s_version" {
type = string
default = "CHANGE_VERSION"
}
variable "min_node_count" {
type = number
default = 3
}
variable "max_node_count" {
type = number
default = 9
}
variable "machine_type" {
type = string
default = "Standard_D2_v2"
}
variable "destroy" {
type = bool
default = false
}
You can probably guess what each of those variables means from their names. What matters, for now, is that the default value of k8s_version
is set to CHANGE_VERSION
. I could not define the AKS version in advance since they are changing all the time, and whatever is the valid one at the time of this writing might not be correct when you’re working on the exercises. I could have skipped defining the explicit version altogether, but that would be against one of the most essential principles. We always need to be explicit with the versions we’re using. So, the only alternative left was to force us to change that value to the valid AKS version.
So, let’s discover the AKS version we will use. We can do that easily through the az
CLI.
az aks get-versions --location eastus
The output, limited to the last entries, is as follows.
{
"id": "/subscriptions/7f9f9b08-7d00-43c9-9d30-f10bb79e9a61/providers/Microsoft.ContainerService/locations/eastus/orchestrators",
"name": "default",
"orchestrators": [
...
{
"default": null,
"isPreview": null,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.17.9",
"upgrades": [
{
"isPreview": true,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.18.4"
},
{
"isPreview": true,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.18.6"
}
]
},
{
"default": null,
"isPreview": true,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.18.4",
"upgrades": [
{
"isPreview": true,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.18.6"
}
]
},
{
"default": null,
"isPreview": true,
"orchestratorType": "Kubernetes",
"orchestratorVersion": "1.18.6",
"upgrades": null
}
],
"type": "Microsoft.ContainerService/locations/orchestrators"
}
Please choose any of the orchestratorVersion
(e.g., 1.17.9
). I prefer using the latest stable. That would be the newest that does not have isPreview
set to true
. Copy the selected version, and paste it instead of [...]
in the command that follows.
# Replace `[...]` with any of the `orchestratorVersion`
export VERSION=[...]
Next, we’ll use sed
magic to replace the beforementioned placeholders with the values stored in environment variables.
cat variables.tf
| sed -e "s@CHANGE_VERSION@$VERSION@g"
| tee variables.tf
Now we can move into main.tf
that contains the definitions of the resources we’ll create.
cat main.tf
The output is as follows.
provider "azurerm" {
features {}
}
terraform {
backend "azurerm" {
resource_group_name = "devops-catalog-aks"
storage_account_name = "devopscatalog"
container_name = "devopscatalog"
key = "terraform.tfstate"
}
}
resource "azurerm_kubernetes_cluster" "primary" {
count = var.destroy == true ? 0 : 1
name = var.cluster_name
location = var.region
resource_group_name = var.resource_group
dns_prefix = var.dns_prefix
default_node_pool {
name = var.cluster_name
vm_size = var.machine_type
enable_auto_scaling = true
max_count = var.max_node_count
min_count = var.min_node_count
}
identity {
type = "SystemAssigned"
}
}
If you are familiar with Terraform, that definition should be straightforward. If you’re not, this is not the place where we’ll go into details, so I’ll just mention what each means.
Through provider "azurerm"
, we are telling Terraform that we want to create and manage Azure resources. We could have stored credentials in that block, but, since we will use environment variables, that is not necessary.
We also have the backend
set to azurerm
. That is the signal to Terraform to keep the state in Azure storage instead of storing it locally. As a result, we’ll be able to manage the resources from any location that has access to that storage and without being constrained to a single machine or some kind of mounted network drives.
Further on, azurerm_kubernetes_cluster
defines both the control plane and the node pool the cluster will use.
One important thing to note about the azurerm_kubernetes_cluster
resource is that it has count
set to var.destroy == true ? 0 : 1
. The reason behind that is simple. If we apply GitOps principles, everything needs to be defined in Git, and, preferably, in a declarative format. We cannot rely on ad-hoc commands. That is not an issue when creating or updating resources. But, if we’d like to destroy what we created, we cannot simply remove all the files. Terraform would not allow us to execute terraform apply
against an empty directory. Even if it would, it might be a good idea to keep the definitions, even if we want to destroy everything. That allows us to be able to change our minds easily without restoring previous commits. Also, there could be quite a few cases when we’d like to destroy a cluster temporarily and recreate it later. For example, we might have a development cluster that should exist only while we are working.
In any case, I believe that it is easier and more transparent to destroy resources by changing the value of the destroy
variable, then deleting all the files. So, if we go back count = var.destroy == true ? 0 : 1
, we can translate it to “create that resource if destroy
is set to false
, or destroy it if it’s set to true
.” We’ll see that variable in action later.
Finally, the last Terraform file we have is output.tf
.
cat output.tf
The output is as follows.
output "cluster_name" {
value = var.cluster_name
}
output "region" {
value = var.region
}
output "resource_group" {
value = var.resource_group
}
Those are the outputs that we’ll see every time we apply
the definitions. Those three might come in handy if anyone wants to know the name of the cluster, the region, or the project ID. You’ll see their usefulness later when we get to the part of validating the cluster.
We are almost finished with Terraform. The only thing left is to push the changes to Git.
git add .
git commit -m "Initial commit"
git push
Now comes the “real deal”. We finally reached the main subject. Everything we did so far was the preparation for the “big moment”.
Defining A Continuous Delivery Pipeline
We want to have a pipeline that will run every time we push a change to that repository. So, we need to pick a CD tool. It could be any, but not today. Right now, we are going to use codefresh.io. I won’t take deep dive into it but focus on the parts that matter when managing infrastructure with Terraform.
Codefresh comes in handy for the task we are about to perform because it is a Software as a Service (SaaS) solution. It could be a self-managed solution, but not inside the same cluster since that would pose a “chicken and egg” type of problem. If we need a cluster to manage a cluster, then it cannot be the same one. We’d need a cluster with a CD platform to manage a different cluster. On top of that, we could not run a pipeline inside the cluster if that pipeline might need to upgrade or even destroy that same cluster. Using a SaaS solution is very handy, especially when dealing with infrastructure.
So, what is the bare minimum for a pipeline that will manage infrastructure using Terraform?
Before answering such a question, I will need to make at least two assumptions.
I will assume that you are creating pull requests (PRs) with proposed changes. Those PRs can be reviewed and, potentially, tested. That means that, as a minimum, we need to be able to see the proposed changes. We need to know how will that PR affect the current state. That means that we probably want to execute terraform plan
that outputs what will be removed, what will be added, and what will be modified if we apply
the changes. PRs should NOT apply the changes, at least not inside the same project. In other words, we need to know what the proposed changes are, but not to apply
them.
The second assumption is that you want to apply
changes only after merging to the master branch. Any other branch should be ignored.
You might have a different way of working. If that’s the case, pretend that it is not, and go with the flow. Later on, you should be able to modify the examples to fit whichever process you prefer to use.
With all that in mind, let’s try to define which steps we might need to have. If we are successful at that, you should have no problems extending them to whatever else you might need.
We can split the needs into two stages; prepare
and apply
. Those are arbitrary, and you are free to have steps organized differently, but not today.
In the prepare
stage, we have to clone the specific revision from the repository.
In the apply
phase, we have to init
the project so that the Terraform plugins used in our definitions are downloaded, and the state is retrieved from the bucket. Further on, we already discussed that we need to output the plan
. We could do that only when working with PRs, but there is no harm in outputting it always. Finally, we need to apply
the definitions so that the actual state is converged into the desired state. But, as we already discussed, we should apply
only when making changes to the master branch, and not when working with other branches or with PRs.
Now that we know, more or less, what we want to do, let’s take a look at one possible implementation of those goals.
Codefresh pipelines are defined as YAML and, by default, are expected to be in codefresh.yaml
file. I already prepared one, so let’s take a look.
cat codefresh.yml
The output is as follows.
version: "1.0"
stages:
- prepare
- apply
steps:
main_clone:
title: Cloning repository
type: git-clone
repo: "${{CF_REPO_OWNER}}/${{CF_REPO_NAME}}"
revision: "${{CF_BRANCH}}"
stage: prepare
init:
image: hashicorp/terraform:0.13.0
title: Initializing Terraform
stage: apply
commands:
- terraform init
plan:
image: hashicorp/terraform:0.13.0
title: Outputting Terraform plan
stage: apply
commands:
- terraform plan
apply:
image: hashicorp/terraform:0.13.0
title: Applying Terraform
stage: apply
commands:
- terraform apply -auto-approve
when:
branch:
only:
- master
That’s a very simple pipeline. It is split into two stages (prepare
and apply
).
Inside the prepare
stage, we are cloning the revision
that initiated the build (main_clone
). There’s not much more to it, so let’s move to the steps in the apply
stage. That’s where the “real” action is happening.
The steps in the second stage (apply
) should be self-explanatory if you are familiar with Terraform. We are initializing the project through terraform init
so that the required plugins are downloaded and, more importantly, the state stored in the S3 bucket is retrieved. Further on, we are outputting the plan (plan
) and applying the definitions (apply
). Since, by default, terraform apply
asks for an input confirmation, we are circumventing that through the -auto-approve
argument.
However, we are not going to execute all those steps always. The flow of events differs depending on whether we are working with a pull request or the master
branch. To be more precise, when pushing to a pull request, we want to get the information about the changes that will be applied, without changing the actual state. We want the information that we can review and make the decision whether to apply it or not. That’s why we have the when
conditional in the apply
step. It will be executed only
if the branch
is master
.
This process is, in a way, equivalent to executing terraform apply
manually without -auto-approve
. If we did that, Terraform would show us all the changes and ask us to type yes
if we are satisfied with the outcome. However, running terraform apply
manually prevents us from involving the rest of the team, not to mention that it does not adhere to GitOps principles. By having a pipeline that outputs the plan
when working with a pull request and apply
after merging to the master, we can involve the whole team to participate in the review of the changes. We can suggest modifications through comments, approve the changes, and perform all the other steps we usually associate with pull requests. We are, effectively, adopting the same practices as those we might be using when working on an application.
Now, before we continue, there is one crucial thing to note.
Each of those steps is executed in a separate container based on potentially different images. For now, all we need is Terraform, so all the steps, besides main_clone
are using hashicorp/terraform
.
The pipeline should be, more or less, self-explanatory, so let’s skip further explanations and make it work inside Codefresh.
Creating And Configuring Codefresh Pipeline
Let’s start by opening Codefresh UI in a browser.
open https://codefresh.io/
Please log in if you are already a Codefresh user, or register if you’re not. Everything we need can be accomplished through the free plan, so your wallet will not be affected in any form or way, at least not by Codefresh.
Once you log in, you will be presented with the Projects screen.
Please create a new project by clicking the CREATE PROJECT button if you are a first-time user, or + NEW PROJECT if you already have others.
Type devops-catalog as the PROJECT NAME, use any PROJECT TAGS you like, and select any ICON.
Click the CREATE button.
Next, we’ll need to create a new pipeline. As you can surely guess, we can do that by clicking the CREATE PIPELINE button.
Type cf-terraform-aks as the pipeline name, and select cf-terraform-aks as the repository. You can use the search field to narrow the list of the repos if you have too many.
Click the CREATE button.
Codefresh allows you to specify pipeline definition inside the UI, but, in the majority of cases, it should be stored in a Git repo, just as any other code or config related to the project. We already have codefresh.yaml, so there is no good reason not to use it.
Please change Inline YAML to Use YAML from Repository.
Click the DONE button.
As we already discussed, Terraform definitions assume that there are environment variables AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_DEFAULT_REGION
. We created them locally, in a terminal session. But Codefresh does not know about them. We need to define them there as well.
Please go back to the terminal to output the contents of those variables.
echo $ARM_SUBSCRIPTION_ID
echo $ARM_TENANT_ID
echo $ARM_CLIENT_ID
echo $ARM_CLIENT_SECRET
Copy the output of the first (ARM_SUBSCRIPTION_ID
) and return to Codefresh in your browser.
Select the VARIABLES tab, and click the ADD VARIABLE button.
Type ARM_SUBSCRIPTION_ID as the Key and paste the content you copied earlier into the Value field.
Finally, we’ll encrypt the value of that variable by clicking the Encrypt button, followed with OK to confirm the desire to ENCRYPT VALUE.
Now, repeat those same steps two more times to add variables ARM_TENANT_ID
, ARM_CLIENT_ID
, and ARM_CLIENT_SECRET
.
Click the SAVE button to persist the newly created variable.
We’re done. We have the pipeline, and the only thing left is to give it a spin.
Applying Infrastructure Definitions
Everything is set up, except for our infrastructure. We defined everything we need as a combination of Terraform definitions and a pipeline YAML. From now on, we will not be clicking any buttons to make the “magic” happen. Whichever changes we push to the master branch will be applied directly to our infrastructure. Git will make sure to notify Codefresh about the changes of the desired state, and the pipeline will make sure that the actual state is converged to the desired one. Later on, we’ll see that might not be enough, but, for now, we’ll focus on the master alone.
However, since we already have the definition stored in Git, we will break the rule and click the RUN button ourselves. Hopefully, this will be the first and the last time we’ll initiate a pipeline from the UI. Think of the action we are about to perform as me showing you what NOT to do.
Please click the RUN button, twice.
You will be redirected to the build screen from where you can observe the progress. Feel free to click any of the steps that were already executed or to follow the one that is currently running. You’ll see the output of the logs.
When the build reaches the apply
stage steps, it will init
Terraform by downloading the plugins and restoring the state from the bucket. Further on, it will show the plan
of the changes that will be performed, and continue to apply
them.
A few minutes later, the cluster should have been created, and we can confirm that from a laptop by, let’s say, executing kubectl get nodes
. But, to do that, we need to create KubeConfig locally. We can do that through the az aks get-credentials
, but, for it to work, we need to find out the name of the cluster, the project ID, and the region where it is running. We can get that information in two ways.
One option would be to copy and paste the information we need from the output of the apply
step. Go ahead. Open the logs of that step (the last one), and you’ll see the information we need. However, there is a better way which, at the same time, demonstrates the benefits of storing Terraform state in a storage bucket.
Instead of going through the Codefresh UI and digging through build logs, we can restore the Terraform state locally and retrieve all the information we need or, to be more precise, the information defined as output
values. To do that, we need to init
the project first.
terraform init
That command downloaded the plugins we would need to apply
the changes. But we are not going to apply
anything. From now on, applying changes is done through GitOps. We should push changes to Git and let the system converge the actual into the desired state. Besides, that was not the reason why we initialized the project. That command also configured our local project to use the Azure Storage bucket that contains the state. That was the “real” reason behind terraform init
.
Next, we need to synchronize the local state with the one stored in the bucket. We can do that through refresh
.
terraform refresh
The output, limited to the outputs
, is as follows.
...
Outputs:
cluster_name = docatalog
region = eastus
resource_group = devops-catalog-aks
We can see the information we need. Now we could copy it and paste it as arguments to the get-credentials
command. But we will not do that. The only place worth copying and pasting is Stack Overflow. For everything else, it is much better to have self-executable commands that we can easily convert into scripts if needed.
Fortunately, we can use terraform output
commands to retrieve the information we need. So, without further ado, the commands that will generate the KubeConfig file, which will allow us to interact with the newly created cluster, are as follows.
export KUBECONFIG=$PWD/kubeconfig
az aks get-credentials
--name
$(terraform output cluster_name)
--resource-group
$(terraform output resource_group)
--file
$KUBECONFIG
We defined the environment variable KUBECONFIG
with the path where we want to store the configuration. That wasn’t necessary. Without it, the config would be stored in the default location, together with other clusters we might be using. However, that tends to get messy when working with many clusters, so I prefer to have a separate file for each and store it in the same directory where the project with the cluster definitions is.
Do not worry about the danger of accidentally pushing kubeconfig
to the Git repo. It is listed in the .gitignore
file.
Further on, we executed the get-credentials
command. Instead of copying and pasting the name of the cluster, the project ID, and the region, we used terraform output
commands to retrieve that info.
All that’s left, before we move on, is to confirm that the cluster is indeed operational.
kubectl get nodes
You should see the output of the nodes that constitute the cluster. Hurray!
Nevertheless, we are not yet finished. We still need to figure out how to make the infrastructure changes safe or, at least, less likely to cause damage.
Incorporating Pull Requests Into Infrastructure Management
Applying changes to infrastructure directly, without doing any type of review or testing is irresponsible. That might prove to be catastrophic. We could make a mistake that would be poorly reflected on our users (external or internal). Yet, that’s precisely what we did. So, let me correct the previous statement. Updating existing infra without any review or validation is terrible. Creating infra for the first time is usually OK since no one is yet using it, and we are not updating or destroying existing resources. That’s why our previous actions were acceptable. We created a new cluster that isn’t affecting anyone since no one could have been using something that does not exist.
Now, let’s imagine that we deployed some applications in that cluster and that they are being used by others. How can we make changes to that cluster safely? The answer to that question is in pull requests. That’s the widely accepted way to propose changes that can be reviewed and tested.
Testing changes proposed through a pull request usually means deployment of an application and some of its dependencies. But, right now, we are not dealing with applications, but with infrastructure. Still, the logic is, more or less, the same. We need to deploy at least some applications so that we can test whether they behave correctly in changed infrastructure.
When compared with PRs related to applications, the additional complication is that we need to create that infrastructure in parallel with the one that we are planning to change, then apply the changes, followed with testing. But, all that is too big of a subject and deserves a separate article. So, we’ll focus on PRs themselves and reviews and leave the creation of temporary infrastructure that can be used for testing purposes for some other time.
A review of any pull request usually consists of observing the differences in code, communication between team members, and a clear understanding of what would happen if applied to production. We’ll focus only on the latter, assuming that you already know how to review code changes and communicate with your team through comments or other means.
The good news is that we already set up almost everything we need. The plan
step inside the apply
stage is executing terraform plan
that shows which changes will be performed if we execute terraform apply
. We can use that output to evaluate whether we should proceed with the proposed changes.
On the other hand, we do NOT want to execute terraform apply
as a result of creating a pull request. That would defy the purpose of proposing changes.
If you take another look at the apply
step in codefresh.yml
, you’ll see the following declaration.
...
apply:
...
when:
branch:
only:
- master
That is a conditional statement, and you can probably guess that it means that the step should be executed only
if the branch
is master
.
In other words, if we make a change to the master branch, all the steps will be executed, including the one that applies the changes. But, in all other cases (including PRs), the apply
step will be skipped.
So, we are all set, except for one tiny detail. We need to modify Codefresh triggers so that pipelines are executed only if a change is made to the master branch, or when creating or making changes to PRs. That means that we need to do two things. We need to restrict Codefresh to trigger pipelines only when changes are made to the master branch and ignore others. On top of that, we need to add an additional trigger that will run builds whenever we create a new pull request or make changes to an existing one.
Please go back to the Codefresh UI in your browser.
You should see the breadcrumbs near the top of the screen. Click the cf-terraform-aks link.
Next, we want to change the triggers, so click the TRIGGERS tab. You’ll see a single trigger created by default when we created the pipeline. Click the edit button next to it.
For now, we’ll focus on making sure that only the master branch triggers pipeline builds.
Please change the BRANCH (REGEX EXPRESSION) to /master/gi. That will ensure that only the changes to the master branch will trigger pipelines.
Click the UPDATE button.
Next, we need to add an additional trigger that will handle creation and changes to pull requests.
Click ADD TRIGGER, followed with the + ADD TRIGGER button. Select GIT as the type. Click the NEXT button.
Change the TRIGGER NAME to pr, or whatever you would like to call it. Select the cf-terraform-aks repository.
For this trigger, we are not interested in the push commits coming from branches since we already have that set up in the other trigger. So, unselect the TRIGGER BY option Push commits.
Select the TRIGGER BY options Pull request opened and Pull request synchronized. The names should be self-explanatory.
Now, if we leave it as-is, the trigger would run pipelines on any pull request. But we might not want that. We might want to limit them only to PRs created against the master branch.
I don’t think I ever created a PR that is not against the master branch, but one can never know how far the creativity of others goes.
So, we’ll change the PULL REQUEST TARGET BRANCH (REGEX EXPRESSION) to /master/gi
Click NEXT, followed by the DONE button.
Close the dialog by pressing the X icon in the top-left corner of the Pipeline triggers dialog.
!
Now we’re ready to give pull requests a spin.
Using Pull Requests To Preview Changes To Infrastructure
Let’s create some changes to our infrastructure and see how we can benefit from pull requests and triggers to Codefresh pipelines.
The first few steps are typical for creating any pull request. It starts with the creation of a new branch.
git checkout -b destroy
We could make some changes to the definition of the infrastructure. We could, for example, upgrade the Kubernetes version. Or, we could add another node pool. But, since we are getting close to the end of the story, we might just as well destroy the whole cluster. I don’t want you to blame me for the high cost of using my examples, so I always end tutorials with the destruction of everything. This will not be an exception, except that we’ll proceed with the destruction before reaching the end.
The typical way to destroy the resources created through Terraform is to execute terraform destroy
. But that command is not very “friendly” to CD processes. It makes much more sense to keep using terraform apply
, no matter whether we want to create, update, or destroy resources.
However, there is no intuitive way to do something like that in Terraform. If, for example, we delete all the files, terraform apply
would think that there is something fishy about it and prevent us from proceeding. On top of that, deleting all the files from Git might complicate tracking and reviews. Similarly, sometimes we might want to delete resources but still keep the option to recreate them later. That is especially true when dealing with those that are temporary like, for example, a cluster used as a development environment. One might create it at the beginning of the workday and shut it down when finished working.
We can use the count
property available in all Terraform resources. It can be set to 1
by default and changed to 0
if we want to destroy that resource. To make it a bit more user friendly, we can have a boolean variable to control that behavior.
That’s why the current definition has the variable destroy
of type bool
set, by default
, to false
. Further on, we have count = var.destroy == true ? 0 : 1
set to all the relevant resources. We saw those when we explored variables.tf
and main.tf
. Feel free to take another look at those files if your memory does not serve you, or if you did not pay attention.
Now that the logic behind destructive tendencies is a bit clearer, let’s create a PR that will propose the destruction of the whole cluster.
Please open variables.tf in your favorite editor and change the value of the destroy
variable to true
. Make sure to save the changes.
Now that we modified the Terraform definition, we can push the changes to the newly created branch.
git add .
git commit -m "Destroying everything"
git push
--set-upstream origin destroy
Feel free to go back to Codefresh. If you do, you’ll see that a pipeline build was not triggered by that push. That was expected since we modified the triggers to work only with the master branch and pull requests.
Next, we’ll pretend that we are finished working with that branch and that we are ready to propose that those changes be applied to production. So, as you already know, it is time to create a pull request.
open https://github.com/$GH_ORG/cf-terraform-aks
Please create a pull request. I’m sure you already know how to do that, so I will not give you step-by-step instructions. If you don’t know how to do it, the first step is to feel ashamed for a few minutes and then Google it.
Now, go back to Codefresh UI opened in your browser, and observe that a new pipeline build was triggered. The last two steps are what matters in the context of pull requests.
The logs from the second to last step (the one named Outputting Terraform plan) should show what will happen if we apply
the changes. In this case, you will see that some of the resources will be deleted.
The second important observation is that the Applying Terraform step was skipped. We did not apply
the changes, but only executed terraform plan
that shows us what would happen if we do.
Further on, we would probably review the code of the proposed code changes, discuss it with our peers, ask for someone to approve, and do all the other things we usually do with PRs. We’ll imagine that we did all that,
All that is left is to confirm that we indeed want to apply
the changes that will destroy the cluster. Since we are enforcing the GitOps principles, the way to proceed is to merge the PR and let the machines do the rest. So, please go back to the pull request and hit the Merge pull request button.
We’re finished with that branch, so let’s go back to master
.
git checkout master
Go back to Codefresh. You’ll see that a new pipeline build was triggered. It will complete soon. When it’s done, your cluster will be gone. If you do not believe me, open the Azure Portal and confirm that it is no more.
From now on, you can create or destroy the cluster by changing the value of the variable destroy
. Of course, that does not mean those are the only operations you can do. You can add new resources, change the properties of the existing ones, or do (almost) anything else related to your infrastructure.
What matters is that your job is to write or modify Terraform definitions and push them to Git. The machines will do the rest.
What Are We Missing?
We could have done many other things, but due to constraints of time and space, we didn’t. We could create a temporary parallel infrastructure for testing purposes. We could add tests that would validate changes. We could send the output of terraform plan
to the PR to have all the information there. We could also employ ChatOps to further streamline the whole process.
Time and space are limited, and this article is already much longer than I initially thought it will be. So, we’ll end here, for now.