Provision an on-prems Kubernetes Cluster with Rancher, Terraform and Ansible
Automate the provisioning of on-premises Rancher RKE Clusters and the registration of their nodes via Terraform and Ansible
What is Rancher
Rancher is one of the most exciting pieces of software I have come across the last decade. In a nutshell, we could say that Rancher empowers you to deliver anywhere, from premises to cloud or to edge devices, managed Kubernetes-as-a-Service services.
Citing what the company itself states in their website:
“Rancher is a complete software stack for teams adopting containers. It addresses the operational and security challenges of managing multiple Kubernetes clusters, while providing DevOps teams with integrated tools for running containerized workloads.”
In simple words, manage multiple clusters and workload processes through out their lifecycle, painlessly. Rancher has created its own Kubernetes distributions RKE and K3S; and recently, back in 2020, they were acquired by SUSE. RKE, is the CNCF-certified K8S distribution that runs on any host that is already prepared with the necessary Docker engine. K3S on the other hand, is the lightweight alternative that consolidates everything that Kubernetes needs, in a small binary with a footprint of no more than 40MB, addressing scenarios of edge and IoT devices.
What are we trying to achieve
In this article we are going to talk about the former option, RKE, and how we can provision new clusters and register their nodes in an automated fashion without any manual effort or resorting to ClickOps (although the web interface of Rancher will cover most of your needs — but this is not the case we want to investigate here).
One of the biggest pain-points for developers and development teams is to setup and discard equally fast Kubernetes clusters. Setting up a K8S cluster is a tedious and time consuming task and mainly developers want just to spin up a cluster and start testing and collaborating without diving into the nuances and intricacies that Kubernetes brings (If you want a step-by-step guide how to setup a Kubernetes cluster from scratch, without Rancher, you can have a look in another article of mine). Additionally most development teams don’t want and cannot afford to work in isolated local environments as minikube— which without a doubt is an excellent choice for local development but it has its limitations and cannot serve a whole team or emulate a 100% realistic full-fledge production environment.
And this is how Rancher comes into the picture. Here’s what we are going to need for this lab:
- a Rancher server — I presume that you already have a working station prepped with Docker where you could spin Rancher as a container; to speed things up. I will exhibit later how.
- 3 Virtual Machines (1 vCPU, 2GB RAM will do the trick but I would advise you to dial up both of the numbers by a factor of 2), already prepped with Docker. How to provision these boxes is out of the scope of this article and is mainly up to your personal taste. I would strongly propose you go with Vagrant. For this lab, I personally used pre-baked CentOS 8 images that I downloaded from Linux VM Images (those VMs are throw-aways, don’t bother too much for an elaborate provisioning process). Let your DHCP server assign an IPv4 address but make it stick, we need static addresses for our nodes.
- A working station with Terraform and Ansible — The most optimal thing to do is combining all the necessary tools (Docker Engine, Terraform, Ansible, Rancher container) in one machine, preferrably in the one we already mentioned in bullet #1.
As the installation of those various component slightly vary per distribution — and I cannot cover every single one of them, I will solely focus on exhibiting the preparation of the boxes assuming they’re all running CentOS 8.
Create SSH Keys
The first step is to create a key pair on development working station:
as simple as it gets, this will create a 2048-bit RSA key pair, which is secure enough for our experiment.
Next you have to copy your public key to every single one of the Virtual Machines that are destined to serve as cluster nodes:
replace the IPv4 address and the account as it fits your environment.
CentOS 8 has unfortunately reached its end of life last year (don’t get me started — trigger warning) and consquently doesn’t receive any further updates from the CentOS project. In order to update it, after 31st of December 2021, you have to point to the CentOS Stream repo. Get elevated privileges and lets start (apply this paragraph to all 4 boxes):
cd /etc/yum.repos.d/sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
Update the system:
yum update -y
Let’s start now installing Docker:
yum install -y yum-utils
Setup the stable repository:
Install Docker engine itself:
yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin
start the service:
systemctl enable --now docker
and finally add user to Docker User Group:
usermod -aG docker $USER
All of the boxes have the required Docker engine installed and ready to host containers.
Connect to your first box (your development working station and not in any of the boxes that you plan to use as cluster nodes). First step (get sudo permission before any action is implied throughout this guide), add the repo:
yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
yum -y install terraform
and finally confirm the installation is succesfull:
If you are able to see something like that, then you are good to go:
Continue working on your development box. Ansible for CentOS 8 can be found in the EPEL repository, and this is how we are going to enable it:
dnf install epel-release
Installing now Ansible is simple as that:
dnf install ansible
and as we did for Terraform, let’s make sure that all went well:
If you are able to see something like that, then Ansible is succesfully installed:
Installing Rancher is an integral part of preparing our environment. Let’s install it as a container in our development machine:
docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name=rancher-v2.6.3 rancher/rancher:v2.6.3
The latest image of Rancher seems to have an issue that I couldn't overcome as a lot of people as well (you can check github issues of the project as well) that are complaining that the server eventually is not starting. I experienced, with no resolution, this persistent problem while trying the image out in various machines, so I chose eventually the image tagged with version 2.6.3 that I already knew it was working without a problem from previous installations. Arrange exposed ports mapping as you wish and as it better fits your environment.
Wait till the container is running and then open the defined protocol://ip:port combination uri from your browser of choice and follow the post-installation configuration steps. Set a password and the exposed URL address of your Rancher (make sure this IP address or FQDN can be resolved from the other boxes destined to serve as cluster nodes). Then you can access Rancher from its web interface.
If you are a fan of the old design, like me, you can still access it from protocol://ip:port/g/clusters . No guarantees though for how long this is going to be available and when it will be eventually removed completely.
What we need to do next is to create an API key (we are going to need it later in order to access Rancher from Terraform). Click the avatar at the top right-hand corner of the screen and choose Account & API Keys.
Finally fill in the description with a meaningful name, keep scope as “No Scope” and click Create:
In the next screen you will be presented with the generated Access and Secret Keys. That’s an one time chance, save those values somewhere safe because afterwards it will not be possible to retrieve them. Press Done and exit.
Dissecting the configuration files
Go back to your development box, and clone the following github repo:
Inpect the files, and you will find a file name .terraformrcsample, create a copy of it and let’s call it .terraformrc and edit its contents as following:
export OS_RANCHER_SECRETKEY="..." export TF_VAR_RANCHER_URL=$OS_RANCHER_URL
Assign to OS_RANCHER_URL the value of Rancher’s URL you defined during Rancher’s post-installation steps (e.g protocol://ip:port) adding a /v3 in the path.
You can find this URI in the Cluster Explorer area, where you create new Access and Secret Key and copy it safely from there. The rest variables are sort of self-explained.
That is a quite simple Terraform experiment, consisting of three modules following the standard module structure and an Ansible playbook spiced with a “plot-twist”. Let’s go through it step by step:
First we have the Root module and its supporting files:
- provider.tf: this is where we configure the Terraform provider for Rancher. It automatically inherits the environmental values we exported previously with .terraformrc
- variables.tf: This file holds the variables of the entry point module of our Terraform script. All variables prefixed with RANCHER_ are automatically loaded from the TF_VAR_ environmental variables we created in the previous step. The variable cluster_nodes_ips is a list containing the static IPv4 addresses of our soon to become cluster nodes. The rest two variables are necessary for Ansible. remote_sudoer is the account name that is a designated sudoer that can connect passwordless on the target boxes (you should provide it on beforehand if there is none available). private_key_file is the path of the ssh key that we generated before and Ansible will use to connect to those machines.
- root.tf: Typical minimal structure, definition of required providers and their minimum baseline version. The module creates one random_id resource, called stack, in order to use it as identifier of the cluster and of various other resources (this one is insignificant; you can omit it as it doesn’t have a real added value to the overall solution). Then it calls two module: cluster and nodes.
Let’s now jump into the cluster module and then we will come back to the root because node module has some dependencies on the outputs of the cluster one.
The cluster module follows the standard module structure. It consists of a main, variables and output file:
- variables.tf: This holds the input variables required for the cluster module. Nothing fancy, just the stack variable we mentioned in the root module.
- main.tf: We create a single resource, a rancher2_cluster, with the simplest configuration possible; as you can see in the gist below. This will create a new RKE1 cluster in our Rancher, but just the cluster placeholder and metadata, no nodes yet.
If you want more information about the rancher2 Terraform provider, visit the terraform registry page of the provider, as the purpose of this article is not to present the specific provider.
- outputs.tf: This is the most interesting part of this module. We are going to expect three output variables from the cluster module: cluster_id, cluster_name which are fairly self explanatory variable names and last but not least cluster_node_command which is practically the very purpose of this article. The value of cluster_node_command is the registration command that is required to run in every node in order this node to be added in the cluster and configured with all the appropriate K8S pods and resources.
Let’s go back now to the Root module and see what is going on with the node module:
The node module requires a bunch of variables as input parameters:
- stack, already mentioned above — not mandatory for the lab per se
- cluster_ip_nodes, the list of IPv4 addresses of nodes as provided from the root module
- remote_sudoer, the account that Ansible will use to execute its tasks, derived as well from the root module variables
- private_key_file, the certificate that Ansible will use to ssh to the node, provided from the root module variables
- cluster_node_command, the registration command that has to run in every given node, taken as output variable from the cluster module
- wait_for, a list of resource or attributes that node module will depend on. We chose cluster_name output variable of the cluster module in order to make sure that our Ansible tasks will run after the cluster is provisioned.
Now, let’s dig around in the last, but definetely not least, module of this experiment. the node module:
Same pattern here, the module complies to the standard module structure and consists of the following files:
- variables.tf: This holds the input variables required from the node module. Their naming and purpose were discussed right above.
- output.tf: That’s empty at the time being as we don’t need to export any values or results from this module
- main.tf: Here lies the core of our experiment and soon our “plot-twist” will unfold.
Ansible requires basically 3 files:
- ansible.cfg: that is the configuration file that will govern the behavior of all interactions performed by this playbook.
- inventory: the Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate.
- and minimum one playbook.yml (or whatever you want to call it): Ansible playbooks are lists of tasks that automatically execute against hosts.
Well guess what — and that’s the plot-twist, if you had the patience to read this article until here — we are not going to provide any of them. Instead we have in place 3 template files (ansible.tpl, inventory.tpl, playbook.tpl) and we are going to let Terraform feed those templates with dynamic values taken from the variables.tf and generate the required Ansible files in runtime during the application of the Terraform plan.
The node module looks like this:
It creates three resources of type local_file:
- ansible_inventory, that feeds the inventory.tpl with the required cluster nodes IPs list and generates the inventory file.
- ansible_config, that feeds the ansible.tpl with the required account name, and certificate path and generates the ansible.cfg file.
- and the ansible_playbook resource, that feeds the playbook.tpl with the registration command that was generated after the creation of the cluster in the cluster module and builds the playbook.yml file. As a last step, it executes the generated playbook tasks upon the hosts that the newly created inventory file contains.
Let’s see how the generated playbook.yml looks like and what tasks is going to execute:
The playbook consists of two plays:
- Prepare Nodes: It copies a bash script in every node. This is our Plan B in case something goes wrong during terraform destroy or artifacts persist in the nodes even after destruction of the plan.
- Register Nodes: This is the play that will actually consolidate all our preparation efforts. It executes in every node the registration command that adds the host in the newly generated RKE1 cluster.
Can this playbook be improved? Definitely! Another play could be added, that would prepare the nodes and install Docker engine to each one of them so we eliminate even more the manual preparation steps needed. Additionally we could use the flag “become: yes” in order to instruct Ansible to execute the tasks with elevated privileges instead of running the register command with sudo — Ansible nugs a lot for this one . But I leave those to you…
Let’s take it for a spin
First thing first, let’s go to our development workstation and export the variables from .terraformrc in our enviroment:
Then we have to initialize our working directory containing Terraform configuration files:
After that, let’s create an execution plan and preview the changes:
Last step would be to apply this execution plan on our infrastructure:
terraform apply -auto-approve
That will take significant amount of time — not the creation of the cluster that much (~5min) but the execution of the Ansible tasks. Calculate something around 15–20min until all nodes are registered in the cluster and all necessary containers are created successfully. The Terraform plan will execute and exit but inside the nodes the whole process of registration will kick off and this included bring up many components as containers, downloading the right images, setting certificates etc etc.
During the process Terraform cannot provides with any more information. You can go to the Cluster Explorer find your cluster from the list and then observe live the provisioning log in the nodes that is aggregated in this tab.
When the registration process completes, you can open the previous tab named Machines and inspect your newly added nodes and their current state. As long as all are marked as green and active then we have a winner. Your cluster is ready to go.
If you want to dispose you cluster the reverse process if fairly easy. Just destroy the terraform plan:
terraform destroy -auto-approve
There is a slight chance in that process, due to whatever technical hickups, the cluster to be removed successfully from Rancher, but a bunch of containers to remain active and intact inside the nodes. Without completely purging the existing RKE containers, config files & binaries from the nodes unfortunately those nodes become unusuable and you cannot ask them to join another RKE cluster. In this occasion, although is rare because Terraform will get the job done, comes in the picture the hard reset script we copied in each node with the first play of our Ansible playbook. Execute the script remotely in every node like this:
ssh email@example.com 'sudo /tmp/reset_cluster.sh'
This will blast the node and purge all remaining RKE artifacts. You node is ready to be used again and you avoided throwing away the whole VM.
Those 3 simple Terraform modules can save you a lot of time and nerves when you need to provision fast a new Kubernetes cluster for a development environment for you or for your team. With some hardening and fine-grained configuration of the rancher2_cluster resource you could provision production clusters on your premises.
Our cluster is ready, but bare metal clusters come with an important downside. When you create a LoadBalancer service in a Public Cloud (e.g AWS, Azure, GCP, Open Telekom Cloud etc) there is the required glue in place to spin a new Network Load Balancer and assign its IP address as the external IP of the LoadBalancer service. In a private cloud or on-premises however, the implementations of network load balancers are missing and this is the void that MetalLB comes to fill in. Jump to the following article to solve this issue once and for all in less than 5 minutes
Load Balancing with MetalLB in bare metal Kubernetes
Set up a MetalLB Load Balancer for a bare metal Kubernetes Cluster
All the files of this lab can be found in the following repo:
GitHub - akyriako/terraform-rancher-cluster
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
This article is by no means an article presenting best practices for Ansible or Terraform. If you want to read more about Ansible and its amazing capabilities and how you could work more effectively with playbooks, I would definitely recommend you have a look in this article:
Working with Ansible Playbooks - Tips & Tricks with Examples
In this article, we are exploring Ansible Playbooks, which are basically blueprints for automation actions. Playbooks…
Hope you liked this article, stay tuned more material about K3S this time, is on the way.