Provision an on-prems Kubernetes Cluster with Rancher, Terraform and Ansible

Akriotis Kyriakos
16 min readMay 13, 2022

Automate the provisioning of on-premises Rancher RKE Clusters and the registration of their nodes via Terraform and Ansible

What is Rancher

Rancher is one of the most exciting piece of software I have come across the last decade. In a nutshell, we could say that Rancher empowers you to deliver anywhere, from premises to cloud or to edge devices, managed Kubernetes-as-a-Service services.

Citing what the company itself states in their website:

“Rancher is a complete software stack for teams adopting containers. It addresses the operational and security challenges of managing multiple Kubernetes clusters, while providing DevOps teams with integrated tools for running containerized workloads.”

In simple words, manage multiple clusters and workload processes through out their lifecycle, painlessly. Rancher has created its own Kubernetes distributions RKE and K3S; and recently, back in 2020, they were acquired by SUSE. RKE, is a CNCF-certified K8S distribution that runs on any host that is already prepared with the necessary Docker engine. K3S on the other hand, is the lightweight alternative that consolidates everything that Kubernetes needs, in a small binary with a footprint of no more than 40MB, addressing scenarios of edge and IoT devices.

What are we trying to achieve

In this article we are going to talk about the former option, RKE, and how we can provision new clusters and register their nodes in an automated fashion, without any manual effort or resorting to ClickOps (although the web interface of Rancher will cover most of your needs — but this is not the case we want to investigate here).

One of the biggest pain-points for developers and development teams is to setup and discard, equally fast, Kubernetes clusters. Setting up a K8S cluster is a tedious and time consuming task and mainly developers want just to spin up a cluster and start testing and collaborating without diving into the nuances and intricacies that Kubernetes brings along (If you want a step-by-step guide how to setup a Kubernetes cluster from scratch, without Rancher, you can have a look in another article of mine). Additionally, most development teams don’t want and cannot afford to work in isolated local environments, as minikube — which without a doubt is an excellent choice for local development but it has its limitations and cannot serve a whole team or emulate a 100% realistic full-fledge production environment.

And this is how Rancher comes into the picture. Here’s what we are going to need for this lab:

  1. a Rancher server. I presume that you already have a working station prepped with Docker, where you could spin Rancher as a container, to speed things up. I will exhibit later how.
  2. 3 Virtual Machines (1 vCPU, 2GB RAM will do the trick but I would advise you to dial up both of the numbers by a factor of 2), already prepped with Docker. How to provision these boxes is out of the scope of this article and is mainly up to your personal taste. I would strongly propose you go with Vagrant. For this lab, I personally used pre-baked CentOS 8 images that I downloaded from Linux VM Images (those VMs are throw-aways, don’t bother too much for an elaborate provisioning process). Let your DHCP server assign an IPv4 address but make it stick, we need static addresses for our nodes.
  3. a working station with Terraform and Ansible. The most optimal thing to do is combining all the necessary tools (Docker Engine, Terraform, Ansible, Rancher container) in one machine, preferrably the one we already mentioned in bullet #1.

As the installation of those various component slightly vary per distribution — and I cannot cover every single one of them, I will solely focus on exhibiting the preparation of the boxes assuming they’re all running CentOS 8.

Create SSH Keys

The first step is to create a key pair on development working station:

ssh-keygen

as simple as it gets, this will create a 2048-bit RSA key pair, which is secure enough for our experiment.

Next you have to copy your public key to every single one of the virtual machines that are destined to serve as cluster nodes:

ssh-copy-id centos@192.168.1.30

replace the IPv4 address and the account as it fits your environment.

Install Docker

Docker is an open source containerization platform that revolutionized computing industry since 2013 and enabled developers and devops engineers to package, distribute and deploy applications as containers

CentOS 8 has unfortunately reached its end of life last year (don’t get me started — trigger warning) and consquently doesn’t receive any further updates from the CentOS project. In order to update it, after 31st of December 2021, you have to point to the CentOS Stream repo. Get elevated privileges and lets start (apply this paragraph to all 4 boxes):

cd /etc/yum.repos.d/
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*

Update the system:

yum update -y

Let’s start now installing Docker:

yum install -y yum-utils

Setup the stable repository:

yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo

Install Docker engine itself:

yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin

start the service:

systemctl enable --now docker

and finally add user to Docker User Group:

usermod -aG docker $USER

All of the boxes have the required Docker engine installed and are ready now to host containers.

Install Terraform

Terraform is an Infrastructure as Code tool that helps you define cloud and/or on-prem resources in configuration files. It provides a consistent workflow to deploy and manage all of your infrastructure. It was introduced by HashiCorp in 2014

Connect to your first box (your development working station and not in any of the boxes that you plan to use as cluster nodes). First step (getting elevated permission with sudo before any action is implied throughout this guide), add the repo:

yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo

install Terraform:

yum -y install terraform

and finally confirm the installation is succesfull:

terraform  version

If you are able to see something like that, then you are good to go:

Check Terraform installation.

Install Ansible

Ansible is an agentless open source software provisioning, configuration management, and application-deployment tool enabling Infrastructure as Code. Ansible was first released in 2014 from Michael DeHaan and now is part of IBM/Redhat

Continue working on your development box. Ansible for CentOS 8 can be found in the EPEL repository, and this is how we are going to enable it:

dnf install epel-release

Installing now Ansible is simple as that:

dnf install ansible

and as we did for Terraform, let’s make sure that all went well:

ansible --version

If you are able to see the following output, then Ansible is succesfully installed:

Check Ansible installation.

Install Rancher

Rancher Labs released Rancher back in 2014. Since late 2020 they are a part of SUSE.

Installing Rancher is an integral part of preparing our environment. Let’s install it as a container in our development machine:

docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name=rancher-v2.6.3 rancher/rancher:v2.6.3

The latest image of Rancher seems to have an issue that I couldn't overcome. Lot of people are complaining that the server eventually is not starting — check the github issues for more details on that matter. I experienced, without a resolution, this persistent problem while trying out the image on various machines, so I chose eventually the image tagged with version 2.6.3, that I already knew it was working without a problem from previous installations. Arrange exposed ports mapping as you wish and as it better fits your environment.

Wait till the container is running and then open the defined protocol://ip:port combination uri from your browser of choice and follow the post-installation configuration steps. Set a password and the exposed URL address of your Rancher (make sure this IP address or FQDN can be resolved from the other boxes destined to serve as cluster nodes). Then you can access Rancher from its web interface.

Rancher’s Cluster Management Explorer

If you are a fan of the old design, like me, you can still access it from protocol://ip:port/g/clusters . No guarantees though, for how long this is going to be available and when it will be eventually removed completely.

Rancher’s old Cluster Management Dashboard

What we need to do next is to create an API key (we are going to need it later in order to access Rancher from Terraform). Click the avatar at the top right-hand corner of the screen and choose Account & API Keys.

Click Account & API Keys to proceed creating the required token.
Click Create API Key

Finally fill in the description with a meaningful name, keep scope as “No Scope” and click Create:

Create the API Key.

In the next screen you will be presented with the generated Access and Secret Keys. That’s an one time chance; save those values somewhere safe because afterwards it will not be possible to retrieve them. Press Done and exit.

Note down the AK & SK values.

Dissecting the configuration files

Go back to your development box, and clone the following github repo:

git clone https://github.com/akyriako/terraform-rancher-cluster.git

Inpect the files, and you will find a file name .terraformrcsample, create a copy of it and let’s call it .terraformrc and edit its contents as following:

export OS_RANCHER_URL="protocol://ip:port/v3"
export OS_RANCHER_ACCESSKEY="..."
export OS_RANCHER_SECRETKEY="..."
export TF_VAR_RANCHER_URL=$OS_RANCHER_URL
export TF_VAR_RANCHER_ACCESSKEY=$OS_RANCHER_ACCESSKEY
export TF_VAR_RANCHER_SECRETKEY=$OS_RANCHER_SECRETKEY

Assign to OS_RANCHER_URL the value of Rancher’s URL you defined during Rancher’s post-installation steps (e.g protocol://ip:port) suffixing the path with /v3 .

You can find this URI in the Cluster Explorer area, where you create new Access and Secret Key and copy it safely from there. The rest variables are sort of self-explained.

That is a quite simple Terraform experiment, consisting of three modules following the standard module structure and an Ansible playbook spiced with a “plot-twist”. Let’s go through it step by step:

First we have the Root module and its supporting files:

  • provider.tf: this is where we configure the Terraform provider for Rancher. It automatically inherits the environmental values we exported previously with .terraformrc
provider "rancher2" {
api_url = "${var.RANCHER_URL}"
access_key = "${var.RANCHER_ACCESSKEY}"
secret_key = "${var.RANCHER_SECRETKEY}"
}
  • variables.tf: This file holds the variables of the entry point module of our Terraform script. All variables prefixed with RANCHER_ are automatically loaded from the TF_VAR_ environmental variables we created in the previous step. The variable cluster_nodes_ips is a list containing the static IPv4 addresses of our soon to become cluster nodes. The rest two variables are necessary for Ansible. remote_sudoer is the account name of a designated sudoer that can connect passwordless on the target boxes (you should provide it on beforehand if there is none available). private_key_file is the path of the ssh key that we generated before and Ansible will use to connect to those machines.
variable "RANCHER_URL" {}

variable "RANCHER_ACCESSKEY" {}

variable "RANCHER_SECRETKEY" {}

variable cluster_nodes_ips {
type = list
default = ["192.168.1.30", "192.168.1.31", "192.168.1.91"]
}

variable remote_sudoer {
default = "centos"
}

variable private_key_file {
default = "../../../secrets/id_rsa"
}
  • root.tf: Typical minimal structure, definition of required providers and their minimum baseline version. The module creates one random_id resource, called stack, in order to use it as identifier of the cluster and of various other resources (this one is insignificant; you can omit it as it doesn’t have a real added value to the overall solution). Then it calls two module: cluster and nodes.
terraform {
required_version = ">= 0.12.0"
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}

resource "random_id" "stack" {
byte_length = 4
}

module "cluster" {
source = "./modules/cluster"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
}

module "nodes" {
source = "./modules/nodes"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
cluster_nodes_ips = var.cluster_nodes_ips
cluster_node_command = module.cluster.cluster_node_command
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
wait_for = [ module.cluster.cluster_name ]
}

Let’s now jump into the cluster module and then we will come back to the root, because the node one has some dependencies to the outputs of the cluster.

The cluster module follows the standard module structure. It consists of a main, variables and output file:

  • variables.tf: This holds the input variables required for the cluster module. Nothing fancy, just the stack variable we mentioned in the root module.
variable "stack" {
description = "Stack unique ID"
}
  • main.tf: We create a single resource, a rancher2_cluster, with the simplest configuration possible; as you can see in the gist below. This will create a new RKE1 cluster in our Rancher, but just the cluster placeholder and metadata, no nodes yet.

If you want more information about the rancher2 Terraform provider, visit the terraform registry page of the provider, as the purpose of this article is not to present the specific provider.

terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}

resource "rancher2_cluster" "cluster_rke" {
name = "rke-${var.stack}"
description = "rke-${var.stack}"
rke_config {
ignore_docker_version = false
network {
plugin = "canal"
}
}
}
  • outputs.tf: This is the most interesting part of this module. We are going to expect three output variables from the cluster module: cluster_id, cluster_name which are fairly self explanatory variable names and last but not least cluster_node_command which is practically the very purpose of this article. The value of cluster_node_command is the registration command that is required to run on every node, in order this node to be added in the cluster and configured with all the appropriate Kubernetes artifacts.
cluster_node_command value will contain the Registration Command that you can find alternatively in the Cluster Explorer
output "cluster_id" {
value = rancher2_cluster.cluster_rke.cluster_registration_token.*.cluster_id
}

output "cluster_name" {
value = rancher2_cluster.cluster_rke.name
}

output "cluster_node_command" {
value = rancher2_cluster.cluster_rke.cluster_registration_token.*.node_command
}

Let’s go back now to the root module and see what is going on with the node module:

module "nodes" {
source = "./modules/nodes"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
cluster_nodes_ips = var.cluster_nodes_ips
cluster_node_command = module.cluster.cluster_node_command
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
wait_for = [ module.cluster.cluster_name ]
}

The node module requires a bunch of variables as input parameters:

  • stack, already mentioned above — not mandatory for the lab per se
  • cluster_ip_nodes, the list of IPv4 addresses of nodes as provided from the root module
  • remote_sudoer, the account that Ansible will use to execute its tasks, derived as well from the root module variables
  • private_key_file, the certificate that Ansible will use to ssh to the node, provided from the root module variables
  • cluster_node_command, the registration command that has to run in every given node, taken as output variable from the cluster module
  • wait_for, a list of resource or attributes that node module will depend on. We chose cluster_name output variable of the cluster module in order to make sure that our Ansible tasks will run after the cluster is provisioned.

Now, let’s dig around in the last, but definetely not least, module of this experiment. the node module:

Same pattern here, the module complies to the standard module structure and consists of the following files:

  • variables.tf: This holds the input variables required from the node module. Their naming and purpose were discussed right above.
variable "cluster_nodes_ips" {
type = list
}

variable "stack" {
description = "Stack unique ID"
}

variable "cluster_node_command" {
description = "Cluster Registration Command"
}

variable remote_sudoer {
}

variable private_key_file {
}


variable "wait_for" {
type = any
default = []
}
  • output.tf: That’s empty at the time being as we don’t need to export any values or results from this module
  • main.tf: Here lies the core of our experiment and soon our “plot-twist” will unfold.

Ansible requires basically 3 files:

  • ansible.cfg: that is the configuration file that will govern the behavior of all interactions performed by this playbook.
  • inventory: the Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate.
  • and minimum one playbook.yml (or whatever you want to call it): Ansible playbooks are lists of tasks that automatically execute against hosts.

Well guess what — and that’s the plot-twist, if you had the patience to read this article until here — we are not going to provide any of them. Instead we have in place 3 template files (ansible.tpl, inventory.tpl, playbook.tpl) and we are going to let Terraform feed those templates with dynamic values taken from the variables.tf and generate the required Ansible files in runtime during the application of the Terraform plan.

ansible.tpl:

[defaults]
inventory = ./inventory
remote_user = ${remote_sudoer}
host_key_checking = False
remote_tmp = /tmp/ansible
display_ok_hosts = no
private_key_file = ${private_key_file}

[ssh_connection]
ssh_args = -o ServerAliveInterval=200

inventory.tpl:

[cluster_nodes]
%{ for cluster_nodes_ip in cluster_nodes_ips ~}
${cluster_nodes_ip}
%{ endfor ~}

playbook.tpl:

- name: Prepare Nodes
hosts: all
tasks:
- name: Copy hard reset script to nodes.
copy:
src: ../../../scripts/reset_nodes.sh
dest: /tmp/reset_nodes.sh
follow: yes
mode: u=rwx,g=rx,o=r

- name: Register Nodes
hosts: all
tasks:
- name: Register Nodes to the Cluster.
command: ${cluster_node_command} --etcd --controlplane --worker

The node module looks like this:

terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}

resource "local_file" "ansible_inventory" {
depends_on = [
var.wait_for
]

content = templatefile("${path.module}/ansible/templates/inventory.tpl",
{
cluster_nodes_ips = var.cluster_nodes_ips
})
filename = "${path.module}/ansible/inventory"
}

resource "local_file" "ansible_config" {

content = templatefile("${path.module}/ansible/templates/ansible.tpl",
{
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
})
filename = "${path.module}/ansible/ansible.cfg"
}

resource "local_file" "ansible_playbook" {

content = templatefile("${path.module}/ansible/templates/playbook.tpl",
{
cluster_node_command = var.cluster_node_command[0]
})
filename = "${path.module}/ansible/playbook.yml"

provisioner "local-exec" {
working_dir = "${path.module}/ansible"
command = "ansible-playbook -i inventory playbook.yml"
}

}

It creates three resources of type local_file:

  • ansible_inventory, that feeds the inventory.tpl with the required cluster nodes IPs list and generates the inventory file.
  • ansible_config, that feeds the ansible.tpl with the required account name, and certificate path and generates the ansible.cfg file.
  • and the ansible_playbook resource, that feeds the playbook.tpl with the registration command that was generated after the creation of the cluster in the cluster module and builds the playbook.yml file. As a last step, it executes the generated playbook tasks upon the hosts that the newly created inventory file contains.

Let’s see how the generated playbook.yml looks like and what tasks is going to execute:

- name: Prepare Nodes
hosts: all
tasks:
- name: Copy hard reset script to nodes.
copy:
src: ../../../scripts/reset_nodes.sh
dest: /tmp/reset_nodes.sh
follow: yes
mode: u=rwx,g=rx,o=r

- name: Register Nodes
hosts: all
tasks:
- name: Register Nodes to the Cluster.
command: sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.6.3 --server https://192.168.1.106:27443 --token lg7ncpvcm2zvq52n5kcg922b8b5d85crsccm8fbdxm6l29c4mn76q9 --ca-checksum d4fc3e21b795bda1ead361d9d045d2ab79d45ef75ecf05cfa149a0c74b0c42c8 --etcd --controlplane --worker

The playbook consists of two plays:

  1. Prepare Nodes: It copies a bash script in every node. This is our Plan B in case something goes wrong during terraform destroy or artifacts persist in the nodes even after destruction of the plan.
  2. Register Nodes: This is the play that will actually consolidate all our preparation efforts. It executes in every node the registration command that adds the host in the newly generated RKE1 cluster.

Can this playbook be improved? Definitely! Another play could be added, that would prepare the nodes and install Docker engine to each one of them so we eliminate even more the manual preparation steps needed. Additionally we could use the flag become: yes in order to instruct Ansible to execute the tasks with elevated privileges instead of running the register command with sudo — Ansible nugs a lot about this. But I leave those changes to you…

Let’s take it for a spin

First thing first, let’s go to our development workstation and export the variables from .terraformrc in our enviroment:

source .terraformrc

Then we have to initialize our working directory containing Terraform configuration files:

terraform init
That looks like a successful initialization

After that, let’s create an execution plan and preview the changes:

terraform plan
That looks like a nice execution plan, all went well so far.

Last step would be to apply this execution plan on our infrastructure:

terraform apply -auto-approve

That will take significant amount of time — not the creation of the cluster that much (~5min) but the execution of the Ansible tasks. Calculate something around 15–20min until all nodes are registered in the cluster and all necessary containers are created successfully. The Terraform plan will execute and exit but inside the nodes the whole process of registration will kick off and this included bring up many components as containers, downloading the right images, setting certificates etc etc.

During the process Terraform cannot provides with any more information. You can go to the Cluster Explorer find your cluster from the list and then observe live the provisioning log in the nodes that is aggregated in this tab.

Check the provisioning logs of your nodes.

When the registration process completes, you can open the previous tab named Machines and inspect your newly added nodes and their current state. As long as all are marked as green and active, then we have a winner. Your cluster is ready to go.

Inspecting the cluster nodes.

If you want to dispose you cluster, the reverse process if fairly easy. Just destroy the terraform plan:

terraform destroy -auto-approve

There is a slight chance in that process, due to whatever technical hickups, the cluster to be removed successfully from Rancher, but a bunch of containers to remain active and intact inside the nodes. Without completely purging the existing RKE containers, config files & binaries from the nodes they become unusuable and you cannot request them to join another RKE cluster. In this occasion, although is rare because Terraform will get the job done, comes into the picture the hard reset script we copied on each node with the first play of our Ansible playbook. Execute the script remotely in every node like this:

ssh centos@192.168.1.30 'sudo /tmp/reset_cluster.sh'

This will blast the node and purge all remaining RKE artifacts. You node is ready to be used again and you avoided throwing away the whole VM.

Summary

Those 3 simple Terraform modules can save you a lot of time and nerves when you need to provision fast a new Kubernetes cluster for a development environment for you or for your team. With some hardening and fine-grained configuration of the rancher2_cluster resource you could provision production clusters on your premises.

Our cluster is ready, but bare metal clusters come with an important downside. When you create a LoadBalancer service in a Public Cloud (e.g AWS, Azure, GCP, Open Telekom Cloud etc) there is the required glue in place to spin a new Network Load Balancer and assign its IP address as the external IP of the LoadBalancer service. Hoever, on a private cloud or on-premises, the implementations of network load balancers are missing and this is the void that MetalLB comes to fill in. Jump to the following article to solve this issue once and for all in less than 5 minutes

All the files of this lab can be found in the following repo:

Hope you found this article useful. On the other hand, if you want something smaller and way more compact to start with as a developer environment, make sure to have a look in one my articles describing how to provision a highly available containerized Kubernetes cluster with K3S and K3D:

Have fun!

--

--

Akriotis Kyriakos

talking about: kubernetes, golang, open telekom cloud, aws, openstack, sustainability, software carbon emissions