Provision an on-prems Kubernetes Cluster with Rancher, Terraform and Ansible
Automate the provisioning of on-premises Rancher RKE Clusters and the registration of their nodes via Terraform and Ansible
What is Rancher
Rancher is one of the most exciting piece of software I have come across the last decade. In a nutshell, we could say that Rancher empowers you to deliver anywhere, from premises to cloud or to edge devices, managed Kubernetes-as-a-Service services.
Citing what the company itself states in their website:
“Rancher is a complete software stack for teams adopting containers. It addresses the operational and security challenges of managing multiple Kubernetes clusters, while providing DevOps teams with integrated tools for running containerized workloads.”
In simple words, manage multiple clusters and workload processes through out their lifecycle, painlessly. Rancher has created its own Kubernetes distributions RKE and K3S; and recently, back in 2020, they were acquired by SUSE. RKE, is a CNCF-certified K8S distribution that runs on any host that is already prepared with the necessary Docker engine. K3S on the other hand, is the lightweight alternative that consolidates everything that Kubernetes needs, in a small binary with a footprint of no more than 40MB, addressing scenarios of edge and IoT devices.
What are we trying to achieve
In this article we are going to talk about the former option, RKE, and how we can provision new clusters and register their nodes in an automated fashion, without any manual effort or resorting to ClickOps (although the web interface of Rancher will cover most of your needs — but this is not the case we want to investigate here).
One of the biggest pain-points for developers and development teams is to setup and discard, equally fast, Kubernetes clusters. Setting up a K8S cluster is a tedious and time consuming task and mainly developers want just to spin up a cluster and start testing and collaborating without diving into the nuances and intricacies that Kubernetes brings along (If you want a step-by-step guide how to setup a Kubernetes cluster from scratch, without Rancher, you can have a look in another article of mine). Additionally, most development teams don’t want and cannot afford to work in isolated local environments, as minikube — which without a doubt is an excellent choice for local development but it has its limitations and cannot serve a whole team or emulate a 100% realistic full-fledge production environment.
And this is how Rancher comes into the picture. Here’s what we are going to need for this lab:
- a Rancher server. I presume that you already have a working station prepped with Docker, where you could spin Rancher as a container, to speed things up. I will exhibit later how.
- 3 Virtual Machines (1 vCPU, 2GB RAM will do the trick but I would advise you to dial up both of the numbers by a factor of 2), already prepped with Docker. How to provision these boxes is out of the scope of this article and is mainly up to your personal taste. I would strongly propose you go with Vagrant. For this lab, I personally used pre-baked CentOS 8 images that I downloaded from Linux VM Images (those VMs are throw-aways, don’t bother too much for an elaborate provisioning process). Let your DHCP server assign an IPv4 address but make it stick, we need static addresses for our nodes.
- a working station with Terraform and Ansible. The most optimal thing to do is combining all the necessary tools (Docker Engine, Terraform, Ansible, Rancher container) in one machine, preferrably the one we already mentioned in bullet #1.
As the installation of those various component slightly vary per distribution — and I cannot cover every single one of them, I will solely focus on exhibiting the preparation of the boxes assuming they’re all running CentOS 8.
Create SSH Keys
The first step is to create a key pair on development working station:
ssh-keygen
as simple as it gets, this will create a 2048-bit RSA key pair, which is secure enough for our experiment.
Next you have to copy your public key to every single one of the virtual machines that are destined to serve as cluster nodes:
ssh-copy-id centos@192.168.1.30
replace the IPv4 address and the account as it fits your environment.
Install Docker
CentOS 8 has unfortunately reached its end of life last year (don’t get me started — trigger warning) and consquently doesn’t receive any further updates from the CentOS project. In order to update it, after 31st of December 2021, you have to point to the CentOS Stream repo. Get elevated privileges and lets start (apply this paragraph to all 4 boxes):
cd /etc/yum.repos.d/
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
Update the system:
yum update -y
Let’s start now installing Docker:
yum install -y yum-utils
Setup the stable repository:
yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
Install Docker engine itself:
yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin
start the service:
systemctl enable --now docker
and finally add user to Docker User Group:
usermod -aG docker $USER
All of the boxes have the required Docker engine installed and are ready now to host containers.
Install Terraform
Connect to your first box (your development working station and not in any of the boxes that you plan to use as cluster nodes). First step (getting elevated permission with sudo
before any action is implied throughout this guide), add the repo:
yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
install Terraform:
yum -y install terraform
and finally confirm the installation is succesfull:
terraform version
If you are able to see something like that, then you are good to go:
Install Ansible
Continue working on your development box. Ansible for CentOS 8 can be found in the EPEL repository, and this is how we are going to enable it:
dnf install epel-release
Installing now Ansible is simple as that:
dnf install ansible
and as we did for Terraform, let’s make sure that all went well:
ansible --version
If you are able to see the following output, then Ansible is succesfully installed:
Install Rancher
Installing Rancher is an integral part of preparing our environment. Let’s install it as a container in our development machine:
docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name=rancher-v2.6.3 rancher/rancher:v2.6.3
The latest image of Rancher seems to have an issue that I couldn't overcome. Lot of people are complaining that the server eventually is not starting — check the github issues for more details on that matter. I experienced, without a resolution, this persistent problem while trying out the image on various machines, so I chose eventually the image tagged with version 2.6.3, that I already knew it was working without a problem from previous installations. Arrange exposed ports mapping as you wish and as it better fits your environment.
Wait till the container is running and then open the defined protocol://ip:port
combination uri from your browser of choice and follow the post-installation configuration steps. Set a password and the exposed URL address of your Rancher (make sure this IP address or FQDN can be resolved from the other boxes destined to serve as cluster nodes). Then you can access Rancher from its web interface.
If you are a fan of the old design, like me, you can still access it from
protocol://ip:port/g/clusters
. No guarantees though, for how long this is going to be available and when it will be eventually removed completely.
What we need to do next is to create an API key (we are going to need it later in order to access Rancher from Terraform). Click the avatar at the top right-hand corner of the screen and choose Account & API Keys.
Finally fill in the description with a meaningful name, keep scope as “No Scope” and click Create:
In the next screen you will be presented with the generated Access and Secret Keys. That’s an one time chance; save those values somewhere safe because afterwards it will not be possible to retrieve them. Press Done and exit.
Dissecting the configuration files
Go back to your development box, and clone the following github repo:
git clone https://github.com/akyriako/terraform-rancher-cluster.git
Inpect the files, and you will find a file name .terraformrcsample, create a copy of it and let’s call it .terraformrc and edit its contents as following:
export OS_RANCHER_URL="protocol://ip:port/v3"
export OS_RANCHER_ACCESSKEY="..."
export OS_RANCHER_SECRETKEY="..."
export TF_VAR_RANCHER_URL=$OS_RANCHER_URL
export TF_VAR_RANCHER_ACCESSKEY=$OS_RANCHER_ACCESSKEY
export TF_VAR_RANCHER_SECRETKEY=$OS_RANCHER_SECRETKEY
Assign to OS_RANCHER_URL
the value of Rancher’s URL you defined during Rancher’s post-installation steps (e.g protocol://ip:port
) suffixing the path with /v3
.
You can find this URI in the Cluster Explorer area, where you create new Access and Secret Key and copy it safely from there. The rest variables are sort of self-explained.
That is a quite simple Terraform experiment, consisting of three modules following the standard module structure and an Ansible playbook spiced with a “plot-twist”. Let’s go through it step by step:
First we have the Root module and its supporting files:
- provider.tf: this is where we configure the Terraform provider for Rancher. It automatically inherits the environmental values we exported previously with .terraformrc
provider "rancher2" {
api_url = "${var.RANCHER_URL}"
access_key = "${var.RANCHER_ACCESSKEY}"
secret_key = "${var.RANCHER_SECRETKEY}"
}
- variables.tf: This file holds the variables of the entry point module of our Terraform script. All variables prefixed with
RANCHER_
are automatically loaded from theTF_VAR_
environmental variables we created in the previous step. The variablecluster_nodes_ips
is a list containing the static IPv4 addresses of our soon to become cluster nodes. The rest two variables are necessary for Ansible.remote_sudoer
is the account name of a designated sudoer that can connect passwordless on the target boxes (you should provide it on beforehand if there is none available).private_key_file
is the path of the ssh key that we generated before and Ansible will use to connect to those machines.
variable "RANCHER_URL" {}
variable "RANCHER_ACCESSKEY" {}
variable "RANCHER_SECRETKEY" {}
variable cluster_nodes_ips {
type = list
default = ["192.168.1.30", "192.168.1.31", "192.168.1.91"]
}
variable remote_sudoer {
default = "centos"
}
variable private_key_file {
default = "../../../secrets/id_rsa"
}
- root.tf: Typical minimal structure, definition of required providers and their minimum baseline version. The module creates one
random_id
resource, calledstack
, in order to use it as identifier of the cluster and of various other resources (this one is insignificant; you can omit it as it doesn’t have a real added value to the overall solution). Then it calls two module:cluster
andnodes
.
terraform {
required_version = ">= 0.12.0"
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}
resource "random_id" "stack" {
byte_length = 4
}
module "cluster" {
source = "./modules/cluster"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
}
module "nodes" {
source = "./modules/nodes"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
cluster_nodes_ips = var.cluster_nodes_ips
cluster_node_command = module.cluster.cluster_node_command
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
wait_for = [ module.cluster.cluster_name ]
}
Let’s now jump into the cluster
module and then we will come back to the root
, because the node
one has some dependencies to the outputs of the cluster
.
The cluster
module follows the standard module structure. It consists of a main, variables and output file:
- variables.tf: This holds the input variables required for the cluster module. Nothing fancy, just the
stack
variable we mentioned in the root module.
variable "stack" {
description = "Stack unique ID"
}
- main.tf: We create a single resource, a
rancher2_cluster
, with the simplest configuration possible; as you can see in the gist below. This will create a new RKE1 cluster in our Rancher, but just the cluster placeholder and metadata, no nodes yet.
If you want more information about the rancher2 Terraform provider, visit the terraform registry page of the provider, as the purpose of this article is not to present the specific provider.
terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}
resource "rancher2_cluster" "cluster_rke" {
name = "rke-${var.stack}"
description = "rke-${var.stack}"
rke_config {
ignore_docker_version = false
network {
plugin = "canal"
}
}
}
- outputs.tf: This is the most interesting part of this module. We are going to expect three output variables from the cluster module:
cluster_id
,cluster_name
which are fairly self explanatory variable names and last but not leastcluster_node_command
which is practically the very purpose of this article. The value ofcluster_node_command
is the registration command that is required to run on every node, in order this node to be added in the cluster and configured with all the appropriate Kubernetes artifacts.
output "cluster_id" {
value = rancher2_cluster.cluster_rke.cluster_registration_token.*.cluster_id
}
output "cluster_name" {
value = rancher2_cluster.cluster_rke.name
}
output "cluster_node_command" {
value = rancher2_cluster.cluster_rke.cluster_registration_token.*.node_command
}
Let’s go back now to the root
module and see what is going on with the node
module:
module "nodes" {
source = "./modules/nodes"
providers = {
rancher2 = rancher2
}
stack = "${random_id.stack.hex}"
cluster_nodes_ips = var.cluster_nodes_ips
cluster_node_command = module.cluster.cluster_node_command
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
wait_for = [ module.cluster.cluster_name ]
}
The node
module requires a bunch of variables as input parameters:
stack
, already mentioned above — not mandatory for the lab per secluster_ip_nodes
, the list of IPv4 addresses of nodes as provided from the root moduleremote_sudoer
, the account that Ansible will use to execute its tasks, derived as well from the root module variablesprivate_key_file
, the certificate that Ansible will use to ssh to the node, provided from the root module variablescluster_node_command
, the registration command that has to run in every given node, taken as output variable from the cluster modulewait_for
, a list of resource or attributes that node module will depend on. We chosecluster_name
output variable of the cluster module in order to make sure that our Ansible tasks will run after the cluster is provisioned.
Now, let’s dig around in the last, but definetely not least, module of this experiment. the node
module:
Same pattern here, the module complies to the standard module structure and consists of the following files:
- variables.tf: This holds the input variables required from the node module. Their naming and purpose were discussed right above.
variable "cluster_nodes_ips" {
type = list
}
variable "stack" {
description = "Stack unique ID"
}
variable "cluster_node_command" {
description = "Cluster Registration Command"
}
variable remote_sudoer {
}
variable private_key_file {
}
variable "wait_for" {
type = any
default = []
}
- output.tf: That’s empty at the time being as we don’t need to export any values or results from this module
- main.tf: Here lies the core of our experiment and soon our “plot-twist” will unfold.
Ansible requires basically 3 files:
- ansible.cfg: that is the configuration file that will govern the behavior of all interactions performed by this playbook.
- inventory: the Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate.
- and minimum one playbook.yml (or whatever you want to call it): Ansible playbooks are lists of tasks that automatically execute against hosts.
Well guess what — and that’s the plot-twist, if you had the patience to read this article until here — we are not going to provide any of them. Instead we have in place 3 template files (ansible.tpl, inventory.tpl, playbook.tpl) and we are going to let Terraform feed those templates with dynamic values taken from the variables.tf and generate the required Ansible files in runtime during the application of the Terraform plan.
ansible.tpl:
[defaults]
inventory = ./inventory
remote_user = ${remote_sudoer}
host_key_checking = False
remote_tmp = /tmp/ansible
display_ok_hosts = no
private_key_file = ${private_key_file}
[ssh_connection]
ssh_args = -o ServerAliveInterval=200
inventory.tpl:
[cluster_nodes]
%{ for cluster_nodes_ip in cluster_nodes_ips ~}
${cluster_nodes_ip}
%{ endfor ~}
playbook.tpl:
- name: Prepare Nodes
hosts: all
tasks:
- name: Copy hard reset script to nodes.
copy:
src: ../../../scripts/reset_nodes.sh
dest: /tmp/reset_nodes.sh
follow: yes
mode: u=rwx,g=rx,o=r
- name: Register Nodes
hosts: all
tasks:
- name: Register Nodes to the Cluster.
command: ${cluster_node_command} --etcd --controlplane --worker
The node
module looks like this:
terraform {
required_providers {
rancher2 = {
source = "rancher/rancher2"
version = "1.23.0"
}
}
}
resource "local_file" "ansible_inventory" {
depends_on = [
var.wait_for
]
content = templatefile("${path.module}/ansible/templates/inventory.tpl",
{
cluster_nodes_ips = var.cluster_nodes_ips
})
filename = "${path.module}/ansible/inventory"
}
resource "local_file" "ansible_config" {
content = templatefile("${path.module}/ansible/templates/ansible.tpl",
{
remote_sudoer = var.remote_sudoer
private_key_file = var.private_key_file
})
filename = "${path.module}/ansible/ansible.cfg"
}
resource "local_file" "ansible_playbook" {
content = templatefile("${path.module}/ansible/templates/playbook.tpl",
{
cluster_node_command = var.cluster_node_command[0]
})
filename = "${path.module}/ansible/playbook.yml"
provisioner "local-exec" {
working_dir = "${path.module}/ansible"
command = "ansible-playbook -i inventory playbook.yml"
}
}
It creates three resources of type local_file
:
ansible_inventory
, that feeds the inventory.tpl with the required cluster nodes IPs list and generates the inventory file.ansible_config
, that feeds the ansible.tpl with the required account name, and certificate path and generates the ansible.cfg file.- and the
ansible_playbook
resource, that feeds the playbook.tpl with the registration command that was generated after the creation of the cluster in the cluster module and builds the playbook.yml file. As a last step, it executes the generated playbook tasks upon the hosts that the newly created inventory file contains.
Let’s see how the generated playbook.yml looks like and what tasks is going to execute:
- name: Prepare Nodes
hosts: all
tasks:
- name: Copy hard reset script to nodes.
copy:
src: ../../../scripts/reset_nodes.sh
dest: /tmp/reset_nodes.sh
follow: yes
mode: u=rwx,g=rx,o=r
- name: Register Nodes
hosts: all
tasks:
- name: Register Nodes to the Cluster.
command: sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.6.3 --server https://192.168.1.106:27443 --token lg7ncpvcm2zvq52n5kcg922b8b5d85crsccm8fbdxm6l29c4mn76q9 --ca-checksum d4fc3e21b795bda1ead361d9d045d2ab79d45ef75ecf05cfa149a0c74b0c42c8 --etcd --controlplane --worker
The playbook consists of two plays:
- Prepare Nodes: It copies a bash script in every node. This is our Plan B in case something goes wrong during terraform destroy or artifacts persist in the nodes even after destruction of the plan.
- Register Nodes: This is the play that will actually consolidate all our preparation efforts. It executes in every node the registration command that adds the host in the newly generated RKE1 cluster.
Can this playbook be improved? Definitely! Another play could be added, that would prepare the nodes and install Docker engine to each one of them so we eliminate even more the manual preparation steps needed. Additionally we could use the flag
become: yes
in order to instruct Ansible to execute the tasks with elevated privileges instead of running the register command withsudo
— Ansible nugs a lot about this. But I leave those changes to you…
Let’s take it for a spin
First thing first, let’s go to our development workstation and export the variables from .terraformrc in our enviroment:
source .terraformrc
Then we have to initialize our working directory containing Terraform configuration files:
terraform init
After that, let’s create an execution plan and preview the changes:
terraform plan
Last step would be to apply this execution plan on our infrastructure:
terraform apply -auto-approve
That will take significant amount of time — not the creation of the cluster that much (~5min) but the execution of the Ansible tasks. Calculate something around 15–20min until all nodes are registered in the cluster and all necessary containers are created successfully. The Terraform plan will execute and exit but inside the nodes the whole process of registration will kick off and this included bring up many components as containers, downloading the right images, setting certificates etc etc.
During the process Terraform cannot provides with any more information. You can go to the Cluster Explorer find your cluster from the list and then observe live the provisioning log in the nodes that is aggregated in this tab.
When the registration process completes, you can open the previous tab named Machines and inspect your newly added nodes and their current state. As long as all are marked as green and active, then we have a winner. Your cluster is ready to go.
If you want to dispose you cluster, the reverse process if fairly easy. Just destroy the terraform plan:
terraform destroy -auto-approve
There is a slight chance in that process, due to whatever technical hickups, the cluster to be removed successfully from Rancher, but a bunch of containers to remain active and intact inside the nodes. Without completely purging the existing RKE containers, config files & binaries from the nodes they become unusuable and you cannot request them to join another RKE cluster. In this occasion, although is rare because Terraform will get the job done, comes into the picture the hard reset script we copied on each node with the first play of our Ansible playbook. Execute the script remotely in every node like this:
ssh centos@192.168.1.30 'sudo /tmp/reset_cluster.sh'
This will blast the node and purge all remaining RKE artifacts. You node is ready to be used again and you avoided throwing away the whole VM.
Summary
Those 3 simple Terraform modules can save you a lot of time and nerves when you need to provision fast a new Kubernetes cluster for a development environment for you or for your team. With some hardening and fine-grained configuration of the rancher2_cluster resource you could provision production clusters on your premises.
Our cluster is ready, but bare metal clusters come with an important downside. When you create a LoadBalancer service in a Public Cloud (e.g AWS, Azure, GCP, Open Telekom Cloud etc) there is the required glue in place to spin a new Network Load Balancer and assign its IP address as the external IP of the LoadBalancer service. Hoever, on a private cloud or on-premises, the implementations of network load balancers are missing and this is the void that MetalLB comes to fill in. Jump to the following article to solve this issue once and for all in less than 5 minutes
All the files of this lab can be found in the following repo:
Hope you found this article useful. On the other hand, if you want something smaller and way more compact to start with as a developer environment, make sure to have a look in one my articles describing how to provision a highly available containerized Kubernetes cluster with K3S and K3D:
Have fun!