Automating Kubernetes Disaster Recovery with Terraform

Building a One-Click Enterprise Kubernetes Infrastructure for Disaster Recovery

Welcome! Today, I will be sharing the behind-the-scenes of our “One-Click Infrastructure & App Deployment” project—a project that didn’t just remain a theoretical concept, but one where we truly got our hands dirty in the terminal, wrestled with errors, and ultimately forged a flawless automation pipeline.

Before diving into the depths of this article, you can watch our project video below, which serves as the living proof of everything I am about to explain. It is an approximately 5-minute recording; I hope you enjoy watching it without getting bored! In the video, you will briefly witness the following process: First, our “magic” command terraform apply is triggered, and the system creates 3 Debian virtual machines out of thin air on Hyper-V, connecting them to the network. Then, it automatically infiltrates these machines and installs the Kubernetes (K3s) cluster. Switching to the console, we type kubectl get nodes to verify that all 3 machines are in a “Ready” state, immediately followed by kubectl get pods -o wide to confirm how our Nginx application is intelligently distributed across the machines in 3 replicas. Finally, we find our externally exposed port using the kubectl get svc nginx-web command and conclude this epic process by successfully loading the “Welcome to nginx!” page in the Chrome browser.

Now, let’s take a closer step-by-step look at the engineering marvel, architectural decisions, and the challenges we faced behind that 5-minute show you just watched.

The Heart of the System: What is Terraform and Why Use It?

The main engine that enabled these machines to boot up on their own, acquire IP addresses, share passwords with one another, and transform into a giant cluster right before our eyes was Terraform. So, what exactly is Terraform?

In traditional System Administration (SysAdmin), setting up an infrastructure takes hours, sometimes even days. You log into a Hyper-V or VMware interface, click the “New Virtual Machine” button, attach an ISO file, and manually allocate RAM and CPU. After the operating system is installed, you connect to each machine one by one to install the necessary packages. This process is slow, highly prone to human error, and worst of all, it is not easily repeatable. If your system crashes (a Disaster), you have to spend all those hours from scratch.

Developed by HashiCorp, Terraform is the shining star of the “Infrastructure as Code” (IaC) concept. The core philosophy of Terraform is this: “Don’t click around in an interface for me; write down what you want in a text file, and I will turn it into reality.”

The primary reasons we chose Terraform for this project are:

Declarative Approach: We don’t tell Terraform “Follow these exact steps” (imperative). We say, “I want 1 Master and 2 Worker machines, and these should be their names” (declarative). It calculates the difference between the current state and the desired state, and does whatever is necessary to achieve it.
State Management: Terraform keeps an updated map (tfstate) of the system it builds. On the next run, it only updates the parts that have changed.
Provisioning Capability: As seen in this project, Terraform doesn’t just create virtual machines (empty boxes); it can also use tools like local-exec and remote-exec to send commands into those machines (via SSH) to install software.
Disaster Recovery: When a machine is deleted or the entire system collapses, we can restore the company’s entire infrastructure to its original state in mere seconds, all thanks to that single main.tf file.

In short, Terraform took on the role of the “Master Builder” and “System Architect” of this project. Now, let’s look at how we planned the project designed by this architect.

Anatomy of the Project: What Did We Plan and How?

We aimed to simulate a corporate data center infrastructure at real-world standards. We weren’t just doing an ordinary student assignment; we had to build a fault-tolerant, highly available, and modular architecture.

We divided our project into 3 main phases:

Phase 1: Physical Infrastructure Preparation with Hyper-V (The Empty Boxes)

First, we had to handle the hardware layer (Virtual Machines). Certain restrictions brought by the school environment (such as the WinRM protocol being disabled) pushed us toward a creative solution. Instead of using traditional Hyper-V providers, we built our own custom automation bridge using Terraform’s null_resource and PowerShell integration.

Golden Image: Installing Debian from scratch every single time would be a waste of time. We prepared a debian-template.vhdx disk image with the basic configurations already applied.
Cloning and Booting: Terraform copied this image for each machine (DR-K8s-Master, DR-K8s-Worker-1, DR-K8s-Worker-2) in seconds, allocated 2048 MB of RAM to each, and booted them up.
Dynamic IP Detection: One of the most critical challenges in the system was the inability to know the DHCP-assigned IP addresses of the machines in advance. To solve this, we embedded a smart while loop inside Terraform. Terraform listened to the system every 5 seconds until the machine booted up in Hyper-V and received a valid IPv4 address. The moment it caught the IP address, it saved this data to its memory to proceed to the next stages.

Phase 2: Kubernetes Orchestration with K3s (The Brain and the Workers)

The boxes (Virtual machines) were ready, and the operating systems were running. However, these were just isolated computers unaware of each other. We needed “orchestration” software to bring them together and make them operate as a single organism. Our choice was K3s (Lightweight Kubernetes), which performs miracles, especially in resource-constrained environments.

The Birth of the Master Node: After finding the first IP (Master IP), Terraform smoothly infiltrated the machine via SSH without a password prompt. It installed the curl package and then initiated the K3s Master (Control Plane) installation.
Secure Token Transfer: To prevent random external machines from joining the Kubernetes cluster, the Master generates a “Node Token”. As soon as the installation was finished, Terraform pulled this token from the Master machine and wrote it to a secure text file on the desktop.
The Enrollment of Worker Nodes: As the Worker-1 and Worker-2 machines, which were opened in parallel, received their IPs, Terraform infiltrated them as well. It initiated the K3s Agent (Worker) installation by telling them, “Your manager is located at this Master IP, and this is your entry password.” Thus, the 3 machines established an encrypted and secure communication network among themselves, becoming a giant Cluster. I no longer had “My computers”; I had “My System”.

Phase 3: Enterprise Stack Deployment (The Company’s Digital Ecosystem)

If we had just installed Kubernetes and left it there, it would be like building a massive factory and producing nothing inside it. There are “Fixture Services” that must exist in the technology departments of corporate companies. In the final phase of the project, as soon as Kubernetes was ready, we automatically injected this corporate architecture (Enterprise Stack) into the system.

The components of this architecture were:

Nginx (Frontend & Load Balancing): Our web server that welcomes the users. Following the High Availability (HA) principle, we set this to 3 replicas. Kubernetes intelligently distributed these three copies across the Master, Worker-1, and Worker-2. Even if one of the machines burned down completely, the other two would continue serving the users. Zero downtime!
PostgreSQL (Relational Database): The backend memory where the company’s data (customers, orders, logs, etc.) is kept permanently and securely. As a security measure, we secretly injected the database password (POSTGRES_PASSWORD) directly into the system via an environment variable.
Redis (In-Memory Cache): A caching service installed to increase the speed and performance of the system. In the real world, frequently read data (e.g., the homepage of an e-commerce site) is kept on Redis so as not to exhaust the main database.
Prometheus & Grafana (Monitoring & Observability): Based on the principle “You cannot manage what you cannot measure,” we added tools to monitor the pulse of our system. Prometheus is a detective that collects CPU, RAM, and network traffic metrics from the machines second by second. Grafana is the visual feast component that takes this raw data and turns it into beautiful, readable dashboards. Thanks to this, we gained an infrastructure capable of instantly detecting via graphs when one of the servers is struggling or has crashed.

Having all these 5 different services thrown onto Kubernetes with a final Terraform command and transitioning to a “Running” state within seconds was a fascinating pinnacle of our automation.

Challenges Faced and “Sysadmin” Reflexes

Of course, when building an architecture of this complexity, not everything worked perfectly the first time. Codes that looked great in theory stumbled over various obstacles in the practical world due to the nature of operating systems. This is exactly where real engineering began.

1. A Crash Brought by Speed: “Race Condition” One of the most interesting errors we encountered while developing the project was the ssh: connect to host ... port 22: Connection refused error. Terraform was such a fast tool that the moment the machine booted in Hyper-V and got its IP address, Terraform tried to dive in via SSH (Port 22) within milliseconds. However, the SSH service inside the Debian operating system needed a few more seconds to wake up and start accepting connections. Because the door wasn’t open yet, Terraform threw a “Connection refused” error.

The Solution: We added a “human margin” (Breathing time) to the code. After finding the IP, we gave the system a 20-second wake-up period using the Start-Sleep -Seconds 20 command. This allowed the SSH service to initialize, and the connection occurred flawlessly.

2. Logs Mixing with the Password: The “Syntax Error” Crisis While the Master machine was installing K3s, the operating system was printing logs to the screen like “Downloading packages… 5%… 10%…”. PowerShell captured all these logs and stuffed them right into our master_token.txt file, which was supposed to only contain our password! When the Worker machine read this file and tried to connect to Kubernetes, the system freaked out because it saw download percentages and parentheses instead of a clean token.

The Solution: We completely separated the installation phase and the password retrieval phase. First, we performed a “Silent Install” (without printing logs to the screen), and then with a clean command (cat /var/lib/rancher/k3s/server/node-token), we extracted only the password, cleaned up the leading and trailing spaces using .Trim(), and wrote it to the file.

3. Kubernetes Syntax Evolution: PostgreSQL Failing to Install While Nginx and Redis were running perfectly in the Enterprise Stack section, we noticed that PostgreSQL was missing from the list. The issue stemmed from a version difference in Kubernetes commands (kubectl). In its newer versions, Kubernetes stopped accepting the --env (password definition) parameter directly when creating a Deployment.

The Solution: As soon as we identified the problem, we intervened in the system with manual commands and brought Postgres online. Immediately after, to stay true to our main goal of “Full Automation,” we updated our Terraform code. By splitting the process into two steps (create deployment && set env), we made our main.tf file flawless and error-free.

Detecting these errors, analyzing their causes, and solving them proved to us wonderfully that this project is not just about writing code; it is an analytical process that requires understanding the behavior of systems.

Conclusion and Future Vision

At the end of the day, what do we have? A folder sitting on the desktop and a few configuration files inside it… But when the commands inside that folder are executed; there is a power that virtualizes physical servers, resolves network configurations, infiltrates operating systems and assigns them tasks, boots up a giant cloud orchestrator like Kubernetes, and can deploy all of a company’s needs (Web, Database, Cache, Monitoring) onto it in a matter of seconds.

This project is proof of how companies can save their lives in moments of disaster (Disaster Recovery) where even seconds matter. It is the product of a vision that transforms the sentence “The servers burned down, everything is lost” into the comfort of “No problem, we are re-running the automation script, we will be live in 5 minutes.”

Thank you for reading and accompanying me on this technical journey. May we continue to experience the indescribable thrill of seeing those green (“Ready” and “Running”) texts in the terminal!