From 4 Bare VMs to Serving an LLM — Kubespray + Dynamo Simulation

The Scenario

What we're building

You have 4 virtual machines, each with an NVIDIA GPU. Your goal: deploy a Llama-3-70B model and serve it via an OpenAI-compatible API. Here's the setup:

VM	Hostname	IP	GPU	RAM	Role
VM 1	`master`	10.0.0.10	H100 80GB	256 GB	Control plane + etcd
VM 2	`gpu-worker-1`	10.0.0.11	H100 80GB	256 GB	Worker (prefill)
VM 3	`gpu-worker-2`	10.0.0.12	H100 80GB	256 GB	Worker (prefill/decode)
VM 4	`gpu-worker-3`	10.0.0.13	H100 80GB	256 GB	Worker (decode)

Plus a 5th machine (your laptop or a jump box) that runs Kubespray to orchestrate everything. It doesn't need a GPU.

Your Network Layout:

 ┌─────────────────────────────────────────────────────────────────┐
 │                     Your Network (10.0.0.0/24)                  │
 │                                                                 │
 │   ┌──────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐
 │   │  master   │  │ gpu-worker-1 │  │ gpu-worker-2 │  │ gpu-worker-3   │
 │   │ 10.0.0.10│  │  10.0.0.11   │  │  10.0.0.12   │  │  10.0.0.13     │
 │   │ [H100]   │  │   [H100]     │  │   [H100]     │  │   [H100]       │
 │   │ K8s      │  │   K8s        │  │   K8s         │  │   K8s          │
 │   │ Control  │  │   Worker     │  │   Worker      │  │   Worker       │
 │   │ Plane    │  │   Node       │  │   Node        │  │   Node         │
 │   └──────────┘  └──────────────┘  └──────────────┘  └────────────────┘
 │         ▲                                                       │
 │         │  SSH                                                  │
 │   ┌─────┴──────┐                                                │
 │   │  Deployer   │  Your laptop / jump box                       │
 │   │  (Kubespray)│  Runs Ansible playbooks                       │
 │   └────────────┘                                                │
 └─────────────────────────────────────────────────────────────────┘

Think of this like building a factory from scratch. You have 4 empty warehouses (VMs). You need to: install electrical wiring (Kubernetes), set up the assembly line (GPU drivers + NVIDIA operator), bring in the machinery (NVIDIA Dynamo), and start production (serve the model). Kubespray is like a general contractor — you hand it the blueprints (inventory file) and it wires up everything automatically.

Background

What is Kubespray and why use it?

Kubespray is an open-source tool that uses Ansible playbooks to deploy production-ready Kubernetes clusters. It works on bare metal, VMs, or cloud. You define your nodes in an inventory file, run one command, and Kubespray configures everything: container runtime, etcd, control plane, networking, DNS, and worker nodes.

If kubeadm is like building IKEA furniture step-by-step (you run each command manually), then Kubespray is like hiring a professional assembler. You say "I want 1 control plane and 3 workers" and it handles the 500+ configuration steps automatically — in the right order, with retries, idempotently.

Tool	Approach	Best for
kubeadm	Manual step-by-step CLI	Learning, single-node
Kubespray	Ansible automation (declarative)	Production bare-metal/VM clusters
k3s	Lightweight single binary	Edge, IoT, dev environments
EKS/GKE/AKS	Managed cloud service	Cloud-native, no infra management

For GPU inference clusters on bare-metal VMs, Kubespray is the standard choice. It supports Calico/Cilium networking, HA control planes, and custom configurations needed for GPU workloads.

Phase 1

Prepare the VMs (prerequisites)

Before Kubespray can do its thing, each VM needs basic setup. Think of this as pouring the concrete foundation before building the factory.

OS & SSH on all 4 VMs

Install Ubuntu 24.04 on all VMs. Ensure SSH is running and you can reach each one from your deployer machine.

# On each VM: ensure SSH is running
sudo systemctl enable ssh
sudo systemctl start ssh

# On your deployer machine: set up /etc/hosts for convenience
cat >> /etc/hosts << 'EOF'
10.0.0.10 master
10.0.0.11 gpu-worker-1
10.0.0.12 gpu-worker-2
10.0.0.13 gpu-worker-3
EOF

SSH key-based auth from deployer to all nodes

Kubespray (Ansible) needs passwordless SSH access to every node.

# On deployer: generate key if you don't have one
ssh-keygen -t ed25519 -N ""

# Copy to all nodes
ssh-copy-id master
ssh-copy-id gpu-worker-1
ssh-copy-id gpu-worker-2
ssh-copy-id gpu-worker-3

# Test — should log in without password
ssh gpu-worker-1 "hostname"
# → gpu-worker-1

System requirements on each VM

Kubespray needs these on every node:

# On ALL 4 VMs:
sudo apt update && sudo apt upgrade -y

# Enable IPv4 forwarding (required for K8s pod networking)
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Disable swap (Kubernetes requirement)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab

# Disable firewall (or configure rules — Kubespray doesn't manage it)
sudo ufw disable

GPU pre-check: Verify GPUs are visible on each worker: lspci | grep -i nvidia. You do NOT need to install NVIDIA drivers manually — the GPU Operator (Phase 3) handles that inside Kubernetes.

Phase 2

Deploy Kubernetes with Kubespray

This is where the magic happens. One command, ~20 minutes, and you have a production K8s cluster.

Kubespray is like sending a team of electricians, plumbers, and carpenters into your 4 warehouses simultaneously. They install wiring (container runtime), plumbing (pod networking), structure (control plane), and connect everything. You just handed them the blueprint (inventory.ini).

Clone Kubespray on your deployer

# On deployer machine
mkdir ~/k8s-deploy && cd ~/k8s-deploy
git clone https://github.com/kubernetes-sigs/kubespray.git --branch release-2.26
cd kubespray

Create your inventory

This file tells Kubespray which VMs are control planes, which are workers, and where etcd runs.

# Copy the sample inventory
cp -rf inventory/sample inventory/gpu-cluster

# Edit the inventory file
cat > inventory/gpu-cluster/inventory.ini << 'EOF'
[all]
master       ansible_host=10.0.0.10  ip=10.0.0.10
gpu-worker-1 ansible_host=10.0.0.11  ip=10.0.0.11
gpu-worker-2 ansible_host=10.0.0.12  ip=10.0.0.12
gpu-worker-3 ansible_host=10.0.0.13  ip=10.0.0.13

[kube_control_plane]
master

[etcd]
master

[kube_node]
gpu-worker-1
gpu-worker-2
gpu-worker-3

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr
EOF

What this means:

[kube_control_plane]  →  master runs the K8s API server, scheduler, controller
[etcd]                →  master runs the cluster database
[kube_node]           →  3 GPU VMs are worker nodes (where pods actually run)

   ┌────────────────────────────────────────────────┐
   │          Kubernetes Cluster                     │
   │                                                 │
   │  ┌──────────────────┐                           │
   │  │   master          │                           │
   │  │   - API Server    │   ← brain of the cluster │
   │  │   - Scheduler     │                           │
   │  │   - etcd          │   ← cluster database     │
   │  │   - Controller    │                           │
   │  └──────────────────┘                           │
   │           │                                      │
   │     ┌─────┼─────────────┐                        │
   │     ▼     ▼             ▼                        │
   │  ┌──────┐ ┌──────┐ ┌──────┐                     │
   │  │gpu-1 │ │gpu-2 │ │gpu-3 │  ← workers         │
   │  │[H100]│ │[H100]│ │[H100]│  ← GPUs here       │
   │  │kubelet│ │kubelet│ │kubelet│  ← runs pods     │
   │  └──────┘ └──────┘ └──────┘                     │
   └────────────────────────────────────────────────┘

Enable containerd (recommended runtime for GPU)

# Enable containerd and configure NVIDIA runtime support
cat >> inventory/gpu-cluster/group_vars/k8s_cluster/k8s-cluster.yml << 'EOF'

# Use containerd as container runtime (required for GPU operator)
container_manager: containerd

# Enable Helm (we'll need it for GPU operator + Dynamo)
helm_enabled: true
EOF

Run Kubespray (the big moment)

# Option A: Run via Docker (recommended — all deps included)
docker run --rm -it \
  --mount type=bind,source="$(pwd)"/inventory/gpu-cluster,dst=/inventory \
  --mount type=bind,source="${HOME}"/.ssh/id_ed25519,dst=/root/.ssh/id_rsa \
  quay.io/kubespray/kubespray:v2.26.0 bash

# Inside container:
ansible-playbook -i /inventory/inventory.ini \
  --private-key /root/.ssh/id_rsa \
  --become \
  cluster.yml

# Option B: Run directly (need Ansible + deps installed)
pip install -r requirements.txt
ansible-playbook -i inventory/gpu-cluster/inventory.ini \
  --become \
  cluster.yml

# ~20 minutes later...
PLAY RECAP *************************************************************
master       : ok=568  changed=126  unreachable=0  failed=0
gpu-worker-1 : ok=365  changed=80   unreachable=0  failed=0
gpu-worker-2 : ok=365  changed=80   unreachable=0  failed=0
gpu-worker-3 : ok=365  changed=80   unreachable=0  failed=0

# SSH to master and verify:
ssh master "kubectl get nodes -o wide"

NAME           STATUS   ROLES           AGE   VERSION   INTERNAL-IP
master         Ready    control-plane   18m   v1.30.2   10.0.0.10
gpu-worker-1   Ready    <none>          16m   v1.30.2   10.0.0.11
gpu-worker-2   Ready    <none>          16m   v1.30.2   10.0.0.12
gpu-worker-3   Ready    <none>          16m   v1.30.2   10.0.0.13

Your Kubernetes cluster is running. But the GPUs aren't visible to Kubernetes yet — we need the NVIDIA GPU Operator.

Phase 3

Install NVIDIA GPU Operator

The GPU Operator installs NVIDIA drivers, container toolkit, and device plugins inside Kubernetes so that pods can request GPU resources.

The cluster is wired (K8s), but the factories don't have power outlets yet. The GPU Operator is like installing industrial power connections in each warehouse. After this, any machine (pod) you roll in can plug into a GPU and use it.

Install NVIDIA GPU Operator via Helm

# On master node:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the GPU operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

# Wait for all pods to be ready (~5 minutes)
kubectl -n gpu-operator get pods -w

# Verify GPUs are visible:
kubectl get nodes -o json | \
  jq '.items[].status.allocatable["nvidia.com/gpu"]'
# → "1"  (for each worker node)

After GPU Operator:

  kubectl describe node gpu-worker-1 | grep nvidia

  Allocatable:
    nvidia.com/gpu: 1        ← Kubernetes now sees the H100!
  
  Pods can now request GPUs:
    resources:
      limits:
        nvidia.com/gpu: "1"  ← "give me 1 GPU"

Phase 4

Deploy NVIDIA Dynamo for LLM Serving

Now we install Dynamo's Kubernetes platform and deploy a model across our 3 GPU workers with disaggregated serving.

The factories have power (GPUs accessible). Now we need to install the production line (Dynamo). Dynamo's components are: the order intake counter (Frontend), the dispatcher (Router), the prep kitchen (Prefill workers), and the assembly line (Decode workers). Each runs in its own pod on the GPU workers.

Install Dynamo Platform (CRDs + Operator)

# Install Dynamo CRDs
export VERSION=0.9.0
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default

# Install Dynamo Platform (operator, controllers)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
  -n dynamo-system --create-namespace

# Create HuggingFace secret for model downloads
kubectl create secret generic hf-secret \
  -n dynamo-system \
  --from-literal=HF_TOKEN=hf_your_token_here

Deploy Llama-3-70B with disaggregated serving

# Create the deployment manifest
cat > llama70b-disagg.yaml << 'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b-disagg
  namespace: dynamo-system
spec:
  services:
    Frontend:
      replicas: 1                    # Runs on master or any node
      
    PrefillWorker:
      replicas: 1                    # gpu-worker-1
      resources:
        limits:
          nvidia.com/gpu: "1"        # 1x H100 for prefill
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
        - name: TENSOR_PARALLEL_SIZE
          value: "1"
          
    DecodeWorker:
      replicas: 2                    # gpu-worker-2 + gpu-worker-3
      resources:
        limits:
          nvidia.com/gpu: "1"        # 1x H100 each for decode
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
EOF

kubectl apply -f llama70b-disagg.yaml

What this deploys across your cluster:

  master (10.0.0.10):
  ┌─────────────────────────────────────┐
  │ K8s Control Plane                   │
  │ + Frontend Pod (OpenAI API :8000)   │
  │ + Router Pod (KV-aware routing)     │
  └─────────────────────────────────────┘
          │
          │  routes requests based on KV cache overlap
          │
  ┌───────┴──────────────────────────────────────────────────┐
  │                                                          │
  gpu-worker-1 (10.0.0.11):          gpu-worker-2 + 3:
  ┌──────────────────────┐            ┌──────────────────────┐
  │ Prefill Worker Pod   │──NIXL──→  │ Decode Worker Pods   │
  │ [H100 — compute]     │  KV cache │ [H100 — bandwidth]   │
  │                      │  transfer │ [H100 — bandwidth]   │
  │ Processes prompts    │           │ Generates tokens      │
  │ at max FLOPS         │           │ at max batch size     │
  └──────────────────────┘            └──────────────────────┘

  1 prefill GPU + 2 decode GPUs = disaggregated serving
  Prefill blasts through prompts → NIXL transfers KV cache → Decode generates tokens

Verify everything is running

# Check all pods
kubectl -n dynamo-system get pods -o wide

NAME                              READY   STATUS    NODE
llama-70b-frontend-xxxxx          1/1     Running   master
llama-70b-prefill-worker-xxxxx    1/1     Running   gpu-worker-1
llama-70b-decode-worker-0-xxxxx   1/1     Running   gpu-worker-2
llama-70b-decode-worker-1-xxxxx   1/1     Running   gpu-worker-3

# Check GPU allocation
kubectl -n dynamo-system describe pod llama-70b-prefill-worker-xxxxx | grep gpu
    nvidia.com/gpu: 1

# Test the API
curl http://10.0.0.10:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-70B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "max_tokens": 100
  }'

The Full Picture

Request flow: from curl to tokens

You: curl http://10.0.0.10:8000/v1/chat/completions -d '{"messages":[...]}'
  │
  ▼
┌─────────────────────────────────────────────────────────────────────┐
│ master (10.0.0.10)                                                  │
│                                                                     │
│ [Frontend Pod]  ← receives HTTP request, tokenizes, validates       │
│       │                                                             │
│       ▼                                                             │
│ [Router Pod]    ← checks KV cache radix tree:                       │
│                    "gpu-worker-2 has 78% prefix cached,             │
│                     gpu-worker-3 has 12% — route decode to worker-2"│
│       │                                                             │
│       │  routes to prefill worker                                   │
└───────┼─────────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────┐
│ gpu-worker-1 (10.0.0.11) — PREFILL           │
│                                               │
│ [Prefill Worker Pod]                          │
│   - Receives full prompt tokens               │
│   - Runs forward pass through all 80 layers   │
│   - Generates KV cache (~1.3 GB)              │
│   - Produces first token "Imagine"            │
│   - GPU compute: 85% utilized                 │
│   - Time: ~45ms (with TP=1)                   │
│                                               │
│   KV cache ready — trigger NIXL transfer      │
└───────────────────┬───────────────────────────┘
                    │
                    │ NIXL KV transfer
                    │ (~1.3 GB via NVLink or InfiniBand)
                    │ Time: ~1-30ms depending on interconnect
                    │
                    ▼
┌───────────────────────────────────────────────┐
│ gpu-worker-2 (10.0.0.12) — DECODE            │
│                                               │
│ [Decode Worker Pod]                           │
│   - Receives KV cache from prefill            │
│   - Generates tokens one by one               │
│   - Reads 131 GB/token (weights + KV cache)   │
│   - GPU bandwidth: 88% utilized               │
│   - ~40ms per token, 25 tok/s per user        │
│   - Batches multiple users for efficiency     │
│                                               │
│   Streams tokens back via Frontend            │
└───────────────────────────────────────────────┘
        │
        │ tokens stream back
        ▼
You see: "Imagine a tiny factory inside your computer..."

Scaling

What happens when traffic increases?

Dynamo's Planner watches TTFT (time to first token) and ITL (inter-token latency). When SLAs are breached, it rebalances:

Low traffic (2 users):
  gpu-worker-1: [PREFILL]     ← 1 GPU prefilling
  gpu-worker-2: [DECODE]      ← 1 GPU decoding
  gpu-worker-3: [DECODE]      ← 1 GPU decoding

Traffic spike (20 users):
  Planner detects: prefill queue depth = 15 (too high!)

  gpu-worker-1: [PREFILL]     ← 1 GPU prefilling
  gpu-worker-2: [PREFILL]     ← REASSIGNED from decode to prefill!
  gpu-worker-3: [DECODE]      ← 1 GPU decoding (batches all 20 users)

  Planner log: "Scaled prefill 1→2, decode 2→1. Queue depth: 15→3. SLA met."

Traffic back to normal:
  Planner scales back: 1 prefill, 2 decode.

Why not just add more VMs? You absolutely can. Kubespray makes it easy to add worker nodes — just add them to inventory.ini and re-run the playbook. Dynamo's Planner will automatically discover new GPU workers and start scheduling on them.

# Adding a 5th GPU VM:
# 1. Add to inventory.ini under [kube_node]:
#    gpu-worker-4 ansible_host=10.0.0.14  ip=10.0.0.14

# 2. Run Kubespray scale playbook:
ansible-playbook -i inventory/gpu-cluster/inventory.ini \
  --become scale.yml

# 3. Install GPU Operator on new node (automatic — operator is a DaemonSet)
# 4. Update DynamoGraphDeployment to use the new worker
#    → Dynamo Planner auto-discovers it

Summary

The complete timeline

Phase	What	Tool	Time
0	4 bare Ubuntu VMs with GPUs	Manual / Terraform	varies
1	SSH setup, system prep, disable swap	Manual / Ansible	~10 min
2	Deploy K8s cluster (control plane + 3 workers)	Kubespray	~20 min
3	Install GPU Operator (drivers + device plugin)	Helm	~5 min
4	Install Dynamo Platform (CRDs + operator)	Helm	~3 min
5	Deploy Llama-3-70B disaggregated	kubectl apply	~10 min (model download)
6	Serve inference via OpenAI API	curl / SDK	immediate
Total: bare VMs → serving LLM inference			~50 minutes

The Stack (bottom to top):

  ┌─────────────────────────────────────────┐
  │    Your Application / OpenAI SDK        │  ← You send prompts here
  ├─────────────────────────────────────────┤
  │    Dynamo Frontend (OpenAI API)         │  ← HTTP server
  │    Dynamo Router (KV-aware)             │  ← Smart routing
  │    Dynamo Workers (prefill + decode)    │  ← LLM inference
  │    Dynamo NIXL (KV transfer)            │  ← Data movement
  ├─────────────────────────────────────────┤
  │    NVIDIA GPU Operator                  │  ← Drivers, toolkit, device plugin
  ├─────────────────────────────────────────┤
  │    Kubernetes (via Kubespray)           │  ← Container orchestration
  │    - containerd, calico, coredns, etcd  │
  ├─────────────────────────────────────────┤
  │    Ubuntu 24.04                         │  ← Operating system
  ├─────────────────────────────────────────┤
  │    4× VMs with H100 GPUs               │  ← Hardware
  └─────────────────────────────────────────┘

The final analogy: You started with 4 empty warehouses (VMs). Kubespray wired them into a connected industrial park (K8s cluster). The GPU Operator installed heavy-duty power (GPU access). NVIDIA Dynamo set up the production line — a receiving dock (Frontend), a smart dispatcher (Router), a rapid prep kitchen (Prefill GPU), and an assembly line (Decode GPUs) connected by conveyor belts (NIXL). Now you're shipping products (tokens) to customers (API users) at scale.

Sources

References

Kubespray — Deploy Production Kubernetes → github.com/kubernetes-sigs/kubespray
Kubespray walkthrough — Minh Thang Hoang → medium.com
NVIDIA GPU Operator → docs.nvidia.com/gpu-operator
NVIDIA Dynamo → github.com/ai-dynamo/dynamo
Kubernetes Docs → kubernetes.io/docs
Dynamo Kubernetes Deployment → docs.nvidia.com/dynamo

Built as an educational resource.
Kubespray · NVIDIA Dynamo