A complete walkthrough: take 4 virtual machines with GPUs, build a Kubernetes cluster with Kubespray, install the GPU stack, deploy NVIDIA Dynamo, and serve a 70B model. With analogies, diagrams, and every command.
You have 4 virtual machines, each with an NVIDIA GPU. Your goal: deploy a Llama-3-70B model and serve it via an OpenAI-compatible API. Here's the setup:
| VM | Hostname | IP | GPU | RAM | Role |
|---|---|---|---|---|---|
| VM 1 | master | 10.0.0.10 | H100 80GB | 256 GB | Control plane + etcd |
| VM 2 | gpu-worker-1 | 10.0.0.11 | H100 80GB | 256 GB | Worker (prefill) |
| VM 3 | gpu-worker-2 | 10.0.0.12 | H100 80GB | 256 GB | Worker (prefill/decode) |
| VM 4 | gpu-worker-3 | 10.0.0.13 | H100 80GB | 256 GB | Worker (decode) |
Plus a 5th machine (your laptop or a jump box) that runs Kubespray to orchestrate everything. It doesn't need a GPU.
Your Network Layout: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Your Network (10.0.0.0/24) β β β β ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β β master β β gpu-worker-1 β β gpu-worker-2 β β gpu-worker-3 β β β 10.0.0.10β β 10.0.0.11 β β 10.0.0.12 β β 10.0.0.13 β β β [H100] β β [H100] β β [H100] β β [H100] β β β K8s β β K8s β β K8s β β K8s β β β Control β β Worker β β Worker β β Worker β β β Plane β β Node β β Node β β Node β β ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β β² β β β SSH β β βββββββ΄βββββββ β β β Deployer β Your laptop / jump box β β β (Kubespray)β Runs Ansible playbooks β β ββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Think of this like building a factory from scratch. You have 4 empty warehouses (VMs). You need to: install electrical wiring (Kubernetes), set up the assembly line (GPU drivers + NVIDIA operator), bring in the machinery (NVIDIA Dynamo), and start production (serve the model). Kubespray is like a general contractor — you hand it the blueprints (inventory file) and it wires up everything automatically.
Kubespray is an open-source tool that uses Ansible playbooks to deploy production-ready Kubernetes clusters. It works on bare metal, VMs, or cloud. You define your nodes in an inventory file, run one command, and Kubespray configures everything: container runtime, etcd, control plane, networking, DNS, and worker nodes.
If kubeadm is like building IKEA furniture step-by-step (you run each command manually), then Kubespray is like hiring a professional assembler. You say "I want 1 control plane and 3 workers" and it handles the 500+ configuration steps automatically — in the right order, with retries, idempotently.
| Tool | Approach | Best for |
|---|---|---|
| kubeadm | Manual step-by-step CLI | Learning, single-node |
| Kubespray | Ansible automation (declarative) | Production bare-metal/VM clusters |
| k3s | Lightweight single binary | Edge, IoT, dev environments |
| EKS/GKE/AKS | Managed cloud service | Cloud-native, no infra management |
For GPU inference clusters on bare-metal VMs, Kubespray is the standard choice. It supports Calico/Cilium networking, HA control planes, and custom configurations needed for GPU workloads.
Before Kubespray can do its thing, each VM needs basic setup. Think of this as pouring the concrete foundation before building the factory.
Install Ubuntu 24.04 on all VMs. Ensure SSH is running and you can reach each one from your deployer machine.
# On each VM: ensure SSH is running sudo systemctl enable ssh sudo systemctl start ssh # On your deployer machine: set up /etc/hosts for convenience cat >> /etc/hosts << 'EOF' 10.0.0.10 master 10.0.0.11 gpu-worker-1 10.0.0.12 gpu-worker-2 10.0.0.13 gpu-worker-3 EOF
Kubespray (Ansible) needs passwordless SSH access to every node.
# On deployer: generate key if you don't have one ssh-keygen -t ed25519 -N "" # Copy to all nodes ssh-copy-id master ssh-copy-id gpu-worker-1 ssh-copy-id gpu-worker-2 ssh-copy-id gpu-worker-3 # Test β should log in without password ssh gpu-worker-1 "hostname" # β gpu-worker-1
Kubespray needs these on every node:
# On ALL 4 VMs: sudo apt update && sudo apt upgrade -y # Enable IPv4 forwarding (required for K8s pod networking) echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf sudo sysctl -p # Disable swap (Kubernetes requirement) sudo swapoff -a sudo sed -i '/ swap / s/^/#/' /etc/fstab # Disable firewall (or configure rules β Kubespray doesn't manage it) sudo ufw disable
GPU pre-check: Verify GPUs are visible on each worker: lspci | grep -i nvidia. You do NOT need to install NVIDIA drivers manually β the GPU Operator (Phase 3) handles that inside Kubernetes.
This is where the magic happens. One command, ~20 minutes, and you have a production K8s cluster.
Kubespray is like sending a team of electricians, plumbers, and carpenters into your 4 warehouses simultaneously. They install wiring (container runtime), plumbing (pod networking), structure (control plane), and connect everything. You just handed them the blueprint (inventory.ini).
# On deployer machine mkdir ~/k8s-deploy && cd ~/k8s-deploy git clone https://github.com/kubernetes-sigs/kubespray.git --branch release-2.26 cd kubespray
This file tells Kubespray which VMs are control planes, which are workers, and where etcd runs.
# Copy the sample inventory cp -rf inventory/sample inventory/gpu-cluster # Edit the inventory file cat > inventory/gpu-cluster/inventory.ini << 'EOF' [all] master ansible_host=10.0.0.10 ip=10.0.0.10 gpu-worker-1 ansible_host=10.0.0.11 ip=10.0.0.11 gpu-worker-2 ansible_host=10.0.0.12 ip=10.0.0.12 gpu-worker-3 ansible_host=10.0.0.13 ip=10.0.0.13 [kube_control_plane] master [etcd] master [kube_node] gpu-worker-1 gpu-worker-2 gpu-worker-3 [calico_rr] [k8s_cluster:children] kube_control_plane kube_node calico_rr EOF
What this means: [kube_control_plane] β master runs the K8s API server, scheduler, controller [etcd] β master runs the cluster database [kube_node] β 3 GPU VMs are worker nodes (where pods actually run) ββββββββββββββββββββββββββββββββββββββββββββββββββ β Kubernetes Cluster β β β β ββββββββββββββββββββ β β β master β β β β - API Server β β brain of the cluster β β β - Scheduler β β β β - etcd β β cluster database β β β - Controller β β β ββββββββββββββββββββ β β β β β βββββββΌββββββββββββββ β β βΌ βΌ βΌ β β ββββββββ ββββββββ ββββββββ β β βgpu-1 β βgpu-2 β βgpu-3 β β workers β β β[H100]β β[H100]β β[H100]β β GPUs here β β βkubeletβ βkubeletβ βkubeletβ β runs pods β β ββββββββ ββββββββ ββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββ
# Enable containerd and configure NVIDIA runtime support cat >> inventory/gpu-cluster/group_vars/k8s_cluster/k8s-cluster.yml << 'EOF' # Use containerd as container runtime (required for GPU operator) container_manager: containerd # Enable Helm (we'll need it for GPU operator + Dynamo) helm_enabled: true EOF
# Option A: Run via Docker (recommended β all deps included)
docker run --rm -it \
--mount type=bind,source="$(pwd)"/inventory/gpu-cluster,dst=/inventory \
--mount type=bind,source="${HOME}"/.ssh/id_ed25519,dst=/root/.ssh/id_rsa \
quay.io/kubespray/kubespray:v2.26.0 bash
# Inside container:
ansible-playbook -i /inventory/inventory.ini \
--private-key /root/.ssh/id_rsa \
--become \
cluster.yml
# Option B: Run directly (need Ansible + deps installed)
pip install -r requirements.txt
ansible-playbook -i inventory/gpu-cluster/inventory.ini \
--become \
cluster.yml
# ~20 minutes later... PLAY RECAP ************************************************************* master : ok=568 changed=126 unreachable=0 failed=0 gpu-worker-1 : ok=365 changed=80 unreachable=0 failed=0 gpu-worker-2 : ok=365 changed=80 unreachable=0 failed=0 gpu-worker-3 : ok=365 changed=80 unreachable=0 failed=0 # SSH to master and verify: ssh master "kubectl get nodes -o wide" NAME STATUS ROLES AGE VERSION INTERNAL-IP master Ready control-plane 18m v1.30.2 10.0.0.10 gpu-worker-1 Ready <none> 16m v1.30.2 10.0.0.11 gpu-worker-2 Ready <none> 16m v1.30.2 10.0.0.12 gpu-worker-3 Ready <none> 16m v1.30.2 10.0.0.13
Your Kubernetes cluster is running. But the GPUs aren't visible to Kubernetes yet — we need the NVIDIA GPU Operator.
The GPU Operator installs NVIDIA drivers, container toolkit, and device plugins inside Kubernetes so that pods can request GPU resources.
The cluster is wired (K8s), but the factories don't have power outlets yet. The GPU Operator is like installing industrial power connections in each warehouse. After this, any machine (pod) you roll in can plug into a GPU and use it.
# On master node: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update # Install the GPU operator helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --set driver.enabled=true \ --set toolkit.enabled=true # Wait for all pods to be ready (~5 minutes) kubectl -n gpu-operator get pods -w # Verify GPUs are visible: kubectl get nodes -o json | \ jq '.items[].status.allocatable["nvidia.com/gpu"]' # β "1" (for each worker node)
After GPU Operator:
kubectl describe node gpu-worker-1 | grep nvidia
Allocatable:
nvidia.com/gpu: 1 β Kubernetes now sees the H100!
Pods can now request GPUs:
resources:
limits:
nvidia.com/gpu: "1" β "give me 1 GPU"
Now we install Dynamo's Kubernetes platform and deploy a model across our 3 GPU workers with disaggregated serving.
The factories have power (GPUs accessible). Now we need to install the production line (Dynamo). Dynamo's components are: the order intake counter (Frontend), the dispatcher (Router), the prep kitchen (Prefill workers), and the assembly line (Decode workers). Each runs in its own pod on the GPU workers.
# Install Dynamo CRDs
export VERSION=0.9.0
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default
# Install Dynamo Platform (operator, controllers)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
-n dynamo-system --create-namespace
# Create HuggingFace secret for model downloads
kubectl create secret generic hf-secret \
-n dynamo-system \
--from-literal=HF_TOKEN=hf_your_token_here
# Create the deployment manifest
cat > llama70b-disagg.yaml << 'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llama-70b-disagg
namespace: dynamo-system
spec:
services:
Frontend:
replicas: 1 # Runs on master or any node
PrefillWorker:
replicas: 1 # gpu-worker-1
resources:
limits:
nvidia.com/gpu: "1" # 1x H100 for prefill
env:
- name: MODEL_PATH
value: "meta-llama/Llama-3-70B"
- name: TENSOR_PARALLEL_SIZE
value: "1"
DecodeWorker:
replicas: 2 # gpu-worker-2 + gpu-worker-3
resources:
limits:
nvidia.com/gpu: "1" # 1x H100 each for decode
env:
- name: MODEL_PATH
value: "meta-llama/Llama-3-70B"
EOF
kubectl apply -f llama70b-disagg.yaml
What this deploys across your cluster:
master (10.0.0.10):
βββββββββββββββββββββββββββββββββββββββ
β K8s Control Plane β
β + Frontend Pod (OpenAI API :8000) β
β + Router Pod (KV-aware routing) β
βββββββββββββββββββββββββββββββββββββββ
β
β routes requests based on KV cache overlap
β
βββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
gpu-worker-1 (10.0.0.11): gpu-worker-2 + 3:
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Prefill Worker Pod βββNIXLβββ β Decode Worker Pods β
β [H100 β compute] β KV cache β [H100 β bandwidth] β
β β transfer β [H100 β bandwidth] β
β Processes prompts β β Generates tokens β
β at max FLOPS β β at max batch size β
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
1 prefill GPU + 2 decode GPUs = disaggregated serving
Prefill blasts through prompts β NIXL transfers KV cache β Decode generates tokens
# Check all pods
kubectl -n dynamo-system get pods -o wide
NAME READY STATUS NODE
llama-70b-frontend-xxxxx 1/1 Running master
llama-70b-prefill-worker-xxxxx 1/1 Running gpu-worker-1
llama-70b-decode-worker-0-xxxxx 1/1 Running gpu-worker-2
llama-70b-decode-worker-1-xxxxx 1/1 Running gpu-worker-3
# Check GPU allocation
kubectl -n dynamo-system describe pod llama-70b-prefill-worker-xxxxx | grep gpu
nvidia.com/gpu: 1
# Test the API
curl http://10.0.0.10:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-70B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"max_tokens": 100
}'
You: curl http://10.0.0.10:8000/v1/chat/completions -d '{"messages":[...]}'
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β master (10.0.0.10) β
β β
β [Frontend Pod] β receives HTTP request, tokenizes, validates β
β β β
β βΌ β
β [Router Pod] β checks KV cache radix tree: β
β "gpu-worker-2 has 78% prefix cached, β
β gpu-worker-3 has 12% β route decode to worker-2"β
β β β
β β routes to prefill worker β
βββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β gpu-worker-1 (10.0.0.11) β PREFILL β
β β
β [Prefill Worker Pod] β
β - Receives full prompt tokens β
β - Runs forward pass through all 80 layers β
β - Generates KV cache (~1.3 GB) β
β - Produces first token "Imagine" β
β - GPU compute: 85% utilized β
β - Time: ~45ms (with TP=1) β
β β
β KV cache ready β trigger NIXL transfer β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β
β NIXL KV transfer
β (~1.3 GB via NVLink or InfiniBand)
β Time: ~1-30ms depending on interconnect
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β gpu-worker-2 (10.0.0.12) β DECODE β
β β
β [Decode Worker Pod] β
β - Receives KV cache from prefill β
β - Generates tokens one by one β
β - Reads 131 GB/token (weights + KV cache) β
β - GPU bandwidth: 88% utilized β
β - ~40ms per token, 25 tok/s per user β
β - Batches multiple users for efficiency β
β β
β Streams tokens back via Frontend β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
β tokens stream back
βΌ
You see: "Imagine a tiny factory inside your computer..."
Dynamo's Planner watches TTFT (time to first token) and ITL (inter-token latency). When SLAs are breached, it rebalances:
Low traffic (2 users): gpu-worker-1: [PREFILL] β 1 GPU prefilling gpu-worker-2: [DECODE] β 1 GPU decoding gpu-worker-3: [DECODE] β 1 GPU decoding Traffic spike (20 users): Planner detects: prefill queue depth = 15 (too high!) gpu-worker-1: [PREFILL] β 1 GPU prefilling gpu-worker-2: [PREFILL] β REASSIGNED from decode to prefill! gpu-worker-3: [DECODE] β 1 GPU decoding (batches all 20 users) Planner log: "Scaled prefill 1β2, decode 2β1. Queue depth: 15β3. SLA met." Traffic back to normal: Planner scales back: 1 prefill, 2 decode.
Why not just add more VMs? You absolutely can. Kubespray makes it easy to add worker nodes — just add them to inventory.ini and re-run the playbook. Dynamo's Planner will automatically discover new GPU workers and start scheduling on them.
# Adding a 5th GPU VM: # 1. Add to inventory.ini under [kube_node]: # gpu-worker-4 ansible_host=10.0.0.14 ip=10.0.0.14 # 2. Run Kubespray scale playbook: ansible-playbook -i inventory/gpu-cluster/inventory.ini \ --become scale.yml # 3. Install GPU Operator on new node (automatic β operator is a DaemonSet) # 4. Update DynamoGraphDeployment to use the new worker # β Dynamo Planner auto-discovers it
| Phase | What | Tool | Time |
|---|---|---|---|
| 0 | 4 bare Ubuntu VMs with GPUs | Manual / Terraform | varies |
| 1 | SSH setup, system prep, disable swap | Manual / Ansible | ~10 min |
| 2 | Deploy K8s cluster (control plane + 3 workers) | Kubespray | ~20 min |
| 3 | Install GPU Operator (drivers + device plugin) | Helm | ~5 min |
| 4 | Install Dynamo Platform (CRDs + operator) | Helm | ~3 min |
| 5 | Deploy Llama-3-70B disaggregated | kubectl apply | ~10 min (model download) |
| 6 | Serve inference via OpenAI API | curl / SDK | immediate |
| Total: bare VMs β serving LLM inference | ~50 minutes | ||
The Stack (bottom to top): βββββββββββββββββββββββββββββββββββββββββββ β Your Application / OpenAI SDK β β You send prompts here βββββββββββββββββββββββββββββββββββββββββββ€ β Dynamo Frontend (OpenAI API) β β HTTP server β Dynamo Router (KV-aware) β β Smart routing β Dynamo Workers (prefill + decode) β β LLM inference β Dynamo NIXL (KV transfer) β β Data movement βββββββββββββββββββββββββββββββββββββββββββ€ β NVIDIA GPU Operator β β Drivers, toolkit, device plugin βββββββββββββββββββββββββββββββββββββββββββ€ β Kubernetes (via Kubespray) β β Container orchestration β - containerd, calico, coredns, etcd β βββββββββββββββββββββββββββββββββββββββββββ€ β Ubuntu 24.04 β β Operating system βββββββββββββββββββββββββββββββββββββββββββ€ β 4Γ VMs with H100 GPUs β β Hardware βββββββββββββββββββββββββββββββββββββββββββ
The final analogy: You started with 4 empty warehouses (VMs). Kubespray wired them into a connected industrial park (K8s cluster). The GPU Operator installed heavy-duty power (GPU access). NVIDIA Dynamo set up the production line — a receiving dock (Frontend), a smart dispatcher (Router), a rapid prep kitchen (Prefill GPU), and an assembly line (Decode GPUs) connected by conveyor belts (NIXL). Now you're shipping products (tokens) to customers (API users) at scale.
Built as an educational resource.
Kubespray · NVIDIA Dynamo