Executive Summary
Faced with 10‑year‑old Ivy Bridge servers, fragmented GPUs, and mounting energy bills, the lab launched a six‑month modernization from January to June 2025. The goals were to:
- Cut idle power and HVAC costs
- Consolidate scattered compute and GPU resources
- Standardize on IaC for repeatable, vendor‑agnostic deployments
- Prepare for I/O‑intensive AI and CI/CD workloads
Key results:
Key metric | Early 2025 | June 2025 | Δ |
---|---|---|---|
Idle rack draw | ~1 700 W | ~1 000 W | ‑700 W (‑41 %) |
Annual energy cost¹ | $2 637 | $1 551 | ‑$1 086 / yr |
¹$0.17 kWh Minnesota residential rate.
Baseline Challenges
Pain point | Symptom | Business impact |
---|---|---|
Excess idle power & heat | 1 700 W draw, noisy fans | +$2.6 k/yr Opex; thermal stress |
Fragmented GPU pool | Five hosts, five kernels | Under‑utilised GPUs; manual pinning |
Aging Ceph disks | SATA + DRAM‑less NVMe failures | Rebuild cycles, sluggish pipelines |
Manual VM provisioning | Proxmox templates only | Hours to rebuild; config drift |
Limited IaC | Ad‑hoc scripts | Inconsistent state; hard to audit |
Intervention Strategy
- Compute consolidation
- Retired 4 dual‑socket Ivy Bridge nodes and 1 GPU node
- Added AMD EPYC 7551P (32C, SMT‑off, 512 GB DDR4) redeploying 3 × Tesla P4
- Storage upgrade
- Both HL15 pods → Xeon Gold 6154 (18C, HT‑off) + 256 GB DDR4
- Installed 5 × NVMe OSDs each (PCIe bifurcation)
- Workstation separation
- Added Tesla T40 (24 GB) for local ML; kept RTX 3070 Ti for media
- IaC refactor
- Terraform + Ansible (Kuberspray) deploy a 6‑node k8s cluster (3 masters / 3 GPU workers)
- Back‑end network moved to bonded 10 Gb SFP+; SD‑WAN (Tailscale) for private access
- Ceph modernization
- Replication 3 → 4; CephFS volumes 8 → 2 to cut IOPS amplification
- DRAM‑less NVMe drives replaced with cache‑backed units
Impact Narrative
🧊 Power & Cooling
Retiring Ivy Bridge iron cut 700 W and trimmed rack inlet temps by 10 °F. Fans slowed, reducing studio noise and deferring HVAC upgrades.
⚙️ Compute & GPU Efficiency
A single EPYC node now hosts 32 physical cores (cluster‑wide) and three P4s. Kubernetes‑level GPU time‑slicing tripled available accelerator‑hours with no manual affinity.
📦 Storage Resilience
NVMe‑backed OSDs plus replication‑4 slashed write tail‑latency and tolerate two simultaneous OSD‑host failures. CI image‑push times fell from 2 min → 55 s.
💡 Automation & Ops
All infra lives in Git. Bare‑metal bring‑up: 90 min. Drift detected in minutes. SD‑WAN + Cloudflare WAF isolate services from the public Internet, enforcing Zero‑Trust.
👩💻 Developer Experience
The T40 provides 24 GB VRAM for local inference; the k8s GPU pool handles production jobs. Windows media tasks stay on the RTX 3070 Ti — no contention.
6 Lessons Learned
What worked | To refine |
---|---|
Workload profiling before buying iron | Plan EPYC PCIe lanes for future NICs |
IaC‑first enabled safe, iterative redeploys | P4 time‑slicing still RAM‑limited (8 GB vRAM) |
Early NVMe OSD tests validated Ceph configs | Add CephFS+NFS‑Ganesha for legacy VMs |
7 Next Steps (2H 2025)
- Add three NVIDIA T4s to raise GPU VRAM and efficiency
- Enable Ceph cache‑tiering with spare NVMe for video ingest
- Adopt Cluster‑API for declarative bare‑metal scaling
- Target 800 W idle by phasing out remaining DDR3 hardware
- Real‑time Syncthing backups across two remote edge nodes for DR compliance
Bottom line: Modernization cut energy burn by 40 %, doubled storage performance, and turned a fragile, hand‑tuned cluster into a reproducible, IaC‑driven platform ready for AI, automation, and future growth.