Select Page

Executive Summary

Faced with 10‑year‑old Ivy Bridge servers, fragmented GPUs, and mounting energy bills, the lab launched a six‑month modernization from January to June 2025. The goals were to:

  • Cut idle power and HVAC costs
  • Consolidate scattered compute and GPU resources
  • Standardize on IaC for repeatable, vendor‑agnostic deployments
  • Prepare for I/O‑intensive AI and CI/CD workloads

Key results:

Key metricEarly 2025June 2025Δ
Idle rack draw ~1 700 W ~1 000 W‑700 W (‑41 %)
Annual energy cost¹ $2 637 $1 551‑$1 086 / yr

¹$0.17 kWh Minnesota residential rate.


Baseline Challenges

Pain pointSymptomBusiness impact
Excess idle power & heat1 700 W draw, noisy fans+$2.6 k/yr Opex; thermal stress
Fragmented GPU poolFive hosts, five kernelsUnder‑utilised GPUs; manual pinning
Aging Ceph disksSATA + DRAM‑less NVMe failuresRebuild cycles, sluggish pipelines
Manual VM provisioningProxmox templates onlyHours to rebuild; config drift
Limited IaCAd‑hoc scriptsInconsistent state; hard to audit

Intervention Strategy

  1. Compute consolidation
    • Retired 4 dual‑socket Ivy Bridge nodes and 1 GPU node
    • Added AMD EPYC 7551P (32C, SMT‑off, 512 GB DDR4) redeploying 3 × Tesla P4
  2. Storage upgrade
    • Both HL15 pods → Xeon Gold 6154 (18C, HT‑off) + 256 GB DDR4
    • Installed 5 × NVMe OSDs each (PCIe bifurcation)
  3. Workstation separation
    • Added Tesla T40 (24 GB) for local ML; kept RTX 3070 Ti for media
  4. IaC refactor
    • Terraform + Ansible (Kuberspray) deploy a 6‑node k8s cluster (3 masters / 3 GPU workers)
    • Back‑end network moved to bonded 10 Gb SFP+; SD‑WAN (Tailscale) for private access
  5. Ceph modernization
    • Replication 3 → 4; CephFS volumes 8 → 2 to cut IOPS amplification
    • DRAM‑less NVMe drives replaced with cache‑backed units

Impact Narrative

🧊 Power & Cooling

Retiring Ivy Bridge iron cut 700 W and trimmed rack inlet temps by 10 °F. Fans slowed, reducing studio noise and deferring HVAC upgrades.

⚙️ Compute & GPU Efficiency

A single EPYC node now hosts 32 physical cores (cluster‑wide) and three P4s. Kubernetes‑level GPU time‑slicing tripled available accelerator‑hours with no manual affinity.

📦 Storage Resilience

NVMe‑backed OSDs plus replication‑4 slashed write tail‑latency and tolerate two simultaneous OSD‑host failures. CI image‑push times fell from 2 min → 55 s.

💡 Automation & Ops

All infra lives in Git. Bare‑metal bring‑up: 90 min. Drift detected in minutes. SD‑WAN + Cloudflare WAF isolate services from the public Internet, enforcing Zero‑Trust.

👩‍💻 Developer Experience

The T40 provides 24 GB VRAM for local inference; the k8s GPU pool handles production jobs. Windows media tasks stay on the RTX 3070 Ti — no contention.


6  Lessons Learned

What workedTo refine
Workload profiling before buying ironPlan EPYC PCIe lanes for future NICs
IaC‑first enabled safe, iterative redeploysP4 time‑slicing still RAM‑limited (8 GB vRAM)
Early NVMe OSD tests validated Ceph configsAdd CephFS+NFS‑Ganesha for legacy VMs

7  Next Steps (2H 2025)

  1. Add three NVIDIA T4s to raise GPU VRAM and efficiency
  2. Enable Ceph cache‑tiering with spare NVMe for video ingest
  3. Adopt Cluster‑API for declarative bare‑metal scaling
  4. Target 800 W idle by phasing out remaining DDR3 hardware
  5. Real‑time Syncthing backups across two remote edge nodes for DR compliance

Bottom line: Modernization cut energy burn by 40 %, doubled storage performance, and turned a fragile, hand‑tuned cluster into a reproducible, IaC‑driven platform ready for AI, automation, and future growth.