Research Lab Infrastructure Modernization — Case Study

Executive Summary

Faced with 10‑year‑old Ivy Bridge servers, fragmented GPUs, and mounting energy bills, the lab launched a six‑month modernization from January to June 2025. The goals were to:

Cut idle power and HVAC costs
Consolidate scattered compute and GPU resources
Standardize on IaC for repeatable, vendor‑agnostic deployments
Prepare for I/O‑intensive AI and CI/CD workloads

Key results:

Key metric	Early 2025	June 2025	Δ
Idle rack draw	~1 700 W	~1 000 W	‑700 W (‑41 %)
Annual energy cost¹	$2 637	$1 551	‑$1 086 / yr

¹$0.17 kWh Minnesota residential rate.

Baseline Challenges

Pain point	Symptom	Business impact
Excess idle power & heat	1 700 W draw, noisy fans	+$2.6 k/yr Opex; thermal stress
Fragmented GPU pool	Five hosts, five kernels	Under‑utilised GPUs; manual pinning
Aging Ceph disks	SATA + DRAM‑less NVMe failures	Rebuild cycles, sluggish pipelines
Manual VM provisioning	Proxmox templates only	Hours to rebuild; config drift
Limited IaC	Ad‑hoc scripts	Inconsistent state; hard to audit

Intervention Strategy

Compute consolidation
- Retired 4 dual‑socket Ivy Bridge nodes and 1 GPU node
- Added AMD EPYC 7551P (32C, SMT‑off, 512 GB DDR4) redeploying 3 × Tesla P4
Storage upgrade
- Both HL15 pods → Xeon Gold 6154 (18C, HT‑off) + 256 GB DDR4
- Installed 5 × NVMe OSDs each (PCIe bifurcation)
Workstation separation
- Added Tesla T40 (24 GB) for local ML; kept RTX 3070 Ti for media
IaC refactor
- Terraform + Ansible (Kuberspray) deploy a 6‑node k8s cluster (3 masters / 3 GPU workers)
- Back‑end network moved to bonded 10 Gb SFP+; SD‑WAN (Tailscale) for private access
Ceph modernization
- Replication 3 → 4; CephFS volumes 8 → 2 to cut IOPS amplification
- DRAM‑less NVMe drives replaced with cache‑backed units

Impact Narrative

🧊 Power & Cooling

Retiring Ivy Bridge iron cut 700 W and trimmed rack inlet temps by 10 °F. Fans slowed, reducing studio noise and deferring HVAC upgrades.

⚙️ Compute & GPU Efficiency

A single EPYC node now hosts 32 physical cores (cluster‑wide) and three P4s. Kubernetes‑level GPU time‑slicing tripled available accelerator‑hours with no manual affinity.

📦 Storage Resilience

NVMe‑backed OSDs plus replication‑4 slashed write tail‑latency and tolerate two simultaneous OSD‑host failures. CI image‑push times fell from 2 min → 55 s.

💡 Automation & Ops

All infra lives in Git. Bare‑metal bring‑up: 90 min. Drift detected in minutes. SD‑WAN + Cloudflare WAF isolate services from the public Internet, enforcing Zero‑Trust.

👩‍💻 Developer Experience

The T40 provides 24 GB VRAM for local inference; the k8s GPU pool handles production jobs. Windows media tasks stay on the RTX 3070 Ti — no contention.

6  Lessons Learned

What worked	To refine
Workload profiling before buying iron	Plan EPYC PCIe lanes for future NICs
IaC‑first enabled safe, iterative redeploys	P4 time‑slicing still RAM‑limited (8 GB vRAM)
Early NVMe OSD tests validated Ceph configs	Add CephFS+NFS‑Ganesha for legacy VMs

7  Next Steps (2H 2025)

Add three NVIDIA T4s to raise GPU VRAM and efficiency
Enable Ceph cache‑tiering with spare NVMe for video ingest
Adopt Cluster‑API for declarative bare‑metal scaling
Target 800 W idle by phasing out remaining DDR3 hardware
Real‑time Syncthing backups across two remote edge nodes for DR compliance

Bottom line: Modernization cut energy burn by 40 %, doubled storage performance, and turned a fragile, hand‑tuned cluster into a reproducible, IaC‑driven platform ready for AI, automation, and future growth.

Research Lab Infrastructure Modernization — Case Study

Executive Summary

Baseline Challenges

Intervention Strategy

Impact Narrative

🧊 Power & Cooling

⚙️ Compute & GPU Efficiency

📦 Storage Resilience

💡 Automation & Ops

👩‍💻 Developer Experience

6 Lessons Learned

7 Next Steps (2H 2025)