The AI Home Lab: Virtualizing and Protecting Heavy GPU Workloads
vmware vsphere - gpu passthrough - amd rocm - local ai - veeam backup
What this covers
How to architect a virtualized home server that runs AMD MI50 GPUs for local LLM inference and image generation, with real configuration details on DirectPath I/O passthrough, ROCm driver constraints, NUMA topology, and vCPU pinning. Then the backup strategy: what AI data actually matters, how to size it, and exactly how to protect it with Veeam.
Contents
- Why run local AI on a virtualized host
- The AMD MI50: what you get and what you give up
- The ROCm support reality
- ESXi DirectPath I/O: the passthrough configuration
- NUMA topology and vCPU pinning
- Known driver and passthrough quirks
- What the full VM stack looks like
- What AI data you are actually protecting
- The Veeam backup strategy
- Repository sizing for AI workloads
🧠 Why run local AI on a virtualized host
The question comes up every time: why not just run the AI stack on bare metal? For a dedicated AI workstation, bare metal makes sense. But if your home lab or small-team infrastructure runs on vSphere, and yours probably does if you're reading this site, the case for keeping AI workloads inside the hypervisor is the same case for keeping everything else there: consolidation, consistent operational tooling, and unified backup coverage. You already know how to manage VMs. You already have Veeam watching them. Spinning up a bare-metal Linux box alongside your vSphere cluster means a second management surface and a second backup strategy.
GPU passthrough via VMware DirectPath I/O makes this viable. The GPU is presented directly to the VM guest OS with near-native performance. The hypervisor layer is bypassed for the device itself. You give up vMotion, Fault Tolerance, suspend/resume, and VMware snapshots while the passthrough device is active. That is the trade. For a GPU-intensive inference VM that functions as a dedicated appliance and does not move between hosts, the trade is acceptable.
For small teams building local AI development environments, running Ollama, vLLM, or custom fine-tuning pipelines, the virtualized approach means each project can have its own VM with isolated GPU access, managed by the same infrastructure that handles everything else. You do not need a separate GPU server per project. You need a host with enough physical GPUs and a clear assignment per VM.
📊 The AMD MI50: what you get and what you give up
The Instinct MI50 is a data center compute GPU, not a consumer card. No display outputs, no gaming driver, no consumer software ecosystem. What it has is a significant amount of HBM2 memory at a price point that only makes sense once you look at the used market. Decommissioned from hyperscaler and university clusters, MI50 cards regularly appear at prices that make the per-GB VRAM economics compelling for home lab and small-team AI work.
The passive cooling design is an operational consideration that catches people off guard. The MI50 depends entirely on chassis airflow. There is no onboard fan. No onboard fan means no onboard fan controller. The card temperature is determined entirely by how much air your chassis pushes through the slot. You need at least 60 CFM of forced airflow. In a proper rack server or workstation with front-to-back airflow this is fine. In an open-frame or small-form-factor home lab chassis, deliberate fan placement is required. Under sustained load the card will thermal throttle if airflow is insufficient, and it will do so without much warning. Plan the cooling before the card arrives.
The 32GB variant is the compelling one for LLM work. Four 32GB MI50 cards give you 128GB of GPU VRAM, enough to run 70B parameter models at Q4 quantization with headroom for KV cache during inference. The community has documented clusters of four MI50s running Llama-70B at useful token rates for total hardware cost well under $1,000. That VRAM density is not available from any other used GPU platform at comparable pricing.
⚠️ The ROCm support reality
This section matters. Understand it before committing to the MI50 platform. AMD's ROCm software stack is the equivalent of CUDA for AMD compute GPUs. Without ROCm, you have no PyTorch, no TensorFlow, no vLLM, no Ollama GPU acceleration. The MI50's relationship with ROCm has a defined lifecycle and it has expired.
AMD placed the MI50 (gfx906 architecture) into maintenance mode starting with ROCm 5.7 in Q3 2023. New features and performance optimizations stopped at that release. Bug fixes and security patches continued until Q2 2024, at which point official maintenance ended entirely. The MI50 is now outside AMD's official support matrix for ROCm 6.x and 7.x. ROCm 7.0 actively removed MI50 and MI60 profiler support.
The practical impact depends on your workload. For LLM inference via llama.cpp or Ollama, the ROCm version dependency is less critical. These tools have their own GPU abstraction and the gfx906 target continues to function. For cutting-edge PyTorch features, newer attention kernels, or any tooling explicitly targeting MI200/MI300-class hardware, the MI50 is increasingly left behind. ROCm being open source means the community can and does maintain support beyond AMD's official lifecycle, but you are carrying maintenance risk that compounds over time.
The right mental model: the MI50 is a budget VRAM strategy for workloads that tolerate a pinned software stack. It is not a platform for staying current with the bleeding edge of the ROCm ecosystem.
🔧 ESXi DirectPath I/O: the passthrough configuration
GPU passthrough in vSphere uses VMDirectPath I/O, which maps the PCIe device directly into the guest VM's address space via IOMMU, bypassing the ESXi hypervisor layer for that device. The GPU appears to the guest OS as if it were physically attached. Driver installation in the guest proceeds as on bare metal. This is the appropriate approach for the MI50 on vSphere. The MI50 does not use SR-IOV in the same way as enterprise MxGPU lineup cards, so VMDirectPath is the right path here.
Prerequisites
AMD-Vi (IOMMU) or Intel VT-d must be enabled in BIOS/UEFI. Without IOMMU, passthrough is not possible. Verify first. In vCenter or the ESXi Host Client, check Host > Configure > Hardware > PCI Devices. The MI50 should appear as a Display Controller (Vega 20).
Enabling passthrough and configuring the VM
ESXi Host Client: Manage > Hardware > PCI Devices. Select the MI50 device and click Toggle Passthrough. An orange icon indicates a host reboot is required.
Reboot the ESXi host. After reboot, the device should show Active (green). If it shows Enabled/Needs Reboot after the reboot, toggle it off and back on. This is a known ESXi 7.x bug for AMD GPU devices. No additional reboot required.
Add the PCI device to the VM: Edit Settings > Add Other Device > PCI Device. ESXi automatically creates a full memory reservation equal to the VM's configured RAM. This is required. The VM will not power on without the reservation in place when a passthrough device is attached.
Set the VM firmware to UEFI: VM Options > Boot Options > Firmware: EFI. UEFI is required for correct GPU initialization with the MI50.
Add the configuration parameter: VM Options > Advanced > Edit Configuration. Add: hypervisor.cpuid.v0 = FALSE. Required for GPU driver initialization under a hypervisor.
🧱 NUMA topology and vCPU pinning
This is where most home lab GPU passthrough builds leave performance on the table. The MI50 is physically connected to a specific NUMA node on the host CPU. PCIe lanes are not equally distant from all cores. They route through a specific socket on dual-socket systems or a specific NUMA domain on multi-die platforms. If the VM's vCPUs and memory are assigned to a different NUMA node than the one the GPU connects to, every DMA transfer between the GPU and system memory crosses a NUMA interconnect. The latency penalty is measurable and the bandwidth ceiling is lower.
Check which NUMA node the MI50 is associated with in the guest or on the host running Linux:
vSphere is NUMA-aware for passthrough scheduling and will attempt to schedule the VM on the NUMA node local to the device. One known conflict: CPU Hot Add is incompatible with vNUMA configuration on VMs with passthrough devices. Disable CPU Hot Add on the GPU VM. This issue was resolved in vSphere 8.0 U1, but if you are running earlier builds you will hit it.
vCPU count for inference workloads
The MI50 is memory-bandwidth-bound for most LLM inference workloads. The GPU is doing the heavy lifting. The VM's vCPUs handle tokenization, API serving overhead, and data movement. For a dedicated inference VM running Ollama or vLLM, 8 vCPUs is typically adequate. Allocating more only helps if you have high concurrency or heavy CPU-side pre and post-processing. Stacking vCPUs on an inference VM wastes scheduler slots for other VMs on the host.
🐛 Known driver and passthrough quirks
ROCm PCIe atomics
ROCm requires PCIe atomics support for certain operations. In a passthrough VM, availability depends on the host platform and how the hypervisor exposes the PCIe topology to the guest. If you see ROCm initialization warnings about PCIe atomics being unavailable, that is the cause. For inference workloads this typically does not affect functionality. It mainly affects profiling and some debug tooling.
Passthrough state persistence after host reboot
On ESXi 7.x, passthrough settings for AMD GPUs may not persist correctly after a host reboot. After rebooting, the passthrough status may show Enabled/Needs Reboot rather than Active. Toggle the setting off and on again. No additional reboot required. This behavior was resolved in later 7.x point releases and ESXi 8.x.
ROCm version inside the guest
Ubuntu 22.04 LTS with ROCm 5.7 pinned is the most reliable configuration for MI50 passthrough VMs. ROCm 5.7 packages for Ubuntu 22.04 are available from AMD's package repositories and install cleanly. Attempting ROCm 6.x official builds on an MI50 will result in the GPU not being recognized or showing as unsupported in the rocminfo output. Stay on 5.7 unless you are building from source or using a community-maintained gfx906 package.
Cooling and fan control
The MI50's passive heatsink means the GPU temperature is entirely dependent on chassis fan speed. The card has no onboard PWM fan that ROCm or the OS can control. If your host chassis runs fans at low speed during idle periods, the MI50 can sit at elevated temperatures even when not under load. Monitor GPU temperature via rocm-smi during initial deployment. Sustained inference should keep die temperature below 85C. Above that you will see clock speed reductions. PCIe riser cables, common in dense home lab builds, restrict airflow. Avoid them for MI50 slots if possible.
🗺️ What the full VM stack looks like
A production-grade local AI lab on vSphere with MI50 passthrough breaks into three VM categories. The GPU inference VM holds the passthrough device and runs the model serving layer: Ollama, vLLM, or a custom wrapper. This VM has the memory reservation and no snapshot capability. The management VM handles API gateway, job dispatch, and log aggregation. No passthrough device, fully snapshot-capable, standard Veeam job. The storage VM or NAS provides the share where model weights, training datasets, and fine-tuning checkpoints live. This is the largest and most critical data asset to protect.
| Component | Role | Backup method | Snapshot capable |
|---|---|---|---|
| GPU inference VM | Model serving (Ollama, vLLM) | Veeam Agent for Linux (in-guest) | No - passthrough device |
| Management VM | API gateway, logging, job dispatch | Standard Veeam VM backup | Yes |
| NAS / storage VM | Model weights, datasets, checkpoints | Veeam NAS Backup or File Share backup | Yes |
📁 What AI data you are actually protecting
AI workloads produce several distinct data categories with very different protection requirements. Getting this wrong in either direction wastes repository space or loses work that cannot be recovered.
Model weights downloaded from public sources are not irreplaceable. Hugging Face model files, GGUF quantized models, Stable Diffusion checkpoints: all of these can be re-downloaded. Backing them up on a daily schedule is waste. The only exception is if your internet connection is so constrained that re-downloading 40GB is operationally painful.
Fine-tuned model weights are the opposite. A fine-tuning run that cost 40 hours of GPU compute using your proprietary dataset, producing a LoRA adapter or a full checkpoint that does exactly what you need: that is irreplaceable. The GPU time alone represents real cost. The training data may be proprietary. Treat that checkpoint like you would a production database.
Training and evaluation datasets you collected, cleaned, or annotated are similarly irreplaceable. Raw datasets from public sources are not. The distinction drives your backup scope and repository sizing.
Application configuration, including Ollama model manifests, vLLM serving configs, prompt templates, RAG pipeline configurations, and environment files, is the smallest category but the one that makes everything else usable. Manage it as code in Git. Veeam is the safety net for the VM those files live on.
🛡️ The Veeam backup strategy
The GPU inference VM: why standard VM backup fails
Standard Veeam VM-level backup uses VMware snapshot quiescing via the VADP libraries to get a consistent point-in-time image. A VM with an active DirectPath I/O passthrough device cannot be snapshotted while powered on. Attempting to run a Veeam job against the GPU inference VM using the standard vSphere backup path will fail at the snapshot creation step. This is a VMware platform limitation confirmed by Broadcom documentation.
The solution is Veeam Agent for Linux deployed inside the guest OS. The agent operates independently of the VMware snapshot mechanism. It uses Linux-native filesystem freeze and thaw capabilities for consistency, not VMware VADP. The GPU passthrough device is invisible to the agent. You are protecting the VM's storage: the OS, the ROCm stack, the model serving software, and the service configuration. This is not large data. A 50-100GB backup for the inference VM OS volume is typical. Run daily with 7-day retention. Active full weekly, forward incremental daily.
Model weights and datasets: NAS Backup or File Share jobs
If your model weights and datasets live on a NAS or storage VM, Veeam's NAS Backup and File Share Backup features handle large, relatively static file workloads far better than VM-level backup would. The key advantage is change tracking. Veeam detects which files changed since the last backup and processes only those. A 500GB dataset directory where 5GB changed since yesterday produces a small incremental, not a full re-scan of 500GB.
Configure backup only for the directories you identified as irreplaceable. Exclude paths containing downloaded public model weights. Focus coverage on your fine-tuned checkpoints and proprietary datasets.
Mid-training checkpoint protection
Fine-tuning runs produce checkpoint files at intervals during training. Configure Veeam NAS Backup for the checkpoint output directory with a tight schedule: every few hours during active training if operationally viable, or at minimum daily with the last 3 to 5 versions retained. This gives you a recovery point if a training run corrupts, the VM crashes mid-run, or storage fails during a long job.
📏 Repository sizing for AI workloads
AI workload data sizing is dominated by model files, which are dense binary blobs that compress at approximately 1:1. A 70B parameter model at Q4 quantization is roughly 38-40GB. These files have near-random content at the byte level. Veeam's deduplication and compression have minimal impact on safetensors and GGUF files. Size your repository for model data at 1:1. Do not expect storage savings from compression on model weight files.
Training datasets are more variable. Text corpora compress well: a 50GB text dataset might store at 15-20GB with Veeam's inline compression. Image and video datasets for diffusion model training do not compress significantly. Plan accordingly.
| Data type | Typical size | Compression ratio | Backup priority |
|---|---|---|---|
| Downloaded public model weights (GGUF, safetensors) | 4-70GB per model | ~1:1 | Low: re-downloadable |
| Fine-tuned LoRA adapters | 10MB to 2GB | Moderate | High: irreplaceable |
| Full fine-tuned model checkpoints | 7-70GB | ~1:1 | High: irreplaceable |
| Proprietary training datasets (text) | 10MB to 100GB | 3:1 to 5:1 | High: irreplaceable |
| Image/video training datasets | 50GB to several TB | ~1:1 | Depends on source |
| Inference VM OS plus ROCm stack | 50-100GB | Moderate | Medium: hours to rebuild |
The combination works. vSphere handles VM consolidation and resource management. DirectPath I/O gives the MI50 near-native GPU access inside a guest VM. ROCm 5.7 pinned inside Ubuntu 22.04 runs the inference stack. Veeam Agent for Linux covers the GPU VM that cannot use standard snapshot-based backup. NAS Backup covers the model files and datasets that actually matter. Nothing here requires magic configuration. It is a set of known constraints with documented workarounds, on hardware that makes the VRAM economics of local AI work accessible without a cloud bill attached to every inference call.