Break Glass #05: Proxy Failure Mid-Backup - Clearing Orphaned Snapshots and Recovering from VMware Snapshot
Why This Happens
A Veeam VMware backup proxy crashes or loses connectivity mid-job while using HotAdd (Virtual Appliance) transport mode. At the moment of failure, the proxy had one or more VM disks attached to it via VMDK hotadd, and it had issued a snapshot creation request to vCenter but had not yet completed snapshot removal. The result is a VM running on a snapshot it cannot remove and a proxy that has orphaned VMDKs attached to it.
Snapshot stun is the impact. When vSphere tries to commit and remove a snapshot, it must pause the VM briefly to consolidate the delta disk. Under normal conditions this stun is measured in seconds. When the proxy had a HotAdd disk attached from a different host than the VM (cross-host HotAdd), the stun can extend to minutes. When the proxy is dead and the VMDK cannot be detached before the snapshot removal request fires, the stun can become a full lockout.
NFS datastores with HotAdd transport mode have a specific known issue. When the proxy and the VM are on different hosts and the datastore is NFS, the cross-host mount required for HotAdd creates a stun during snapshot removal that VMware documented and partially addressed in ESXi 8.0 Update 2b. Environments on older ESXi builds still hit this regularly.
The job failure alone is recoverable. The dangerous part is a VM stuck on an orphaned snapshot that is growing by the minute. Every write to that VM is going into the delta disk. If the delta disk fills the datastore, the VM crashes. This is the scenario where a failed backup job causes a production outage.
Triage
- 1Stop all running VBR jobs immediately. In the VBR console, right-click each running job and select Stop. Do not let any job continue running against VMs that may have snapshot issues. Additional backup operations against a VM with a stuck snapshot can create snapshot chains that are much harder to clean up.
- 2In vCenter, check all VMs for consolidation warnings. Go to vCenter, select all VMs, and look for the yellow warning icon. Alternatively, run a PowerCLI query: Get-VM | Where-Object {$_.Extensiondata.Runtime.ConsolidationNeeded -eq $true}Note every VM that shows as needing consolidation.
- 3Check the backup proxy VM for attached disks. In vSphere Client, open the hardware configuration of the proxy VM. Look for any VMDK entries that do not belong to the proxy itself. These are orphaned HotAdd disks from VMs that were being backed up when the proxy failed.
- 4Check which VMs had snapshots named "VEEAM BACKUP TEMPORARY SNAPSHOT." In vCenter, right-click each VM flagged for consolidation, select Manage Snapshots, and verify whether a Veeam snapshot is still present. If the snapshot is present but Veeam has no active job running against that VM, it is an orphaned snapshot.
- 5Assess VM health. For each VM with a stuck snapshot, confirm: is the VM running, is it reachable on the network, and how large is the snapshot delta? In vCenter, check the snapshot delta file size on the datastore. A small delta (under 1 GB) means the VM has not been running long on the snapshot. A large or rapidly growing delta means the datastore fill risk is real and this VM is the priority.
- 6Check datastore free space. If any affected VM's datastore is under 20% free, treat that VM as critical priority. A growing snapshot delta on a nearly-full datastore will crash the VM when space runs out.
The Recovery Path
- 1Remove orphaned HotAdd disks from the proxy VM first. In vSphere Client, open the proxy VM settings. For each disk that does not belong to the proxy, select it and click Remove. Choose Remove from virtual machine only (do not select Delete files from datastore. Those are production VM disks). Confirm the removal. Do this for all orphaned disks before attempting any snapshot operations.
- 2Verify the disks are detached. After removing each disk from the proxy, refresh the proxy VM hardware in vSphere Client and confirm the disk is no longer listed. On a Windows proxy, you can also check Device Manager for any unexpected disk devices.
- 3Re-run the backup job for the affected VMs. At the start of each job session, VBR checks for registered Veeam snapshots left behind by previous failed sessions and attempts to remove them before proceeding with the new backup. This pre-job orphaned snapshot cleanup handles most cases where a proxy failure left a "VEEAM BACKUP TEMPORARY SNAPSHOT" still registered in vCenter. Monitor the job session to confirm the orphaned snapshot is removed. Note: this is distinct from Snapshot Hunter, which targets phantom snapshot files that remain on the datastore after vSphere reports a successful removal but consolidation actually failed.
- 4If the pre-job cleanup does not resolve the issue, attempt snapshot consolidation from vCenter. Right-click the VM in vCenter, select Snapshots, Consolidate. VCenter will attempt to commit the delta disk and remove the snapshot. Monitor the consolidation task in the vCenter task panel. This can take minutes to hours depending on delta size.
- 5If consolidation fails with "another task is already in progress," restart the ESXi management agent on the host running the VM. From the ESXi host console (F2 to enter customization menu): navigate to Troubleshooting Options, then Restart Management Agents. This clears stale task locks without affecting running VMs. After the management agent restarts, retry the consolidation.
- 6Power off the VM. In vSphere Client, right-click the VM and select Power Off (not suspend). Coordinate with application owners. A controlled power-off for snapshot removal is recoverable. A datastore-full crash is not.
- 7After the VM is powered off, go to Snapshots, Delete All Snapshots in vCenter. With the VM off, snapshot removal completes quickly and without stun. Wait for the consolidation task to complete in the vCenter task panel before proceeding.
- 8Power the VM back on. Verify it comes up cleanly and applications are functioning before continuing.
- 9For any VM where you had to set the parameter snapshot.asyncConsolidate.forceSync = TRUE during troubleshooting, remove that parameter after the snapshot is cleared. Leave it in place permanently only if recommended by VMware support for your specific hardware configuration.
- 10After all VMs are snapshot-free and running cleanly, restart the failed backup proxy. Check its service logs for the cause of the failure before returning it to service. In VBR, go to Backup Infrastructure, Backup Proxies, right-click the proxy, and run a connection test. If the proxy is recovered and passes the test, re-enable it.
- 11Run the backup job manually for the affected VMs. Since the previous run failed mid-job, those VMs need a successful backup before you have coverage again. Monitor the run and confirm all VMs complete without snapshot issues.
- 12Address the transport mode configuration to prevent recurrence. See Prevention Checklist below.
Gotchas
Prevention Checklist
- Deploy one proxy VM per ESXi host in clusters where HotAdd transport mode is required. Per-host proxies eliminate cross-host VMDK mounts and prevent the cross-host stun entirely.
- Alternatively, switch to Direct NFS transport mode for NFS-backed datastores. Direct NFS reads data directly from the NFS share without VMDK attachment, eliminating HotAdd stun risk entirely.
- Verify automount is disabled on all Windows-based proxy VMs. Open an elevated command prompt, run diskpart, then type automount to check status. If it reports enabled, type automount disable. VBR v10 and later auto-disables this before tasks run, but verifying the OS-level setting prevents edge cases where mount entries persist.
- In proxy transport mode settings, set the fallback behavior: configure proxies so that if the local proxy fails, Veeam falls back to NBD (Network Mode) rather than HotAdd on a remote host. This trades performance for safety in the failover case.
- Do not include proxy VMs in any backup job scope. Exclude them explicitly or put them in a separate job with NBD-only transport and no concurrent tasks.
- Upgrade ESXi to 8.0 Update 2b or later to get VMware's fix for the NFS HotAdd snapshot stun bug if your environment uses NFS datastores.
- Monitor datastore free space and alert at 80% full. A growing snapshot delta on a nearly-full datastore is the path to a production outage from a failed backup job.
- Configure proxy failover behavior in job settings. Under the Virtual Proxy selection, set the proxy to Automatic to allow Veeam to select an available proxy if the preferred one is down. For NFS environments, restrict transport mode to Direct NFS or NBD to avoid falling back to cross-host HotAdd on a remote proxy.
- Stop all VBR jobs immediately. Do not let any job run against VMs with stuck snapshots
- Check every VM for consolidation warnings in vCenter before touching snapshots
- Remove orphaned HotAdd disks from proxy VM BEFORE attempting any snapshot consolidation
- Never revert a Veeam snapshot. Always delete or consolidate
- Re-run the job to trigger VBR's pre-job orphaned snapshot cleanup before escalating to manual consolidation
- Consolidation still failing: restart ESXi management agent to clear task locks
- Datastore filling from delta growth: power off VM, delete all snapshots, power back on
- Windows proxies require automount disabled or HotAdd disks get stuck (KB1882)
- Cross-host HotAdd causes long stun. Use per-host proxies or Direct NFS instead
- NFS + HotAdd stun fixed in ESXi 8.0 U2b. Patch or change transport mode on older builds