Break Glass #05: Proxy Failure Mid-Backup - Clearing Orphaned Snapshots and Recovering from VMware Snapshot

Break Glass // Scenario 05
The backup proxy crashed mid-job. Now several VMs are sitting on orphaned snapshots and nobody can reach one of them over the network. Another VM has a VMDK still attached to the dead proxy. VCenter is showing consolidation warnings. Your production environment is degraded and it is not recovering on its own.
Break Glass VBR v13 VMware Snapshot Recovery

Why This Happens

A Veeam VMware backup proxy crashes or loses connectivity mid-job while using HotAdd (Virtual Appliance) transport mode. At the moment of failure, the proxy had one or more VM disks attached to it via VMDK hotadd, and it had issued a snapshot creation request to vCenter but had not yet completed snapshot removal. The result is a VM running on a snapshot it cannot remove and a proxy that has orphaned VMDKs attached to it.

Snapshot stun is the impact. When vSphere tries to commit and remove a snapshot, it must pause the VM briefly to consolidate the delta disk. Under normal conditions this stun is measured in seconds. When the proxy had a HotAdd disk attached from a different host than the VM (cross-host HotAdd), the stun can extend to minutes. When the proxy is dead and the VMDK cannot be detached before the snapshot removal request fires, the stun can become a full lockout.

NFS datastores with HotAdd transport mode have a specific known issue. When the proxy and the VM are on different hosts and the datastore is NFS, the cross-host mount required for HotAdd creates a stun during snapshot removal that VMware documented and partially addressed in ESXi 8.0 Update 2b. Environments on older ESXi builds still hit this regularly.

The job failure alone is recoverable. The dangerous part is a VM stuck on an orphaned snapshot that is growing by the minute. Every write to that VM is going into the delta disk. If the delta disk fills the datastore, the VM crashes. This is the scenario where a failed backup job causes a production outage.

Triage

  1. 1Stop all running VBR jobs immediately. In the VBR console, right-click each running job and select Stop. Do not let any job continue running against VMs that may have snapshot issues. Additional backup operations against a VM with a stuck snapshot can create snapshot chains that are much harder to clean up.
  2. 2In vCenter, check all VMs for consolidation warnings. Go to vCenter, select all VMs, and look for the yellow warning icon. Alternatively, run a PowerCLI query:
    Get-VM | Where-Object {$_.Extensiondata.Runtime.ConsolidationNeeded -eq $true}
    Note every VM that shows as needing consolidation.
  3. 3Check the backup proxy VM for attached disks. In vSphere Client, open the hardware configuration of the proxy VM. Look for any VMDK entries that do not belong to the proxy itself. These are orphaned HotAdd disks from VMs that were being backed up when the proxy failed.
  4. 4Check which VMs had snapshots named "VEEAM BACKUP TEMPORARY SNAPSHOT." In vCenter, right-click each VM flagged for consolidation, select Manage Snapshots, and verify whether a Veeam snapshot is still present. If the snapshot is present but Veeam has no active job running against that VM, it is an orphaned snapshot.
  5. 5Assess VM health. For each VM with a stuck snapshot, confirm: is the VM running, is it reachable on the network, and how large is the snapshot delta? In vCenter, check the snapshot delta file size on the datastore. A small delta (under 1 GB) means the VM has not been running long on the snapshot. A large or rapidly growing delta means the datastore fill risk is real and this VM is the priority.
  6. 6Check datastore free space. If any affected VM's datastore is under 20% free, treat that VM as critical priority. A growing snapshot delta on a nearly-full datastore will crash the VM when space runs out.

The Recovery Path

  1. 1Remove orphaned HotAdd disks from the proxy VM first. In vSphere Client, open the proxy VM settings. For each disk that does not belong to the proxy, select it and click Remove. Choose Remove from virtual machine only (do not select Delete files from datastore. Those are production VM disks). Confirm the removal. Do this for all orphaned disks before attempting any snapshot operations.
  2. 2Verify the disks are detached. After removing each disk from the proxy, refresh the proxy VM hardware in vSphere Client and confirm the disk is no longer listed. On a Windows proxy, you can also check Device Manager for any unexpected disk devices.
  3. 3Re-run the backup job for the affected VMs. At the start of each job session, VBR checks for registered Veeam snapshots left behind by previous failed sessions and attempts to remove them before proceeding with the new backup. This pre-job orphaned snapshot cleanup handles most cases where a proxy failure left a "VEEAM BACKUP TEMPORARY SNAPSHOT" still registered in vCenter. Monitor the job session to confirm the orphaned snapshot is removed. Note: this is distinct from Snapshot Hunter, which targets phantom snapshot files that remain on the datastore after vSphere reports a successful removal but consolidation actually failed.
  4. 4If the pre-job cleanup does not resolve the issue, attempt snapshot consolidation from vCenter. Right-click the VM in vCenter, select Snapshots, Consolidate. VCenter will attempt to commit the delta disk and remove the snapshot. Monitor the consolidation task in the vCenter task panel. This can take minutes to hours depending on delta size.
  5. 5If consolidation fails with "another task is already in progress," restart the ESXi management agent on the host running the VM. From the ESXi host console (F2 to enter customization menu): navigate to Troubleshooting Options, then Restart Management Agents. This clears stale task locks without affecting running VMs. After the management agent restarts, retry the consolidation.
Decision Point
If consolidation succeeds: verify the VM is fully operational and snapshot-free, then continue to step 9. If consolidation continues to fail and the datastore is filling: the VM may need to be powered off to force snapshot commit. This causes a brief outage but prevents a datastore-full crash. Coordinate with application owners before powering off any database or critical application VM. Continue to step 6.
  1. 6Power off the VM. In vSphere Client, right-click the VM and select Power Off (not suspend). Coordinate with application owners. A controlled power-off for snapshot removal is recoverable. A datastore-full crash is not.
  2. 7After the VM is powered off, go to Snapshots, Delete All Snapshots in vCenter. With the VM off, snapshot removal completes quickly and without stun. Wait for the consolidation task to complete in the vCenter task panel before proceeding.
  3. 8Power the VM back on. Verify it comes up cleanly and applications are functioning before continuing.
  4. 9For any VM where you had to set the parameter snapshot.asyncConsolidate.forceSync = TRUE during troubleshooting, remove that parameter after the snapshot is cleared. Leave it in place permanently only if recommended by VMware support for your specific hardware configuration.
  5. 10After all VMs are snapshot-free and running cleanly, restart the failed backup proxy. Check its service logs for the cause of the failure before returning it to service. In VBR, go to Backup Infrastructure, Backup Proxies, right-click the proxy, and run a connection test. If the proxy is recovered and passes the test, re-enable it.
  6. 11Run the backup job manually for the affected VMs. Since the previous run failed mid-job, those VMs need a successful backup before you have coverage again. Monitor the run and confirm all VMs complete without snapshot issues.
  7. 12Address the transport mode configuration to prevent recurrence. See Prevention Checklist below.

Gotchas

Never Revert a Veeam Snapshot
The "VEEAM BACKUP TEMPORARY SNAPSHOT" captures the VM state at backup start. Reverting to it rolls back every write the VM made during and after the backup window. You will lose production data. Always delete or consolidate the Veeam snapshot, never revert to it. The vCenter UI allows revert. It is the wrong option. The correct options are Delete and Consolidate.
Cross-Host HotAdd Is the Stun Source
HotAdd only avoids stun when the proxy VM is on the same ESXi host as the VM being backed up. When Veeam selects a proxy on a different host, the disk attach operation creates a cross-host mount. Snapshot removal while that cross-host mount is active triggers the long stun. This is a VMware architectural issue, not a Veeam bug. The fix is per-host proxies (one proxy VM per host) or switching to Direct NFS or NBD transport mode.
Windows Proxy automount Must Be Disabled
On Windows-based backup proxies, automount should be disabled at the OS level. Since VBR v10, Veeam automatically disables automount on Windows proxies before running a data protection task (KB1882). However, if automount was enabled when disks were previously attached, existing mount entries may persist. If automount is enabled before Veeam disables it, HotAdded VMDKs can be assigned drive letters by Windows. When Veeam tries to detach the disk, Windows holds a handle on it. The disk cannot be cleanly removed, which causes exactly the stuck-VMDK scenario described in this article. Verify automount is disabled on every Windows proxy by running diskpart and then automount to check status. If it reports enabled, run automount disable to turn it off.
Consolidation Needed Warning Stays After Snapshot Removal
vCenter sometimes shows "Virtual Machine Disks Consolidation Needed" even after the snapshot has been successfully removed and all delta files have been committed. This is a vCenter display lag. Run Snapshots, Consolidate again. If the task completes immediately with no work performed, the consolidation is already done. The warning may persist for up to 15 minutes after the actual consolidation completes before vCenter refreshes the status.
NFS + HotAdd Known Issue on ESXi Before 8.0 U2b
VMware (now Broadcom) documented a specific issue with HotAdd transport mode on NFS datastores (Broadcom KB323118) that causes multi-minute stun during snapshot removal. The fix is ESXi 8.0 Update 2b or later. But many production environments are not there yet. If your datastores are NFS and you are on an older ESXi build, HotAdd is not the appropriate transport mode for those VMs. Use Direct NFS (Direct Storage Access) instead. Veeam's automatic transport mode selection will attempt HotAdd first if the proxy can reach the host. You must explicitly restrict this via proxy transport mode settings or use the registry key documented in Veeam KB1681 to force same-host proxy selection.
Do Not Back Up Proxy VMs With Backup Jobs
Veeam explicitly recommends not including backup proxy VMs in VBR backup jobs. A proxy VM being backed up while it is simultaneously backing up other VMs creates a recursive snapshot situation. When the proxy VM's own snapshot removal fires, any VMDKs currently hotadded to that proxy are also covered by the snapshot. This causes exactly the orphaned VMDK and failed consolidation scenario described in this article. Caused not by a failure, but by normal operation of two overlapping jobs.

Prevention Checklist

  • Deploy one proxy VM per ESXi host in clusters where HotAdd transport mode is required. Per-host proxies eliminate cross-host VMDK mounts and prevent the cross-host stun entirely.
  • Alternatively, switch to Direct NFS transport mode for NFS-backed datastores. Direct NFS reads data directly from the NFS share without VMDK attachment, eliminating HotAdd stun risk entirely.
  • Verify automount is disabled on all Windows-based proxy VMs. Open an elevated command prompt, run diskpart, then type automount to check status. If it reports enabled, type automount disable. VBR v10 and later auto-disables this before tasks run, but verifying the OS-level setting prevents edge cases where mount entries persist.
  • In proxy transport mode settings, set the fallback behavior: configure proxies so that if the local proxy fails, Veeam falls back to NBD (Network Mode) rather than HotAdd on a remote host. This trades performance for safety in the failover case.
  • Do not include proxy VMs in any backup job scope. Exclude them explicitly or put them in a separate job with NBD-only transport and no concurrent tasks.
  • Upgrade ESXi to 8.0 Update 2b or later to get VMware's fix for the NFS HotAdd snapshot stun bug if your environment uses NFS datastores.
  • Monitor datastore free space and alert at 80% full. A growing snapshot delta on a nearly-full datastore is the path to a production outage from a failed backup job.
  • Configure proxy failover behavior in job settings. Under the Virtual Proxy selection, set the proxy to Automatic to allow Veeam to select an available proxy if the preferred one is down. For NFS environments, restrict transport mode to Direct NFS or NBD to avoid falling back to cross-host HotAdd on a remote proxy.
Break Glass Recap
  • Stop all VBR jobs immediately. Do not let any job run against VMs with stuck snapshots
  • Check every VM for consolidation warnings in vCenter before touching snapshots
  • Remove orphaned HotAdd disks from proxy VM BEFORE attempting any snapshot consolidation
  • Never revert a Veeam snapshot. Always delete or consolidate
  • Re-run the job to trigger VBR's pre-job orphaned snapshot cleanup before escalating to manual consolidation
  • Consolidation still failing: restart ESXi management agent to clear task locks
  • Datastore filling from delta growth: power off VM, delete all snapshots, power back on
  • Windows proxies require automount disabled or HotAdd disks get stuck (KB1882)
  • Cross-host HotAdd causes long stun. Use per-host proxies or Direct NFS instead
  • NFS + HotAdd stun fixed in ESXi 8.0 U2b. Patch or change transport mode on older builds

Read more