SOBR Extent Failure - Recovering a Scale-Out Backup Repository After an Extent Goes Offline

Break Glass // Scenario 06

An extent in your Scale-Out Backup Repository went offline. Could be a storage system failure, a network outage to the extent host, a RAID event, or a hardware failure. Backup jobs are failing. Some restore points are unreachable. You need to know exactly what you lost, how to contain the damage, and how to get jobs running again without making the situation worse.

Break Glass VBR v13 SOBR Extent Failure

Why This Happens

A Scale-Out Backup Repository (SOBR) aggregates multiple simple repositories (extents) into a single logical target. Each extent is an independent storage system. A Windows or Linux server with local or DAS storage, a NAS share, or an object storage repository. When any one of those underlying storage systems fails, the extent goes offline and the SOBR is degraded.

Storage hardware fails. RAID controllers corrupt volumes. NAS shares become unreachable due to network events. An extent host runs out of disk space and the underlying filesystem goes read-only. A Windows update reboots an extent host mid-job and the host does not come back up cleanly. All of these are real triggers that happen in production environments.

The blast radius depends on two factors: the SOBR placement policy and which backup chains had data on the failed extent. With Data Locality policy, all files in a backup chain (full and incrementals) are on the same extent. A failed extent takes out the chains stored on it, but chains on other extents are unaffected. With Performance policy, the full and incrementals for a chain may be on different extents. A single failed extent can break chains whose incrementals are on the failed extent even if the full is elsewhere. Or vice versa.

Backup copy jobs are also affected. If a backup copy job's chain spans the failed extent, the next copy run will fail. New incremental copies cannot be created for chains that are broken.

Triage

1Identify the failed extent. In the VBR console, go to Backup Infrastructure, then Scale-Out Repositories. Click on the affected SOBR. In the right pane, the extent list shows each extent with its status. An offline or unavailable extent will show a red or yellow indicator. Note the extent name and host.
2Diagnose the extent failure. Can you reach the extent host? Ping it. RDP or SSH to it. Check whether the storage system is online (check RAID status, disk health, filesystem mount). Determine whether this is a transient failure (network blip, host rebooting) or a permanent failure (storage hardware dead).
3Determine which backup chains have data on the failed extent. Right-click the failed extent in VBR and select Properties. Review the backup files listed as stored on this extent. Note the job names and VM names. These are the restore points that are currently unavailable.
4Check your SOBR's placement policy. In the SOBR properties, look at the Placement Policy setting. If it is Data Locality, only chains that have their full on the failed extent are affected. If it is Performance, chains may be split. The full on one extent and incrementals on another. A Performance policy failure can affect chains across multiple extents.
5Check whether the SOBR has a capacity tier (object storage). If a capacity tier is configured and offloading was running, older restore points that were offloaded to object storage may still be accessible. In VBR, check whether restore points appear under the SOBR even with the extent offline.
6Determine whether an active restore session was running against the failed extent when it went offline. Check for any active restore sessions in the VBR console Home view. An Instant VM Recovery running from a restore point on the failed extent will need to be addressed immediately.

Decision Point

Extent is temporarily offline (transient failure, recoverable): go to Recovery Path A. Extent is permanently failed (hardware dead, storage unrecoverable): go to Recovery Path B. Restore is urgently needed from data on the failed extent: go to Recovery Path C first, then continue with A or B.

Recovery Path A. Transient Failure, Extent Can Be Recovered

1Put the failed extent into Maintenance Mode in VBR before doing anything else. Right-click the extent in Backup Infrastructure, Scale-Out Repositories. Select Maintenance Mode. This tells VBR to stop trying to write to this extent and prevents backup jobs from failing repeatedly against an unavailable target.
2Repair the underlying storage issue. Bring the extent host back online, resolve the RAID event, fix the network path, or restore the filesystem. Take the time to fully confirm the storage is healthy before bringing the extent back into service. An extent that returns with silent data corruption is worse than a clean failure.
3After the extent host is healthy, verify the backup files are intact. SSH or RDP to the extent host and check the backup directory. Confirm .vbk and .vib files are present and the filesystem is mounted correctly. Run a filesystem check if there is any doubt (chkdsk for NTFS/ReFS, fsck for ext4/XFS on Linux).
4Remove the extent from Maintenance Mode in VBR. Right-click the extent, deselect Maintenance Mode. VBR will reconnect to the extent and rescan it. Backup chains that were stored on this extent will become accessible again.
5Run a manual rescan of the SOBR. Right-click the SOBR in VBR, select Rescan. This reconciles VBR's database records with the actual files on all extents.
6Re-enable jobs that were targeting this SOBR and run them manually. Monitor the first run for any errors related to the previously failed extent.

Recovery Path B. Permanent Failure, Extent Is Gone

1Put the failed extent into Maintenance Mode immediately if it is not already. This is required before any SOBR service operations.
2Assess what was lost. The backup chains stored on the failed extent are gone. Document which VMs have lost restore points and what the date range of the loss is. Check whether a backup copy job has copies of these chains on a separate repository.
3Enable the "Perform full backup when required extent is offline" setting on the SOBR if it is not already set. In the SOBR advanced settings, this option forces VBR to create a new active full backup for any chain whose required extent is unavailable, rather than failing the job. This allows backup jobs to continue running on the remaining healthy extents. Note: this requires enough free space on the remaining extents to accommodate the new full backups.
4Add a replacement extent to the SOBR. Provision a new storage system, add it as a simple repository in VBR Backup Infrastructure, then add it to the SOBR. Right-click the SOBR, select Properties, and add the new extent. The SOBR immediately becomes aware of the new storage.
5Remove the failed extent from the SOBR. In the SOBR properties, select the failed extent and click Remove. VBR will warn you that backup chains on the removed extent will no longer be accessible. Confirm the removal.
6Run backup jobs for all affected VMs. With the "Perform full backup when required extent is offline" setting active, VBR will create new active fulls for chains that were on the failed extent. These new chains land on the remaining healthy extents and the new extent. Backup coverage is restored after these jobs complete.
7After the new active fulls complete, run a SOBR rescan to update VBR's view of all chains and extents.

Recovery Path C. Urgent Restore Needed from the Failed Extent

1Check the capacity tier first. If the SOBR has object storage offloading enabled and the restore point you need is old enough to have been offloaded, it may be accessible via the capacity tier even with the performance extent offline. In VBR Home, look for the restore point under the SOBR. Capacity tier restore points show even when the performance extent is down.
2Check backup copy jobs. If a backup copy job was running against this SOBR and writing to a separate repository, that repository has an independent copy of the backup chain. Use the backup copy as the restore source. In VBR Home, expand Backups and look under the backup copy job's target repository.
3If the extent failure is a network issue rather than a physical storage failure, attempt to restore the network path to the extent host before attempting any other recovery. A working network path to intact storage is better than any workaround.
4If the storage hardware is physically dead but the disks may be intact, engage your storage hardware vendor's support for disk recovery options. In a RAID failure scenario, the individual disks may contain recoverable data even if the RAID controller is failed. This is outside Veeam's scope but is worth pursuing in parallel with starting new backups from scratch.

Gotchas

Performance Policy Splits Chains. One Extent Failure Breaks Two

With Performance placement policy, the full backup file (.vbk) and incrementals (.vib) are intentionally placed on different extents to distribute I/O. This means losing one extent can break chains stored across two extents. If the extent containing the full is lost, all incrementals depending on it are unusable. If the extent containing an incremental is lost, restores to points after that incremental are broken. Data Locality policy keeps full and incrementals together, so one extent failure only affects chains fully stored on that extent. For most environments, Data Locality is the safer choice unless you have a specific I/O performance reason for Performance policy.

Evacuation Is All-or-Nothing and I/O Intensive

The Evacuate Backups function in SOBR maintenance mode moves all data off an extent to the other extents. It cannot move a partial selection. If the failing extent has 10 TB of data and you have 8 TB of free space across the remaining extents, the evacuation will fail partway through. Evacuation also generates heavy read I/O on the failing extent. If the extent is degraded hardware that is about to fail completely, evacuation may push it over the edge. Only evacuate from extents that are healthy but being retired, not from extents showing hardware errors.

Maintenance Mode Required Before Removal

You cannot remove an extent from a SOBR without first placing it in Maintenance Mode. VBR enforces this. Attempting to remove an active extent while jobs are running against it will fail with an error. Always: Maintenance Mode first, wait for any active tasks to drain, then remove.

Config Backup Cannot Target a SOBR

The VBR configuration backup job cannot use a SOBR as its target. This is a hard limitation. If your only repositories are SOBR extents, VBR has no valid config backup target. The default repository created during VBR installation cannot be added to a SOBR while it is still the config backup target. This forces you to maintain at least one standalone simple repository outside the SOBR specifically for the config backup. Many environments discover this limitation only when trying to migrate all storage into a SOBR.

Active Full Required After Extent Loss. Plan the Space

When the "Perform full backup when required extent is offline" option triggers, VBR creates a new active full for every affected VM chain. In a large environment with hundreds of VMs on the failed extent, this means hundreds of active full backup jobs fire simultaneously on the next run. The remaining extents must have enough free space to absorb all of those new fulls. If they do not, the jobs fail again. Now with a space error rather than an extent error. Confirm free space on remaining extents before triggering this option, and stagger the job runs if necessary to avoid overwhelming the storage.

Prevention Checklist

Use Data Locality placement policy unless you have a specific, documented performance reason for Performance policy. Data Locality limits the blast radius of an extent failure to chains stored only on that extent.
Enable the "Perform full backup when required extent is offline" setting on every SOBR. This keeps jobs running when an extent goes down rather than failing all jobs until the extent is restored.
Size remaining extents to absorb a full backup of every VM that could be on any single extent. When an extent fails and this option fires, you need the space headroom.
Run a backup copy job for all critical VMs. The copy job writes to a separate repository outside the SOBR, giving you a recovery path independent of SOBR extent health.
Enable object storage as a capacity tier on the SOBR for long-term retention. Offloaded restore points survive performance extent failures.
Maintain at least one standalone simple repository outside the SOBR for the VBR configuration backup target. SOBRs cannot be config backup targets.
Monitor extent health from an external system. Veeam ONE can alert on extent offline events. Do not rely on checking the VBR console manually to discover extent failures.
Test extent failure scenarios in a lab annually. Put an extent into Maintenance Mode, confirm jobs continue with active fulls on remaining extents, confirm restores work from the remaining extents and the capacity tier.

Break Glass Recap

Put the failed extent into Maintenance Mode first. Before any other SOBR operation
Performance policy SOBR: one extent failure can break chains across two extents
Data Locality policy: failure domain is one extent only. Safer default
Transient failure: Maintenance Mode, fix storage, remove from Maintenance Mode, rescan
Permanent failure: add replacement extent, enable active full on offline, remove failed extent, run jobs
Urgent restore: check capacity tier first, then backup copy job, then hardware recovery options
Evacuation is all-or-nothing and I/O intensive. Only evacuate healthy but retiring extents
Active full option needs space headroom on remaining extents. Verify before relying on it
Config backup cannot target a SOBR. Keep a standalone simple repo for config backup
Backup copy job to a non-SOBR target is the save here when extent data is unrecoverable