SOBR Extent Failure - Recovering a Scale-Out Backup Repository After an Extent Goes Offline
Why This Happens
A Scale-Out Backup Repository (SOBR) aggregates multiple simple repositories (extents) into a single logical target. Each extent is an independent storage system. A Windows or Linux server with local or DAS storage, a NAS share, or an object storage repository. When any one of those underlying storage systems fails, the extent goes offline and the SOBR is degraded.
Storage hardware fails. RAID controllers corrupt volumes. NAS shares become unreachable due to network events. An extent host runs out of disk space and the underlying filesystem goes read-only. A Windows update reboots an extent host mid-job and the host does not come back up cleanly. All of these are real triggers that happen in production environments.
The blast radius depends on two factors: the SOBR placement policy and which backup chains had data on the failed extent. With Data Locality policy, all files in a backup chain (full and incrementals) are on the same extent. A failed extent takes out the chains stored on it, but chains on other extents are unaffected. With Performance policy, the full and incrementals for a chain may be on different extents. A single failed extent can break chains whose incrementals are on the failed extent even if the full is elsewhere. Or vice versa.
Backup copy jobs are also affected. If a backup copy job's chain spans the failed extent, the next copy run will fail. New incremental copies cannot be created for chains that are broken.
Triage
- 1Identify the failed extent. In the VBR console, go to Backup Infrastructure, then Scale-Out Repositories. Click on the affected SOBR. In the right pane, the extent list shows each extent with its status. An offline or unavailable extent will show a red or yellow indicator. Note the extent name and host.
- 2Diagnose the extent failure. Can you reach the extent host? Ping it. RDP or SSH to it. Check whether the storage system is online (check RAID status, disk health, filesystem mount). Determine whether this is a transient failure (network blip, host rebooting) or a permanent failure (storage hardware dead).
- 3Determine which backup chains have data on the failed extent. Right-click the failed extent in VBR and select Properties. Review the backup files listed as stored on this extent. Note the job names and VM names. These are the restore points that are currently unavailable.
- 4Check your SOBR's placement policy. In the SOBR properties, look at the Placement Policy setting. If it is Data Locality, only chains that have their full on the failed extent are affected. If it is Performance, chains may be split. The full on one extent and incrementals on another. A Performance policy failure can affect chains across multiple extents.
- 5Check whether the SOBR has a capacity tier (object storage). If a capacity tier is configured and offloading was running, older restore points that were offloaded to object storage may still be accessible. In VBR, check whether restore points appear under the SOBR even with the extent offline.
- 6Determine whether an active restore session was running against the failed extent when it went offline. Check for any active restore sessions in the VBR console Home view. An Instant VM Recovery running from a restore point on the failed extent will need to be addressed immediately.
Recovery Path A. Transient Failure, Extent Can Be Recovered
- 1Put the failed extent into Maintenance Mode in VBR before doing anything else. Right-click the extent in Backup Infrastructure, Scale-Out Repositories. Select Maintenance Mode. This tells VBR to stop trying to write to this extent and prevents backup jobs from failing repeatedly against an unavailable target.
- 2Repair the underlying storage issue. Bring the extent host back online, resolve the RAID event, fix the network path, or restore the filesystem. Take the time to fully confirm the storage is healthy before bringing the extent back into service. An extent that returns with silent data corruption is worse than a clean failure.
- 3After the extent host is healthy, verify the backup files are intact. SSH or RDP to the extent host and check the backup directory. Confirm .vbk and .vib files are present and the filesystem is mounted correctly. Run a filesystem check if there is any doubt (chkdsk for NTFS/ReFS, fsck for ext4/XFS on Linux).
- 4Remove the extent from Maintenance Mode in VBR. Right-click the extent, deselect Maintenance Mode. VBR will reconnect to the extent and rescan it. Backup chains that were stored on this extent will become accessible again.
- 5Run a manual rescan of the SOBR. Right-click the SOBR in VBR, select Rescan. This reconciles VBR's database records with the actual files on all extents.
- 6Re-enable jobs that were targeting this SOBR and run them manually. Monitor the first run for any errors related to the previously failed extent.
Recovery Path B. Permanent Failure, Extent Is Gone
- 1Put the failed extent into Maintenance Mode immediately if it is not already. This is required before any SOBR service operations.
- 2Assess what was lost. The backup chains stored on the failed extent are gone. Document which VMs have lost restore points and what the date range of the loss is. Check whether a backup copy job has copies of these chains on a separate repository.
- 3Enable the "Perform full backup when required extent is offline" setting on the SOBR if it is not already set. In the SOBR advanced settings, this option forces VBR to create a new active full backup for any chain whose required extent is unavailable, rather than failing the job. This allows backup jobs to continue running on the remaining healthy extents. Note: this requires enough free space on the remaining extents to accommodate the new full backups.
- 4Add a replacement extent to the SOBR. Provision a new storage system, add it as a simple repository in VBR Backup Infrastructure, then add it to the SOBR. Right-click the SOBR, select Properties, and add the new extent. The SOBR immediately becomes aware of the new storage.
- 5Remove the failed extent from the SOBR. In the SOBR properties, select the failed extent and click Remove. VBR will warn you that backup chains on the removed extent will no longer be accessible. Confirm the removal.
- 6Run backup jobs for all affected VMs. With the "Perform full backup when required extent is offline" setting active, VBR will create new active fulls for chains that were on the failed extent. These new chains land on the remaining healthy extents and the new extent. Backup coverage is restored after these jobs complete.
- 7After the new active fulls complete, run a SOBR rescan to update VBR's view of all chains and extents.
Recovery Path C. Urgent Restore Needed from the Failed Extent
- 1Check the capacity tier first. If the SOBR has object storage offloading enabled and the restore point you need is old enough to have been offloaded, it may be accessible via the capacity tier even with the performance extent offline. In VBR Home, look for the restore point under the SOBR. Capacity tier restore points show even when the performance extent is down.
- 2Check backup copy jobs. If a backup copy job was running against this SOBR and writing to a separate repository, that repository has an independent copy of the backup chain. Use the backup copy as the restore source. In VBR Home, expand Backups and look under the backup copy job's target repository.
- 3If the extent failure is a network issue rather than a physical storage failure, attempt to restore the network path to the extent host before attempting any other recovery. A working network path to intact storage is better than any workaround.
- 4If the storage hardware is physically dead but the disks may be intact, engage your storage hardware vendor's support for disk recovery options. In a RAID failure scenario, the individual disks may contain recoverable data even if the RAID controller is failed. This is outside Veeam's scope but is worth pursuing in parallel with starting new backups from scratch.
Gotchas
Prevention Checklist
- Use Data Locality placement policy unless you have a specific, documented performance reason for Performance policy. Data Locality limits the blast radius of an extent failure to chains stored only on that extent.
- Enable the "Perform full backup when required extent is offline" setting on every SOBR. This keeps jobs running when an extent goes down rather than failing all jobs until the extent is restored.
- Size remaining extents to absorb a full backup of every VM that could be on any single extent. When an extent fails and this option fires, you need the space headroom.
- Run a backup copy job for all critical VMs. The copy job writes to a separate repository outside the SOBR, giving you a recovery path independent of SOBR extent health.
- Enable object storage as a capacity tier on the SOBR for long-term retention. Offloaded restore points survive performance extent failures.
- Maintain at least one standalone simple repository outside the SOBR for the VBR configuration backup target. SOBRs cannot be config backup targets.
- Monitor extent health from an external system. Veeam ONE can alert on extent offline events. Do not rely on checking the VBR console manually to discover extent failures.
- Test extent failure scenarios in a lab annually. Put an extent into Maintenance Mode, confirm jobs continue with active fulls on remaining extents, confirm restores work from the remaining extents and the capacity tier.
- Put the failed extent into Maintenance Mode first. Before any other SOBR operation
- Performance policy SOBR: one extent failure can break chains across two extents
- Data Locality policy: failure domain is one extent only. Safer default
- Transient failure: Maintenance Mode, fix storage, remove from Maintenance Mode, rescan
- Permanent failure: add replacement extent, enable active full on offline, remove failed extent, run jobs
- Urgent restore: check capacity tier first, then backup copy job, then hardware recovery options
- Evacuation is all-or-nothing and I/O intensive. Only evacuate healthy but retiring extents
- Active full option needs space headroom on remaining extents. Verify before relying on it
- Config backup cannot target a SOBR. Keep a standalone simple repo for config backup
- Backup copy job to a non-SOBR target is the save here when extent data is unrecoverable