Veeam v13: SureBackup Verification for vSphere

Veeam v13 SureBackup Virtual Lab Application Group vSphere Recovery Verification

Veeam v13 Series | Component: VBR v13, vSphere | Audience: Hands-on Sysadmins, Enterprise Architects

SureBackup is Veeam's answer to the question that every backup administrator eventually gets asked: how do you know your backups actually work? The answer "because the job completed successfully" isn't enough. A backup job completing without errors means the data was written to the repository. It doesn't mean the VM boots, the applications can start, or the data is consistent. SureBackup tests all of that. It starts VMs directly from compressed, deduplicated backup files in an isolated virtual environment, runs real tests against live applications, and produces a verified result. Not an assumption. An actual test result.

This article covers SureBackup for vSphere from the ground up: how it works under the hood, the two verification modes and when to use each, virtual lab design and networking, application group configuration with real examples for AD, SQL Server, and web applications, custom test scripts, running SureBackup jobs at scale with PowerShell, and the failure patterns you'll encounter most often and how to fix them.


1. How SureBackup Works

SureBackup uses the Veeam vPower NFS Service to mount backup files as a datastore directly on an ESXi host. VMs are registered and started from those mounted backup files without copying or extracting them. The VM boots and runs from compressed, deduplicated backup data in place. All writes during verification go to redo log files on a separate datastore, not back into the backup file. When verification ends, Veeam deletes the redo logs and the backup files are untouched.

The virtual lab is the isolated network environment where this happens. It's a proxy appliance VM (the "masquerade" gateway) that VBR deploys automatically, plus isolated network segments mapped to your production networks. VMs in the virtual lab get the same IPs they have in production, but they can't reach production because the network is isolated. The proxy appliance acts as the routing bridge between VBR and the isolated VMs, allowing test scripts to communicate with application services running inside the lab without those services being able to reach production.

The Two Verification Modes

  • Full recoverability testing: VMs boot from backup in the virtual lab and Veeam runs heartbeat tests, ping tests, application tests, and custom scripts. This proves the VM can boot and the application can start from backup. Requires a virtual lab and application group. This is the mode that actually verifies recoverability.
  • Backup verification and content scan only: Veeam performs a CRC check on the backup file and optionally runs a malware scan and YARA rule check against the backup contents. No virtual lab required. No VM is booted. This is fast and works on any backup type including Veeam Agents. It proves the backup file is intact and clean, not that the VM can recover from it.

Don't confuse these two modes. Content scan only is useful and worth running, but it's not recoverability testing. If your SureBackup job doesn't boot VMs, you haven't proven your backups are recoverable.


2. Virtual Lab Design

The virtual lab is where VMs run during verification. Get the design right. A poorly configured lab produces false failures, or worse, allows test traffic to leak into the production network.

Basic vs Advanced single host Lab

  • Basic single host: VBR creates isolated networks automatically using VLAN offsets. Production VLAN 100 becomes isolated VLAN 200 (or whatever offset you configure). Simple to set up. Works well for environments where all VMs are on the same host and network isolation via VLAN offset is sufficient.
  • Advanced single host: You specify the isolated networks manually. Required when VMs span multiple VLANs with complex routing between them, when you need precise control over which networks are isolated, or when VLAN offsets would conflict with existing VLANs in your environment. This is the right choice for most production environments with any network complexity.

Configuring the Virtual Lab

  1. In VBR, go to Backup Infrastructure, then Virtual Labs, and click Add Virtual Lab.
  2. Select the ESXi host where the lab will run. Choose a host with enough resources to boot the VMs you plan to verify simultaneously. The lab doesn't provision resources in advance; VMs consume host resources only when the SureBackup job runs.
  3. Select the datastore where redo logs will be written during verification. This should be a fast local datastore or SSD backed LUN. Redo logs can grow significantly during extended tests on write heavy VMs.
  4. Configure the proxy appliance settings. The proxy appliance needs an IP on the production management network so VBR can communicate with it. It also needs to be reachable from the VBR server.
  5. Configure isolated networks. For each production network your verified VMs use, create a corresponding isolated network. Map the isolated port group to a vSwitch or distributed port group without physical uplinks. Without uplinks, the isolated network truly has no path to production.
  6. Configure IP masquerading. For each isolated network, configure what IP range the proxy appliance will masquerade as when VBR needs to reach a VM in the isolated network. That's how Veeam's test scripts reach VMs that have production IPs in the isolated network without touching production.
The isolated network port group must have no physical uplinks assigned. That's the entire isolation mechanism. If you accidentally assign a physical uplink, the VMs in the virtual lab will be on the production network with their production IPs and could conflict with running production VMs. VBR doesn't validate that the isolated port group has no uplinks. You have to verify this manually in vCenter after creating the lab.

3. Application Groups

The application group is the ordered list of dependency VMs that must start before your verified VMs. If you're verifying a web application server, the application group contains the domain controller and the SQL Server it depends on. Veeam starts them in order, waits for each to reach a stabilized state, then starts the VMs under verification.

If Veeam can't find a valid restore point for any VM in the application group, the SureBackup job fails entirely. Worth planning ahead: application group VMs need to be covered by backup jobs with recent, healthy restore points. A stale restore point for the SQL Server in the application group blocks verification of every VM that depends on it.

VM Roles and Startup Sequence

Each VM in the application group gets a role assignment that controls its startup behavior and what tests run against it. Roles are defined in XML files in %ProgramFiles%\Veeam\Backup and Replication\Backup\SbRoles\ on the VBR server. You can edit existing roles or create new ones by adding XML files to that directory.

Built-in RoleWhat It TestsStartup Behavior
Domain ControllerLDAP port 389 response. AD replication readiness.Starts in Non Authoritative mode by default. Use Authoritative mode only if verifying whether AD data itself is recoverable.
Global CatalogGlobal Catalog port 3268 response.Same as Domain Controller role.
Mail Server (Exchange)SMTP port 25 and OWA port 443 response.Waits for Exchange services to initialize fully before marking VM ready.
SQL ServerSQL port 1433 response. Optional query execution.Waits for SQL Server service to start and accept connections.
Web ServerHTTP port 80 or HTTPS port 443 response code check.Waits for web server port to respond before marking ready.

Application Initialization Timeout

Each VM in the application group has an Application initialization timeout setting (default 120 seconds). This is how long Veeam waits after the VM boots for the application service to start and respond to tests. If the application doesn't respond within this window, the test fails with a timeout error. It's one of the most common SureBackup failure patterns. The timeout's too short for actual startup time. SQL Server on a VM with large databases or complex startup routines can easily take 3 to 5 minutes. Increase this timeout before concluding the application doesn't start from backup.


4. Custom Test Scripts

Built-in role tests confirm that a port is open. Custom scripts let you test that the application is actually doing something. For a SQL Server, you can run a query against a critical database and verify it returns expected results. For an AD domain controller, you can run an LDAP search and verify the correct OU structure. For a web application, you can fetch a page and verify a specific string in the response. These are the tests that actually prove your application data is intact, not just that the service is running.

Test scripts run on the VBR server and communicate with VMs in the virtual lab through the proxy appliance. The scripts receive environment variables: %vm_ip% and %vm_fqdn% for the VM under test. If the VM has no VMware Tools installed or has no mapped network, these variables are empty and any test that references them is skipped with a warning in the job log.

PowerShell: Custom SQL Server test script for SureBackup
# Save this as a .ps1 file and reference it in the SureBackup job VM settings
# The script receives %vm_ip% as the first argument when called by VBR
# VBR passes it as $args[0] when running the script

param([string]$vmIP = $env:vm_ip)

if (-not $vmIP) {
    Write-Host "VM IP not provided - check VMware Tools and network mapping"
    exit 1
}

# Test SQL Server connectivity and run a validation query
$sqlServer   = $vmIP
$sqlInstance = "$sqlServer\MSSQLSERVER"
$database    = "ProductionDB"
$query       = "SELECT COUNT(*) FROM Orders WHERE OrderDate > DATEADD(day, -30, GETDATE())"

try {
    $conn = New-Object System.Data.SqlClient.SqlConnection
    $conn.ConnectionString = "Server=$sqlInstance;Database=$database;Integrated Security=False;User Id=veeam_test;Password=TestPassword123;"
    $conn.Open()

    $cmd    = $conn.CreateCommand()
    $cmd.CommandText = $query
    $result = $cmd.ExecuteScalar()

    $conn.Close()

    if ($result -gt 0) {
        Write-Host "SQL test PASSED: $result recent orders found in $database"
        exit 0
    } else {
        Write-Host "SQL test WARNING: Query returned 0 rows - data may be stale"
        exit 1
    }
} catch {
    Write-Host "SQL test FAILED: $($_.Exception.Message)"
    exit 1
}
PowerShell: Custom web application test script for SureBackup
param([string]$vmIP = $env:vm_ip)

if (-not $vmIP) { exit 1 }

# Test that the web application returns a 200 and contains expected content
$testUrl     = "http://$vmIP/health"
$expected    = "status: healthy"

try {
    $response = Invoke-WebRequest -Uri $testUrl -TimeoutSec 30 -UseBasicParsing

    if ($response.StatusCode -eq 200 -and $response.Content -match $expected) {
        Write-Host "Web test PASSED: App returned 200 with expected health check"
        exit 0
    } else {
        Write-Host "Web test FAILED: StatusCode=$($response.StatusCode), Content check: $(($response.Content -match $expected))"
        exit 1
    }
} catch {
    Write-Host "Web test FAILED: $($_.Exception.Message)"
    exit 1
}

5. SureBackup Job Design

Linked Jobs vs Specific VMs

A SureBackup job can verify VMs from a linked backup job (all VMs in a job) or from a specific list of VMs you select manually. Linking to a backup job is operationally simpler: when VMs are added to the backup job, they're automatically included in SureBackup verification. Selecting VMs manually gives you more control over which VMs get the full recoverability test versus just the content scan.

For most environments, the right design is to link SureBackup to your most critical backup jobs and run full recoverability testing on the VMs in those jobs. Run content scan only on less critical jobs where the overhead of booting VMs isn't justified by the criticality of the workload.

Simultaneous VM Verification

By default, SureBackup verifies VMs sequentially. One VM boots, gets tested, then shuts down before the next starts. You can change this to verify multiple VMs in parallel. Running more VMs simultaneously reduces total verification time but puts more load on the ESXi host running the virtual lab. Watch host CPU and memory during SureBackup runs and tune the concurrent limit based on what the host can absorb without squeezing production VMs.

Schedule and Job Overlap

The SureBackup job tries to use the most recent restore point available when it runs. If the backup job and the SureBackup job overlap in schedule, the backup file may be locked and SureBackup will wait for the backup job to finish before starting. The simplest approach is to chain the jobs: configure SureBackup to run after the linked backup job completes rather than on a fixed schedule. This guarantees SureBackup always has access to the freshest restore point and eliminates the overlap problem.


6. PowerShell Automation

PowerShell: Create a SureBackup job linked to a backup job with application group
Connect-VBRServer -Server "vbr-server.domain.local"

# Get the virtual lab and application group
$vlab   = Get-VSBVirtualLab -Name "Virtual-Lab-Production"
$appGrp = Get-VSBApplicationGroup -Name "AppGroup-AD-SQL"

# Get the backup job to link for verification
$sourceJob = Get-VBRJob -Name "Backup - Production VMs"

# Create the SureBackup job
$sbJob = Add-VSBJob `
    -Name        "SureBackup - Production VMs" `
    -VirtualLab  $vlab `
    -AppGroup    $appGrp `
    -LinkJob     $sourceJob `
    -Description "Weekly recoverability verification for production VMs"

Write-Host "SureBackup job created: $($sbJob.Name)"

Disconnect-VBRServer
PowerShell: Report on SureBackup job results across all verification jobs
Connect-VBRServer -Server "vbr-server.domain.local"

$cutoff = (Get-Date).AddDays(-30)

Get-VSBJob | ForEach-Object {
    $job = $_
    Write-Host "`n=== $($job.Name) ==="

    Get-VSBSession -Job $job | Where-Object { $_.CreationTime -gt $cutoff } |
    Sort-Object CreationTime -Descending | Select-Object -First 5 |
    ForEach-Object {
        $session = $_
        $tasks   = Get-VSBTaskSession -Session $session

        $passed  = ($tasks | Where-Object { $_.Status -eq 'Success' }).Count
        $failed  = ($tasks | Where-Object { $_.Status -eq 'Failed'  }).Count
        $total   = $tasks.Count

        Write-Host "  $($session.CreationTime.ToString('yyyy-MM-dd')) - $($session.Result) | VMs: $total | Pass: $passed | Fail: $failed"

        # Show failed VMs
        $tasks | Where-Object { $_.Status -eq 'Failed' } | ForEach-Object {
            Write-Host "    FAILED: $($_.Name) - $($_.Info)"
        }
    }
}

Disconnect-VBRServer

7. Common Failure Patterns

Timeout Errors on Application Initialization

The VM boots but the application doesn't respond within the initialization timeout. The fix is almost always increasing the Application initialization timeout, not investigating the application. A VM started from backup in a vPower NFS environment takes longer to boot than the same VM starting normally in production: disk I/O goes through the NFS mount, and there's no SSD caching benefit from backup file storage. Count on boot times being noticeably longer than production, especially for SQL Server or VMs with large disks. Start with 300 seconds for SQL Server and 180 seconds for web servers, then tune down if needed.

Test Scripts Completing with Exit Code 1

VBR interprets any non zero exit code from a test script as a failure. If your script is connecting to the VM via the masquerade IP and the connection is refused, check the IP masquerading configuration in the virtual lab settings. The masquerade network must overlap with the range VBR is trying to reach. If the VM is in the isolated VLAN 200 network but VBR is trying to reach it on the production VLAN 100 IP, the masquerade configuration needs to map production VLAN 100 addresses to their isolated counterparts. That's the most common reason scripts work outside the lab but fail inside SureBackup.

Application Group VM Fails to Find Restore Point

The SureBackup job fails immediately with "unable to find valid restore point for [VM name]" for a VM in the application group. The VM's backup job is failing or running infrequently enough that no restore point exists within the window SureBackup checks. Fix the upstream backup job first. Then verify the application group VM is actually included in an active backup job with a recent successful restore point before relying on it as a dependency.


Key Takeaways

  • SureBackup Full recoverability testing actually boots VMs from backup and tests live applications. Content scan only checks backup file integrity and runs malware scans but never boots a VM. You haven't verified recoverability unless VMs boot.
  • The isolated network port group must have zero physical uplinks. This is the only thing preventing verified VMs from reaching production with their production IPs. VBR doesn't validate this. You have to confirm it in vCenter manually after creating the virtual lab.
  • VMs in the application group boot in order before verified VMs start. If any application group VM has no valid restore point, the entire SureBackup job fails. Keep application group VMs covered by healthy backup jobs.
  • Application initialization timeout defaults to 120 seconds. VMs started from backup in a vPower NFS environment boot significantly slower than production. SQL Server needs at least 300 seconds as a starting point. Increase the timeout before concluding an application doesn't start from backup.
  • Custom test scripts receive the VM IP as %vm_ip% and the FQDN as %vm_fqdn%. Any non zero exit code is a failure. Use these scripts to run actual application queries and content checks, not just port connectivity tests.
  • Chain SureBackup after its linked backup job instead of scheduling them independently. This prevents schedule overlap, ensures SureBackup always has the freshest restore point, and eliminates backup file locking failures.
  • You can increase simultaneous VM verification beyond the default of 3 but watch the ESXi host running the virtual lab. Each additional concurrent VM consumes host CPU and memory during the verification window.

Read more