Veeam v13: Disaster Recovery Runbooks and Documentation

Veeam v13 Disaster Recovery Runbooks Documentation MSP Enterprise

Veeam v13 Series | Component: VBR v13, Veeam ONE v13 | Audience: MSP Engineers, Enterprise Architects, Security and Compliance Teams

Most Veeam environments are technically solid. The backup jobs run. The hardened repo is configured. Immutability is on. But when someone asks "can you walk me through exactly what happens if the backup server goes down at 2am on a Saturday," the answer is usually a pause followed by "well, we would figure it out."

That gap is what this article is about. A DR runbook is not a Veeam config export. It is an operational document that a trained engineer who has never touched your environment can pick up and execute under pressure. This article covers how to build one, what it needs to contain, how to test it, and how to produce the kind of audit ready documentation that satisfies a compliance reviewer or a customer asking for proof of recoverability.


1. What a DR Runbook Actually Is

A runbook is a step by step operational procedure for a specific failure scenario. It is not architecture documentation. It is not a Veeam best practices guide. It is a document that answers one question: given this specific failure, what do I do, in what order, and how do I know it worked.

For Veeam environments, you need at minimum one runbook per critical failure scenario. The scenarios that matter most in production:

ScenarioScopeRTO Target
VBR server total lossRebuild or restore the backup server itself4 to 8 hours
Backup repository corruption or lossRecover from offsite copy or immutable backup2 to 4 hours to restore operations
Ransomware eventIsolate, assess, restore from clean immutable restore pointScenario dependent, document the decision tree
Primary site loss (DR failover)Failover replicas or restore to DR sitePer SLA, typically 1 to 4 hours
Single VM or workload recoveryRestore specific VM or files from backup15 to 60 minutes
Cloud Connect tenant data recoveryMSP specific: restore tenant workloads from cloud repositoryPer tenant SLA
A runbook you have never tested is a hypothesis, not a procedure. Every runbook in this article needs a test date and a test result before it goes into production use.

2. VBR Server Recovery Runbook

The VBR server is the most critical single point of failure in most Veeam environments. Losing it does not lose your backup data, but it does lose your ability to restore until it is rebuilt. This runbook covers a full rebuild from the Veeam configuration backup.

Prerequisites Before You Need This Runbook

  • Veeam configuration backup is scheduled and running to a location outside the VBR server (network share, object storage, or separate repo)
  • VBR installer media is accessible offline (ISO or downloaded installer stored separately)
  • License file or license portal credentials are documented and stored in your password manager
  • PostgreSQL credentials for the VBR configuration database are documented (v13 migrated to PostgreSQL; the SA password set during install is required for restore)
  • Service account credentials for VBR (the account VBR services run under) are documented

Recovery Steps

  1. Provision replacement server. Match or exceed original hardware/VM specs. Install Windows Server (same version as original). Join to domain if applicable. Apply current patches.
  2. Install Veeam VBR v13. Run the installer. Select the same installation path as the original. When prompted for PostgreSQL, use the same PostgreSQL SA password as the original installation. Do not configure any infrastructure during setup.
  3. Stop Veeam services before importing config. In Services, stop all Veeam services before running the configuration restore. Running a config restore against a live VBR instance causes conflicts.
  4. Run configuration restore. Open VBR console. Go to Home tab, click the VBR menu (top left), select Configuration Backup, then Restore. Point to the most recent configuration backup file. Enter the encryption password if the config backup was encrypted (it should be).
  5. Verify infrastructure reconnection. After restore completes, open Backup Infrastructure. Verify all managed servers show as connected. Re-enter credentials for any server that shows as disconnected. This is common for VSA connected servers where the Analytics Service needs to re-register.
  6. Verify repository access. Open Backup Repositories. Confirm all repositories are accessible and backup chains are visible under each repo. If using a hardened Linux repo, re-enter the single use credentials to re-establish the connection.
  7. Run a test restore. Select a non critical VM. Run an Instant Recovery to verify the full restore path is working before declaring recovery complete.
  8. Re-register with Veeam ONE. If Veeam ONE is in use, re-add the rebuilt VBR server in Veeam ONE configuration. The Analytics Service will reinstall automatically.
Configuration backup encryption password is the single most common recovery blocker. If this password is lost, the config backup cannot be restored. Store it in your password manager and in a sealed physical document in a secure location. Not in the same system the config backup protects.

3. Ransomware Response Runbook

Ransomware runbooks are different from other DR runbooks because the first phase is not restoration. It is isolation and assessment. Restoring before you know what was hit and whether the infection vector is closed is how you restore infected data and extend the incident.

Phase 1: Isolate

  1. Isolate the VBR server from the network immediately if there is any indication it was reached. Veeam ONE malware detection alarms are the first signal in most environments.
  2. Do not shut down affected systems. Memory forensics may be needed. Isolate at the network switch or firewall level first.
  3. Verify the hardened repository is intact. SSH to the hardened repo server. Confirm the Veeam service user account has not been modified and immutability flags are set on backup files.
  4. Pull the active alarm list from Veeam ONE. Document every alarm that fired in the 72 hours before detection. This establishes the timeline.

Phase 2: Assess

  1. Identify the last clean restore point for each affected workload using the VBR console. Look for restore points that predate the earliest indicators of compromise.
  2. Use Veeam ONE Alarm History to identify the first backup job that may have backed up encrypted data. Back up from that point.
  3. Check SureBackup results history. The most recent successful SureBackup run gives you a verified clean restore point baseline.
  4. Engage your incident response process. DR runbook execution is parallel to, not a replacement for, security incident response.

Phase 3: Restore

  1. Restore to an isolated network segment first. Do not restore directly to production until the infection vector is confirmed closed.
  2. Use Secure Restore for all workloads if antivirus scanning is configured. This scans the restore point before mounting it.
  3. Restore in priority order per your workload priority matrix (covered in Section 6 of this article).
  4. Document every restore action with timestamp, operator, restore point date, and target. This is your incident record.

4. Replica Failover Runbook

This runbook applies to environments using Veeam replication to a DR site. It covers planned failover (maintenance or migration) and unplanned failover (primary site loss).

Planned Failover

  1. In VBR console, go to Home, then Replicas, then Ready. Identify the VMs to fail over.
  2. In VBR, right click the replica and select Planned Failover. This synchronizes one final delta before switching, minimizing data loss.
  3. Confirm the replica powers on at the DR site. Verify network connectivity and application health before proceeding.
  4. Update DNS or load balancer entries to point to the DR site IPs.
  5. Notify stakeholders that failover is complete and applications are running from DR.

Unplanned Failover

  1. In VBR console, go to Home, then Replicas, then Ready. Identify affected VMs.
  2. In VBR, right click the replica and select Failover Now. Select the most recent restore point or a specific point in time if the most recent point may be suspect.
  3. Verify replica is running. Test application connectivity before updating DNS.
  4. Document the restore point used and the timestamp of failover.
  5. Begin failback planning immediately. Unplanned failover means your DR site is now your primary. This is a temporary state.
Failover without a documented failback plan is an incomplete runbook. Every failover procedure needs a corresponding failback procedure or you will be making it up under pressure at the worst possible time.

5. Testing Your Runbooks

A runbook that has never been executed is not a runbook. It is a draft. Every runbook needs a test protocol and a documented test history.

Test Types and Cadence

Test TypeWhat It ValidatesRecommended Cadence
Tabletop exerciseTeam knows the runbook, roles are clear, decision points are understoodQuarterly
SureBackup automated verificationBackup data is restorable, application starts correctlyWeekly per job
Instant Recovery testFull VM restore path works end to endMonthly, rotating workloads
Full runbook execution in isolated environmentEntire procedure works as documentedAnnually minimum, twice a year for critical workloads
Replica failover testReplicas are current and failover completes within RTOTwice a year

Documenting Test Results

Every test needs a record. Minimum fields for each test record:

  • Date and time of test
  • Operator who performed the test
  • Runbook version tested
  • Workloads included in the test
  • RTO achieved vs RTO target
  • Steps that failed or deviated from the runbook
  • Runbook updates made as a result
  • Sign off by team lead or manager

This test record is what you hand to an auditor. It is also what tells you whether your RTO targets are realistic before you discover they are not during an actual incident.


6. Workload Priority Matrix

Not all workloads recover in the same order. A priority matrix documents which systems come back first, who owns the decision, and what the dependency chain looks like. Without this, recovery devolves into whoever is loudest on the phone gets their system first.

TierDescriptionExamplesRTO Target
Tier 1Infrastructure dependencies. Nothing else recovers without these.Domain controllers, DNS, core networking, authenticationUnder 1 hour
Tier 2Business critical applications. Revenue or operations stop without these.ERP, core databases, primary file servers, email1 to 4 hours
Tier 3Important but not immediately blocking.Secondary applications, reporting systems, dev environments used in production4 to 8 hours
Tier 4Can wait. Recovery can be deferred until Tier 1 to 3 are stable.Dev, test, sandbox, non production workloadsNext business day

The priority matrix should be reviewed and signed off by business stakeholders, not just IT. The business owns the priority decision. IT implements it.


7. Compliance and Audit Documentation

For MSPs, enterprise environments under compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS), and any organization that has made contractual SLA commitments, runbooks are not optional. They are evidence. Here is what auditors and compliance frameworks actually ask for:

DocumentWhat It ProvesFramework Relevance
DR runbook with version historyDocumented recovery procedures exist and are maintainedSOC 2 CC9 and A1, ISO 27001 A.17, HIPAA 164.308(a)(7)
Test records with RTO resultsRecovery procedures have been validatedSOC 2 CC9.1, PCI DSS 12.10.2
Workload priority matrix with stakeholder sign offRecovery prioritization is defined and approvedISO 27001 A.17.1.2
Backup job success logs (30 to 90 days)Backups are running and completing successfullySOC 2 A1, HIPAA 164.310(d)(2)
SureBackup verification reportsBackup data is verified restorable, not just presentSOC 2 A1.2, PCI DSS 12.10
Encryption key management documentationBackup encryption keys are stored and accessibleSOC 2 CC6, HIPAA 164.312(a)(2)(iv)

Veeam ONE scheduled reports cover most of the operational evidence automatically. The Protected VMs report, Failed Job History, and SureBackup results can all be scheduled to email to a compliance inbox on a regular cadence, building an evidence trail without manual effort.


8. Runbook Template Structure

Every runbook in your library should follow a consistent structure so any engineer can pick it up without having to understand a new format under pressure.

Standard Runbook Template Structure
RUNBOOK: [Scenario Name]
Version: [x.x] | Last Tested: [Date] | Owner: [Team/Name]
Last Updated: [Date] | Classification: [Internal/Confidential]

SCENARIO
Brief description of the failure condition this runbook addresses.

SCOPE
Which systems, workloads, and sites are covered.

PREREQUISITES
What must be in place before executing this runbook.
Include: access requirements, tool locations, credential sources.

RTO TARGET
The recovery time objective this runbook is designed to meet.

DECISION CRITERIA
Under what conditions should this runbook be invoked?
Who has authority to invoke it?

STEPS
1. [Action] -- [Expected outcome] -- [Verification]
2. ...

ESCALATION
If step X fails or RTO is exceeded, contact: [Name, Role, Contact]

ROLLBACK
If recovery cannot proceed, what is the fallback state?

TEST HISTORY
Date | Operator | Result | RTO Achieved | Notes
[Date] | [Name] | Pass/Fail | [Time] | [Notes]

Key Takeaways

  • A runbook is an operational procedure for a specific failure scenario, not architecture documentation. It must be executable by a trained engineer under pressure.
  • Cover at minimum: VBR server loss, ransomware response, replica failover, and single workload recovery.
  • The ransomware runbook starts with isolation and assessment, not restoration. Restoring before the vector is closed extends the incident.
  • Every runbook needs a test record. RTO targets that have never been validated are guesses.
  • A workload priority matrix owned by the business, not IT, prevents recovery chaos when multiple systems are down simultaneously.
  • Veeam ONE scheduled reports (Protected VMs, Failed Job History, SureBackup results) build your compliance evidence trail automatically.
  • The config backup encryption password is the single most common recovery blocker. Store it outside the system it protects.

Read more