Veeam

Veeam v13: Disaster Recovery Runbooks and Documentation

Eric Black

13 Mar 2026 — 10 min read

Veeam v13 Disaster Recovery Runbooks Documentation MSP Enterprise

Veeam v13 Series | Component: VBR v13, Veeam ONE v13 | Audience: MSP Engineers, Enterprise Architects, Security and Compliance Teams

Most Veeam environments are technically solid. The backup jobs run. The hardened repo is configured. Immutability is on. But when someone asks "can you walk me through exactly what happens if the backup server goes down at 2am on a Saturday," the answer is usually a pause followed by "well, we would figure it out."

That gap is what this article is about. A DR runbook is not a Veeam config export. It is an operational document that a trained engineer who has never touched your environment can pick up and execute under pressure. This article covers how to build one, what it needs to contain, how to test it, and how to produce the kind of audit ready documentation that satisfies a compliance reviewer or a customer asking for proof of recoverability.

1. What a DR Runbook Actually Is

A runbook is a step by step operational procedure for a specific failure scenario. It is not architecture documentation. It is not a Veeam best practices guide. It is a document that answers one question: given this specific failure, what do I do, in what order, and how do I know it worked.

For Veeam environments, you need at minimum one runbook per critical failure scenario. The scenarios that matter most in production:

Scenario	Scope	RTO Target
VBR server total loss	Rebuild or restore the backup server itself	4 to 8 hours
Backup repository corruption or loss	Recover from offsite copy or immutable backup	2 to 4 hours to restore operations
Ransomware event	Isolate, assess, restore from clean immutable restore point	Scenario dependent, document the decision tree
Primary site loss (DR failover)	Failover replicas or restore to DR site	Per SLA, typically 1 to 4 hours
Single VM or workload recovery	Restore specific VM or files from backup	15 to 60 minutes
Cloud Connect tenant data recovery	MSP specific: restore tenant workloads from cloud repository	Per tenant SLA

A runbook you have never tested is a hypothesis, not a procedure. Every runbook in this article needs a test date and a test result before it goes into production use.

2. VBR Server Recovery Runbook

The VBR server is the most critical single point of failure in most Veeam environments. Losing it does not lose your backup data, but it does lose your ability to restore until it is rebuilt. This runbook covers a full rebuild from the Veeam configuration backup.

Prerequisites Before You Need This Runbook

Veeam configuration backup is scheduled and running to a location outside the VBR server (network share, object storage, or separate repo)
VBR installer media is accessible offline (ISO or downloaded installer stored separately)
License file or license portal credentials are documented and stored in your password manager
PostgreSQL credentials for the VBR configuration database are documented (v13 migrated to PostgreSQL; the SA password set during install is required for restore)
Service account credentials for VBR (the account VBR services run under) are documented

Recovery Steps

Provision replacement server. Match or exceed original hardware/VM specs. Install Windows Server (same version as original). Join to domain if applicable. Apply current patches.
Install Veeam VBR v13. Run the installer. Select the same installation path as the original. When prompted for PostgreSQL, use the same PostgreSQL SA password as the original installation. Do not configure any infrastructure during setup.
Stop Veeam services before importing config. In Services, stop all Veeam services before running the configuration restore. Running a config restore against a live VBR instance causes conflicts.
Run configuration restore. Open VBR console. Go to Home tab, click the VBR menu (top left), select Configuration Backup, then Restore. Point to the most recent configuration backup file. Enter the encryption password if the config backup was encrypted (it should be).
Verify infrastructure reconnection. After restore completes, open Backup Infrastructure. Verify all managed servers show as connected. Re-enter credentials for any server that shows as disconnected. This is common for VSA connected servers where the Analytics Service needs to re-register.
Verify repository access. Open Backup Repositories. Confirm all repositories are accessible and backup chains are visible under each repo. If using a hardened Linux repo, re-enter the single use credentials to re-establish the connection.
Run a test restore. Select a non critical VM. Run an Instant Recovery to verify the full restore path is working before declaring recovery complete.
Re-register with Veeam ONE. If Veeam ONE is in use, re-add the rebuilt VBR server in Veeam ONE configuration. The Analytics Service will reinstall automatically.

Configuration backup encryption password is the single most common recovery blocker. If this password is lost, the config backup cannot be restored. Store it in your password manager and in a sealed physical document in a secure location. Not in the same system the config backup protects.

3. Ransomware Response Runbook

Ransomware runbooks are different from other DR runbooks because the first phase is not restoration. It is isolation and assessment. Restoring before you know what was hit and whether the infection vector is closed is how you restore infected data and extend the incident.

Phase 1: Isolate

Isolate the VBR server from the network immediately if there is any indication it was reached. Veeam ONE malware detection alarms are the first signal in most environments.
Do not shut down affected systems. Memory forensics may be needed. Isolate at the network switch or firewall level first.
Verify the hardened repository is intact. SSH to the hardened repo server. Confirm the Veeam service user account has not been modified and immutability flags are set on backup files.
Pull the active alarm list from Veeam ONE. Document every alarm that fired in the 72 hours before detection. This establishes the timeline.

Phase 2: Assess

Identify the last clean restore point for each affected workload using the VBR console. Look for restore points that predate the earliest indicators of compromise.
Use Veeam ONE Alarm History to identify the first backup job that may have backed up encrypted data. Back up from that point.
Check SureBackup results history. The most recent successful SureBackup run gives you a verified clean restore point baseline.
Engage your incident response process. DR runbook execution is parallel to, not a replacement for, security incident response.

Phase 3: Restore

Restore to an isolated network segment first. Do not restore directly to production until the infection vector is confirmed closed.
Use Secure Restore for all workloads if antivirus scanning is configured. This scans the restore point before mounting it.
Restore in priority order per your workload priority matrix (covered in Section 6 of this article).
Document every restore action with timestamp, operator, restore point date, and target. This is your incident record.

4. Replica Failover Runbook

This runbook applies to environments using Veeam replication to a DR site. It covers planned failover (maintenance or migration) and unplanned failover (primary site loss).

Planned Failover

In VBR console, go to Home, then Replicas, then Ready. Identify the VMs to fail over.
In VBR, right click the replica and select Planned Failover. This synchronizes one final delta before switching, minimizing data loss.
Confirm the replica powers on at the DR site. Verify network connectivity and application health before proceeding.
Update DNS or load balancer entries to point to the DR site IPs.
Notify stakeholders that failover is complete and applications are running from DR.

Unplanned Failover

In VBR console, go to Home, then Replicas, then Ready. Identify affected VMs.
In VBR, right click the replica and select Failover Now. Select the most recent restore point or a specific point in time if the most recent point may be suspect.
Verify replica is running. Test application connectivity before updating DNS.
Document the restore point used and the timestamp of failover.
Begin failback planning immediately. Unplanned failover means your DR site is now your primary. This is a temporary state.

Failover without a documented failback plan is an incomplete runbook. Every failover procedure needs a corresponding failback procedure or you will be making it up under pressure at the worst possible time.

5. Testing Your Runbooks

A runbook that has never been executed is not a runbook. It is a draft. Every runbook needs a test protocol and a documented test history.

Test Types and Cadence

Test Type	What It Validates	Recommended Cadence
Tabletop exercise	Team knows the runbook, roles are clear, decision points are understood	Quarterly
SureBackup automated verification	Backup data is restorable, application starts correctly	Weekly per job
Instant Recovery test	Full VM restore path works end to end	Monthly, rotating workloads
Full runbook execution in isolated environment	Entire procedure works as documented	Annually minimum, twice a year for critical workloads
Replica failover test	Replicas are current and failover completes within RTO	Twice a year

Documenting Test Results

Every test needs a record. Minimum fields for each test record:

Date and time of test
Operator who performed the test
Runbook version tested
Workloads included in the test
RTO achieved vs RTO target
Steps that failed or deviated from the runbook
Runbook updates made as a result
Sign off by team lead or manager

This test record is what you hand to an auditor. It is also what tells you whether your RTO targets are realistic before you discover they are not during an actual incident.

6. Workload Priority Matrix

Not all workloads recover in the same order. A priority matrix documents which systems come back first, who owns the decision, and what the dependency chain looks like. Without this, recovery devolves into whoever is loudest on the phone gets their system first.

Tier	Description	Examples	RTO Target
Tier 1	Infrastructure dependencies. Nothing else recovers without these.	Domain controllers, DNS, core networking, authentication	Under 1 hour
Tier 2	Business critical applications. Revenue or operations stop without these.	ERP, core databases, primary file servers, email	1 to 4 hours
Tier 3	Important but not immediately blocking.	Secondary applications, reporting systems, dev environments used in production	4 to 8 hours
Tier 4	Can wait. Recovery can be deferred until Tier 1 to 3 are stable.	Dev, test, sandbox, non production workloads	Next business day

The priority matrix should be reviewed and signed off by business stakeholders, not just IT. The business owns the priority decision. IT implements it.

7. Compliance and Audit Documentation

For MSPs, enterprise environments under compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS), and any organization that has made contractual SLA commitments, runbooks are not optional. They are evidence. Here is what auditors and compliance frameworks actually ask for:

Document	What It Proves	Framework Relevance
DR runbook with version history	Documented recovery procedures exist and are maintained	SOC 2 CC9 and A1, ISO 27001 A.17, HIPAA 164.308(a)(7)
Test records with RTO results	Recovery procedures have been validated	SOC 2 CC9.1, PCI DSS 12.10.2
Workload priority matrix with stakeholder sign off	Recovery prioritization is defined and approved	ISO 27001 A.17.1.2
Backup job success logs (30 to 90 days)	Backups are running and completing successfully	SOC 2 A1, HIPAA 164.310(d)(2)
SureBackup verification reports	Backup data is verified restorable, not just present	SOC 2 A1.2, PCI DSS 12.10
Encryption key management documentation	Backup encryption keys are stored and accessible	SOC 2 CC6, HIPAA 164.312(a)(2)(iv)

Veeam ONE scheduled reports cover most of the operational evidence automatically. The Protected VMs report, Failed Job History, and SureBackup results can all be scheduled to email to a compliance inbox on a regular cadence, building an evidence trail without manual effort.

8. Runbook Template Structure

Every runbook in your library should follow a consistent structure so any engineer can pick it up without having to understand a new format under pressure.

Standard Runbook Template Structure

RUNBOOK: [Scenario Name]
Version: [x.x] | Last Tested: [Date] | Owner: [Team/Name]
Last Updated: [Date] | Classification: [Internal/Confidential]

SCENARIO
Brief description of the failure condition this runbook addresses.

SCOPE
Which systems, workloads, and sites are covered.

PREREQUISITES
What must be in place before executing this runbook.
Include: access requirements, tool locations, credential sources.

RTO TARGET
The recovery time objective this runbook is designed to meet.

DECISION CRITERIA
Under what conditions should this runbook be invoked?
Who has authority to invoke it?

STEPS
1. [Action] -- [Expected outcome] -- [Verification]
2. ...

ESCALATION
If step X fails or RTO is exceeded, contact: [Name, Role, Contact]

ROLLBACK
If recovery cannot proceed, what is the fallback state?

TEST HISTORY
Date | Operator | Result | RTO Achieved | Notes
[Date] | [Name] | Pass/Fail | [Time] | [Notes]

Key Takeaways

A runbook is an operational procedure for a specific failure scenario, not architecture documentation. It must be executable by a trained engineer under pressure.
Cover at minimum: VBR server loss, ransomware response, replica failover, and single workload recovery.
The ransomware runbook starts with isolation and assessment, not restoration. Restoring before the vector is closed extends the incident.
Every runbook needs a test record. RTO targets that have never been validated are guesses.
A workload priority matrix owned by the business, not IT, prevents recovery chaos when multiple systems are down simultaneously.
Veeam ONE scheduled reports (Protected VMs, Failed Job History, SureBackup results) build your compliance evidence trail automatically.
The config backup encryption password is the single most common recovery blocker. Store it outside the system it protects.