Break Glass #03: Configuration Database Corruption - Diagnosing and Recovering a Corrupt VBR PostgreSQL

Break Glass // Scenario 03

The VBR console opens to an error. Jobs are not running. The config backup job failed overnight. The PostgreSQL service started but cannot serve queries. The VBR service log is full of database errors. The configuration database is corrupt. Here is how you get back.

Why This Happens

Database corruption in a Veeam v13 environment happens in a handful of specific ways. The most common is an ungraceful shutdown -- someone yanks power from the VBR server, the host crashes under load, or Windows forces a hard restart during a patch cycle while PostgreSQL has active write transactions. PostgreSQL's write-ahead log (WAL) is designed to handle this, but when the data directory is on a storage system that lies about write completion (a NAS presenting cached writes as committed, for example), the WAL cannot protect you.

The second class of corruption is logical rather than physical. A corrupted credential row, a malformed SSH passphrase stored in the database, or a character encoding issue in a recently added object causes the config backup to fail with "Configuration database contains corrupted data." The VBR services stay up and jobs may keep running, but the config backup fails every night and nobody notices until the window for clean recovery has closed.

Antivirus is a third cause nobody wants to admit to. An AV scanner that opens and scans PostgreSQL data files while they are being written can corrupt pages. Veeam documents the required exclusions. Environments that deploy a new AV product without reviewing those exclusions hit this within weeks.

V13 uses PostgreSQL by default in many deployment paths, but Microsoft SQL Server remains a supported option for the configuration database in applicable installations. PostgreSQL behaves differently from SQL Server under corruption. SQL Server has DBCC CHECKDB and well-documented repair paths that DBAs know. PostgreSQL recovery requires pg_resetwal and pg_dump/pg_restore workflows that most Windows-centric VBR admins have never touched. That gap is what this article covers.

Triage

1Check whether VBR services are running. Open Services (services.msc) and look for Veeam Backup Service (VeeamBackupSvc). If it is stopped and will not start, open Event Viewer and look under Windows Logs, Application for errors from Veeam or PostgreSQL. Database connection failures appear here immediately.
2Check PostgreSQL service status. Look for the service named postgresql-x64-17 (or the version bundled with your VBR install). If PostgreSQL is stopped, check the PostgreSQL log files at:
C:\Program Files\PostgreSQL\17\data\log\
Look for FATAL or PANIC entries. A PANIC with "could not locate a valid checkpoint record" indicates WAL corruption. A FATAL with "invalid page header" or "page verification failed" indicates data page corruption.
3Check VBR service logs for database errors. VBR logs are at:
C:\ProgramData\Veeam\Backup\
Open the most recent Svc.VeeamBackup log. Search for "database" or "PostgreSQL". Errors referencing specific tables (Credentials, Ssh_creds, Jobs) indicate logical corruption of specific rows rather than a physical database failure. This distinction drives the recovery path.
4Check when the last successful config backup ran. Go to the VBR console if it opens, or look at the config backup file timestamps in the target repository. If the last good .bco file is recent (within 24 hours), you have a clean restore path. If the last successful config backup was days or weeks ago, the restore will miss recent job additions and infrastructure changes.
5Determine whether backup jobs are still running despite the database issue. Jobs that were running before the corruption hit may have continued on their own schedule. Check the repository for recent backup files. If new VBK/VIB files exist with recent timestamps, jobs ran. The backup data is intact even if VBR cannot display it correctly.

Decision Point

Logical corruption (specific table/row errors, config backup fails but services run): go to Recovery Path A. Physical corruption (PostgreSQL will not start, PANIC in logs, WAL errors): go to Recovery Path B. No config backup available: go to Recovery Path C.

Recovery Path A -- Logical Corruption (Services Running, Config Backup Failing)

The most common presentation: VBR runs fine, jobs complete, but the config backup fails every night with "Configuration database contains corrupted data" and an error naming a specific table and column.

1Open the VBR service log and identify the exact table and column in the error. The error format is:
Configuration database contains corrupted data (table: Ssh_creds, column: passphrase)
or:
Failed to save database row (table Credentials, field password)
Note the table name.
2If the corrupted table is Credentials or Ssh_creds: open the VBR console. Go to the main menu and select Manage Credentials. Look for recently added credentials that may have special characters, long passphrases, or non-ASCII characters in the password field. These are the most common culprits.
3Delete the suspicious credential entry. If you are not sure which one it is, sort by modification date and look at the most recently changed entries. Delete the most recent additions first and run a manual config backup after each deletion to test whether the corruption clears.
4Re-create the credential with a clean entry. If the original passphrase contained special characters, test whether the new entry accepts the config backup. If the config backup succeeds, the corruption is resolved.
5Run a manual config backup immediately and confirm it succeeds. Then verify that the next scheduled run also succeeds. A single successful manual run does not confirm the issue is fully resolved -- the nightly run may hit a different corrupted object.

Recovery Path B -- Physical Corruption (PostgreSQL Will Not Start)

1Stop all Veeam services immediately:
Get-Service Veeam* | Stop-Service -Force
Do not attempt to force PostgreSQL to start repeatedly. Each failed start attempt can worsen WAL corruption.
2Make a full copy of the PostgreSQL data directory before touching anything:
xcopy "C:\Program Files\PostgreSQL\17\data" "C:\pg-backup-corrupt\" /E /I /H
This is your safety net. If recovery attempts make things worse, you can restore this copy and try again.
3Attempt to start PostgreSQL in single-user mode to assess damage. Open a Command Prompt as Administrator:
cd "C:\Program Files\PostgreSQL\17\bin" pg_ctl start -D "C:\Program Files\PostgreSQL\17\data" -l C:\pg-start.log
Review C:\pg-start.log for the specific error. If it starts, proceed to step 4. If it does not start, proceed to step 5.
4If PostgreSQL starts but the VeeamBackup database is corrupted, attempt to dump what you can:
pg_dump -U postgres -d VeeamBackup -f C:\veeam-dump.sql
If pg_dump completes without errors, the database content is recoverable. Proceed to restore from config backup -- the dump is a last resort only, as Veeam does not support direct database restores from pg_dump output.
5If PostgreSQL will not start due to WAL corruption, you must restore from your configuration backup. Locate your most recent .bco file. This is the supported recovery path for physical database corruption -- attempting pg_resetwal to force WAL reset is a last resort that can cause data loss and is not supported by Veeam. Go to step 6.
6Reinstall VBR using the exact same build as the current installation. During the installer's database setup step, choose PostgreSQL and use the same SA password. Allow the installer to create a fresh, empty VeeamBackup database.
7Stop all Veeam services after the reinstall completes:
Get-Service Veeam* | Stop-Service -Force
8Open the VBR console. Go to the main menu, select Configuration Backup, then Restore. Select Restore mode. Browse to the most recent .bco file. Enter the encryption password. Click Analyze, then Restore. Wait for the wizard to complete.
9After restore, verify Backup Infrastructure connectivity and run a test restore before re-enabling jobs. Follow the same post-restore steps from Break Glass #01.

Recovery Path C -- No Valid Config Backup

PostgreSQL is physically corrupt and the last config backup predates the current environment or does not exist. This is the hard path.

1Attempt pg_dump against the corrupt database to extract whatever is readable. Even a partial dump gives you job names, repository paths, and infrastructure hostnames that you can use to rebuild manually.
pg_dump -U postgres -d VeeamBackup -f C:\veeam-partial.sql --no-password 2>&1
Errors will appear for corrupted objects. The readable sections still export.
2Open the partial SQL dump in a text editor. Search for repository paths, managed server hostnames, job names, and credential references. This is your reconstruction map.
3Reinstall VBR fresh. Re-add all managed servers, proxies, and repositories manually using the data from the dump. For each repository, run a Rescan after adding it to rediscover existing backup chains. The backup data is still on disk -- VBR just needs to find it again.
4Recreate all backup jobs. For jobs targeting repositories with existing chains, Veeam will detect the existing chain on the next run and continue incrementally. You do not lose existing restore points by recreating the job.
5Run a manual config backup immediately after reconstruction and verify it succeeds. Store it off-server with encryption enabled.

Gotchas

Do Not Use pg_dump/pg_restore as the Primary Recovery Method

Veeam explicitly documents that the only supported method for protecting and restoring the VBR configuration is the native Configuration Backup tool. Direct PostgreSQL database backup via pg_dump, native SQL backups, or third-party tools are not supported restore paths. A pg_restore into a running VBR installation will produce an inconsistent state that looks like it worked and fails in subtle ways later. Use pg_dump only as a forensic tool to extract readable data when no config backup exists.

Config Backup Failure Is Silent Unless You Check

The config backup job sends no email by default. If it fails, it shows a warning in the VBR console -- but only if you open the console and look at the Home dashboard. In environments where nobody checks the console daily, a config backup that has been failing for three weeks goes unnoticed. By the time you need it, the most recent .bco file is ancient. Set up email notifications for config backup results, or have a monitoring alert on the config backup file modification timestamp in the target repository.

Antivirus Must Exclude PostgreSQL Data Directories

Veeam documents specific AV exclusion paths for PostgreSQL. The data directory (default C:\Program Files\PostgreSQL\17\data\), the binary directory, and the WAL directory must all be excluded from real-time scanning. An AV scan that opens a PostgreSQL WAL segment during a write will corrupt it. This is documented in Veeam's deployment guides and is responsible for a significant percentage of production corruption cases.

pg_resetwal Is a Last Resort That Causes Data Loss

pg_resetwal forces PostgreSQL to discard the WAL and start fresh. It will get PostgreSQL running again. It will also discard any committed transactions that had not been checkpointed to the data files. In a VBR context this means job session history, recently added credentials, recently added infrastructure objects, and possibly job definitions. Only use pg_resetwal if you have confirmed that no config backup is available and you have exhausted all other options. Document what you are doing before you do it.

Logical Corruption Does Not Always Block Jobs

A corrupted credential row in the Credentials table does not necessarily stop the jobs that use that credential. VBR caches credential data in memory at service startup. Jobs may keep running with the cached values while the on-disk row is corrupt. This creates a window where everything looks fine but the config backup is failing nightly. The first sign of trouble is usually a config backup warning, not a job failure.

Storage That Lies About Write Completion

NAS devices with write caching enabled and no UPS present a completed write to the OS before the data reaches stable storage. When power fails, PostgreSQL's WAL is corrupt because writes it believed were committed never actually landed. If your VBR server's PostgreSQL data directory lives on a NAS, confirm that the NAS has write-back caching disabled or is UPS protected with a graceful shutdown configured. This is the most common cause of post-power-failure database corruption in v13.

Prevention Checklist

Configure email notifications or monitoring for config backup job results. Treat a config backup failure as a P2 incident -- restore the last valid .bco off-server while you investigate.
Add AV exclusions for PostgreSQL data and binary directories on the day VBR is installed. Check that these exclusions survive AV policy updates.
Store the PostgreSQL data directory on storage backed by a UPS with graceful shutdown configured. Do not store it on a NAS with write-back caching and no power protection.
Verify write-back caching is disabled on the storage hosting PostgreSQL data, or that the storage has battery-backed write cache (BBWC).
Run a manual config backup after every significant infrastructure change: adding a new repo, proxy, managed server, or job. Do not wait for the nightly schedule.
Keep at least 5 config backup retention points. A single-point retention means one bad backup overwrites your last good one.
Monitor config backup file age from an external system. If the .bco file on the target repository has not been updated in 25 hours, fire an alert.

Break Glass Recap

Two corruption types: logical (specific row/table, services run) and physical (PostgreSQL will not start)
Logical: identify the corrupt table in VBR logs, delete and re-create the offending credential
Physical: copy the data directory first, then attempt pg_dump for forensics, then restore from config backup
pg_dump/pg_restore is not a supported Veeam recovery path -- use it only as a forensic tool
pg_resetwal is a last resort that discards uncommitted transactions
AV must exclude PostgreSQL data and binary directories -- this is the leading cause of new-install corruption
NAS with write-back caching and no UPS causes WAL corruption on power loss
Config backup failure is silent without explicit monitoring -- set up alerts on the .bco file age
Jobs may keep running with cached credentials even while the database row is corrupt
Backup data on repositories is not affected by VBR database corruption