NTP Architecture for Backup Infrastructure: Why It Breaks Things and How to Build It Right

Time drift is one of those infrastructure problems that fails quietly. You don't get a clear error saying "your clocks are 7 minutes apart." What you get is a Kerberos authentication failure that looks like a permissions problem, a TLS handshake rejection that looks like a certificate issue, a replication job that can't establish a session that looks like a network problem, and a log correlation nightmare that makes all of it nearly impossible to trace. In Veeam environments specifically -- which span Windows servers, Linux proxies, storage systems, hypervisor hosts, and cloud appliances -- every one of those components needs to agree on the time, and they almost never use the same source by default. This article is about fixing that before it causes an incident.

Why Time Drift Breaks Veeam Environments Specifically

Most infrastructure can tolerate a few seconds of drift before anything visibly breaks. Veeam environments are less forgiving because the backup infrastructure crosses multiple protocol boundaries that each carry their own time tolerance.

🔑
Kerberos Failures
Kerberos tickets are time-bound. The default clock skew tolerance is 5 minutes. Drift beyond that and authentication fails for SMB connections to repositories, storage system management sessions, and guest interaction for application-aware processing.
🔒
TLS Certificate Rejection
TLS validates that the current time falls within a certificate's validity window. A clock that's significantly fast or slow makes a valid certificate appear expired or not yet valid. Cloud Connect, VSPC, and StoreOnce Catalyst connections are all TLS-protected and all fail this way.
🕐
Job Scheduling Drift
A backup job scheduled at midnight on a server with a 10-minute fast clock starts at 11:50 PM relative to the rest of the environment. Over weeks this compounds. Windows and Linux components may desync from each other as each drifts against a different upstream source.
📊
Log Correlation Failures
When VBR logs, proxy logs, storage logs, and hypervisor logs disagree on timestamps, diagnosing a failed job becomes manual archaeology. Even 30 seconds of drift between components makes it nearly impossible to reconstruct a precise event sequence.
🔁
Replication Session Failures
Veeam replication jobs use session tokens with expiry windows. Time drift between the source VBR and replication target can cause sessions to be rejected mid-job, producing "session expired" errors that have nothing to do with network connectivity.
Cloud Connector Rejection
Cloud Connect service providers, cloud backup appliances, and VSPC all validate timestamps during API handshakes. A tenant whose VBR clock is off will fail to establish cloud sessions even with correct credentials.
The Invisible Failure Mode

Time drift rarely produces a clear "clock skew" error in Veeam logs. What you typically see is a downstream symptom: authentication failure, TLS handshake error, or session refused. The actual root cause requires checking chronyc tracking or w32tm /query /status on the affected component -- which most people skip during a failed backup investigation.

The Stratum Model: What You're Actually Building

NTP describes time sources in terms of stratum -- how many hops a clock is from a physical reference. You don't need to understand every layer, but you do need to understand where your internal servers sit and what that means for client accuracy.

Stratum 0 Physical reference clock -- atomic clock, GPS receiver, radio time signal. Not a network device. Never queried directly by infrastructure. Examples: GPS antenna, cesium oscillator, WWVB radio receiver
Stratum 1 Servers directly connected to a Stratum 0 device. Top of the public NTP hierarchy. Your internal servers query these. Examples: time.cloudflare.com, time.google.com, pool.ntp.org tier
Stratum 2 Your internal NTP servers. They query multiple Stratum 1 sources upstream and serve time internally. This is what your backup infrastructure queries. This is where your two internal NTP servers live
Stratum 3 Your infrastructure clients -- VBR server, proxies, Linux repositories, hypervisor hosts, storage systems. They query your Stratum 2 servers. VBR, backup proxies, Linux repos, ESXi hosts, Windows servers, NAS appliances

The design principle: two internal NTP servers at Stratum 2 querying multiple public Stratum 1 sources, with every piece of backup infrastructure pointing at those internal servers and nothing else. Public internet NTP servers should never be queried directly by production infrastructure -- that adds variable latency, external dependency, and removes control over synchronization behavior for isolated networks.

Two Servers, Not One

Always deploy two internal NTP servers. NTP's falseticker detection algorithm -- which discards sources reporting inconsistent time -- needs at least three sources to work correctly. With two internal servers plus a couple of external fallback sources on the clients, the algorithm has enough inputs to identify a bad source. With one internal server and nothing else, a clock fault goes undetected because there's no peer for comparison.

chrony vs. ntpd vs. systemd-timesyncd

Three NTP implementations see common use across Linux distributions. For backup infrastructure NTP servers, the choice matters -- not every implementation can serve time to other hosts.

Feature chrony ntpd (ntp/ntpsec) systemd-timesyncd
Can serve time to clients Yes Yes No -- client only
Handles intermittent connectivity Excellent Moderate Moderate
Virtual machine performance Excellent Moderate Basic
Initial sync speed Fast (iburst) Slower Moderate
Isolated network (local stratum) Yes Yes No
NTP authentication (symmetric keys) Yes Yes No
Default on RHEL / Rocky / Alma Yes (v8+) No No
Default on Ubuntu / Debian Available Available Yes (default)
Suitable as NTP server Yes -- recommended Yes No

Use chrony for your NTP servers. It's the default on every RHEL-derivative, handles virtualized environments better than ntpd, reaches synchronization faster after a restart, and manages large initial offsets more gracefully with makestep. On Ubuntu and Debian hosts that are NTP clients only, systemd-timesyncd is fine -- it just can't serve time to other hosts.

Configuring Your Internal NTP Servers

This configuration targets Linux hosts you're standing up as internal NTP servers -- Rocky, Alma, RHEL, Ubuntu, Debian, or SLES. Install chrony first, then replace the default config.

Install chrony

RHEL / Rocky / Alma / CentOS Stream
dnf install -y chrony
systemctl enable --now chronyd
Ubuntu / Debian
apt install -y chrony
systemctl disable --now systemd-timesyncd   # disable client-only daemon first
systemctl enable --now chronyd
SLES / openSUSE
zypper install -y chrony
systemctl enable --now chronyd

chrony.conf -- NTP Server

Both internal NTP servers get the same upstream sources and the same allow directives. The peer line makes them aware of each other, which adds a layer of cross-checking between your two servers. Adjust upstream server names to your preferred public Stratum 1 providers.

/etc/chrony.conf -- Internal NTP Server (both servers)
# Upstream Stratum 1 sources
# Use at least 3-4 for falseticker detection to work correctly
server time.cloudflare.com iburst
server time.google.com iburst
server time1.google.com iburst
server time.apple.com iburst

# Peer with the other internal NTP server
# Replace with the actual IP of your second NTP server
peer 10.x.x.y iburst

# Persist clock drift estimates across restarts
driftfile /var/lib/chrony/drift

# Step the clock on startup if offset exceeds 1 second
# Slewing a large offset takes hours; stepping does it immediately
# After 3 clock updates, switch to slewing only
makestep 1.0 3

# Sync hardware clock to system clock periodically
rtcsync

# Serve time to internal subnets
# Scope to your actual infrastructure subnets in production
allow 10.0.0.0/8
allow 172.16.0.0/12
allow 192.168.0.0/16

# If upstream sources are unreachable, keep serving at Stratum 10
# Clients see degraded accuracy but don't lose their time source entirely
local stratum 10

# Log clock changes for audit and troubleshooting
logdir /var/log/chrony
log measurements statistics tracking
local stratum 10 Explained

local stratum 10 tells chrony: if you lose all upstream sources, keep serving time to clients but advertise yourself at Stratum 10 so they know accuracy is uncertain. Without this, a temporary upstream outage causes your NTP server to stop responding entirely -- and your entire infrastructure loses its time source simultaneously. Stratum 10 signals degraded accuracy without creating a complete blackout.

Open the Firewall

firewalld (RHEL / Rocky / Alma)
firewall-cmd --add-service=ntp --permanent
firewall-cmd --reload
ufw (Ubuntu / Debian)
ufw allow 123/udp

Verify the Server is Synced and Serving

Check sync status and upstream sources
chronyc tracking

Reference ID    : A29FC701 (time.cloudflare.com)
Stratum         : 2
Ref time (UTC)  : Sun Mar 29 10:14:22 2026
System time     : 0.000012483 seconds fast of NTP time
Last offset     : +0.000012108 seconds
RMS offset      : 0.000018432 seconds
Frequency       : 4.211 ppm slow
Residual freq   : +0.002 ppm
Skew            : 0.088 ppm
Root delay      : 0.012834217 seconds
Root dispersion : 0.001204837 seconds
Update interval : 64.4 seconds
Leap status     : Normal

chronyc sources -v    # shows all upstream sources and Reach values
chronyc clients       # shows which infrastructure hosts are querying this server

The Reach column in chronyc sources should show 377 (octal) for all upstream sources after a few minutes -- meaning all 8 of the last 8 polls returned a valid response. Anything less indicates packet loss or an unreachable source. A value of 0 means the source is completely unreachable.

Configuring Backup Infrastructure as NTP Clients

Once your internal servers are up and serving, every backup infrastructure component needs to point at them. The mechanism varies by OS and component type.

Linux Clients (VBR on Linux, Proxies, Linux Repositories)

For any Linux host that's a client only, the chrony.conf is simpler. Drop the allow and local stratum directives and point only at your internal servers.

/etc/chrony.conf -- Linux NTP Client
# Point only at internal NTP servers
server 10.x.x.ntp1 iburst prefer
server 10.x.x.ntp2 iburst

driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony

Windows Clients (VBR Server, Windows Proxies, Mount Servers)

Windows uses the W32Time service. For domain-joined machines, time flows from the PDC emulator down through the domain hierarchy. Configure the PDC emulator to use your internal NTP servers -- every domain-joined Windows host in the environment then gets time through the domain chain without individual host configuration.

PowerShell -- Configure PDC emulator (run on the PDC emulator DC)
w32tm /config /manualpeerlist:"10.x.x.ntp1,0x8 10.x.x.ntp2,0x8" /syncfromflags:manual /reliable:yes /update
Restart-Service w32tm
w32tm /resync /force

# Verify sync on any Windows host
w32tm /query /status
w32tm /query /peers

The 0x8 flag marks each source as a client-mode association. Don't omit it. For workgroup Windows servers that aren't domain-joined -- a standalone VBR deployment, for instance -- configure W32Time directly on each host using the same command without the /reliable:yes flag.

VMware ESXi Hosts

ESXi has its own NTP client independent of any guest VMs. Configure it via esxcli or the vSphere Client under Host > Configure > Time Configuration.

ESXi Shell
esxcli system ntp set --server=10.x.x.ntp1 --server=10.x.x.ntp2
esxcli system ntp set --enabled=true
/etc/init.d/ntpd restart
esxcli system ntp get    # verify
ESXi Host Time vs. Guest VM Time

ESXi host time and guest VM time are independent. Guest VMs should sync to your internal NTP servers directly -- not rely on VMware Tools time synchronization from the hypervisor. VMware Tools time sync is useful for correcting large jumps after suspend/resume, but it shouldn't be the primary time source for production VMs. Disable VMware Tools periodic time sync and let the guest's NTP daemon handle routine synchronization.

Quick Distro Reference

Daemon, config path, and key commands differ slightly across distributions. Here's the reference for the three most common Linux families in backup infrastructure environments.

RHEL / Rocky / Alma / CentOS Stream
Default daemon: chronyd
Config: /etc/chrony.conf
Install: dnf install chrony
Enable: systemctl enable --now chronyd
Status: chronyc tracking
Firewall: firewall-cmd --add-service=ntp
Ubuntu / Debian
Default daemon: systemd-timesyncd (client only)
Config path: /etc/chrony/chrony.conf
Install: apt install chrony
Disable timesyncd: systemctl disable --now systemd-timesyncd
Enable chrony: systemctl enable --now chronyd
Firewall: ufw allow 123/udp
SLES / openSUSE Leap / Tumbleweed
Default daemon: chronyd
Config: /etc/chrony.conf
Install: zypper install chrony
Enable: systemctl enable --now chronyd
Status: chronyc tracking
Firewall: firewall-cmd --add-service=ntp
Ubuntu Config Path Difference

On Ubuntu and Debian, chrony's config lives at /etc/chrony/chrony.conf -- note the extra directory level. On RHEL-based distros and SLES it's at /etc/chrony.conf directly. This trips people up when adapting configs between distributions.

Isolated Backup VLANs Without Internet Access

Dedicated backup VLANs often have no outbound internet access by design. Your infrastructure on those segments still needs a time source.

The right pattern: deploy your internal NTP servers on a management VLAN that does have internet access. Those servers sync to public Stratum 1 sources and then serve time to clients on the isolated backup VLAN via your inter-VLAN routing. Backup infrastructure queries the internal NTP servers' management IP -- it never needs internet access itself. UDP 123 is the only thing that needs to cross the VLAN boundary from backup clients to NTP servers.

If your NTP servers are themselves on a segment with zero external access, you have two workable options. Use your vCenter or hypervisor management host as the upstream reference -- it likely has external NTP access already -- and configure your internal NTP servers to sync from it. Alternatively, accept local stratum 10 operation with no external source. This degrades accuracy over time but keeps your environment synchronized relative to itself, which resolves the Kerberos and TLS failure modes even if wall-clock accuracy drifts.

Monitoring Time Sync Health

Time sync problems accumulate silently. Build these checks into your monitoring stack so drift surfaces before it causes job failures.

Monitor chronyc tracking on NTP servers. Alert if System time offset exceeds 50ms. Critical if it exceeds 250ms. The Kerberos window is 5 minutes but catch drift long before it reaches that threshold.
Check chronyc sources Reach values on NTP servers. A source below 377 for more than a few polls indicates intermittent reachability. A source stuck at 0 is completely unreachable -- investigate immediately.
On Windows hosts, monitor w32tm /query /status. The Source field should show your internal NTP server -- not CMOS, LOCAL, or time.windows.com. The latter two mean W32Time isn't syncing to your infrastructure at all.
Alert if the chronyd service is not running on any NTP server. A stopped chronyd is silent -- clients keep using their last drift correction and drift undetected until the offset becomes large enough to break things.
Monitor chronyc clients on NTP servers to verify backup infrastructure components are actively querying. A component that disappears from the client list has lost NTP connectivity -- that host is now free-running on local clock drift.
Correlate time sync alerts with Veeam job failures in your monitoring platform. If jobs start failing at the same time chrony offset alerts fire, treat time drift as the primary suspect even if the Veeam error message says something different.

Quick Troubleshooting Reference

  1. 1Job fails with authentication, Kerberos, or TLS error: Check time first. Run chronyc tracking on the VBR server and any proxy involved. Run w32tm /query /status on Windows components. If offset exceeds a few seconds, fix the time source before investigating anything else.
  2. 2chronyc sources shows Reach = 0 for all upstream sources: UDP 123 is blocked. Check the firewall on the NTP server host and between the NTP server and its upstream sources. Confirm DNS resolves upstream server names. Test with chronyc -n sources to see numeric IPs and bypass DNS resolution.
  3. 3Client shows correct NTP server in sources but large offset: The makestep correction may not have triggered if the service was already running when you made the config change. Run chronyc makestep to force an immediate step correction, or restart chronyd to let makestep fire on startup.
  4. 4NTP server advertises Stratum 10 instead of Stratum 2: The local stratum 10 fallback is active because chrony has lost all upstream sources. Check internet/network connectivity from the NTP server and verify upstream DNS resolution. This is correct fallback behavior -- the server is still serving time, just at degraded accuracy.
  5. 5Windows domain member shows time.windows.com or CMOS as source: W32Time on this host isn't receiving time through the domain hierarchy. Verify the PDC emulator is configured with your internal NTP servers, check that the domain member can reach the PDC emulator, and run w32tm /resync /force on the affected host.
Key Takeaways
Time drift in Veeam environments produces downstream symptoms -- Kerberos failures, TLS rejections, replication errors -- that mask the root cause. Always verify clock sync before debugging certificate or credential issues in backup infrastructure.
Deploy two internal NTP servers at Stratum 2 querying at least three or four public Stratum 1 upstream sources. Every backup infrastructure component points at these internal servers -- never at public internet NTP servers directly.
Use chrony on every Linux NTP server. It's the default on RHEL-derivative distros, handles virtual environments better than ntpd, reaches synchronization faster, and is the only suitable choice among the three common Linux NTP implementations for serving time to other hosts.
On Ubuntu and Debian, install chrony and disable systemd-timesyncd before deploying an NTP server role. systemd-timesyncd is a client-only daemon and cannot serve time to other hosts.
Configure local stratum 10 on internal NTP servers. This keeps them serving time to clients at degraded-accuracy stratum if upstream sources go unreachable, rather than going silent and dropping all clients from time sync simultaneously.
For Windows domain environments, configure the PDC emulator to use your internal NTP servers. Every domain-joined Windows host -- VBR, proxies, mount servers -- inherits time through the domain hierarchy from there without individual host configuration.
Monitor chronyc tracking offset on NTP servers, Reach values on all upstream sources, and W32tm source on Windows hosts. Alert at 50ms offset. Treat time sync alerts and Veeam job failures as potentially related until proven otherwise.

Read more