NTP Architecture for Backup Infrastructure: Why It Breaks Things and How to Build It Right
Time drift is one of those infrastructure problems that fails quietly. You don't get a clear error saying "your clocks are 7 minutes apart." What you get is a Kerberos authentication failure that looks like a permissions problem, a TLS handshake rejection that looks like a certificate issue, a replication job that can't establish a session that looks like a network problem, and a log correlation nightmare that makes all of it nearly impossible to trace. In Veeam environments specifically -- which span Windows servers, Linux proxies, storage systems, hypervisor hosts, and cloud appliances -- every one of those components needs to agree on the time, and they almost never use the same source by default. This article is about fixing that before it causes an incident.
Why Time Drift Breaks Veeam Environments Specifically
Most infrastructure can tolerate a few seconds of drift before anything visibly breaks. Veeam environments are less forgiving because the backup infrastructure crosses multiple protocol boundaries that each carry their own time tolerance.
Time drift rarely produces a clear "clock skew" error in Veeam logs. What you typically see is a downstream symptom: authentication failure, TLS handshake error, or session refused. The actual root cause requires checking chronyc tracking or w32tm /query /status on the affected component -- which most people skip during a failed backup investigation.
The Stratum Model: What You're Actually Building
NTP describes time sources in terms of stratum -- how many hops a clock is from a physical reference. You don't need to understand every layer, but you do need to understand where your internal servers sit and what that means for client accuracy.
The design principle: two internal NTP servers at Stratum 2 querying multiple public Stratum 1 sources, with every piece of backup infrastructure pointing at those internal servers and nothing else. Public internet NTP servers should never be queried directly by production infrastructure -- that adds variable latency, external dependency, and removes control over synchronization behavior for isolated networks.
Always deploy two internal NTP servers. NTP's falseticker detection algorithm -- which discards sources reporting inconsistent time -- needs at least three sources to work correctly. With two internal servers plus a couple of external fallback sources on the clients, the algorithm has enough inputs to identify a bad source. With one internal server and nothing else, a clock fault goes undetected because there's no peer for comparison.
chrony vs. ntpd vs. systemd-timesyncd
Three NTP implementations see common use across Linux distributions. For backup infrastructure NTP servers, the choice matters -- not every implementation can serve time to other hosts.
| Feature | chrony | ntpd (ntp/ntpsec) | systemd-timesyncd |
|---|---|---|---|
| Can serve time to clients | Yes | Yes | No -- client only |
| Handles intermittent connectivity | Excellent | Moderate | Moderate |
| Virtual machine performance | Excellent | Moderate | Basic |
| Initial sync speed | Fast (iburst) | Slower | Moderate |
| Isolated network (local stratum) | Yes | Yes | No |
| NTP authentication (symmetric keys) | Yes | Yes | No |
| Default on RHEL / Rocky / Alma | Yes (v8+) | No | No |
| Default on Ubuntu / Debian | Available | Available | Yes (default) |
| Suitable as NTP server | Yes -- recommended | Yes | No |
Use chrony for your NTP servers. It's the default on every RHEL-derivative, handles virtualized environments better than ntpd, reaches synchronization faster after a restart, and manages large initial offsets more gracefully with makestep. On Ubuntu and Debian hosts that are NTP clients only, systemd-timesyncd is fine -- it just can't serve time to other hosts.
Configuring Your Internal NTP Servers
This configuration targets Linux hosts you're standing up as internal NTP servers -- Rocky, Alma, RHEL, Ubuntu, Debian, or SLES. Install chrony first, then replace the default config.
Install chrony
dnf install -y chrony systemctl enable --now chronyd
apt install -y chrony systemctl disable --now systemd-timesyncd # disable client-only daemon first systemctl enable --now chronyd
zypper install -y chrony systemctl enable --now chronyd
chrony.conf -- NTP Server
Both internal NTP servers get the same upstream sources and the same allow directives. The peer line makes them aware of each other, which adds a layer of cross-checking between your two servers. Adjust upstream server names to your preferred public Stratum 1 providers.
# Upstream Stratum 1 sources # Use at least 3-4 for falseticker detection to work correctly server time.cloudflare.com iburst server time.google.com iburst server time1.google.com iburst server time.apple.com iburst # Peer with the other internal NTP server # Replace with the actual IP of your second NTP server peer 10.x.x.y iburst # Persist clock drift estimates across restarts driftfile /var/lib/chrony/drift # Step the clock on startup if offset exceeds 1 second # Slewing a large offset takes hours; stepping does it immediately # After 3 clock updates, switch to slewing only makestep 1.0 3 # Sync hardware clock to system clock periodically rtcsync # Serve time to internal subnets # Scope to your actual infrastructure subnets in production allow 10.0.0.0/8 allow 172.16.0.0/12 allow 192.168.0.0/16 # If upstream sources are unreachable, keep serving at Stratum 10 # Clients see degraded accuracy but don't lose their time source entirely local stratum 10 # Log clock changes for audit and troubleshooting logdir /var/log/chrony log measurements statistics tracking
local stratum 10 tells chrony: if you lose all upstream sources, keep serving time to clients but advertise yourself at Stratum 10 so they know accuracy is uncertain. Without this, a temporary upstream outage causes your NTP server to stop responding entirely -- and your entire infrastructure loses its time source simultaneously. Stratum 10 signals degraded accuracy without creating a complete blackout.
Open the Firewall
firewall-cmd --add-service=ntp --permanent firewall-cmd --reload
ufw allow 123/udp
Verify the Server is Synced and Serving
chronyc tracking Reference ID : A29FC701 (time.cloudflare.com) Stratum : 2 Ref time (UTC) : Sun Mar 29 10:14:22 2026 System time : 0.000012483 seconds fast of NTP time Last offset : +0.000012108 seconds RMS offset : 0.000018432 seconds Frequency : 4.211 ppm slow Residual freq : +0.002 ppm Skew : 0.088 ppm Root delay : 0.012834217 seconds Root dispersion : 0.001204837 seconds Update interval : 64.4 seconds Leap status : Normal chronyc sources -v # shows all upstream sources and Reach values chronyc clients # shows which infrastructure hosts are querying this server
The Reach column in chronyc sources should show 377 (octal) for all upstream sources after a few minutes -- meaning all 8 of the last 8 polls returned a valid response. Anything less indicates packet loss or an unreachable source. A value of 0 means the source is completely unreachable.
Configuring Backup Infrastructure as NTP Clients
Once your internal servers are up and serving, every backup infrastructure component needs to point at them. The mechanism varies by OS and component type.
Linux Clients (VBR on Linux, Proxies, Linux Repositories)
For any Linux host that's a client only, the chrony.conf is simpler. Drop the allow and local stratum directives and point only at your internal servers.
# Point only at internal NTP servers server 10.x.x.ntp1 iburst prefer server 10.x.x.ntp2 iburst driftfile /var/lib/chrony/drift makestep 1.0 3 rtcsync logdir /var/log/chrony
Windows Clients (VBR Server, Windows Proxies, Mount Servers)
Windows uses the W32Time service. For domain-joined machines, time flows from the PDC emulator down through the domain hierarchy. Configure the PDC emulator to use your internal NTP servers -- every domain-joined Windows host in the environment then gets time through the domain chain without individual host configuration.
w32tm /config /manualpeerlist:"10.x.x.ntp1,0x8 10.x.x.ntp2,0x8" /syncfromflags:manual /reliable:yes /update Restart-Service w32tm w32tm /resync /force # Verify sync on any Windows host w32tm /query /status w32tm /query /peers
The 0x8 flag marks each source as a client-mode association. Don't omit it. For workgroup Windows servers that aren't domain-joined -- a standalone VBR deployment, for instance -- configure W32Time directly on each host using the same command without the /reliable:yes flag.
VMware ESXi Hosts
ESXi has its own NTP client independent of any guest VMs. Configure it via esxcli or the vSphere Client under Host > Configure > Time Configuration.
esxcli system ntp set --server=10.x.x.ntp1 --server=10.x.x.ntp2 esxcli system ntp set --enabled=true /etc/init.d/ntpd restart esxcli system ntp get # verify
ESXi host time and guest VM time are independent. Guest VMs should sync to your internal NTP servers directly -- not rely on VMware Tools time synchronization from the hypervisor. VMware Tools time sync is useful for correcting large jumps after suspend/resume, but it shouldn't be the primary time source for production VMs. Disable VMware Tools periodic time sync and let the guest's NTP daemon handle routine synchronization.
Quick Distro Reference
Daemon, config path, and key commands differ slightly across distributions. Here's the reference for the three most common Linux families in backup infrastructure environments.
On Ubuntu and Debian, chrony's config lives at /etc/chrony/chrony.conf -- note the extra directory level. On RHEL-based distros and SLES it's at /etc/chrony.conf directly. This trips people up when adapting configs between distributions.
Isolated Backup VLANs Without Internet Access
Dedicated backup VLANs often have no outbound internet access by design. Your infrastructure on those segments still needs a time source.
The right pattern: deploy your internal NTP servers on a management VLAN that does have internet access. Those servers sync to public Stratum 1 sources and then serve time to clients on the isolated backup VLAN via your inter-VLAN routing. Backup infrastructure queries the internal NTP servers' management IP -- it never needs internet access itself. UDP 123 is the only thing that needs to cross the VLAN boundary from backup clients to NTP servers.
If your NTP servers are themselves on a segment with zero external access, you have two workable options. Use your vCenter or hypervisor management host as the upstream reference -- it likely has external NTP access already -- and configure your internal NTP servers to sync from it. Alternatively, accept local stratum 10 operation with no external source. This degrades accuracy over time but keeps your environment synchronized relative to itself, which resolves the Kerberos and TLS failure modes even if wall-clock accuracy drifts.
Monitoring Time Sync Health
Time sync problems accumulate silently. Build these checks into your monitoring stack so drift surfaces before it causes job failures.
chronyc tracking on NTP servers. Alert if System time offset exceeds 50ms. Critical if it exceeds 250ms. The Kerberos window is 5 minutes but catch drift long before it reaches that threshold.
chronyc sources Reach values on NTP servers. A source below 377 for more than a few polls indicates intermittent reachability. A source stuck at 0 is completely unreachable -- investigate immediately.
w32tm /query /status. The Source field should show your internal NTP server -- not CMOS, LOCAL, or time.windows.com. The latter two mean W32Time isn't syncing to your infrastructure at all.
chronyc clients on NTP servers to verify backup infrastructure components are actively querying. A component that disappears from the client list has lost NTP connectivity -- that host is now free-running on local clock drift.
Quick Troubleshooting Reference
- 1Job fails with authentication, Kerberos, or TLS error: Check time first. Run chronyc tracking on the VBR server and any proxy involved. Run w32tm /query /status on Windows components. If offset exceeds a few seconds, fix the time source before investigating anything else.
- 2chronyc sources shows Reach = 0 for all upstream sources: UDP 123 is blocked. Check the firewall on the NTP server host and between the NTP server and its upstream sources. Confirm DNS resolves upstream server names. Test with chronyc -n sources to see numeric IPs and bypass DNS resolution.
- 3Client shows correct NTP server in sources but large offset: The makestep correction may not have triggered if the service was already running when you made the config change. Run chronyc makestep to force an immediate step correction, or restart chronyd to let makestep fire on startup.
- 4NTP server advertises Stratum 10 instead of Stratum 2: The local stratum 10 fallback is active because chrony has lost all upstream sources. Check internet/network connectivity from the NTP server and verify upstream DNS resolution. This is correct fallback behavior -- the server is still serving time, just at degraded accuracy.
- 5Windows domain member shows time.windows.com or CMOS as source: W32Time on this host isn't receiving time through the domain hierarchy. Verify the PDC emulator is configured with your internal NTP servers, check that the domain member can reach the PDC emulator, and run w32tm /resync /force on the affected host.
local stratum 10 on internal NTP servers. This keeps them serving time to clients at degraded-accuracy stratum if upstream sources go unreachable, rather than going silent and dropping all clients from time sync simultaneously.