Monitoring and Observability on Proxmox VE 9.1

Article 9 in the Proxmox VE Cluster Design Fundamentals for v9.1 series. PVCD-08 covered backup with Proxmox Backup Server. This article covers monitoring and observability: pvestatd as the metrics collector, the built in external metric server backends (Graphite, InfluxDB, and OpenTelemetry), Prometheus via the prometheus-pve-exporter, the new notification system with Gotify, SMTP, and Webhook targets, Ceph health monitoring, and the journalctl commands that turn an incident into a post mortem instead of a guess.

Monitoring is the part of cluster design that gets done last and matters first when something breaks. Proxmox does not ship a built in dashboard the way vCenter does. The web UI shows live metrics for the current view, but production retention, cluster wide visualization, and alerting need an external metrics stack. The good news is Proxmox makes the integration straightforward: pvestatd already collects the common node, guest, and storage statistics, the external metric server feature ships the data to Graphite, InfluxDB, or OpenTelemetry out of the box, and the community maintained prometheus-pve-exporter handles the Prometheus side.

This article walks the metrics pipeline, the notification system that landed in PVE 8.1 and matured in 9.x, the Ceph health story for hyperconverged clusters, and the journalctl knowledge that pays off the first time something breaks at 3 AM. Series target version is Proxmox VE 9.1. Every claim below is sourced to the External Metric Server wiki, the Notifications chapter of the Administration Guide, the prometheus-pve-exporter GitHub repository, the Ceph documentation, or linked Proxmox forum threads with Proxmox staff responses.

pvestatd: the collector that already exists

Per the External Metric Server wiki and the pvestatd source, pvestatd is the per node statistics daemon. It runs on every Proxmox VE node and gathers data about the local system, running QEMU VMs, LXC containers, storage, and the local cluster view. It is what populates the live charts in the web UI.

What pvestatd collects per the wiki and the pvestatd Perl source:

Node level: CPU usage, memory usage, load average, network throughput per interface, root filesystem usage
VM and container level: CPU, memory, disk I/O, network I/O, status
Storage level: capacity, used space, status
Time series sampling at a hardcoded 10 second interval, consistent across the pvestatd source and observed pvestatd log cadence

Gaps to cover outside the built in metric server output:

Cluster quorum status (whether the cluster is quorate, vote counts). The data is available via pvesh get /cluster/ha/status/current but is not currently pushed to the external metric servers.
Per OSD Ceph health detail. Ceph specific monitoring requires the Ceph manager modules (covered later in this article).
Disk SMART data and wearout. Available in the GUI and via smartctl but not pushed to external metric servers.

The gap between what pvestatd collects and what gets shipped to external metric servers is real. For cluster status and Ceph detail you need either the Proxmox API (via Telegraf exec input or similar) or the prometheus-pve-exporter, both covered later.

Built in external metric servers

Per the External Metric Server wiki and the Proxmox VE 9.0 Roadmap, Proxmox VE has built in support for external metric targets including Graphite, InfluxDB, and OpenTelemetry. The 9.0 Roadmap introduced an OpenTelemetry plugin that pushes metrics via the OTLP/HTTP protocol to an external OpenTelemetry collector. This article focuses on Graphite and InfluxDB because they are the common PVE metric server paths, then covers Prometheus through prometheus-pve-exporter. The configuration lives in /etc/pve/status.cfg and is editable via the web UI under Datacenter, Metric Server, or directly in the file.

Graphite backend

Per the wiki: the default port is 2003 and the default Graphite path is proxmox. By default Proxmox VE sends data over UDP, so the Graphite server has to be configured to accept UDP. The MTU can be configured for environments not using the standard 1500 MTU. The plugin can also be configured to use TCP instead.

Per the wiki: in order not to block the important pvestatd statistic collection daemon, a timeout is required to cope with network problems. This matters because pvestatd is on the status collection path; a hung metric server should not block local status updates.

InfluxDB backend

Per the wiki, the InfluxDB plugin supports both UDP (for InfluxDB 1.x) and HTTP/HTTPS (for InfluxDB 2.x and the 1.8.x compatibility API). For UDP, the InfluxDB server has to be configured for it. Example InfluxDB 1.x configuration verbatim from the wiki:

[[udp]] enabled = true bind-address = "0.0.0.0:8089" database = "proxmox" batch-size = 1000 batch-timeout = "1s"

For InfluxDB 2.x via the HTTP API, per the wiki: since InfluxDB's v2 API is only available with authentication, a token has to be generated that can write into the correct bucket. By default Proxmox VE uses the organization proxmox and the bucket proxmox (configurable via the organization and bucket settings). In the v2 compatible API of 1.8.x, user:password can be used as the token if required, and the organization can be omitted because it has no meaning in InfluxDB 1.x.

Per the wiki: the HTTP timeout defaults to 1 second (configurable via the timeout setting) and the maximum batch size defaults to 25000000 bytes (configurable via max-body-size, corresponding to the InfluxDB setting of the same name).

CLI configuration

Per Proxmox forum guidance, the metric server configuration in /etc/pve/status.cfg can also be edited directly. A minimal Graphite configuration:

graphite: GraphiteServer1 server 192.168.10.50 port 2003

A minimal InfluxDB v2 configuration:

influxdb: InfluxV2 server 192.168.10.51 port 8086 influxdbproto https organization proxmox bucket proxmox token your-write-token-here

Prometheus via prometheus-pve-exporter

The Prometheus path is community maintained. The canonical project is prometheus-pve-exporter. It is a Python exporter that talks to the Proxmox VE REST API, queries metrics, and exposes them in Prometheus format on a scrape endpoint (default port 9221).

Per the prometheus-pve-exporter README, the exporter requires Python 3.9 or better and can be deployed either directly on a Proxmox VE node or on a separate machine. For security, the README recommends adding a user with read only access (PVEAuditor role) for metrics collection.

User and token setup

Per the prometheus-pve-exporter README and consistent Proxmox forum threads, the recommended setup uses an API token rather than a password. The pveum commands:

pveum user add prometheus@pve --password "placeholder" pveum user token add prometheus@pve exporter -privsep 1 pveum acl modify / -token 'prometheus@pve!exporter' -role PVEAuditor

Separated privileges are the safer default. The token gets its own PVEAuditor ACL. Do not disable privilege separation unless the integration genuinely needs the full permissions of the associated user.

Exporter configuration

The exporter configuration is YAML at a path you choose (commonly /etc/prometheus/pve.yml). Per the project README:

default: user: prometheus@pve token_name: exporter token_value: your-token-secret-here verify_ssl: true

Per the README: when operating PVE with self signed certificates, set verify_ssl: false in the config or import the certificate into the local trust store. The README notes that PVE supports Let's Encrypt out of the box and that setting up trusted certificates is the better option in production.

Prometheus scrape config

Per the project README, the Prometheus scrape configuration uses the multi target scrape configuration:

scrape_configs: - job_name: 'pve' static_configs: - targets: - 192.168.1.2 # Proxmox VE node - 192.168.1.3 # Proxmox VE node metrics_path: /pve params: module: [default] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9221 # PVE exporter

This configuration lets Prometheus scrape multiple Proxmox nodes through a single exporter instance.

Collectors

Per the prometheus-pve-exporter README, the exporter exposes multiple collector groups, controllable via the --collector.X and --no-collector.X flags:

status. Node, VM, container status (online, running)
version. PVE version info
node. PVE node info
cluster. PVE cluster info
resources. PVE resources info
backup-info. Information about guests not covered by any backup job
config. PVE onboot status (per node)
replication. PVE replication info (per node)
subscription. PVE subscription info (per node)

Per the README, the config collector results in one API call per guest VM or container. It is recommended to disable this collector with --no-collector.config on big deployments.

Per the project releases, the exporter added pve_ha_state and pve_lock_state metrics in v3.5.0 (released January 2025). For HA configured clusters this fills part of the gap that the built in metric server leaves.

Ceph health monitoring

Per the Ceph documentation and Proxmox Ceph wiki, Ceph cluster health is exposed via the standard Ceph CLI tools, which are available on every Proxmox node that participates in a Ceph cluster. The command set is the same as upstream Ceph.

The CLI commands that matter

ceph -s # Cluster status summary ceph health # HEALTH_OK / HEALTH_WARN / HEALTH_ERR ceph health detail # Detailed warnings with explanations ceph -w # Live event stream (Ctrl+C to stop) ceph osd tree # OSD topology and status ceph osd df tree # OSD usage by host and rack ceph df # Cluster and per pool capacity ceph pg stat # PG state summary ceph quorum_status # MON quorum detail (JSON capable)

Per the Ceph documentation, the health states:

HEALTH_OK. All OSDs up and in, all PGs active+clean, MON quorum stable, no warnings.
HEALTH_WARN. Capacity warnings (nearfull, backfillfull), slow OSDs, rebalancing in progress, recoverable degraded states. Investigate, but the cluster is serving I/O.
HEALTH_ERR. A condition that prevents some I/O or risks data loss. PG inactive, OSD full, MON quorum lost. Stop and address before doing anything else.

Ceph in the Proxmox UI

Per the Proxmox Ceph wiki, the Datacenter, Ceph panel in the web UI provides a live view of cluster health, OSD status, MON status, and pool usage. It is sufficient for ad hoc visibility but does not retain history. For trend analysis (OSD usage over weeks, PG count drift, recovery duration) a metrics pipeline is needed.

Ceph metrics into Prometheus

Per the Ceph documentation, the Ceph manager (ceph-mgr) ships with a Prometheus module that can be enabled to export Ceph metrics directly. On a Proxmox HCI cluster:

ceph mgr module enable prometheus ceph mgr module ls # verify "prometheus" is in enabled_modules

The Ceph manager Prometheus module exposes Ceph metrics on port 9283. It includes health check metrics and cluster level data. Some detailed metrics need explicit configuration. RBD per image I/O statistics require mgr/prometheus/rbd_stats_pools. Ceph daemon performance counters are not exported by the manager module by default in newer Ceph behavior when ceph-exporter is used. Verify which Ceph metrics your Proxmox build exposes before building dashboards that depend on per OSD latency or per image I/O.

Scrape manager endpoints or put discovery/proxy behavior behind them deliberately. Ceph has active and standby manager behavior, and standby response behavior is configurable. Configuring all manager nodes as scrape targets means metrics keep flowing across manager failovers without manual reconfiguration.

Maintenance flags

Per the Ceph documentation, during planned maintenance (host reboot for kernel update, hardware swap, etc), set the noout flag to prevent unnecessary recovery while OSDs are temporarily down:

ceph osd set noout # do the maintenance, reboot the node, etc ceph osd unset noout

Per the Ceph docs, additional flags for sensitive interventions: norecover and norebalance pause data movement entirely. Always unset these flags after maintenance; leaving them set degrades cluster resilience.

The notification system

Per the Proxmox VE Notifications chapter of the Administration Guide, Proxmox VE 8.1 introduced a unified notification system that supersedes the old root mail behavior. The system has matured through 8.x and continues in 9.x.

Architecture

Per the Notifications chapter, the system has three components:

Notification events. Backup job completion, fencing, package updates available, Ceph health changes, replication errors, ZFS scrub results, etc. Each event has metadata fields (severity, type, hostname, etc).
Notification matchers. Rules that match on event metadata (severity, type, time, hostname) and route events to one or more targets.
Notification targets. The destinations: sendmail, SMTP, Gotify, and Webhook (introduced in PVE 8.3).

Per the Administration Guide: the configuration is stored in /etc/pve/notifications.cfg and /etc/pve/priv/notifications.cfg. The latter contains sensitive options like passwords and authentication tokens and can only be read by root.

Target types

Per the Notifications chapter:

Sendmail. Uses the system's sendmail binary to email a configured list of users or addresses. The default in fresh installs is mail-to-root which delivers to root's local mail spool. Reasonable for single host labs, useless for production.
SMTP. Authenticated SMTP relay with SSL/TLS. The right answer for production: configure an external SMTP relay (your existing mail infrastructure or a transactional service) so notifications actually reach humans.
Gotify. Push notifications via a Gotify server. Tokens are generated in the Gotify web interface. Good fit if your team already runs Gotify for push.
Webhook. HTTP request to a configurable URL with a templated body. Introduced in PVE 8.3 per the Administration Guide. Routes notifications to chat platforms (Slack, Discord, Mattermost, Teams) or to your own incident management ingest.

Webhook templating

Per the Administration Guide, Webhook bodies and headers use Handlebars templates. A minimal webhook body using the templated title and message fields:

{ "title": "{{ escape title }}", "message": "{{ escape message }}" }

Per the docs, the escape helper handles JSON escaping for special characters (newlines, quotes, control characters) so the payload stays valid JSON. Without escape, multi line backup notifications break the JSON body and the webhook target rejects them with HTTP 400. For chat platforms with character limits (Discord caps webhook content at 2000 characters per the Discord API), oversized notifications also get rejected. Per Proxmox forum threads, the workaround for Discord specifically is to use a custom Handlebars template that omits the logs section, since vzdump backup notifications can exceed the limit on multi VM jobs.

Matchers

Per the Administration Guide, matchers route events to targets based on rules. A matcher can match on:

Severity (info, notice, warning, error, unknown). Per the docs, system-mail events from local daemons like smartd come through with severity unknown.
Type (fencing, replication, package-updates, vzdump, system-mail, etc)
Time of day and day of week
Specific metadata fields with regex

The classic operational design: warning and error severities go to an on call channel (Webhook to Slack or PagerDuty), info notices go to a logging endpoint (or are dropped), and the daily backup notifications go to a separate Webhook target that posts to a backup status channel without paging anyone.

The legacy sendmail mode

Per the Administration Guide, backup jobs still support a legacy notification mode that sends mail directly via the sendmail binary using only the email address configured in the backup job, bypassing the global notification matchers. This mode exists for compatibility with pre 8.1 behavior and may be removed in a future release. For new deployments, use the notification system mode (the default in PVE 9.x) so matchers and targets actually apply.

journalctl commands for incident triage

Proxmox VE is a Debian system; everything goes through systemd journald. journalctl is the universal entry point. The commands that matter operationally:

Service specific logs

journalctl -u pve-cluster -f # pmxcfs, the cluster filesystem journalctl -u pveproxy -f # API and web UI journalctl -u pvedaemon -f # Local pve management daemon journalctl -u pvestatd -f # Statistics collection journalctl -u pve-ha-lrm -f # HA local resource manager journalctl -u pve-ha-crm -f # HA cluster resource manager journalctl -u pve-firewall -f # Firewall rule application journalctl -u corosync -f # Cluster communication journalctl -u ceph-mon@$(hostname) -f # Ceph monitor on this host journalctl -u ceph-osd@<id> -f # Specific Ceph OSD journalctl -u ceph-mgr@$(hostname) -f # Ceph manager journalctl -u zed -f # ZFS event daemon

Time bounded queries

journalctl --since "2026-05-09 14:00" --until "2026-05-09 16:00" journalctl --since "1 hour ago" journalctl -u corosync --since today journalctl -u pve-ha-crm --since "yesterday"

Boot scoped queries

journalctl --list-boots # List boot IDs journalctl -b # Current boot only journalctl -b -1 # Previous boot journalctl -b -1 -p err # Errors from previous boot journalctl -b -1 -u pve-cluster # Cluster service logs from previous boot

The -b -1 command is essential after a node has been rebooted (planned or unplanned). It scopes journalctl to the previous boot's logs so you can see what happened before the reboot without filtering through the current boot.

Priority filtering

journalctl -p err -b # Errors and worse, current boot journalctl -p warning --since "1 hour ago" # Warnings and worse, last hour journalctl -p crit # Critical and worse, all time

Per systemd journal documentation, the priority levels (lowest to highest): debug, info, notice, warning, err, crit, alert, emerg. -p err matches err and higher (crit, alert, emerg).

Kernel messages and hardware events

journalctl -k # Kernel messages journalctl -k -p err -b # Kernel errors current boot journalctl _TRANSPORT=kernel --since today # Same, alternative form

For SCSI errors, NVMe issues, network interface up/down events, kernel messages are the primary source. Combine with dmesg --human --kernel --level=err,crit,alert,emerg for a focused view.

Centralized logging

Per Debian and systemd documentation, journald can forward logs to a central syslog server with ForwardToSyslog=yes in /etc/systemd/journald.conf. For production clusters, central log aggregation (rsyslog forwarding to a Loki instance, a SIEM, or a Splunk indexer) is the right answer. Local journald retention is bounded; central retention is the post mortem source of truth.

Putting it together: what a production stack looks like

The reference monitoring stack for a production Proxmox 9.1 cluster:

Metrics pipeline. Either Prometheus (prometheus-pve-exporter plus Ceph mgr Prometheus module on HCI clusters, plus node_exporter on each node for OS level metrics) or InfluxDB plus Telegraf (with the Proxmox built in InfluxDB target plus Telegraf for any gaps).
Visualization. Grafana with cluster, node, and per VM dashboards. The community maintains Grafana dashboard 10347 for prometheus-pve-exporter as a starting point; expect to customize.
Alerting. Alertmanager (with Prometheus) or Grafana alerting (with InfluxDB or Prometheus) for capacity, host down, Ceph health, replication failure, backup failure. Webhook to Slack, Teams, or PagerDuty for routing.
Notification system. Configure SMTP and Webhook targets in PVE Notifications. Route fencing and Ceph errors to on call. Route backup notifications to a backup channel. Disable legacy sendmail backup mode in favor of the notification system.
Log aggregation. Forward journald to a central log store. Loki plus Grafana for self hosted; commercial SIEM if you have one. Set central log retention to match your incident review, compliance, and support requirements. Thirty days is a practical starting point, not a universal minimum.
Synthetic checks. External tools (Uptime Kuma, Smokeping, or commercial monitoring) hitting the web UI and SSH on each node from outside the cluster. Confirms the cluster is reachable from the network the operators actually use.

The monitoring design mistakes that bite you later

No external metrics retention

The Proxmox web UI graphs are useful for local spot checks, but they are not a monitoring system. They use RRD data, lose detail as data ages, and do not provide alerting, cluster wide dashboards, or long term investigation workflows. Use an external metrics stack for production retention and alerting. Even a small external metrics store with useful retention can make incident review much easier.

Cluster status not monitored

Current built in metric server output does not cover every operational signal. A node can lose quorum while the metric pipeline still shows healthy node metrics. Use prometheus-pve-exporter (which exposes cluster info) or a Telegraf exec input running pvecm status periodically. Verify quorum is in the alerting layer.

Ceph manager Prometheus module not enabled

The PVE built in metric server does not include detailed Ceph metrics. On HCI clusters this leaves the storage tier opaque. Enable the Ceph manager Prometheus module, scrape all manager nodes, and dashboard OSD usage, PG counts, scrub progress, and recovery throughput. The first time a Ceph PG goes inactive in production is not when you want to discover the dashboard does not exist.

No alerting on capacity

ZFS pools, Ceph pools, and PBS datastores all have a point of no return where free space falls below the operational threshold. Alert at 70% used (warning) and 80% used (critical) on every storage pool. The lead time matters: discovering at 95% used means a fire drill, alerting at 70% means a planned hardware order.

Backup notifications going only to root mail

Default Proxmox installs send backup notifications to root@hostname. Nobody reads root mail. Configure the SMTP target in Notifications to deliver to a real distribution list, and add a Webhook target to a chat channel. Backup failures are the kind of notification that should be loud.

No alerting on certificate expiration

Let's Encrypt renews automatically until it does not (DNS challenge misconfiguration, ACME endpoint changes, account expiration). Monitor cert expiration as a metric (Prometheus blackbox_exporter or a synthetic check). Catching this 30 days out is trivial; catching it the morning the cert expires breaks every browser session and every WebAuthn login.

Logs only on the local nodes

journald retention defaults are bounded by disk size and configurable retention. After a serious incident on a node that loses its boot disk, the local logs are gone. Forward to a central log store. Retention there is the post mortem source of truth, not the node itself.

Synthetic checks running on the cluster they are checking

An Uptime Kuma instance running as a VM on the cluster cannot tell you the cluster is down because it is also down. External synthetic checks (a small VM somewhere else, a SaaS monitoring service, the second cluster you operate) need to be the source of truth for "is the cluster reachable from the outside."

Alert fatigue from too many low priority notifications

Every package update, every successful backup, every routine ZFS scrub goes to the same channel as fencing events and Ceph errors. The team learns to ignore the channel. Use notification matchers to route info severity to a logging endpoint or a separate channel; route warning and error severities to the channel humans actually read. Alerts should mean something.

No runbook tied to alerts

An alert that fires at 3 AM and tells the on call engineer "Ceph HEALTH_WARN" but provides no context on what to check, what is normal, or who to escalate to is half useful. Every alert should link to a runbook entry: what the alert means, what to verify first, what known causes exist, what action to take, who to call if the action does not resolve. The runbook is the second half of the alert.

Key Takeaways

pvestatd is the existing collector. Per the External Metric Server wiki, it samples node, VM, container, and storage metrics at a hardcoded 10 second interval consistent with the pvestatd source and observed pvestatd log cadence. The data is what populates the web UI graphs and what gets shipped to external metric servers.
Built in Graphite, InfluxDB, and OpenTelemetry. Graphite and InfluxDB per the External Metric Server wiki; OpenTelemetry plugin per the PVE 9.0 Roadmap. Configure under Datacenter, Metric Server or in /etc/pve/status.cfg. Default Graphite port 2003 over UDP; InfluxDB v1 over UDP on 8089 or v2 via HTTP API with bucket and token.
Cluster status, Ceph detail, and SMART are gaps. Current built in metric server output does not cover every operational signal. For quorum, Ceph detail, and SMART or wear data, use the Proxmox API, prometheus-pve-exporter, Ceph Prometheus metrics, node exporter, smartctl, or Telegraf depending on the signal.
prometheus-pve-exporter for Prometheus. Community maintained at github.com/prometheus-pve/prometheus-pve-exporter. Use a PVEAuditor user with an API token. Multi target Prometheus scrape configuration covers multiple PVE nodes via a single exporter. Disable the config collector on big deployments.
Ceph CLI plus Ceph manager Prometheus module. Per the Ceph documentation, ceph -s, ceph health detail, ceph osd df tree, ceph -w for live triage. Enable the manager Prometheus module on port 9283 and scrape all manager nodes; verify which Ceph metrics your build exposes before depending on per OSD latency or per image I/O.
Notification system landed in PVE 8.1. Per the Administration Guide. Sendmail, SMTP, Gotify, Webhook (Webhook added in 8.3) targets. Matchers route by severity, type, time, and metadata regex. Configuration in /etc/pve/notifications.cfg and /etc/pve/priv/notifications.cfg.
Webhook templating uses Handlebars. Per the docs. The escape helper handles JSON escaping for special characters; without it multi line notifications break the JSON payload.
journalctl commands matter operationally. -u <service> for per service logs, -b -1 for previous boot, -p err for priority filter, -k for kernel. Forward to a central log store for post mortems.
The reference stack. Prometheus or InfluxDB for metrics; Grafana for visualization; Alertmanager or Grafana alerting; PVE Notifications with SMTP and Webhook; central log store; external synthetic checks running off the cluster they monitor.
Capacity alerting at 70% and 80% used. ZFS pools, Ceph pools, PBS datastores. The 70% warning gives lead time for hardware orders; 80% critical means the change window is now.