Proxmox VE 8: Complete Three-Node Cluster Setup End to End
Standalone Infrastructure | Component: Proxmox VE 8.x, PBS | Audience: Enterprise Architects, Senior Sysadmins
Proxmox VE is one of the more approachable hypervisors to get running, and one of the easier ones to get running badly. The installer takes 10 minutes. Building a three node production cluster with Ceph, proper network separation, and high availability configured correctly takes significantly longer and requires decisions that are hard to reverse after the fact. Cluster name, storage pool design, network topology, the subscription repository versus no-subscription: these choices lock in early and affect everything downstream.
This article is end to end on a three node Proxmox VE 8 cluster: hardware and pre-install decisions, installation and initial configuration, cluster formation with Corosync, network design with bridges, bonds, and VLANs, storage architecture with a decision guide between ZFS, Ceph, and external storage, Ceph setup from scratch, VM and LXC container creation with templates and cloud-init, HA manager configuration, and Proxmox Backup Server integration. The baseline assumption is three identical nodes capable of running Ceph.
1. Hardware Requirements and Pre-Install Decisions
Minimum Specifications
The official Proxmox VE minimum requirements from the Proxmox documentation are a 64-bit Intel or AMD CPU with hardware virtualization support (Intel VT or AMD-V), 2 GB RAM for the OS and PVE services plus additional RAM for guests, and at least 8 GB disk for the OS. Those are lab minimums. For production with Ceph, the real floor is higher.
| Component | Lab Minimum | Production Minimum (with Ceph) | Notes |
|---|---|---|---|
| CPU | 4 cores with VT/AMD-V | 10+ cores per node | Reserve one core per Ceph service (MON, MGR, each OSD). Plan for 8+ cores dedicated to Ceph on a node running 6 OSDs. |
| RAM | 8 GB | 64 GB per node | Proxmox: 2 GB. Each OSD: 8 GB recommended per Ceph docs. ZFS ARC: ~1 GB per TB of ZFS storage. Guests: size accordingly. |
| OS disk | 32 GB SSD | Two SSDs in ZFS mirror | Never put the OS on a single disk in production. ZFS mirror adds no CPU overhead and saves you from an OS disk failure taking down a node. |
| Ceph OSD disks | One per node | Three to six per node, same model and size | One OSD per physical disk. No RAID controller between Ceph and the disks. Use SSDs with Power Loss Protection for OSD disks. |
| Network | 1 GbE single NIC | Two 10 GbE NICs per node minimum | Ceph cluster network should be dedicated. Corosync must be on a separate, low latency network. 25 GbE is preferred for NVMe based Ceph. |
The RAID Controller Warning
Both ZFS and Ceph require direct access to raw disks. Neither works correctly behind a hardware RAID controller. RAID controllers cache writes, hide disk errors, and manage disks in ways that interfere with ZFS's checksumming and Ceph's OSD management. If your server has a hardware RAID controller, either flash it to HBA mode (IT mode on LSI controllers) or use a pass through HBA instead. This is documented explicitly in both the Proxmox and Ceph official documentation.
Subscription Repository vs No-Subscription
Proxmox VE is free and open source. The enterprise repository requires a paid subscription. The no-subscription repository is free and available to everyone. For production environments, an enterprise subscription is worth the cost: you get the tested stable package feed, a higher quality update cadence, and access to Proxmox support. For labs and test environments, the no-subscription repository is fine. You'll want to configure one or the other immediately after installation, because the default installer points at the enterprise repository and will generate errors on every apt update if you don't have a subscription key.
# Disable the enterprise repo (remove subscription nag) echo "# disabled - no subscription" > /etc/apt/sources.list.d/pve-enterprise.list # Disable Ceph enterprise repo if you'll use Ceph without subscription echo "# disabled - no subscription" > /etc/apt/sources.list.d/ceph.list # Add the no-subscription PVE repository echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" \ > /etc/apt/sources.list.d/pve-no-sub.list # Update and upgrade apt update && apt dist-upgrade -y
2. Installation and Initial Host Configuration
Install Proxmox VE from the ISO on each node independently before forming the cluster. Don't form the cluster first and then configure the OS. The installer handles disk formatting (including ZFS mirror for the OS if you select it), sets the hostname, configures the management IP, and sets the root password.
A few installer decisions that matter:
- Hostname: Set a proper FQDN during installation (pve1.yourdomain.local, not just pve1). Changing the hostname after cluster formation is painful. Pick names you'll live with.
- OS disk filesystem: Choose ZFS (RAID1) during installation if you have two OS disks. The installer will mirror them automatically. Choose ext4 or XFS if you only have one OS disk and plan to use the other for Ceph OSDs.
- Management IP: Set a static IP during installation. Using DHCP for the management interface causes cluster communication failures when a lease renews to a different address.
Post-Install Configuration on Each Node
# Update /etc/hosts on every node with all cluster member entries # This is required for Corosync and cluster communication # Do this before forming the cluster cat >> /etc/hosts << 'EOF' 10.0.100.11 pve1.yourdomain.local pve1 10.0.100.12 pve2.yourdomain.local pve2 10.0.100.13 pve3.yourdomain.local pve3 EOF # Verify hostnames resolve correctly ping -c 2 pve2 ping -c 2 pve3 # Configure NTP (chrony is default on PVE 8 / Debian 12 Bookworm) # Corosync is sensitive to time drift - nodes must be synchronized systemctl status chrony chronyc tracking # If chrony isn't running or not synced: apt install chrony -y systemctl enable --now chrony # Disable the enterprise subscription nag in the web UI # (optional, only affects the browser UI warning) sed -i.bak "s/if (res === null || res === undefined || !res || res/if (false || res/g" \ /usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js systemctl restart pveproxy
3. Cluster Networking: Bridges, Bonds, VLANs, and OVS
Proxmox VE uses Linux native networking by default. Every interface configuration is stored in /etc/network/interfaces and applied with ifreload -a (from the ifupdown2 package, which PVE installs by default). You don't need a reboot to apply network changes, but you do need to get the configuration right before forming the cluster, because changing the Corosync network after cluster formation requires touching the corosync.conf on a running cluster.
Traffic Types and Network Design
| Traffic Type | Interface | Notes |
|---|---|---|
| Management / Web UI | vmbr0 (bridged to bond0 or single NIC) | This is where the PVE web interface listens on port 8006. Keep on management VLAN. |
| Corosync cluster | Dedicated NIC or VLAN, NOT bridged | Corosync is latency sensitive. Dedicated physical NIC is strongly preferred. Must be under 5ms latency between nodes. |
| Ceph public network | Dedicated NIC or bond | VM-to-Ceph client traffic. 10 GbE minimum. Separate from Ceph cluster network. |
| Ceph cluster network | Dedicated NIC or bond | OSD-to-OSD replication and heartbeat traffic. Keep isolated from all other traffic. 10 GbE minimum, 25 GbE for NVMe OSDs. |
| VM / guest | vmbr1 (trunk or access ports) | VM traffic. Bridge with VLAN awareness enabled for multi-tenant environments. |
| Live Migration | Shares management or Ceph public depending on config | PVE uses the management network for migrations by default. Can be redirected to a dedicated interface via Datacenter options. |
auto lo
iface lo inet loopback
# Physical NICs - do not assign IPs directly
auto eno1
iface eno1 inet manual
auto eno2
iface eno2 inet manual
auto eno3
iface eno3 inet manual
auto eno4
iface eno4 inet manual
# Bond for management + VM traffic (LACP)
auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
bond-lacp-rate fast
# Management bridge (vmbr0) - hosts PVE web UI and management IP
auto vmbr0
iface vmbr0 inet static
address 10.0.100.11/24
gateway 10.0.100.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
dns-nameservers 10.0.100.1
# VM traffic bridge (trunk, VLAN-aware)
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond0.400
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
# Corosync cluster network - dedicated NIC, no bridge
auto eno3
iface eno3 inet static
address 10.0.200.11/24
# Ceph public network
auto eno4
iface eno4 inet static
address 10.0.300.11/24
# Ceph cluster network (if you have a 5th NIC; otherwise share with Ceph public)
# auto eno5
# iface eno5 inet static
# address 10.0.400.11/24
Linux Bridge vs OVS
Proxmox supports both Linux native bridges and Open vSwitch (OVS). Linux bridges are the default and the right choice for most deployments. They're simpler, more stable, and don't require additional packages. OVS adds support for OpenFlow, software defined networking features, and more complex VLAN trunk configurations. Use OVS only if you have a specific need for it, such as integration with an SDN controller or VXLAN tunneling between sites. For a standard three node cluster, Linux bridges with VLAN awareness enabled on the bridge cover every production networking scenario.
Applying Network Changes Without a Reboot
# Validate the config before applying (will report syntax errors) ifup --no-act -a # Apply changes without reboot (ifupdown2 - installed by default on PVE 8) ifreload -a # Verify interfaces are up and have correct addresses ip addr show ip route show
4. Cluster Formation with Corosync
The cluster must be created from one node and all other nodes join it. You can't merge two existing clusters. If a node already has VMs on it before joining, those VMs become part of the cluster's shared namespace. Guest IDs (VM IDs) must be unique across the cluster; conflicts during join will cause the join to fail.
# On pve1 only: create the cluster # --link0 specifies the Corosync network IP on this node pvecm create prod-cluster --link0 10.0.200.11 # Verify cluster created successfully pvecm status # On pve2: join the cluster # IP is pve1's Corosync IP (link0 address), not the management IP # --link0 is this node's own Corosync IP pvecm add 10.0.200.11 --link0 10.0.200.12 # You'll be prompted for pve1's root password # On pve3: join the cluster pvecm add 10.0.200.11 --link0 10.0.200.13 # Verify all nodes are present and quorate pvecm status pvecm nodes
Corosync Redundant Links
Corosync supports two redundant network links (link0 and link1) for cluster communication. When both links are configured, Corosync uses knet transport which handles failover between links automatically. Configure both links during cluster creation, not after. Adding a second link to an existing cluster requires editing corosync.conf on a live cluster, which is possible but carries risk.
# Create cluster with two Corosync network links pvecm create prod-cluster --link0 10.0.200.11 --link1 10.0.201.11 # Join with two links pvecm add 10.0.200.11 --link0 10.0.200.12 --link1 10.0.201.12 # Verify both links are active pvecm status | grep -A 20 "Membership"
5. Storage Architecture Decision Guide
| Storage Type | Best For | Requirements | Key Limitations |
|---|---|---|---|
| Local ZFS | OS boot disks, fast local scratch, single node VM storage that doesn't need live migration | Direct attached SSDs, no hardware RAID. 1 GB RAM per TB of ZFS storage. | Not shared: VMs on local ZFS can't live migrate. Offline migration only. |
| Ceph RBD | Shared VM disks with live migration, hyperconverged HA storage | 3+ nodes, dedicated OSDs, 10 GbE Ceph network, 8 GB RAM per OSD | Minimum 3 nodes for replication. Performance degrades during node failure and rebalancing. |
| NFS | Shared storage from existing NAS, ISO storage, template storage | NFS server on dedicated storage, 1+ GbE network | Single point of failure unless NFS server is itself HA. Latency sensitive for database workloads. |
| iSCSI | Connecting to existing SAN, shared storage without Ceph | iSCSI initiator (open-iscsi), dedicated storage network recommended | Requires separate SAN management. LVM clustering required for shared disk access. |
| ZFS over iSCSI | Sharing ZFS volumes from a dedicated storage node | Storage node with ZFS pool, target portal configured | Additional complexity. The storage node becomes a single point of failure unless replicated. |
For a three node cluster where you want live migration and HA, Ceph is the right choice. It's the only storage type in the table that's truly shared, distributed, and survives a node failure without manual intervention. NFS and iSCSI work but introduce a storage server as a dependency that requires its own HA strategy.
6. Ceph Setup End to End
Ceph must be installed on every node that will participate in the cluster. On a three node cluster, you'll run a Ceph Monitor (MON), Ceph Manager (MGR), and Ceph OSD daemons on each node. One CPU core per Ceph service is the minimum per the official documentation. Plan for at least 8 dedicated CPU cores per node if you're running a MON, a MGR, and 6 OSDs.
The 3-Node Ceph Replication Reality
With a three node cluster and default size=3, min_size=2 pool settings, one complete node failure leaves Ceph degraded but still serving I/O. VMs continue running. However, Ceph can't fully recover to a healthy state until the failed node is restored, because there's no fourth node to receive the missing replica. This is the honest trade off of three node Ceph: it handles failure gracefully but doesn't self-heal. For environments where self-healing redundancy is required, five nodes is the recommended production minimum.
# Run on each node # Choose the Ceph version matching the current Proxmox release # PVE 8.x ships with Ceph Reef (17) or Quincy (17) depending on minor version # Check the Proxmox release notes for the correct version for your PVE build # Install Ceph packages (run on each node) pveceph install --repository no-subscription # Initialize Ceph on the first node (pve1) # --network sets the Ceph public network # --cluster-network sets the OSD replication network (optional but strongly recommended) pveceph init \ --network 10.0.300.0/24 \ --cluster-network 10.0.400.0/24 # Create Monitors on each node (run on pve1 for all three) pveceph mon create pve1 pveceph mon create pve2 pveceph mon create pve3 # Create Managers on each node pveceph mgr create pve1 pveceph mgr create pve2 pveceph mgr create pve3 # Verify MON and MGR are running ceph -s
Creating OSDs
# List available disks (should show disks not currently in use) pveceph osd list # If a disk was previously used for ZFS or another OSD, wipe it first # This destroys all data on the disk - confirm the disk name before running wipefs -a /dev/sdb sgdisk -Z /dev/sdb # Create an OSD on each dedicated disk # Bluestore is the default (and correct) OSD type since Ceph Luminous pveceph osd create /dev/sdb pveceph osd create /dev/sdc pveceph osd create /dev/sdd # Optionally specify a separate NVMe for the Bluestore DB/WAL # This significantly improves performance on spinning disk OSDs # WAL lives with DB by default when only -db_dev is specified # Do NOT point -db_dev and -wal_dev at the same device - it causes an error pveceph osd create /dev/sdb -db_dev /dev/nvme0n1 # After creating OSDs on all nodes, verify cluster health ceph -s ceph osd tree
Creating a Ceph Pool and Adding Storage to Proxmox
# Create a pool for VM disks # size=3: three replicas (one per node) # min_size=2: minimum replicas required to serve I/O # pg_autoscale_mode=on: let Ceph manage placement group count automatically pveceph pool create vm-pool \ --size 3 \ --min_size 2 \ --pg_autoscale_mode on \ --add_storages 1 # automatically adds pool to PVE storage config # Verify pool created and storage is available in PVE ceph osd pool ls pvesm status
7. VM and LXC Container Creation with Templates and Cloud-Init
Creating a VM Template with Cloud-Init
Building VMs from a cloud-init template is faster and more consistent than installing from ISO every time. The workflow is: download a cloud image, import it as a VM, attach a cloud-init drive, convert to a template, then clone from the template whenever you need a new VM.
# Download a cloud image (Debian 12 Bookworm generic cloud image) wget https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-genericcloud-amd64.qcow2 # Create a VM shell with ID 9000 (use a high ID for templates to avoid conflicts) qm create 9000 \ --name "debian-12-template" \ --memory 2048 \ --cores 2 \ --net0 virtio,bridge=vmbr0 \ --serial0 socket \ --vga serial0 \ --ostype l26 # Import the cloud image as the VM's primary disk into your Ceph pool qm importdisk 9000 debian-12-genericcloud-amd64.qcow2 vm-pool # Attach the imported disk as scsi0 (VirtIO SCSI is recommended) qm set 9000 --scsihw virtio-scsi-pci --scsi0 vm-pool:vm-9000-disk-0 # Add a cloud-init drive qm set 9000 --ide2 vm-pool:cloudinit # Set boot order to the primary disk qm set 9000 --boot c --bootdisk scsi0 # Configure cloud-init defaults (these apply to all clones) qm set 9000 \ --ciuser admin \ --sshkeys ~/.ssh/id_rsa.pub \ --ipconfig0 ip=dhcp # Convert to template (this is irreversible) qm template 9000
# Clone the template (full clone stores independent copy in Ceph pool) qm clone 9000 101 \ --name "web-server-01" \ --full \ --storage vm-pool # Customize this instance's cloud-init settings qm set 101 \ --ipconfig0 ip=10.0.100.101/24,gw=10.0.100.1 \ --nameserver 10.0.100.1 \ --searchdomain yourdomain.local \ --memory 4096 \ --cores 4 # Resize the disk if needed qm resize 101 scsi0 +20G # Start the VM qm start 101
LXC Containers
LXC containers in Proxmox share the host kernel. They start in seconds rather than minutes, use far less RAM than a full VM, and are the right choice for services that don't need kernel level isolation: web servers, DNS, monitoring agents, databases that run on Linux. The trade off is that LXC containers run the host's kernel version and can't run a different kernel or kernel modules not loaded on the host.
# Download an LXC template from Proxmox's template repository # List available templates pveam available | grep debian # Download the template to local storage pveam download local debian-12-standard_12.7-1_amd64.tar.zst # Create a container pct create 200 local:vztmpl/debian-12-standard_12.7-1_amd64.tar.zst \ --hostname dns-server \ --storage vm-pool \ --rootfs vm-pool:8 \ --memory 512 \ --swap 512 \ --cores 1 \ --net0 name=eth0,bridge=vmbr0,ip=10.0.100.200/24,gw=10.0.100.1 \ --nameserver 10.0.100.1 \ --password \ --unprivileged 1 # strongly recommended for security # Start the container pct start 200 # Access container console pct console 200
8. High Availability Manager Configuration
Proxmox HA requires at least three nodes with quorum. With two nodes, HA can't safely determine which node failed versus which node lost network connectivity, so it refuses to act. Three nodes provide the majority vote needed to make that determination safely.
HA in Proxmox works through two mechanisms: the HA Manager daemon (pve-ha-lrm on each node, pve-ha-crm cluster wide) and fencing. Fencing is how HA ensures a potentially failed node actually stops running VMs before restarting them elsewhere. Without reliable fencing, you risk two instances of the same VM running simultaneously on different nodes with the same IP and disk, which corrupts data. Don't enable HA without confirming your fencing mechanism works.
Fencing Options
- IPMI/iDRAC/iLO watchdog: The correct production approach. Configure the hardware management interface on each node. If a node loses quorum and doesn't fence itself within the watchdog timeout, the surviving nodes trigger a remote power off via IPMI. Reliable, hardware enforced.
- Self-fencing via hardware watchdog: Proxmox's HA manager activates a hardware watchdog on each node. If the HA daemon stops feeding the watchdog (because the node lost quorum), the watchdog triggers a hardware reset. Works without IPMI but requires a hardware watchdog device, which most server class hardware provides.
- No fencing (not recommended): Proxmox will warn you and limit HA functionality. In a three node cluster where one node fails uncleanly, HA won't restart VMs without confirmed fencing. The safest behavior, but VMs stay down until you manually confirm the failed node is off.
# Create an HA group defining which nodes can run HA resources # Nodes are listed with optional priority (higher = preferred) ha-manager groupadd production \ --nodes "pve1:3,pve2:2,pve3:1" \ --restricted 0 \ --nofailback 0 # Add a VM to HA management ha-manager add vm:101 \ --group production \ --state started \ --max_restart 3 \ --max_relocate 3 # Add an LXC container to HA management ha-manager add ct:200 \ --group production \ --state started # Check HA status ha-manager status ha-manager status --full
ha-manager rules add node-affinity prod-rule --resources vm:101,ct:200 --nodes "pve1:3,pve2:2,pve3:1". The groupadd approach remains functional for backward compatibility but won't receive new features.9. Proxmox Backup Server Integration
Proxmox Backup Server (PBS) is a separate product, also free and open source, designed specifically for backing up Proxmox VMs, containers, and host configurations. It uses incremental backups based on data changed since the last backup, client-side encryption, and a deduplication store that dramatically reduces storage consumption compared to full VM backups. It integrates natively with Proxmox VE with no additional configuration beyond adding PBS as a storage target.
PBS can run on a dedicated physical server, a VM on a separate Proxmox node, or a VM on the same cluster you're backing up (not ideal but workable for smaller environments). Don't run PBS as a VM on the same Ceph pool you're backing up: a Ceph failure takes down both the VMs and the backups simultaneously.
Adding PBS to Proxmox VE
# On the PBS server: get the PBS server fingerprint for authentication proxmox-backup-manager cert info | grep Fingerprint # In PVE (via CLI or web UI): # Add PBS as a storage target pvesm add pbs pbs-store \ --server pbs.yourdomain.local \ --datastore vm-backups \ --username backup@pbs \ --password "your-pbs-password" \ --fingerprint "AB:CD:...:EF" # paste fingerprint from PBS cert info # Verify the storage is accessible pvesm status pbs-store
Creating Backup Jobs
# Create a backup job for all VMs in the cluster
# Schedule: daily at 02:00, keep 7 daily, 4 weekly, 3 monthly backups
pvesh create /cluster/backup \
--id daily-backup \
--storage pbs-store \
--schedule "02:00" \
--all 1 \
--mode snapshot \
--compress zstd \
--notes-template "{{guestname}} backup {{date}}" \
--prune-backups "keep-daily=7,keep-weekly=4,keep-monthly=3"
# List configured backup jobs
pvesh get /cluster/backup
# Run a backup job immediately (useful for testing)
vzdump --storage pbs-store --all 1 --mode snapshot --compress zstd
PBS Garbage Collection and Verification
PBS uses a chunk based deduplication store. Deleting a backup doesn't free space immediately because chunks may be shared across multiple backups. Run garbage collection on the PBS datastore periodically to reclaim space from deleted backups. By default, PBS schedules GC automatically, but in environments with aggressive retention policies you may want to run it more frequently. PBS also supports scheduled verification jobs that read back every chunk of every backup and confirm its integrity. Run verification weekly at minimum. A backup that can't be verified is a backup you can't restore from.
Key Takeaways
- Never put ZFS or Ceph behind a hardware RAID controller. Both require raw disk access. Flash the controller to HBA (IT) mode or use a pass through HBA. This is documented explicitly in both the Proxmox and Ceph official docs.
- Configure the no-subscription or enterprise repository immediately after install. The default installer points at the enterprise repo and generates apt errors on every update without a subscription key. Pick one and don't mix them.
- Set hostnames and static IPs before forming the cluster, and populate /etc/hosts on every node with all cluster members. Corosync uses hostname resolution, and a missing or incorrect entry breaks cluster communication silently.
- Corosync requires under 5ms network latency between all nodes. Give it a dedicated physical NIC on an isolated network. Never put Corosync traffic on a network that carries large data transfers, even on a separate VLAN.
- Three node Ceph with size=3, min_size=2 survives a single node failure but can't self-heal until the failed node is restored, because there's no fourth node to receive the missing replica. For true self-healing redundancy, five nodes is the production minimum.
- One OSD per physical disk. Ceph manages disk level redundancy itself. Multiple OSDs per disk share the disk's throughput and failure domain, which defeats the purpose. Recommended: 8 GB RAM per OSD, one CPU core per OSD daemon.
- Don't enable HA without a working fencing mechanism. Without fencing, Proxmox won't restart VMs after an unclean node failure because it can't confirm the failed node has stopped running them. IPMI fencing is the production standard.
- Run PBS on a dedicated server or a VM on a separate physical host from the cluster you're backing up. A PBS instance running on the same Ceph pool it's backing up provides no protection against storage failures.