Proxmox High Availability (HA) ensures that virtual machines (VMs) and containers (CTs) automatically restart on another node in a cluster if the original node fails. It uses Corosync for cluster communication and quorum, and requires at least 3 nodes for reliable operation. HA is enabled per VM or container and depends on shared or replicated storage. While it does not offer live failover (like memory state transfer), it provides automated recovery with minimal downtime, making it ideal for critical workloads in a Proxmox cluster.
Why 3 Nodes Are Required for HA in Proxmox
1. Quorum
- Proxmox VE uses Corosync for cluster communication and quorum-based decision making.
- Quorum requires a majority of nodes to agree on the current state of the cluster.
- In a 2-node setup, you can lose quorum with just one node down, leading to split-brain or HA services stopping.
2. Failover Logic
- Proxmox HA relies on the pve-ha-crm (Cluster Resource Manager) and pve-ha-lrm (Local Resource Manager) to detect failure and move services.
- With 3 nodes, if one fails:
- The other two still maintain quorum
- HA logic can decide to restart the failed VMs/CTs on healthy nodes
2-Node HA is NOT Recommended
- While a 2-node cluster is technically possible in Proxmox VE, HA is not supported or reliable.
- You can add a QDevice (external quorum vote) to simulate a 3rd vote, but that adds complexity and still has limitations.
Recommended Minimum for HA
Nodes | HA Capable | Notes |
---|---|---|
1 | No | Single node = no cluster or HA |
2 | No* | Possible with QDevice, but fragile |
3+ | Yes | Full support for HA and stable quorum logic |
Best practice: Start with 3 nodes and scale in odd numbers (e.g., 3, 5, 7) for quorum stability.
Cluster Hardware Overview
Component | Node 1 | Node 2 | Node 3 |
---|---|---|---|
CPU | Xeon / Ryzen (8+ cores) | Same | Same |
RAM | 64–128 GB ECC | Same | Same |
Boot Drive | 256–512 GB SSD (ZFS mirror recommended) | Same | Same |
VM Storage | Ceph OSD SSDs or NFS-backed drives | Same | Same |
Network NICs | 2x 1G + 2x 10G NICs | Same | Same |
Network Design
Network Role | Description | NIC Type | VLAN / Separate Phys |
---|---|---|---|
Management | Web GUI, SSH, API | 1G NIC | VLAN 10 or Physical |
Corosync | Cluster heartbeat traffic | 1G or 10G NIC | VLAN 20 or Physical |
VM/Storage LAN | Ceph, NFS, iSCSI, VM traffic | 10G NIC | VLAN 30 or Physical |
Backup LAN | Proxmox Backup Server, replication | Optional | VLAN 40 |
Best Practice: Use dedicated or VLAN-isolated networks for Corosync and Ceph/Storage traffic to avoid congestion and latency.
Storage Design
Option 1: Ceph (Recommended for full HA)
- 3-node Ceph storage with 3 OSDs per node
- Use enterprise SSDs or NVMe (min. 2 TB per node)
- Replication: 3x
- Journals: Use separate SSDs or WAL/DB partitions
Option 2: Shared NFS/iSCSI
- NFS or iSCSI from TrueNAS or similar high-availability NAS
- Accessible to all 3 nodes
- VMs stored on shared volume
- No local-only storage for HA VMs
Option 3: ZFS with Replication (Low-cost HA)
- Each node has local ZFS mirror
- Use ZFS replication (manual or scheduled)
- Enables semi-HA (failover with some delay)
Fencing and Quorum
Feature | Description |
---|---|
Quorum | Needs 2 of 3 nodes online |
Corosync Rings | 2 (ring0 and ring1 for redundancy) |
Fencing | Software-based via Proxmox HA stack |
No STONITH | Proxmox uses internal fencing logic |
VM HA Configuration
- Enable HA per VM in Datacenter > HA
- Group critical VMs into HA Groups with node preferences
- Avoid overcommitting all nodes with HA VMs
- Enable no-failback if you don’t want VMs to jump back after recovery
Backup Strategy
- Deploy Proxmox Backup Server on separate node (physical or VM with external storage)
- Run daily incremental backups of HA-enabled VMs
- Backup stored on ZFS dataset or external NAS
Maintenance & Monitoring
Task | Frequency | Tools/Notes |
---|---|---|
Corosync link test | Monthly | corosync-cfgtool , ping , traceroute |
Disk health check | Weekly | smartctl , Ceph dashboard |
Backup restore test | Monthly | Restore to non-production node |
Resource usage monitor | Daily | Proxmox GUI, Nagios |
Configuration Checklist
- Proxmox VE installed and up to date on all 3 nodes
- Cluster created using
pvecm create
andpvecm add
- Corosync dual-ring configured
- Shared or replicated storage accessible on all nodes
- VMs created on shared storage (not local)
- HA groups defined for critical workloads
- Proxmox Backup Server connected and tested
- Monitoring and alerts configured
Get in touch with Saturn ME today for a free Proxmox consulting session—no strings attached.