Why Proxmox VE Cluster Didn’t Migrate VMs After ZFS Storage Failure (and How to Fix It)

Understanding the Issue: ZFS Failed but VM Didn’t Migrate

You might encounter this situation in a Proxmox VE cluster using local ZFS storage on each node:

A ZFS pool (for example /tank or /rpool) fails or goes offline.
The node itself stays powered on and reachable.
But your VMs remain stuck, and Proxmox HA doesn’t migrate them to another node.

It’s a confusing scenario — especially if you’ve configured ZFS replication between nodes and expected automatic failover. So why doesn’t Proxmox move those VMs automatically?

Proxmox HA Works at Node Level, Not Storage Level

The key to understanding this lies in how Proxmox HA Manager operates.

Proxmox HA monitors node health, not the storage layer.

This means:

If a node goes offline, HA migrates or restarts VMs on another node.
If the ZFS storage pool fails but the node is still running, HA sees the node as healthy and takes no action.

In short:

Proxmox HA cannot detect or act on local ZFS storage failure.

So even if the storage under a VM becomes unreadable, HA won’t migrate it — because the host node didn’t actually “fail.”

Local ZFS Is Not Shared Storage

Most Proxmox setups using ZFS are configured with local pools per node, e.g.:

Even with ZFS replication enabled, these are still separate local datasets. Replication only creates periodic snapshots on other nodes — not live, accessible copies.

So when a ZFS pool on Node A fails:

The replicated copy on Node B exists, but
It’s a snapshot, not a live disk that HA can boot from automatically.

Why ZFS Replication Doesn’t Trigger Automatic Failover

Proxmox ZFS replication (pve-zsync or built-in replication jobs) copies datasets between nodes at scheduled intervals.

However, the replicated dataset:

Remains in snapshot form.
Is not automatically promoted or activated as a live volume.

Therefore, after a failure, you must manually start the replicated VM: Until that step happens, the Proxmox HA system cannot automatically restart the VM from the replica.

Recovery Steps After ZFS Failure

If your ZFS pool failed but replication was enabled, here’s how to restore the latest snapshot on another node. Assuming ZFS storage failed on PVE2 node, PVE1 is up and running and has replicated successfully recently, you can run the following command on pve1 to bring up the VM(102).

Move VM config 102.conf to PVE1:
mv /etc/pve/nodes/pve2/qemu-server/102.conf /etc/pve/nodes/pve1/qemu-server/102.conf
Start the VM manually:
qm start 102

Now the VM will boot from the replicated ZFS dataset on the healthy node.

You can then wipe the ZFS pool and disks on PVE2, create a new ZFS pool using the same name as before, and the existing replication configuration will automatically resume normal operation.

How to Fix It — and Build True HA for ZFS in Proxmox VE

Here are your options to ensure automatic recovery or faster failover.

1. Use Shared Storage for True HA

The most reliable solution is to store all VM disks on shared storage accessible by all cluster nodes.

Recommended options include:

CephFS (native to Proxmox VE)
NFS or iSCSI SAN
ZFS over iSCSI (TrueNAS, StarWind VSAN, etc.)

With shared storage, all nodes can access the same disk image. If one node fails, HA instantly restarts the VM on another host — no replication or promotion required.

2. Automate Failover with ZFS Replication

If you prefer using local ZFS storage per node, you can still achieve partial HA with automation.

Set up Proxmox replication jobs (every 5–15 minutes).
Use a failover script (like ha-replication-manager) that:
- Detects ZFS pool failure.
- Promotes the replicated dataset on the standby node.
- Registers the VM config.
- Starts the VM automatically.

This approach provides storage-aware failover — a practical compromise between full Ceph and manual recovery.

3. Enable ZFS Health Monitoring & Alerts

Proxmox integrates ZFS monitoring tools that can detect pool degradation early.

Useful commands:

To automate notifications, enable ZFS ZED (ZFS Event Daemon):

You’ll receive email alerts for:

Disk failures
Pool degradation
Resilvering or corruption events

This gives you time to replace disks before the pool collapses.

Summary: Why Proxmox Didn’t Migrate and How to Prevent It

Cause	Why Migration Didn’t Happen	Solution
Local ZFS pool failed	HA only detects node failure	Use shared storage or automation
ZFS replication used	Replicated data not live	Promote snapshot and start manually
Node still reachable	HA assumes node healthy	Add storage-level monitoring
No alerts configured	Missed early warnings	Enable ZED, SMART, and email alerts

Recommended High Availability Setup for ZFS in Proxmox VE 9

Component	Recommended Setup
Cluster	2 or 3 nodes with QDevice
Storage	ZFS per node + replication
Backup	Proxmox Backup Server (PBS)
Failover	Custom replication promotion script
Alerts	ZFS ZED + SMART monitoring

Final Thoughts

ZFS provides rock-solid storage reliability in Proxmox VE 9, but automatic HA migration requires shared storage or additional automation.

If your cluster uses local ZFS pools, HA won’t detect storage failure by default — it only reacts to node-level outages.

To achieve true resilience:

Use Ceph or shared ZFS over iSCSI for seamless migration.
Or enhance ZFS replication with automatic failover scripts.
And always keep Proxmox Backup Server running for last-resort recovery.

With these adjustments, your Proxmox ZFS cluster can survive hardware failures, storage faults, and even full node loss — without manual intervention.

October 21, 2025

Why Proxmox VE Cluster Didn’t Migrate VMs After ZFS Storage Failure (and How to Fix It)

Why Proxmox VE Cluster Didn’t Migrate VMs After ZFS Storage Failure (and How to Fix It)

Understanding the Issue: ZFS Failed but VM Didn’t Migrate

Proxmox HA Works at Node Level, Not Storage Level

Local ZFS Is Not Shared Storage

Why ZFS Replication Doesn’t Trigger Automatic Failover

Recovery Steps After ZFS Failure

How to Fix It — and Build True HA for ZFS in Proxmox VE

1. Use Shared Storage for True HA

2. Automate Failover with ZFS Replication

3. Enable ZFS Health Monitoring & Alerts

Summary: Why Proxmox Didn’t Migrate and How to Prevent It

Recommended High Availability Setup for ZFS in Proxmox VE 9

Final Thoughts

Leave A Comment

| OUR SERVICES

| CONNECT

| LOCATIONS

| OUR SERVICES

| CONNECT

| LOCATIONS

October 21, 2025

Why Proxmox VE Cluster Didn’t Migrate VMs After ZFS Storage Failure (and How to Fix It)

Why Proxmox VE Cluster Didn’t Migrate VMs After ZFS Storage Failure (and How to Fix It)

Understanding the Issue: ZFS Failed but VM Didn’t Migrate

Proxmox HA Works at Node Level, Not Storage Level

Local ZFS Is Not Shared Storage

Why ZFS Replication Doesn’t Trigger Automatic Failover

Recovery Steps After ZFS Failure

How to Fix It — and Build True HA for ZFS in Proxmox VE

1. Use Shared Storage for True HA

2. Automate Failover with ZFS Replication

3. Enable ZFS Health Monitoring & Alerts

Summary: Why Proxmox Didn’t Migrate and How to Prevent It

Recommended High Availability Setup for ZFS in Proxmox VE 9

Final Thoughts

Leave A Comment

Related Posts

How to Back Up and Restore Proxmox VE 9 Host Configuration (Step-by-Step Guide)

ZFS RAID Options in Proxmox VE 9: Complete Guide to RAIDZ, Mirrors, and dRAID

Proxmox VE 9 Metrics Explained: Complete Guide to Performance Monitoring

| OUR SERVICES

| CONNECT

| LOCATIONS

| OUR SERVICES

| CONNECT

| LOCATIONS