As data demands continue to grow, many organizations are turning to Proxmox VE (Virtual Environment) with Ceph Storage to build robust, high-performance Hyperconverged Infrastructure (HCI) solutions. The combination offers a flexible and open-source alternative to VMware vSAN or Nutanix, providing scalable compute and storage from the same cluster.
However, even the best-engineered Proxmox + Ceph setups can encounter performance or reliability challenges. Understanding how to troubleshoot Ceph clusters within Proxmox is crucial to maintaining uptime, data safety, and system performance.
In this guide, we’ll dive deep into a structured, step-by-step troubleshooting process tailored for Proxmox HCI deployments—covering everything from OSD issues to network bottlenecks.
Step 1: Checking Cluster Health in Proxmox VE
In a Proxmox-Ceph environment, both the Proxmox dashboard and the Ceph CLI provide detailed insights into cluster health.
From the Proxmox Web GUI:
Navigate to:
Datacenter → Ceph → Status
You’ll see a summary showing:
- Cluster health (OK / Warning / Error)
- Number of MONs, OSDs, and MGRs
- PGs and recovery status
From the command line:
ceph -s
or
ceph health detail
Typical health states:
- HEALTH_OK – All systems operational.
- HEALTH_WARN – Non-critical issues (e.g., recovering PGs, low space).
- HEALTH_ERR – Critical problems threatening redundancy or data safety.
When HEALTH_WARN
or HEALTH_ERR
appears, the message will often reference the exact subsystem at fault—OSDs, MONs, PGs, or network communication.
Step 2: Investigating Placement Groups (PGs)
In Proxmox Ceph, Placement Groups (PGs) determine how data is distributed and replicated across OSDs.
If PGs are degraded
, undersized
, or stuck
, performance and redundancy can suffer.
Check PG stats:
ceph pg stat
Common PG issues in Proxmox:
PG State | Meaning | Typical Fix |
---|---|---|
degraded | One or more replicas missing | Restart affected OSDs or check node connectivity |
undersized | Not enough OSDs to maintain replication | Verify OSDs are “in” and “up” |
stuck inactive | PGs not recovering | Check monitor quorum and network health |
If PGs are stuck, verify that all nodes in the Proxmox cluster have proper Ceph communication on the cluster network.
Restart any affected OSDs:
systemctl restart ceph-osd@<id>
Step 3: Diagnosing OSD Issues
Each OSD (Object Storage Daemon) represents one Ceph disk. In Proxmox HCI, OSDs are distributed across cluster nodes for fault tolerance.
View the OSD tree:
ceph osd tree
Look for:
down
OSDs — The service is stopped or unreachable.out
OSDs — Ceph has automatically excluded the OSD.- Uneven data distribution — Some OSDs overloaded while others idle.
Common OSD fixes:
- Restart a failed OSD:
systemctl restart ceph-osd@<id>
- Re-add an OSD that was marked out:
ceph osd in <id>
- If an OSD repeatedly fails, inspect logs:
journalctl -u ceph-osd@<id> --since today
Look for disk I/O errors or hardware failures.
In Proxmox, faulty disks can also be identified from the Node → Disks → Ceph OSDs tab, which shows SMART status and usage metrics.
Step 4: Network Health and Latency Checks
Ceph performance heavily depends on network latency and bandwidth.
In Proxmox HCI, you typically use two network interfaces per node:
- Public (Proxmox management + client I/O)
- Cluster (Ceph replication and heartbeat traffic)
Verify network communication:
ping <node-ip>
ceph ping mon.<id>
Tips for optimal Ceph networking:
- Use 10 GbE or faster interfaces for the Ceph cluster network.
- Ensure MTU (e.g., jumbo frames 9000) is consistent across switches and NICs.
- Separate Ceph traffic from VM migration or backup traffic.
If you notice slow ops
or degraded performance
in Ceph, network packet loss or high latency is often the culprit.
Step 5: Checking Monitor (MON) and Manager (MGR) Daemons
Ceph monitors (MONs) maintain cluster maps and manage quorum; managers (MGRs) provide additional monitoring and management services.
View MON status:
ceph quorum_status --format json-pretty
You should see all MON nodes listed and “in quorum.”
If quorum is lost, the cluster may enter read-only mode.
Restart a failed MON:
systemctl restart ceph-mon@<id>
Check MGR status:
ceph mgr stat
At least one MGR should always be active; others remain standby.
In Proxmox, you can verify these daemons under Ceph → Services.
Step 6: Review Logs and Metrics
When the root cause isn’t immediately clear, logs hold the answers.
Where to look:
- Ceph logs:
/var/log/ceph/
- System logs:
journalctl -u ceph*
- Proxmox logs:
/var/log/syslog
and/var/log/pve/tasks/
You can also use:
ceph tell <daemon> log last 50
to view recent events from a specific daemon.
Enable the Ceph Dashboard (available directly in Proxmox VE 9+):
ceph mgr module enable dashboard
This gives you visual insights into performance metrics, pool utilization, and warning trends.
Step 7: Common Proxmox-Ceph Issues and Fixes
Issue | Cause | Recommended Fix |
---|---|---|
Cluster in HEALTH_WARN with clock skew | NTP drift between nodes | Sync NTP on all nodes |
PGs stuck inactive | Missing OSDs or quorum loss | Bring OSDs online, verify MON quorum |
Slow ops or degraded I/O | Network congestion | Verify MTU, separate Ceph network |
MON_DISK_LOW | Monitor node disk nearly full | Expand disk or clean logs |
Uneven PG distribution | Misconfigured pool PG count | Rebalance or add more OSDs |
Step 8: Monitoring and Automation for Proxmox + Ceph
For long-term stability, continuous monitoring is essential.
Integrate Ceph metrics into:
- Prometheus + Grafana dashboards
- Proxmox Metrics Server
- Zabbix or Nagios Core for alerting
Track key metrics such as:
- OSD latency
- PG state changes
- Cluster IOPS
- Recovery/backfill operations
Automated alerts can catch issues before they escalate into downtime.
Step 9: Preventive Maintenance Tips
To avoid recurring issues:
- Keep Proxmox and Ceph packages up to date (
apt update && apt dist-upgrade
). - Schedule periodic health checks:
ceph health
- Test recovery procedures on non-critical pools.
- Ensure all nodes have synchronized time.
- Use enterprise-grade SSDs/NVMe for journals and WALs.
Conclusion
Troubleshooting Ceph in Proxmox HCI environments is about methodical analysis rather than guesswork.
By following a structured workflow—from health checks to logs and network validation—you can isolate problems quickly and restore cluster performance.
A well-designed Proxmox + Ceph setup, with:
- Redundant networking,
- Consistent monitoring, and
- Regular maintenance,
can deliver enterprise-level reliability at a fraction of the cost of proprietary hyperconverged platforms.
With the right knowledge and proactive monitoring, your Proxmox HCI cluster can remain stable, scalable, and performant for years to come.