As data demands continue to grow, many organizations are turning to Proxmox VE (Virtual Environment) with Ceph Storage to build robust, high-performance Hyperconverged Infrastructure (HCI) solutions. The combination offers a flexible and open-source alternative to VMware vSAN or Nutanix, providing scalable compute and storage from the same cluster.

However, even the best-engineered Proxmox + Ceph setups can encounter performance or reliability challenges. Understanding how to troubleshoot Ceph clusters within Proxmox is crucial to maintaining uptime, data safety, and system performance.

In this guide, we’ll dive deep into a structured, step-by-step troubleshooting process tailored for Proxmox HCI deployments—covering everything from OSD issues to network bottlenecks.

 


Step 1: Checking Cluster Health in Proxmox VE

In a Proxmox-Ceph environment, both the Proxmox dashboard and the Ceph CLI provide detailed insights into cluster health.

From the Proxmox Web GUI:

Navigate to:

Datacenter → Ceph → Status

You’ll see a summary showing:

  • Cluster health (OK / Warning / Error)
  • Number of MONs, OSDs, and MGRs
  • PGs and recovery status

From the command line:

ceph -s

or

ceph health detail

Typical health states:

  • HEALTH_OK – All systems operational.
  • HEALTH_WARN – Non-critical issues (e.g., recovering PGs, low space).
  • HEALTH_ERR – Critical problems threatening redundancy or data safety.

When HEALTH_WARN or HEALTH_ERR appears, the message will often reference the exact subsystem at fault—OSDs, MONs, PGs, or network communication.

 


Step 2: Investigating Placement Groups (PGs)

In Proxmox Ceph, Placement Groups (PGs) determine how data is distributed and replicated across OSDs.
If PGs are degraded, undersized, or stuck, performance and redundancy can suffer.

Check PG stats:

ceph pg stat

Common PG issues in Proxmox:

PG StateMeaningTypical Fix
degradedOne or more replicas missingRestart affected OSDs or check node connectivity
undersizedNot enough OSDs to maintain replicationVerify OSDs are “in” and “up”
stuck inactivePGs not recoveringCheck monitor quorum and network health

If PGs are stuck, verify that all nodes in the Proxmox cluster have proper Ceph communication on the cluster network.
Restart any affected OSDs:

systemctl restart ceph-osd@<id>

Step 3: Diagnosing OSD Issues

Each OSD (Object Storage Daemon) represents one Ceph disk. In Proxmox HCI, OSDs are distributed across cluster nodes for fault tolerance.

View the OSD tree:

ceph osd tree

Look for:

  • down OSDs — The service is stopped or unreachable.
  • out OSDs — Ceph has automatically excluded the OSD.
  • Uneven data distribution — Some OSDs overloaded while others idle.

Common OSD fixes:

  • Restart a failed OSD:
    systemctl restart ceph-osd@<id>
    
  • Re-add an OSD that was marked out:
    ceph osd in <id>
    
  • If an OSD repeatedly fails, inspect logs:
    journalctl -u ceph-osd@<id> --since today
    

    Look for disk I/O errors or hardware failures.

In Proxmox, faulty disks can also be identified from the Node → Disks → Ceph OSDs tab, which shows SMART status and usage metrics.

 


Step 4: Network Health and Latency Checks

Ceph performance heavily depends on network latency and bandwidth.
In Proxmox HCI, you typically use two network interfaces per node:

  • Public (Proxmox management + client I/O)
  • Cluster (Ceph replication and heartbeat traffic)

Verify network communication:

ping <node-ip>
ceph ping mon.<id>

Tips for optimal Ceph networking:

  • Use 10 GbE or faster interfaces for the Ceph cluster network.
  • Ensure MTU (e.g., jumbo frames 9000) is consistent across switches and NICs.
  • Separate Ceph traffic from VM migration or backup traffic.

If you notice slow ops or degraded performance in Ceph, network packet loss or high latency is often the culprit.

 


Step 5: Checking Monitor (MON) and Manager (MGR) Daemons

Ceph monitors (MONs) maintain cluster maps and manage quorum; managers (MGRs) provide additional monitoring and management services.

View MON status:

ceph quorum_status --format json-pretty

You should see all MON nodes listed and “in quorum.”
If quorum is lost, the cluster may enter read-only mode.

Restart a failed MON:

systemctl restart ceph-mon@<id>

Check MGR status:

ceph mgr stat

At least one MGR should always be active; others remain standby.
In Proxmox, you can verify these daemons under Ceph → Services.

 


Step 6: Review Logs and Metrics

When the root cause isn’t immediately clear, logs hold the answers.

Where to look:

  • Ceph logs: /var/log/ceph/
  • System logs: journalctl -u ceph*
  • Proxmox logs: /var/log/syslog and /var/log/pve/tasks/

You can also use:

ceph tell <daemon> log last 50

to view recent events from a specific daemon.

Enable the Ceph Dashboard (available directly in Proxmox VE 9+):

ceph mgr module enable dashboard

This gives you visual insights into performance metrics, pool utilization, and warning trends.

 


Step 7: Common Proxmox-Ceph Issues and Fixes

IssueCauseRecommended Fix
Cluster in HEALTH_WARN with clock skewNTP drift between nodesSync NTP on all nodes
PGs stuck inactiveMissing OSDs or quorum lossBring OSDs online, verify MON quorum
Slow ops or degraded I/ONetwork congestionVerify MTU, separate Ceph network
MON_DISK_LOWMonitor node disk nearly fullExpand disk or clean logs
Uneven PG distributionMisconfigured pool PG countRebalance or add more OSDs

 


Step 8: Monitoring and Automation for Proxmox + Ceph

For long-term stability, continuous monitoring is essential.
Integrate Ceph metrics into:

  • Prometheus + Grafana dashboards
  • Proxmox Metrics Server
  • Zabbix or Nagios Core for alerting

Track key metrics such as:

  • OSD latency
  • PG state changes
  • Cluster IOPS
  • Recovery/backfill operations

Automated alerts can catch issues before they escalate into downtime.

 


Step 9: Preventive Maintenance Tips

To avoid recurring issues:

  1. Keep Proxmox and Ceph packages up to date (apt update && apt dist-upgrade).
  2. Schedule periodic health checks:
    ceph health
    
  3. Test recovery procedures on non-critical pools.
  4. Ensure all nodes have synchronized time.
  5. Use enterprise-grade SSDs/NVMe for journals and WALs.

 


Conclusion

Troubleshooting Ceph in Proxmox HCI environments is about methodical analysis rather than guesswork.
By following a structured workflow—from health checks to logs and network validation—you can isolate problems quickly and restore cluster performance.

A well-designed Proxmox + Ceph setup, with:

  • Redundant networking,
  • Consistent monitoring, and
  • Regular maintenance,

can deliver enterprise-level reliability at a fraction of the cost of proprietary hyperconverged platforms.

With the right knowledge and proactive monitoring, your Proxmox HCI cluster can remain stable, scalable, and performant for years to come.