[node2] post failure maintenace

Incident Report for h8lio

Postmortem

The second node2 OSD disk failed this morning (France) and has been removed from the Ceph cluster at 08:30 CET.

The data load balancing ended at 09:15 CET and Ceph cluster health is back to normal.

We are still monitoring actively the node2 and the Ceph cluster.

Posted Jul 25, 2024 - 09:25 CEST

Resolved

This incident has been resolved.

Posted Jul 25, 2024 - 07:31 CEST

Monitoring

The slow latency OSD.1 (disk) has been removed from the Ceph cluster.

Posted Jul 24, 2024 - 21:29 CEST

Update

Full clusters check results:
- Kubernetes: OK (minor network issues fixed)
- Ceph:: WARN (we are deep checking the slow heartbeats on the node2's OSDs)

Posted Jul 24, 2024 - 20:41 CEST

Investigating

We are going to run several nodes check with potential interventions on them during the night.
We will keep you posted on this thread.

Posted Jul 24, 2024 - 19:11 CEST

This incident affected: Cloud (kube.h8l.io, monitoring.h8l.io) and Storage (storage.h8l.io).