[node2] post failure maintenace
Incident Report for h8lio
Postmortem

The second node2 OSD disk failed this morning (France) and has been removed from the Ceph cluster at 08:30 CET.

The data load balancing ended at 09:15 CET and Ceph cluster health is back to normal.

We are still monitoring actively the node2 and the Ceph cluster.

Posted Jul 25, 2024 - 09:25 CEST

Resolved
This incident has been resolved.
Posted Jul 25, 2024 - 07:31 CEST
Monitoring
The slow latency OSD.1 (disk) has been removed from the Ceph cluster.
Posted Jul 24, 2024 - 21:29 CEST
Update
Full clusters check results:
- Kubernetes: OK (minor network issues fixed)
- Ceph:: WARN (we are deep checking the slow heartbeats on the node2's OSDs)
Posted Jul 24, 2024 - 20:41 CEST
Investigating
We are going to run several nodes check with potential interventions on them during the night.
We will keep you posted on this thread.
Posted Jul 24, 2024 - 19:11 CEST
This incident affected: Cloud (kube.h8l.io, monitoring.h8l.io) and Storage (storage.h8l.io).