Up-time Monitoring

No Change
adopt
First Added:October 1, 2024 Updated: June 12, 2026

Up-time Monitoring

Up-time monitoring measures whether a system is reachable and behaving acceptably from a user’s perspective. Raw uptime percentage is a lagging summary; we adopt measuring the underlying SLIs (latency, errors, saturation) via OpenTelemetry and Monitoring tools, then derive uptime for SLAs and Incident Management.

Blurb

Uptime is a measure of system reliability, expressed as the percentage of time a service is available.

Summary

What it is: Synthetic probes, RUM, and SLO-based alerts that answer “can customers use the product?” rather than “is the host up?”

When to use: Public services with SLAs; status pages (Upptime); synthetic checks (Grafana k6, Kuberhealthy) complementing metrics.

When to skip: Batch-only internal jobs with no user-facing window (monitor job success instead).

Practices: Define SLOs on user journeys; avoid host-only ping alerts in cloud-native estates; page on burn rate, not single blips.

Details

ApproachGarden items
Public statusUpptime on GitHub Actions
Load / syntheticsGrafana k6
In-cluster checksKuberhealthy
Metrics + SLOsPrometheus, Grafana, OpenTelemetry