Up-time Monitoring

No Change

adopt

First Added:October 1, 2024 Updated: July 2, 2026

Up-time monitoring measures whether a system is reachable and behaving acceptably from a user’s perspective. Raw uptime percentage is a lagging summary; we adopt measuring the underlying SLIs (latency, errors, saturation) via OpenTelemetry and Monitoring tools, then derive uptime for SLAs and Incident Management.

Summary

What it is: Synthetic probes, RUM, and SLO-based alerts that answer “can customers use the product?” rather than “is the host up?”

When to use: Public services with SLAs; status pages (Upptime); synthetic checks (Grafana k6, Kuberhealthy) complementing metrics.

When to skip: Batch-only internal jobs with no user-facing window (monitor job success instead).

Practices: Define SLOs on user journeys; avoid host-only ping alerts in cloud-native estates; page on burn rate, not single blips.

Details

Approach	Garden items
Public status	Upptime on GitHub Actions
Load / synthetics	Grafana k6
In-cluster checks	Kuberhealthy
Metrics + SLOs	Prometheus, Grafana, OpenTelemetry