Uptime monitoring is the practice of checking, on a schedule, whether a service is responding correctly, and alerting someone when it is not. The goal is not to collect a number. It is to learn about a real problem before your customers do, without crying wolf so often that people stop listening.
The kinds of check
- HTTP: request a URL and judge the response by status code and timing. The workhorse for websites and APIs.
- Keyword: an HTTP check that also asserts a word is present (or absent) in the body, so a 200 that renders an error page still counts as down.
- Heartbeat: the inverse. Your cron job or worker pings the monitor on each run; if a ping does not arrive in the expected window, it is down. The right tool for things that have no public URL.
- TCP: open a connection to a host and port, for databases, mail servers and other non-HTTP services.
Interval and what down really means
The interval is how often you check, commonly every minute or two. A single failed check is rarely proof of an outage: networks blip, a server hiccups, a deploy restarts a process. Treating one blip as down is the fastest way to a noisy, ignored pager.
Confirmation and flap thresholds
The fix is to require a run of consecutive failures before opening an incident, and a run of consecutive successes before resolving it. That confirmation window absorbs transient blips while still catching real, sustained failures quickly. A brief recovery in the middle of an outage should not split it into two incidents, and a confirmed recovery followed by a fresh failure should be a new one. Getting this state machine right is most of what separates a calm monitor from a noisy one.
False positives and where they come from
Beyond blips, the other big source of false alarms is the vantage point. A check runs from somewhere, and a network problem between that location and your service looks identical to your service being down. Confirmation thresholds catch a transient version of this, but a sustained regional path problem can still read as an outage from one location. The strongest defence is to check from several regions and only open an incident when more than one agrees. Sentivel runs checks from a single region today and leans on flap thresholds to filter blips; multi-region consensus is the next step on that axis, and we are honest that it is where a single-vantage monitor is weakest.
Tie monitoring to your status page
A monitor is most useful when its confirmed state drives the matching component on your status page, so customers see an honest status without anyone toggling it by hand. A brand-new monitor should read as warming up until it has proven itself, never as a premature green.