What is uptime monitoring?

Checking a service on a schedule (by HTTP request, keyword match, heartbeat ping or TCP connection) and alerting when it fails, so you find out about a real problem before your customers do.

What is the difference between HTTP and heartbeat monitoring?

HTTP monitoring reaches out to a URL and judges the response. Heartbeat monitoring is the inverse: your job pings the monitor on each run, and the absence of an expected ping is the signal. Heartbeats suit cron jobs and workers that have no public URL.

How do I avoid false-positive alerts?

Require several consecutive failed checks before opening an incident and several successes before resolving, so a transient blip never pages anyone. For sustained, region-specific network issues, check from multiple locations and only alert on consensus.

How often should checks run?

Every one to two minutes is typical. More frequent checks detect faster but add load and cost; pair the interval with confirmation thresholds so frequency does not translate into noise.

Uptime monitoring, explained

Uptime monitoring is the practice of checking, on a schedule, whether a service is responding correctly, and alerting someone when it is not. The goal is not to collect a number. It is to learn about a real problem before your customers do, without crying wolf so often that people stop listening.

The kinds of check

HTTP: request a URL and judge the response by status code and timing. The workhorse for websites and APIs.
Keyword: an HTTP check that also asserts a word is present (or absent) in the body, so a 200 that renders an error page still counts as down.
Heartbeat: the inverse. Your cron job or worker pings the monitor on each run; if a ping does not arrive in the expected window, it is down. The right tool for things that have no public URL.
TCP: open a connection to a host and port, for databases, mail servers and other non-HTTP services.

Interval and what down really means

The interval is how often you check, commonly every minute or two. A single failed check is rarely proof of an outage: networks blip, a server hiccups, a deploy restarts a process. Treating one blip as down is the fastest way to a noisy, ignored pager.

Confirmation and flap thresholds

The fix is to require a run of consecutive failures before opening an incident, and a run of consecutive successes before resolving it. That confirmation window absorbs transient blips while still catching real, sustained failures quickly. A brief recovery in the middle of an outage should not split it into two incidents, and a confirmed recovery followed by a fresh failure should be a new one. Getting this state machine right is most of what separates a calm monitor from a noisy one.

False positives and where they come from

Beyond blips, the other big source of false alarms is the vantage point. A check runs from somewhere, and a network problem between that location and your service looks identical to your service being down. Confirmation thresholds catch a transient version of this, but a sustained regional path problem can still read as an outage from one location. The strongest defence is to check from several regions and only open an incident when more than one agrees. Sentivel runs checks from a single region today and leans on flap thresholds to filter blips; multi-region consensus is the next step on that axis, and we are honest that it is where a single-vantage monitor is weakest.

Tie monitoring to your status page

A monitor is most useful when its confirmed state drives the matching component on your status page, so customers see an honest status without anyone toggling it by hand. A brand-new monitor should read as warming up until it has proven itself, never as a premature green.

The kinds of check

Interval and what down really means

Confirmation and flap thresholds

False positives and where they come from

Tie monitoring to your status page

Frequently asked questions