Questions
Short, direct answers to the questions people ask about running a reliable service. For the longer reads, see the guides.
How do I monitor a cron job?
Use a heartbeat (also called a dead-man's switch). Instead of something reaching out to your job, the job pings a unique URL each time it finishes successfully. If a ping does not arrive within the window you expect, the monitor marks it down and alerts you. That is the right tool for cron jobs, workers and backups, which have no public URL for an outside check to hit.
What does 99.9% uptime mean (and what is a good SLA)?
99.9% uptime (three nines) allows about 8 hours 46 minutes of downtime per year, or roughly 43 minutes per month. Each extra nine cuts that by about ten times. What counts as good depends on what you are running: 99.9% is a common baseline for most SaaS, while infrastructure people depend on heavily often targets 99.95% or higher.
How often should uptime checks run?
Every one to two minutes is the common choice and a good default. More frequent checks detect an outage sooner but add load and cost; less frequent checks save resources but widen the window before you know. Whatever interval you pick, pair it with a confirmation threshold so a single failed check never pages anyone.
What is the difference between a status page and uptime monitoring?
Uptime monitoring is how you detect a problem: scheduled checks that tell you when a service is failing. A status page is how you communicate it: the public page where customers see current health and incident updates. They work best together, where the monitoring drives the status shown on the page so it stays honest without anyone toggling it by hand.
How do I tell customers a third-party outage is affecting me?
Map the affected part of your product to the provider it depends on, then post an advisory that attributes the cause to the provider and scopes the impact, for example: we are aware of an issue with a provider we rely on for payments; some checkouts may be affected until it resolves. Name the dependency, say what is affected, promise an update, and avoid unverifiable promises.
What is an escalation policy?
An escalation policy is the ordered ladder an alert climbs until someone acknowledges it: page the on-call person, wait, then page them again or page a secondary, then widen to a group or a manager. Each step has a timeout, so a missed alert is backstopped automatically instead of an incident going unanswered. Acknowledging stops the ladder from climbing further.
What is the difference between HTTP and heartbeat monitoring?
HTTP monitoring reaches out: it requests a URL on a schedule and judges the response by status code, timing and optionally a keyword. Heartbeat monitoring is the inverse: your job pings the monitor on each run, and the absence of an expected ping is the signal that something is wrong. Use HTTP for anything with a public URL, and heartbeats for cron jobs and workers that have none.
Should my status page be public or private?
Public is the norm for customer-facing products: it answers the are-you-down question for everyone and builds trust with a visible track record. A private page (password or login gated) suits internal tools or pre-launch products that are not ready to publish uptime. You can start private and switch to public later.