Job Monitoring Alerts Without Noise
Engineering Note: This post discusses design decisions behind Pakyas’ job monitoring alerting model.
Every alerting system faces the same paradox: send too few alerts and you miss real problems. Send too many and your team starts ignoring them.
This is alert fatigue. It kills trust. When your phone buzzes at 3am, you should know it matters. If most alerts are noise, the real ones get lost.
Most systems default to noise because it feels safer. Better to over-alert than miss something, right? But when engineers start silencing channels or building mental filters, you have already lost.
Pakyas takes a different approach.
What alerting usually gets wrong
Server-oriented monitoring models treat job execution as binary. But scheduled jobs live in a gray zone:
- Late is not the same as failed
- A job that runs 5 minutes late might be fine
- A job that misses entirely is a problem
- Flapping between states is not three incidents
Most tools blur these distinctions. They show “Last run: OK” and call it a day. That green dot might be lying to you.
In job monitoring, alerts should be driven by execution signals. A heartbeat is a fact. Alerting logic should reason from that fact, not infer health from absence or polling.
How Pakyas thinks about alerts
Pakyas distinguishes between states:
- On Schedule — Job pinged on time
- Late — Job ran, but outside the expected window
- Missing — Job missed its expected ping entirely
- Returned to On Schedule — Job transitioned back to health after being late or missing
Some job types may also enter an Overrunning state when execution exceeds its expected duration, but the alerting principles remain the same.
Each state transition can trigger an alert. But not every transition should. The goal is signal, not volume.
When a problem becomes critical
States tell you what is happening. Critical tells you when it needs attention now.
A job can be Missing for two minutes (probably recovering) or Missing for three hours (definitely broken). Pakyas tracks both, but only the second warrants interrupting someone.
A job enters critical state when:
- It has been Missing beyond a configured timeout
- Failures have repeated past a tolerance threshold
- Execution has overrun far longer than expected
- The job itself explicitly signals critical
Critical is edge-triggered: Pakyas notifies once when a job enters critical, not continuously while it stays there. This is a deliberate noise reduction. If you got the first alert, you do not need 47 reminders.
This also means you can configure different notification channels for different severity levels. Routine state changes can go to Slack. Critical situations can interrupt via SMS. The escalation path is explicit, not implicit.
The alert pipeline
When a ping arrives or a deadline passes, Pakyas runs it through several checks before sending a notification.
The alerting pipeline works in three stages: state evaluation, noise reduction, and delivery.
flowchart LR
subgraph Evaluate
E1[State change?] --> E2{Maintenance?}
E2 -->|No| E3{Threshold met?}
end
subgraph Reduce
E3 -->|Yes| R1{Flapping?}
R1 -->|No| R2{Throttled?}
R2 -->|No| SEND[Send alert]
end
E2 -->|Yes| SKIP[Suppress]
E3 -->|No| SKIP
R1 -->|Yes| SKIP
R2 -->|Yes| SKIP
In short:
- State must actually change (no duplicate alerts for the same status)
- Maintenance windows suppress alerts during planned downtime
- Thresholds prevent alerting on single glitches
- Flap dampening handles rapid state transitions
- Throttling controls reminder frequency
Each gate removes noise while preserving signal.
Noise reduction in practice
Thresholds
A single missed ping might be a network blip. Two in a row is a pattern.
A configurable failure threshold controls how many consecutive failures must occur before an alert fires. Set it to 1 for critical jobs, 3 for flaky ones.
This prevents 3am pages for one-off timeouts.
Flap dampening
Some jobs oscillate. They fail, recover, fail again. Without dampening, you get three alerts in five minutes.
Pakyas tracks state history. If a check flips too quickly, alerts are delayed until the state stabilizes. You still see the real-time status in the UI, but notifications wait for clarity.
Throttling
Different events deserve different treatment:
| Event | Behavior |
|---|---|
| Late | Alert once per incident |
| Missing | Remind every 6 hours |
| Overrunning | Remind every 15 min (max 10) |
| Returned to On Schedule | Always notify |
A check that stays down for 24 hours sends 4 reminders, not 1,440.
Maintenance windows
Deployments happen. Planned downtime is not an incident.
Maintenance windows suppress alerts during scheduled periods. One-time windows for specific deployments, recurring windows for regular maintenance.
When maintenance ends, Pakyas recalculates expectations without marking jobs as immediately late.
Smart delivery
Once an alert passes all filters, it needs to reach the right people reliably.
Recipient inheritance
Alert channels cascade from organization to project to check:
- Organization: Default channels for all projects
- Project: Override or add channels for specific projects
- Check: Override or add channels for specific checks
You can see exactly where alerts will go before they fire by previewing the effective recipients.
Webhook retry logic
Not all failures deserve retries.
flowchart LR
SEND[Send] --> R{Response}
R -->|2xx| OK[Delivered]
R -->|401/404| FAIL[Mark failed]
R -->|429| WAIT[Respect Retry-After]
R -->|5xx| BACK[Exponential backoff]
WAIT --> SEND
BACK --> SEND
Permanent failures (401, 404) stop immediately. Rate limits (429) respect the server’s backoff request. Server errors (5xx) and timeouts retry with exponential backoff.
This prevents wasting resources on endpoints that will never accept the webhook.
Audit trail
Every alert decision is logged. Not just the alerts that fired, but the ones that were suppressed.
“Why didn’t I get alerted?” is a valid question. The audit trail answers it:
- Was the check in maintenance?
- Was the threshold not yet met?
- Was it throttled as a duplicate?
- Was flap dampening active?
This helps debug unexpected silence and tune settings without guessing.
Technical notes
For engineers building similar systems, here are a few implementation details.
Deterministic event IDs
Each alert needs a stable ID for deduplication. Pakyas uses a composite key of the check ID, timestamp, and event type to deduplicate incidents. This ensures:
- Multiple processors can evaluate the same event without duplicates
- Reminders attach to the original incident
- Recovery events close the correct incident
Async delivery
Alerts are enqueued, not sent inline. A separate notifier service handles delivery with:
- Parallel delivery to multiple channels
- Independent retry queues per destination
- No blocking the main processing pipeline
Status changed timestamp
A state change timestamp tracks when the current status began. This solves the “late → on schedule → late within one minute” problem where naive bucketing suppresses distinct incidents.
The result
Fewer alerts. Higher signal. Less noise.
When your team sees a Pakyas notification, it means something changed and stayed changed long enough to matter. That trust is worth more than any individual alert.
Alert fatigue is a design problem, not an inevitable cost.