Airflow DAG Monitoring — Catch Silent DAG Failures | Pakyas

Start monitoring your Airflow DAGs free — up to 10 checks, no card required.

Start monitoring free

The problem

Airflow is honest about the runs it executes — but it is silent about the runs it never starts. When the scheduler stalls, a worker dies, a DAG is accidentally paused, or a `start_date`/`schedule` change quietly drops a daily run, there is no failed task to alert on. The Airflow UI shows green for yesterday and simply has no row for today, so nothing pages you. Even within a run, a task that hangs indefinitely or an SLA miss that lands in `sla_miss` callbacks (a notoriously unreliable corner of Airflow) can pass without a clear signal. The result is the worst kind of failure: a daily ETL, backup, or sync that everyone assumes ran, discovered days later when the downstream data is stale. Treating a DAG like a server you poll for "up/down" misses the point — what you actually need to know is "did this DAG run, on time, finish in its expected window, and report success?"

How Pakyas helps

Pakyas monitors Airflow DAGs by execution signal, not by polling. Each DAG run proves it actually executed by sending a signal to a unique ping URL, and Pakyas compares those signals against the schedule you expect. If a DAG never signals within its period plus grace, Pakyas marks it Missing and alerts you — the case Airflow itself can't catch because nothing failed; it simply never ran. A `/start` ping at the beginning of the run plus a terminal success or `/fail` ping lets Pakyas distinguish four distinct states that Airflow's green/red collapses together: Missing (the run never arrived), Late (it ran but outside its window), Overrunning (it started but is taking far longer than normal — your hung task), and Error (the DAG explicitly reported a failure). Because the signal comes from the DAG itself via Airflow's own success/failure callbacks, you get paged even when the scheduler, the metadata database, or the whole cluster is unhealthy and the Airflow UI is unreachable. Alerts mean a specific, actionable state — not just "red."

Set it up

Create a check and copy its ping URL

# In Pakyas, create a check matching your DAG's schedule, e.g.
#   schedule: 0 2 * * *   (matches an Airflow DAG with schedule="0 2 * * *")
# You'll get a unique ping URL:
#   https://ping.pakyas.com/{public_id}
#
# Signal endpoints (no /ping/ prefix):
#   start:   https://ping.pakyas.com/{public_id}/start
#   success: https://ping.pakyas.com/{public_id}
#   fail:    https://ping.pakyas.com/{public_id}/fail

Set the check's schedule to the same cron/interval as your DAG so Pakyas knows when to expect a run and can flag a Missing run.

Add success and failure callbacks to your DAG

import requests
from airflow import DAG
from datetime import datetime

# The success ping; /start and /fail are appended for the other signals.
PAKYAS_URL = "https://ping.pakyas.com/{public_id}"  # replace {public_id}

def _ping(suffix=""):
    # timeout keeps a slow ping from blocking the callback
    requests.get(f"{PAKYAS_URL}{suffix}", timeout=10)

def on_success(context):
    _ping()           # terminal success ping

def on_failure(context):
    _ping("/fail")    # explicit Error signal

with DAG(
    dag_id="daily_backup",
    schedule="0 2 * * *",            # Airflow 2.4+ uses 'schedule'; use 'schedule_interval' on Airflow < 2.4
    start_date=datetime(2024, 1, 1),
    catchup=False,
    on_success_callback=on_success,  # fires when the whole DAG run succeeds
    on_failure_callback=on_failure,  # fires when the DAG run fails
) as dag:
    ...  # your tasks

DAG-level on_success_callback / on_failure_callback fire once per DAG run on its terminal state. requests ships in most Airflow images; if yours lacks it, use airflow.providers.http or urllib instead.

(Optional) Send a /start ping when the run begins

# Add a first task that fires the /start signal so Pakyas can also
# detect an Overrunning DAG (started but never finished in its window).
from airflow.operators.bash import BashOperator  # Airflow 2.x; on Airflow 3 use airflow.providers.standard.operators.bash

start_ping = BashOperator(
    task_id="pakyas_start",
    bash_command="curl -fsS https://ping.pakyas.com/{public_id}/start",  # replace {public_id}
    dag=dag,
)

# Make it the first task so it runs before real work:
start_ping >> your_first_real_task

Pairing a /start with the terminal success/fail signal lets Pakyas measure run duration and flag Overrunning (a hung task). Without /start you still get Missing, Late, and Error detection.

A worked example

Suppose you run a daily Postgres-to-S3 backup DAG (`daily_backup`, schedule `0 2 * * *`). Create a Pakyas check with the same `0 2 * * *` schedule and a grace period that covers a normal run plus headroom. Wire the DAG's `on_success_callback` to ping `https://ping.pakyas.com/{public_id}` and `on_failure_callback` to ping `.../fail`, and add a first `pakyas_start` task that hits `.../start`. Now: if the backup task throws (bad credentials, full disk), Airflow runs `on_failure_callback`, Pakyas receives the `/fail` ping and marks the check Error with the run linked. If the scheduler is wedged and the DAG never runs at all, no signal arrives by 02:00 + grace and Pakyas marks it Missing — the exact failure the green Airflow UI would never have surfaced. And if a slow upload causes the run to drag well past its usual window, the dangling `/start` with no terminal ping shows up as Overrunning. One stale daily backup caught the same morning instead of three days later.

Pricing

Pakyas has four tiers: Free ($0, up to 10 checks), Developer ($9/mo), Pro ($29/mo), and Business ($99/mo). Most teams can monitor their core Airflow DAGs on the free tier and upgrade only as the number of monitored DAGs grows.

See the full breakdown on the pricing page.

New to the terminology? See the cron monitoring glossary for plain-language definitions of every job state, or explore everything Pakyas tracks on the features page.

Start monitoring your Airflow DAGs free — up to 10 checks, no card required.

Execution-signal precision: know when a job is Missing, Late, Overrunning, or reports an Error — not just up or down.

Start monitoring free