Monitoring ETL Pipelines
ETL (Extract, Transform, Load) pipelines move data between systems on a schedule. These jobs often run for hours and can fail at any stage—connection timeouts, data validation errors, or disk space issues.
Configuration
Section titled “Configuration”Set your API key as an environment variable. For cron jobs, add it to your crontab:
# Edit crontabcrontab -e
# Add at top of crontabPAKYAS_API_KEY=pk_live_xxxxx
# Then your ETL job0 0 * * * pakyas monitor etl-pipeline -- python /jobs/data_pipeline.pyOr source from a file:
0 0 * * * . ~/.pakyas_env && pakyas monitor etl-pipeline -- python /jobs/data_pipeline.pySee Environment Variables for all options.
When to use this
Section titled “When to use this”- Pipelines run on a schedule
- Jobs are long-running (minutes to hours)
- Silent failures cause stale or missing data
Basic example
Section titled “Basic example”pakyas monitor etl-pipeline -- python pipeline.pyPakyas wraps your pipeline, tracks duration, and alerts you if it fails or runs longer than expected.
Scheduler setup
Section titled “Scheduler setup”# crontab example - runs every night at midnight0 0 * * * pakyas monitor etl-pipeline -- python /jobs/data_pipeline.pyWhat Pakyas detects
Section titled “What Pakyas detects”- Pipeline exits non-zero
- Pipeline runs longer than expected
- Pipeline never starts (missed schedule)