Why Are My Airflow Jobs Running β€œOne Day Late”?

[ datascience engineering production til ] Β· 3 min read

β€œWhy are our Airflow jobs always one day behind?”, a teammate asked. I popped open the Airflow dashboard; everything looked fineβ€”I didn’t see any delay.

As he explained the delay he was seeing, I realized there are (at least) two ways to think about scheduling. The second way is so obvious (to me) that I forgot others may not be aware of its (i.e., expert blind spot).

Cron

For most people, this is the first type of scheduling they come across, and operate on. Cron runs jobs at fixed intervals; you specify jobs in a crontab file, like below.

# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ minute (0 - 59)
# β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ hour (0 - 23)
# β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of the month (1 - 31)
# β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ month (1 - 12)
# β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of the week (0 - 6) (Sunday to Saturday;
# β”‚ β”‚ β”‚ β”‚ β”‚                                   7 is also Sunday on some systems)
# β”‚ β”‚ β”‚ β”‚ β”‚
# β”‚ β”‚ β”‚ β”‚ β”‚
# *Β * *Β * *Β <command to execute>  

If something is scheduled for 2020-06-14 midnight, it starts at 2020-06-14 midnightβ€”straightforward.

Airflow (and ETL jobs)

Airflow works a bit differently. First, let’s see what happens. When you schedule a job for 2020-06-14 (Run in the image below), it starts at 2020-06-15 (Started in the image below).

An Airflow Job Seemingly One Day Late

An Airflow Job Seemingly One Day Late

Why is there a day’s delay? In Airflow, the job for 2020-06-14 can only trigger after 2020-06-14 2359hrs. In other words, the job only starts after the scheduled period (i.e., day of 2020-06-14) has ended. This so important that the Airflow’s docs has the following:

Let’s Repeat That. The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period. - Airflow Docs

This is not a bug in Airflow or your DAGs.

Why does it work like that?

I find that it’s helpful to explain it in terms of ETL (extract-transform-load). When you schedule a job for 2020-06-14, you want to process the data for that day; thus, it can only start when the day for 2020-06-14 ends, at 2020-06-15 0000hrs.

Thus, unlike cron jobs (which start at the scheduled time), Airflow jobs only start after the period of the scheduled time ends. If the period is an hour, it’ll start an hour after. If the period is a day, it’ll start a day after. And so on.

The Difference between Cron and Airflow

For a daily job, cron jobs run at the start of the day; Airflow jobs run at the end of the day.

If you stumbled on this looking for an answer, I hope this cleared things up. Else, comment below and I’ll help.


Share on:

Browse related tags: [ datascience engineering production til ]

If you enjoyed this...

Get weekly updates on effective data science, learning, and career.

    Welcome gift: A 5-day email course on How to be an Effective Data Scientist πŸš€β€‹