Skip to content

Deferrable Operators get stuck as "scheduled" during backfill #25653

@Gollum999

Description

@Gollum999

Apache Airflow version

2.3.3

What happened

If you try to backfill a DAG that uses any deferrable operators, those tasks will get indefinitely stuck in a "scheduled" state.

If I watch the Grid View, I can see the task state change: "scheduled" (or sometimes "queued") -> "deferred" -> "scheduled". I've tried leaving in this state for over an hour, but there are no further state changes.

When the task is stuck like this, the log appears as empty in the web UI. The corresponding log file does exist on the worker, but it does not contain any errors or warnings that might point to the source of the problem.

Ctrl-C-ing the backfill at this point seems to hang on "Shutting down LocalExecutor; waiting for running tasks to finish." Force-killing and restarting the backfill will "unstick" the stuck tasks. However, any deferrable operators downstream of the first will get back into that stuck state, requiring multiple restarts to get everything to complete successfully.

What you think should happen instead

Deferrable operators should work as normal when backfilling.

How to reproduce

#!/usr/bin/env python3
import datetime
import logging

import pendulum
from airflow.decorators import dag, task
from airflow.sensors.time_sensor import TimeSensorAsync


logger = logging.getLogger(__name__)


@dag(
    schedule_interval='@daily',
    start_date=datetime.datetime(2022, 8, 10),
)
def test_backfill():
    time_sensor = TimeSensorAsync(
        task_id='time_sensor',
        target_time=datetime.time(0).replace(tzinfo=pendulum.UTC),  # midnight - should succeed immediately when the trigger first runs
    )

    @task
    def some_task():
        logger.info('hello')

    time_sensor >> some_task()


dag = test_backfill()


if __name__ == '__main__':
    dag.cli()

airflow dags backfill test_backfill -s 2022-08-01 -e 2022-08-04

Operating System

CentOS Stream 8

Versions of Apache Airflow Providers

None

Deployment

Other

Deployment details

Self-hosted/standalone

Anything else

I was able to reproduce this with the following configurations:

  • standalone mode + SQLite backend + SequentialExecutor
  • standalone mode + Postgres backend + LocalExecutor
  • Production deployment (self-hosted) + Postgres backend + CeleryExecutor

I have not yet found anything telling in any of the backend logs.

Possibly related:

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    affected_version:2.3Issues Reported for 2.3area:Schedulerincluding HA (high availability) schedulerarea:async-operatorsAIP-40: Deferrable ("Async") Operatorsarea:backfillSpecifically for backfill relatedkind:bugThis is a clearly a bug

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions