Skip to content

Airflow Memory Leak in Dag-Processor, Scheduler, Worker, Triggerer #58509

@tfagan25

Description

@tfagan25

Apache Airflow version

3.1.3

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

Seeing consistent memory growth over all of the above mentioned components to Airflow. Looking at the scheduler for example, looks to be growing 4.5MB/hour in base deployment.

Evidence of minimal deployment (only the base helm chart + required changes for internal gatekeeper policies - no Dags added):

Dag Processor:
Image

Scheduler:
Image

Same case for the others mentioned. The above are only over a few hours and represents the percentage utilization compared to the limits set in the below values.

Evidence of my full deployment (gitSync, <10 Dags, other changes) memory growth:

Image The large gaps represent the restarts of these pods. This has been observed to grow nonstop over multiple days until ultimately reaching limits. My current approach to mitigate is adding a cronjob to restart the deployments.

What you think should happen instead?

Memory should remain generally consistent over time.

How to reproduce

Deploy the AIrflow helm chart version 1.18.0 with Airflow version 3.0.2 or 3.1.3 - observed in both. Use only the standard values, other than the following:

ingress:
  enabled: false

apiServer:
  resources:
    limits:
      cpu: "4000m"
      memory: "4Gi"
    requests:
      cpu: "200m"
      memory: "500Mi"

createUserJob:
  resources:
    limits:
      cpu: "1000m"
      memory: "1Gi"
    requests:
      cpu: "50m"
      memory: "128Mi"

dagProcessor:
  logGroomerSidecar:
    resources:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "50m"
        memory: "128Mi"
  resources:
    limits:
      cpu: "1000m"
      memory: "2Gi"
    requests:
      cpu: "200m"
      memory: "512Mi"

flower:
  resources:
    limits:
      cpu: "1000m"
      memory: "2Gi"
    requests:
      cpu: "200m"
      memory: "256Mi"

postgresql:
  image:
    repository: bitnami/postgresql
    tag: 16.1.0-debian-11-r15
  primary:
    persistence:
      storageClass: "px-csi-replicated"
      size: 50Gi
    resources:
      limits:
        cpu: "4000m"
        memory: "4Gi"
      requests:
        cpu: "200m"
        memory: "1Gi"

migrateDatabaseJob:
  resources:
    limits:
      cpu: "1000m"
      memory: "1Gi"
    requests:
      cpu: "50m"
      memory: "128Mi"

redis:
  persistence:
    storageClassName: "px-csi-replicated"
  resources:
    limits:
      cpu: "1000m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "128Mi"

registry:
  secretName: regcred

scheduler:
  replicas: 2
  logGroomerSidecar:
    resources:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "50m"
        memory: "128Mi"
  resources:
    limits:
      cpu: "2000m"
      memory: "2Gi"
    requests:
      cpu: "200m"
      memory: "512Mi"

triggerer:
  persistence:
    enabled: false
    storageClassName: "px-csi-replicated"
  logGroomerSidecar:
    enabled: false
    resources:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "50m"
        memory: "128Mi"
  resources:
    limits:
      cpu: "2000m"
      memory: "2Gi"
    requests:
      cpu: "200m"
      memory: "512Mi"

workers:
  persistence:
    storageClassName: "px-csi-replicated"
    size: "20Gi"
  logGroomerSidecar:
    resources:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "50m"
        memory: "128Mi"
  resources:
    limits:
      cpu: "3000m"
      memory: "5Gi"
    requests:
      cpu: "100m"
      memory: "1Gi"

webserver:
  resources:
    limits:
      cpu: "3000m"
      memory: "5Gi"
    requests:
      cpu: "100m"
      memory: "1Gi"

Operating System

Ubuntu 22.04.2 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.8.0
apache-airflow-providers-celery==3.11.0
apache-airflow-providers-cncf-kubernetes==10.5.0
apache-airflow-providers-common-compat==1.7.0
apache-airflow-providers-common-io==1.6.0
apache-airflow-providers-common-messaging==1.0.3
apache-airflow-providers-common-sql==1.27.1
apache-airflow-providers-docker==4.4.0
apache-airflow-providers-elasticsearch==6.3.0
apache-airflow-providers-fab==2.2.1
apache-airflow-providers-ftp==3.13.0
apache-airflow-providers-git==0.0.2
apache-airflow-providers-google==15.1.0
apache-airflow-providers-grpc==3.8.0
apache-airflow-providers-hashicorp==4.2.0
apache-airflow-providers-http==5.3.0
apache-airflow-providers-microsoft-azure==12.4.0
apache-airflow-providers-mysql==6.3.0
apache-airflow-providers-odbc==4.10.0
apache-airflow-providers-openlineage==2.3.0
apache-airflow-providers-opsgenie==5.9.2
apache-airflow-providers-postgres==6.2.0
apache-airflow-providers-redis==4.1.0
apache-airflow-providers-sendgrid==4.1.0
apache-airflow-providers-sftp==5.3.0
apache-airflow-providers-slack==9.1.0
apache-airflow-providers-smtp==2.1.0
apache-airflow-providers-snowflake==6.3.1
apache-airflow-providers-ssh==4.1.0
apache-airflow-providers-standard==1.2.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Details listed above, also just to specify using the CeleryExecutor here.

Anything else?

Happy to provide any other details that may help or assist in troubleshooting. Will be interested to hear if anyone else is experiencing this same issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    affected_version:3.1Issues Reported for 3.1area:DAG-processingarea:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions