Skip to content

Airflow ElasticSearch provider issue #25177

@PatrykKlimowicz

Description

@PatrykKlimowicz

Apache Airflow version

2.3.3 (latest released)

What happened

Durign usage of Airflow v2.1.3 in my project this issue appeared, and was solved by adding the Offset_Key to the Fluent Bit configuration. This Offset_Key appends the offset field to the logs, so we can retrieve the logs in correct order. We specified the AIRFLOW__ELASTICSEARCH__OFFSET_FIELD="custom_offset" and logs were retrieved correctly based on the custom_offset and then displayed in Airflow UI.

Now, I updated the version to the v2.3.3 and this behavior is no longer valid. I tested some combinations:

  • AIRFLOW__ELASTICSEARCH__OFFSET_FIELD and Offset_Key has the same value - no offset key is created in the logs and logs cannot be obtained from ElasticSearch
  • AIRFLOW__ELASTICSEARCH__OFFSET_FIELD and Offset_Key has different values - both offset keys are added to the logs and I can see the logs on UI (logs are obtained based on AIRFLOW__ELASTICSEARCH__OFFSET_FIELD and not custom one).
    Due to backward compatibility I need to achieve config in which custom_offset has higher precedence than the one Airflow inserts.

As suggested here I tried to lower the elasticsearch provider version and see which one will work for this scenario.

It turned out that the version which we used with Airflow v2.1.3 was OK, so the apache-airflow-providers-elasticsearch==2.0.2.
I think that this change break our use case, as the version 2.0.3 is first that does not work for us - changelog. With the version 2.0.2 I can see that custom_offset and the Airflow's offset are added to the logs, but thanks to AIRFLOW__ELASTICSEARCH__OFFSET_FIELD="custom_offset" logs are displayed in correct order.

What you think should happen instead

Offset from Airflow should not conflict with the offset added by third party tool since Airflow does not support sending logs to the ElasticSearch, but supports reading from it.

Most probably, there will be an issue with flow of the logs. Right now it is like:

Airflow -> LogFile <- Fluent Bit -> ElasticSearch <- Airflow

so Airflow does not know about the (in that specific case) Fluent Bit config and it's offset name.

It would be nice to make the change in version 2.0.3 I linked above optional, so we can instruct Airflow if it should create a offset with given AIRFLOW__ELASTICSEARCH__OFFSET_FIELD name or just use that name to obtain logs (I do not know the whole logic behind the Airflow logs retrieval, so not sure if this is a good idea). I think that the bool flag like AIRFLOW__ELASTICSEARCH__ADD_OFFSET_FIELD could determine the creation of Airflow's offset field and the AIRFLOW__ELASTICSEARCH__OFFSET_FIELD could determine what name to use to either create and retrieve logs OR just retrieve the logs.

How to reproduce

Use Airflow in v2.3.3.
Use Fluent Bit in v1.9.6 and add the Offset_Key to it's INPUT config
Use ElasticSearch to store logs and read logs from ElasticSearch in Airflow UI.

Operating System

AKS

Versions of Apache Airflow Providers

Working case (Airflow 2.1.3):

  • apache-airflow-providers-amazon==2.1.0
  • apache-airflow-providers-celery==2.0.0
  • apache-airflow-providers-cncf-kubernetes==2.0.2
  • apache-airflow-providers-docker==2.1.0
  • apache-airflow-providers-elasticsearch==2.0.2
  • apache-airflow-providers-ftp==2.0.0
  • apache-airflow-providers-google==5.0.0
  • apache-airflow-providers-grpc==2.0.0
  • apache-airflow-providers-hashicorp==2.0.0
  • apache-airflow-providers-http==2.0.0
  • apache-airflow-providers-imap==2.0.0
  • apache-airflow-providers-microsoft-azure==3.1.0
  • apache-airflow-providers-mysql==2.1.0
  • apache-airflow-providers-odbc==2.0.0
  • apache-airflow-providers-postgres==2.0.0
  • apache-airflow-providers-redis==2.0.0
  • apache-airflow-providers-sendgrid==2.0.0
  • apache-airflow-providers-sftp==2.1.0
  • apache-airflow-providers-slack==4.0.0
  • apache-airflow-providers-sqlite==2.0.0
  • apache-airflow-providers-ssh==2.1.0

Not working case (Airflow v2.3.3):

  • apache-airflow-providers-amazon==4.0.0
  • apache-airflow-providers-celery==3.0.0
  • apache-airflow-providers-cncf-kubernetes==4.1.0
  • apache-airflow-providers-docker==3.0.0
  • apache-airflow-providers-elasticsearch==4.0.0
  • apache-airflow-providers-ftp==3.0.0
  • apache-airflow-providers-google==8.1.0
  • apache-airflow-providers-grpc==3.0.0
  • apache-airflow-providers-hashicorp==3.0.0
  • apache-airflow-providers-http==3.0.0
  • apache-airflow-providers-imap==3.0.0
  • apache-airflow-providers-microsoft-azure==4.0.0
  • apache-airflow-providers-mysql==3.0.0
  • apache-airflow-providers-odbc==3.0.0
  • apache-airflow-providers-postgres==5.0.0
  • apache-airflow-providers-redis==3.0.0
  • apache-airflow-providers-sendgrid==3.0.0
  • apache-airflow-providers-sftp==3.0.0
  • apache-airflow-providers-slack==5.0.0
  • apache-airflow-providers-sqlite==3.0.0
  • apache-airflow-providers-ssh==3.0.0

Airflow v2.3.3 is working with apache-airflow-providers-elasticsearch==2.0.2

Deployment

Other 3rd-party Helm chart

Deployment details

We are using Airflow Community Helm chart + Azure Kubernetes Service

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR! (If the fix will be provided in the far future I can work on the PR to get it sooner)

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions