Airflow scheduler misses success events of worker pods on kubernetes

### Apache Airflow version

2.9.3

### If "Other Airflow 2 version" selected, which one?

_No response_

### What happened?

We had several sensors that failed to be rescheduled by the scheduler because it still thought that the worker tasks were running.

The root cause was that the scheduler missed an update event from worker task because the Kubernetes node, where the Airflow worker pod was running on, got deleted soon after the worker finished successful. This does not follow the assumption in the code that a delete of the worker is only issued by Airflow itself. The wrong code is in `kubernetes_executor_utils.py`:

```
  if (
     event["type"] == "DELETED"
     or POD_EXECUTOR_DONE_KEY in pod.metadata.labels
     or pod.metadata.deletion_timestamp
  ):
    self.log.info(
        "Skipping event for Succeeded pod %s - event for this pod already sent to executor",
         pod_name,
    )
    return
```
In the logs we see only skipping event messages for the worker pods instead of first an event that was processed.
The comment of the if check say the following which is not necessarily true:
```
# We get multiple events once the pod hits a terminal state, and we only want to
# send it along to the scheduler once.
# If our event type is DELETED, we have the POD_EXECUTOR_DONE_KEY, or the pod has
# a deletion timestamp, we've already seen the initial Succeeded event and sent it
# along to the scheduler.
```


Our sensor failed such that it needed to be rescheduled and the task got requeued. 
At that time the scheduler never scheduled the task as it thought there was still one running and logged the following:

`{base_executor.py:284} INFO - queued but still running;`

The only way to fix it was to restart the scheduler as then the internal state of the scheduler was in sync with kubernetes.


### What you think should happen instead?

The fundamental problem is that the watcher for events on kubernetes pods skipped the successful event of the worker.

In order to make sure we process all events there are 2 options:
- remove the skipping of events as it is wrong in certain edge situations. I investigated when it was added and it was not really added for a reason, just as a behind the scene optimisation as far as I can tell. https://github.com/apache/airflow/pull/30872
- if you want to handle this reliably in Kubernetes you should use finalizers for your worker pods. This way you can guarantee that you do not miss any events and can use the finalizer to make sure you only process success events once. 

### How to reproduce

The issue is difficult to reproduce reliably. We do notice it on our huge production from time to time. 
It is however easy to see that the code is wrong in certain edge cases 

### Operating System

kubernetes: apache/airflow:slim-2.9.3-python3.11

### Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.0.1
apache-airflow-providers-common-io==1.3.2
apache-airflow-providers-common-sql==1.14.2
apache-airflow-providers-fab==1.2.2
apache-airflow-providers-ftp==3.10.0
apache-airflow-providers-http==4.12.0
apache-airflow-providers-imap==3.6.1
apache-airflow-providers-opsgenie==4.0.0
apache-airflow-providers-postgres==5.11.2
apache-airflow-providers-slack==7.3.2
apache-airflow-providers-smtp==1.7.1
apache-airflow-providers-sqlite==3.8.1


### Deployment

Other 3rd-party Helm chart

### Deployment details

Kubernetes deployment

### Anything else?

/

### Are you willing to submit PR?

- [X] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow scheduler misses success events of worker pods on kubernetes #41436

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Airflow scheduler misses success events of worker pods on kubernetes #41436

Description

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions