[Bug] streampark页面上显示的flink任务状态和实际不一致

### Search before asking

- [X] I had searched in the [issues](https://github.com/apache/incubator-streampark/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues.


### Java Version

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

### Scala Version

2.12.x

### StreamPark Version

2.1.3

### Flink Version

1.15.3

### deploy mode

kubernetes-application

### What happened

k8s环境不稳定，导致flink pod自动重启，重启后flink任务恢复正常，但是streampark页面上显示任务状态为FAIL

![image](https://github.com/apache/incubator-streampark/assets/9477681/416c623c-0fae-45c9-bd96-72f35c4ee958)
![image](https://github.com/apache/incubator-streampark/assets/9477681/e608362a-6b2d-41e9-99fd-6540233b69d6)

查看代码和日志，发现主要问题在于，watch进程在进行flink任务状态监听时，请求flink web接口出现异常，再去请求k8s api server判断deployment是否存在的时候也发生了异常(etcd server leader发生了变更)。这里KubernetesRetriever.isDeploymentExists逻辑存在问题，发生异常时返回false，即认为deployment不存在，但是实际这时候deployment是存在的。由于错误的识别到deployment不存在，于是把任务状态更新为FAIL，并且终止了任务的监听，即使后面k8s恢复，streampark上的任务状态也不会恢复。

通过chaosblade工具模拟网络丢包，可以轻松的重现这个问题。
chaosblade文档见：https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn/blade-create-network-loss

我这里的环境，k8s api server请求地址为https://172.18.64.242:8443, flink web接口的访问地址为https://172.18.64.242:443
于是我注入网络丢包的命令为：./blade create network loss --percent 100 --interface eth0 --remote-port 443,8443 --destination-ip 172.18.64.242 --timeout 200
简单解释下：这里是把请求到IP为172.18.64.242，端口为443和8443的网络请求丢包率设置为100%，持续200秒

启动flink任务后，再通过命令注入网络故障，再查看streampark页面，就会发现任务状态变为FAIL，即使后面网络故障恢复，任务状态也不会恢复为RUNNING(期间flink任务本身一直正常工作)




### Error Exception

_No response_

### Screenshots

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!(您是否要贡献这个PR?)

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] streampark页面上显示的flink任务状态和实际不一致 #3618

Search before asking

Java Version

Scala Version

StreamPark Version

Flink Version

deploy mode

What happened

Error Exception

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] streampark页面上显示的flink任务状态和实际不一致 #3618

Description

Search before asking

Java Version

Scala Version

StreamPark Version

Flink Version

deploy mode

What happened

Error Exception

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions