-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Search before asking
- I had searched in the issues and found no similar issues.
Java Version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
Scala Version
2.12.x
StreamPark Version
2.1.3
Flink Version
1.15.3
deploy mode
kubernetes-application
What happened
k8s环境不稳定,导致flink pod自动重启,重启后flink任务恢复正常,但是streampark页面上显示任务状态为FAIL
查看代码和日志,发现主要问题在于,watch进程在进行flink任务状态监听时,请求flink web接口出现异常,再去请求k8s api server判断deployment是否存在的时候也发生了异常(etcd server leader发生了变更)。这里KubernetesRetriever.isDeploymentExists逻辑存在问题,发生异常时返回false,即认为deployment不存在,但是实际这时候deployment是存在的。由于错误的识别到deployment不存在,于是把任务状态更新为FAIL,并且终止了任务的监听,即使后面k8s恢复,streampark上的任务状态也不会恢复。
通过chaosblade工具模拟网络丢包,可以轻松的重现这个问题。
chaosblade文档见:https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn/blade-create-network-loss
我这里的环境,k8s api server请求地址为https://172.18.64.242:8443, flink web接口的访问地址为https://172.18.64.242:443
于是我注入网络丢包的命令为:./blade create network loss --percent 100 --interface eth0 --remote-port 443,8443 --destination-ip 172.18.64.242 --timeout 200
简单解释下:这里是把请求到IP为172.18.64.242,端口为443和8443的网络请求丢包率设置为100%,持续200秒
启动flink任务后,再通过命令注入网络故障,再查看streampark页面,就会发现任务状态变为FAIL,即使后面网络故障恢复,任务状态也不会恢复为RUNNING(期间flink任务本身一直正常工作)
Error Exception
No response
Screenshots
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!(您是否要贡献这个PR?)
Code of Conduct
- I agree to follow this project's Code of Conduct

