Skip to content

[Bug] [AbstractCommandHandler] Workflow's host may incorrect after recover, cause api operation failed. #17109

@reele

Description

@reele

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

when recover/failover a workflow from running/failed/stopped/paused state in multi master cluster, the host didn't set to new master's address, the operation may failed.

if old master is not exist, server will report Connection refused, if old master exist, server will report Cannot find the WorkflowExecuteRunnable

Image

Image

2025-03-22 18:51:19.676 ERROR [qtp742969054-35] o.a.d.a.e.w.StopWorkflowInstanceExecutorDelegate:[98] - WorkflowInstance: sleep-20250321085059987 stop failed
org.apache.dolphinscheduler.extract.base.exception.RemoteException: Call method to Host(ip=10.0.6.23, port=15678) failed
	at org.apache.dolphinscheduler.extract.base.client.NettyRemotingClient.sendSync(NettyRemotingClient.java:147)
	at org.apache.dolphinscheduler.extract.base.client.SyncClientMethodInvoker.invoke(SyncClientMethodInvoker.java:51)
	at org.apache.dolphinscheduler.extract.base.client.ClientInvocationHandler.invoke(ClientInvocationHandler.java:56)
	at com.sun.proxy.$Proxy830.stopWorkflowInstance(Unknown Source)
	at org.apache.dolphinscheduler.api.executor.workflow.StopWorkflowInstanceExecutorDelegate.stopInMaster(StopWorkflowInstanceExecutorDelegate.java:87)
	at org.apache.dolphinscheduler.api.executor.workflow.StopWorkflowInstanceExecutorDelegate.execute(StopWorkflowInstanceExecutorDelegate.java:52)
	at org.apache.dolphinscheduler.api.executor.workflow.StopWorkflowInstanceExecutorDelegate$StopWorkflowInstanceOperation.execute(StopWorkflowInstanceExecutorDelegate.java:127)

...

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.6.23:15678
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:750)

What you expected to happen

.

How to reproduce

in multi master cluster, run a workflow, stop (and start) the master which running the workflow, stop workflow in web

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions