Skip to content

New node containers retry registration but fail until distributor is restarted #1337

@nextrevision

Description

@nextrevision

🐛 Bug Report

When deployed on kubernetes with distributed components (router, event bus, etc), new nodes seldom get registered until distributor is restarted (container killed and recreated). The distributor error below only gets logged once, not multiple times which makes it seem like the node keeps emitting registration events, but the distributor tries once and not again.

Distributor error:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.12.136.44:5555
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Chrome node:

2021-07-15 23:39:29,454 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2021-07-15 23:39:29,456 INFO supervisord started with pid 9
2021-07-15 23:39:30,459 INFO spawned: 'xvfb' with pid 11
2021-07-15 23:39:30,461 INFO spawned: 'vnc' with pid 12
2021-07-15 23:39:30,463 INFO spawned: 'novnc' with pid 13
2021-07-15 23:39:30,475 INFO spawned: 'selenium-node' with pid 14
2021-07-15 23:39:30,555 INFO success: xvfb entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-15 23:39:30,555 INFO success: vnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-15 23:39:30,555 INFO success: novnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-07-15 23:39:30,555 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Appending Selenium options: --register-cycle 30 --register-period 300
Configuring server...
Setting up SE_NODE_HOST...
Setting up SE_NODE_PORT...
Setting up SE_NODE_GRID_URL...
Selenium Grid Node configuration: 
[events]
publish = "tcp://selenium-event-bus:4442"
subscribe = "tcp://selenium-event-bus:4443"

[node]
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
max-sessions = 1

[[node.driver-configuration]]
name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "91.0", "platformName": "Linux"}'
max-sessions = 1

Starting Selenium Grid Node...
23:39:31.848 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
23:39:31.853 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
23:39:32.067 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-event-bus:4442 and tcp://selenium-event-bus:4443
23:39:32.250 INFO [UnboundZmqEventBus.<init>] - Sockets created
23:39:33.279 INFO [UnboundZmqEventBus.<init>] - Event bus ready
23:39:33.525 INFO [NodeServer.createHandlers] - Reporting self as: http://10.12.136.44:5555
23:39:33.581 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
23:39:33.676 INFO [NodeOptions.report] - Adding chrome for {"browserVersion": "91.0","browserName": "chrome","platformName": "Linux","se:vncEnabled": true} 1 times
23:39:33.687 INFO [Node.<init>] - Binding additional locator mechanisms: id, name
23:39:34.154 INFO [NodeServer$1.start] - Starting registration process for node id 2531c040-42c5-47a6-a291-ad5e154e3d7b
23:39:34.159 INFO [NodeServer.execute] - Started Selenium node 4.0.0-rc-1 (revision 74745cf081): http://10.12.136.44:5555
23:39:34.171 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:40:04.178 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:40:34.180 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:41:04.182 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:41:34.184 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:42:04.187 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:42:34.188 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:43:04.190 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
23:43:34.191 INFO [NodeServer$1.lambda$start$1] - Sending registration event...

I was testing trying different combinations of timeouts with --register-cycle 30 --register-period 300, but the issue is the same. I first noticed this issue, however, using the default node container settings.

To Reproduce

Steps to reproduce the behavior (including the command to start the containers):

  • Deploy
kubectl create ns selenium-grid
kubectl apply -n selenium-grid -f https://raw.githubusercontent.com/SeleniumHQ/docker-selenium/trunk/k8s-deployment-full-grid.yaml
kubectl rollout status --watch -n selenium-grid deployment/selenium-chrome-node-deployment
kubectl rollout status --watch -n selenium-grid deployment/selenium-distributor-deployment
  • Scale chrome nodes
kubectl scale --replicas=3 -n selenium-grid deployment/selenium-chrome-node-deployment
kubectl rollout status --watch -n selenium-grid deployment/selenium-chrome-node-deployment
  • Check distributor logs for Connection Refused error.
kubectl logs -n selenium-grid deployment/selenium-distributor-deployment
  • Restart (delete) distributor container and watch nodes register.
kubectl rollout restart -n selenium-grid deployment/selenium-distributor-deployment
kubectl rollout status --watch -n selenium-grid deployment/selenium-distributor-deployment
kubectl logs -n selenium-grid deployment/selenium-distributor-deployment -f

Expected behavior

Nodes are retried by distributor.

Test script reproducing this issue (when applicable)

See "To Reproduce".

Environment

OS: k8s (EKS)
Docker-Selenium image version: 4.0.0-rc-1-prerelease-20210713
Docker version: k8s v1.18
Exact Docker command to start the containers: see https://raw.githubusercontent.com/SeleniumHQ/docker-selenium/trunk/k8s-deployment-full-grid.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions