Skip to content

Selenium 4 - Node taking 3 minutes to be registered as UP #9847

@gtaujeky

Description

@gtaujeky

🐛 Bug Report

Hi there,

I'm running Selenium Grid 4 using the fully distributed mode on K8s running on GKE.

However, I have some issue with node taking a long time to be marked as UP.

As Dynamic Grid (https://github.com/SeleniumHQ/docker-selenium#dynamic-grid-) is not currently supported in Kubernetes, I wrote a cronjob that will automatically scale up & down the number of node depending on the number of sessions running & sessions in the queue.

While the script is working fine (it's scaling up & down depending on the number of tests), our tests are taking a very long time to run as some nodes would take a long time to be ready.

To Reproduce

I'm running the fully distributed mode using the following yaml file - https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/k8s-deployment-full-grid.yaml

Which is the same as the provided yaml file: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/k8s-deployment-full-grid.yaml, I'm just adding namespace to make it easier to manage on my side.

I managed to reproduce this issue by manually scaling up/down the number of node (using kubectl scale deployment).

In order to track this issue, I wrote a small bash script to see how many of nodes are registered and how many are up or down.
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check.sh

You can see all logs here
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log

Please note, I started from a fresh new deployment of selenium to my k8s cluster using kubectl apply -f k8s-deployment-full-grid.yaml

Events (extract from https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log)

  • Mon Sep 20 11:41:25 AEST 2021: Started deployment of Selenium Grid to K8s. No node is ready yet.
  • Mon Sep 20 11:41:48 AEST 2021: Node is ready and marked as UP
  • Mon Sep 20 11:44:51 AEST 2021: Scaled up the number of replicas to 30 (kubectl scale --replicas=30 deployment)
  • Mon Sep 20 11:44:57 AEST 2021: After a few seconds, all 30 are marked as UP
  • Mon Sep 20 11:47:04 AEST 2021: Scaled down the number of replicas to 1.
  • Mon Sep 20 11:50:54 AEST 2021: Still waiting for grid to remove 2 node marked as DOWN
  • Mon Sep 20 11:50:59 AEST 2021: Scaled back the number of replicas to 30.
  • Mon Sep 20 11:51:04 AEST 2021: All 30 new nodes are added to the grid, but marked as DOWN
  • Mon Sep 20 11:53:53 AEST 2021: Only about 3 minutes later, all 30 nodes are marked as UP

I captured logs from a few nodes that you can see here:
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-1.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-2.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-3.log

As you can see, nodes have been added to the grid at

01:51:01.874 INFO [NodeServer.lambda$createHandlers$2] - Node has been added

But are only marked as UP about 3 minutes later.

You can find the full log of the distributor here: https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/distributor.log

I'm not sure if my issue is related to SeleniumHQ/docker-selenium#1337, as I have this issue with the latest version (4.0.0-rc-2-prerelease-20210916).
I also tried to run the suggested version 4.0.0-rc-1-prerelease-20210618, but have the same issue.

Expected behavior

It should take that long for a node to be marked as UP and ready to run tests

Test script or set of commands reproducing this issue

Please provide a test script to reproduce the issue you are reporting, if the
setup is more complex, GitHub repo links with are also OK.

Please see above

Environment

Selenium Grid version: 4.0.0-rc-2-prerelease-20210916

Metadata

Metadata

Assignees

No one assigned

    Labels

    B-gridEverything grid and server relatedC-javaJava Bindings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions