🐛 Bug Report
Hi there,
I'm running Selenium Grid 4 using the fully distributed mode on K8s running on GKE.
However, I have some issue with node taking a long time to be marked as UP.
As Dynamic Grid (https://github.com/SeleniumHQ/docker-selenium#dynamic-grid-) is not currently supported in Kubernetes, I wrote a cronjob that will automatically scale up & down the number of node depending on the number of sessions running & sessions in the queue.
While the script is working fine (it's scaling up & down depending on the number of tests), our tests are taking a very long time to run as some nodes would take a long time to be ready.
To Reproduce
I'm running the fully distributed mode using the following yaml file - https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/k8s-deployment-full-grid.yaml
Which is the same as the provided yaml file: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/k8s-deployment-full-grid.yaml, I'm just adding namespace to make it easier to manage on my side.
I managed to reproduce this issue by manually scaling up/down the number of node (using kubectl scale deployment).
In order to track this issue, I wrote a small bash script to see how many of nodes are registered and how many are up or down.
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check.sh
You can see all logs here
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log
Please note, I started from a fresh new deployment of selenium to my k8s cluster using kubectl apply -f k8s-deployment-full-grid.yaml
Events (extract from https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log)
Mon Sep 20 11:41:25 AEST 2021: Started deployment of Selenium Grid to K8s. No node is ready yet.
Mon Sep 20 11:41:48 AEST 2021: Node is ready and marked as UP
Mon Sep 20 11:44:51 AEST 2021: Scaled up the number of replicas to 30 (kubectl scale --replicas=30 deployment)
Mon Sep 20 11:44:57 AEST 2021: After a few seconds, all 30 are marked as UP
Mon Sep 20 11:47:04 AEST 2021: Scaled down the number of replicas to 1.
Mon Sep 20 11:50:54 AEST 2021: Still waiting for grid to remove 2 node marked as DOWN
Mon Sep 20 11:50:59 AEST 2021: Scaled back the number of replicas to 30.
Mon Sep 20 11:51:04 AEST 2021: All 30 new nodes are added to the grid, but marked as DOWN
Mon Sep 20 11:53:53 AEST 2021: Only about 3 minutes later, all 30 nodes are marked as UP
I captured logs from a few nodes that you can see here:
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-1.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-2.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-3.log
As you can see, nodes have been added to the grid at
01:51:01.874 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
But are only marked as UP about 3 minutes later.
You can find the full log of the distributor here: https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/distributor.log
I'm not sure if my issue is related to SeleniumHQ/docker-selenium#1337, as I have this issue with the latest version (4.0.0-rc-2-prerelease-20210916).
I also tried to run the suggested version 4.0.0-rc-1-prerelease-20210618, but have the same issue.
Expected behavior
It should take that long for a node to be marked as UP and ready to run tests
Test script or set of commands reproducing this issue
Please provide a test script to reproduce the issue you are reporting, if the
setup is more complex, GitHub repo links with are also OK.
Please see above
Environment
Selenium Grid version: 4.0.0-rc-2-prerelease-20210916
🐛 Bug Report
Hi there,
I'm running Selenium Grid 4 using the fully distributed mode on K8s running on GKE.
However, I have some issue with node taking a long time to be marked as UP.
As Dynamic Grid (https://github.com/SeleniumHQ/docker-selenium#dynamic-grid-) is not currently supported in Kubernetes, I wrote a cronjob that will automatically scale up & down the number of node depending on the number of sessions running & sessions in the queue.
While the script is working fine (it's scaling up & down depending on the number of tests), our tests are taking a very long time to run as some nodes would take a long time to be ready.
To Reproduce
I'm running the fully distributed mode using the following yaml file - https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/k8s-deployment-full-grid.yaml
Which is the same as the provided yaml file: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/k8s-deployment-full-grid.yaml, I'm just adding namespace to make it easier to manage on my side.
I managed to reproduce this issue by manually scaling up/down the number of node (using
kubectl scale deployment).In order to track this issue, I wrote a small bash script to see how many of nodes are registered and how many are up or down.
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check.sh
You can see all logs here
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log
Please note, I started from a fresh new deployment of selenium to my k8s cluster using
kubectl apply -f k8s-deployment-full-grid.yamlEvents (extract from https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/check-30.log)
Mon Sep 20 11:41:25 AEST 2021: Started deployment of Selenium Grid to K8s. No node is ready yet.Mon Sep 20 11:41:48 AEST 2021: Node is ready and marked as UPMon Sep 20 11:44:51 AEST 2021: Scaled up the number of replicas to 30 (kubectl scale --replicas=30 deployment)Mon Sep 20 11:44:57 AEST 2021: After a few seconds, all 30 are marked asUPMon Sep 20 11:47:04 AEST 2021: Scaled down the number of replicas to 1.Mon Sep 20 11:50:54 AEST 2021: Still waiting for grid to remove 2 node marked asDOWNMon Sep 20 11:50:59 AEST 2021: Scaled back the number of replicas to 30.Mon Sep 20 11:51:04 AEST 2021: All 30 new nodes are added to the grid, but marked asDOWNMon Sep 20 11:53:53 AEST 2021: Only about 3 minutes later, all 30 nodes are marked asUPI captured logs from a few nodes that you can see here:
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-1.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-2.log
https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/chrome-node-3.log
As you can see, nodes have been added to the grid at
But are only marked as
UPabout 3 minutes later.You can find the full log of the distributor here: https://github.com/gtaujeky/selenium-autoscaling-issue/blob/master/distributor.log
I'm not sure if my issue is related to SeleniumHQ/docker-selenium#1337, as I have this issue with the latest version (4.0.0-rc-2-prerelease-20210916).
I also tried to run the suggested version
4.0.0-rc-1-prerelease-20210618, but have the same issue.Expected behavior
It should take that long for a node to be marked as
UPand ready to run testsTest script or set of commands reproducing this issue
Please provide a test script to reproduce the issue you are reporting, if the
setup is more complex, GitHub repo links with are also OK.
Please see above
Environment
Selenium Grid version: 4.0.0-rc-2-prerelease-20210916