Skip to content

[do not merge] enable retry-loop on newer versions of Windows as well#43145

Draft
thaJeztah wants to merge 1 commit intomoby:masterfrom
thaJeztah:windows_retry_remove
Draft

[do not merge] enable retry-loop on newer versions of Windows as well#43145
thaJeztah wants to merge 1 commit intomoby:masterfrom
thaJeztah:windows_retry_remove

Conversation

@thaJeztah
Copy link
Copy Markdown
Member

I stumbled upon this comment when looking for code that could be removed now that
Windows RS1 reaches EOL:

// We have a retry loop for ErrVmcomputeOperationInvalidState and
// ErrVmcomputeOperationAccessIsDenied as there is a race condition
// in RS1 and RS2 building during enumeration when a silo is going away
// for example under it, in HCS. AccessIsDenied added to fix 30278.
//
// TODO: For RS3, we can remove the retries. Also consider
// using platform APIs (if available) to get this more succinctly. Also
// consider enhancing the Remove() interface to have context of why
// the remove is being called - that could improve efficiency by not
// enumerating compute systems during a remove of a container as it's
// not required.

We're seeing various failures in CI recently since we updated / rebuilt Jenkins
agents;

=== FAIL: github.com/docker/docker/integration-cli TestDockerSuite/TestContainerAPIChunkedEncoding (0.12s)
check_test.go:170: assertion failed: error is not nil: Error response from daemon: container 7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0: driver "windowsfilter" failed to remove root filesystem: failed to detach VHD: failed to detach virtual disk: The device is not ready.: rename D:\CI\PR-43138\2\daemon\windowsfilter\7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0 D:\CI\PR-43138\2\daemon\windowsfilter\7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0-removing: Access is denied.: failed to remove 7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0
--- FAIL: TestDockerSuite/TestContainerAPIChunkedEncoding (0.12s)

Wondering if there's a regression in Windows that causes this problem (#30278) to be
reintroduced, so let's see if enabling this retry loop on current versions of
Windows resolves those CI failures.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Copy Markdown
Member Author

thaJeztah commented Jan 19, 2022

Comparing RS5 machines (old/new) in Jenkins (sanitised / removed some unrelated diffs)

rs5-before.txt
rs5-after.txt

diff --git a/rs5-before.txt b/rs5-after.txt
index cd6de6ff0d..80581c0d75 100644
--- a/rs5-before.txt
+++ b/rs5-after.txt
@@ -1,7 +1,7 @@
-Old machine (RS5)
+New machine (RS5):
 
- INFO: executeCI.ps1 starting at Thu Dec 23 16:19:49 CUT 2021
+ INFO: executeCI.ps1 starting at Wed Jan 12 17:42:40 CUT 2022
 
  INFO: Script version 05-Feb-2019 09:03 PDT
  INFO: Running git version 2.24.1.windows.2
@@ -52,12 +52,14 @@ Old machine (RS5)
  FQDN                           azwin-2-XXXXX.westus.cloudapp.azure.com
  GIT_BRANCH                     PR-43103
  GIT_COMMIT                     XXXXX
+ GIT_PREVIOUS_COMMIT            XXXXX
+ GIT_PREVIOUS_SUCCESSFUL_COMMIT XXXXX
  GIT_URL                        https://github.com/moby/moby.git
  HUDSON_COOKIE                  XXXXXX
  HUDSON_HOME                    /var/cloudbees-jenkins-distribution
  HUDSON_SERVER_COOKIE           aaf6decb76ababb5
  HUDSON_URL                     https://ci-next.docker.com/public/
- JAVA_HOME                      C:\java-1.8.0-openjdk-1.8.0.302-1.b08.ojdkbuild.windows.x86_64
+ JAVA_HOME                      C:\java-1.8.0-openjdk-1.8.0.312-1.b07.ojdkbuild.windows.x86_64
  JENKINS_HOME                   /var/cloudbees-jenkins-distribution
  JENKINS_NODE_COOKIE            XXXXXX
  JENKINS_SERVER_COOKIE          durable-2f56e31ca5b2498536d5dc93c29eccaf
@@ -126,27 +128,29 @@ Old machine (RS5)
  INFO: Test run under d:\CI\...
  INFO: Running in D:\gopath\src\github.com\docker\docker
  INFO: docker/docker repository was found
- INFO: Image microsoft/windowsservercore:latest is already loaded in the control daemon
- INFO: Version of microsoft/windowsservercore:latest is '10.0.17763.2366'
+ INFO: Pulling mcr.microsoft.com/windows/servercore:ltsc2019 from docker hub. This may take some time...
+ INFO: docker pull of mcr.microsoft.com/windows/servercore:ltsc2019 completed successfully
+ INFO: Tagging mcr.microsoft.com/windows/servercore:ltsc2019 as microsoft/windowsservercore
+ INFO: Version of microsoft/windowsservercore:latest is '10.0.17763.2452'
  INFO: Docker version of control daemon
 
  Client:
-  Version:           20.10.9
+  Version:           20.10.12
   API version:       1.41
-  Go version:        go1.16.8
-  Git commit:        c2ea9bc
-  Built:             Mon Oct  4 16:11:10 2021
+  Go version:        go1.16.12
+  Git commit:        e91ed57
+  Built:             Mon Dec 13 11:44:07 2021
   OS/Arch:           windows/amd64
   Context:           default
   Experimental:      true
 
  Server: Docker Engine - Community
   Engine:
-   Version:          20.10.9
+   Version:          20.10.12
    API version:      1.41 (minimum version 1.24)
-   Go version:       go1.16.8
-   Git commit:       79ea9d3
-   Built:            Mon Oct  4 16:06:39 2021
+   Go version:       go1.16.12
+   Git commit:       459d0df
+   Built:            Mon Dec 13 11:42:13 2021
    OS/Arch:          windows/amd64
    Experimental:     true
 
@@ -166,7 +170,7 @@ Old machine (RS5)
    Paused: 0
    Stopped: 0
   Images: 1
-  Server Version: 20.10.9
+  Server Version: 20.10.12
   Storage Driver: lcow (linux) windowsfilter (windows)
    LCOW:
    Windows:
@@ -178,7 +182,7 @@ Old machine (RS5)
   Swarm: inactive
   Default Isolation: process
   Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
-  Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.2183)
+  Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.2366)
   OSType: windows
   Architecture: x86_64
   CPUs: 4

@thaJeztah
Copy link
Copy Markdown
Member Author

Same for Win 2022

win-2022-before.txt
win-2022-after.txt

diff --git a/win-2022-before.txt b/win-2022-after.txt
index cb925c9297..fcc7639d7c 100644
--- a/win-2022-before.txt
+++ b/win-2022-after.txt
@@ -1,6 +1,6 @@
-Old machine (Win 2022):
+New machine (Win 2022):
 
- INFO: executeCI.ps1 starting at Thu Dec 23 16:18:33 CUT 2021
+ INFO: executeCI.ps1 starting at Wed Jan 12 20:16:30 CUT 2022
 
  INFO: Script version 05-Feb-2019 09:03 PDT
  INFO: Running git version 2.24.1.windows.2
@@ -51,12 +51,14 @@ Old machine (Win 2022):
  FQDN                           azwin-2-XXXXX.westus.cloudapp.azure.com
  GIT_BRANCH                     PR-43103
  GIT_COMMIT                     XXXXX
+ GIT_PREVIOUS_COMMIT            XXXXX
+ GIT_PREVIOUS_SUCCESSFUL_COMMIT XXXXX
  GIT_URL                        https://github.com/moby/moby.git
  HUDSON_COOKIE                  XXXXXX
  HUDSON_HOME                    /var/cloudbees-jenkins-distribution
  HUDSON_SERVER_COOKIE           aaf6decb76ababb5
  HUDSON_URL                     https://ci-next.docker.com/public/
- JAVA_HOME                      C:\java-1.8.0-openjdk-1.8.0.302-1.b08.ojdkbuild.windows.x86_64
+ JAVA_HOME                      C:\java-1.8.0-openjdk-1.8.0.312-1.b07.ojdkbuild.windows.x86_64
  JENKINS_HOME                   /var/cloudbees-jenkins-distribution
  JENKINS_NODE_COOKIE            XXXXXX
  JENKINS_SERVER_COOKIE          durable-2f56e31ca5b2498536d5dc93c29eccaf
@@ -75,9 +77,9 @@ Old machine (Win 2022):
  PATHEXT                        .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.CPL
  ppc64le                        false
  PROCESSOR_ARCHITECTURE         AMD64
- PROCESSOR_IDENTIFIER           Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
+ PROCESSOR_IDENTIFIER           Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
  PROCESSOR_LEVEL                6
- PROCESSOR_REVISION             4f01
+ PROCESSOR_REVISION             5504
  ProgramData                    C:\ProgramData
  ProgramFiles                   C:\Program Files
  ProgramFiles(x86)              C:\Program Files (x86)
@@ -125,27 +127,29 @@ Old machine (Win 2022):
  INFO: Test run under d:\CI\...
  INFO: Running in D:\gopath\src\github.com\docker\docker
  INFO: docker/docker repository was found
- INFO: Image microsoft/windowsservercore:latest is already loaded in the control daemon
- INFO: Version of microsoft/windowsservercore:latest is '10.0.20348.405'
+ INFO: Pulling mcr.microsoft.com/windows/servercore:ltsc2022 from docker hub. This may take some time...
+ INFO: docker pull of mcr.microsoft.com/windows/servercore:ltsc2022 completed successfully
+ INFO: Tagging mcr.microsoft.com/windows/servercore:ltsc2022 as microsoft/windowsservercore
+ INFO: Version of microsoft/windowsservercore:latest is '10.0.20348.469'
  INFO: Docker version of control daemon
 
  Client:
-  Version:           20.10.9
+  Version:           20.10.12
   API version:       1.41
-  Go version:        go1.16.8
-  Git commit:        c2ea9bc
-  Built:             Mon Oct  4 16:11:10 2021
+  Go version:        go1.16.12
+  Git commit:        e91ed57
+  Built:             Mon Dec 13 11:44:07 2021
   OS/Arch:           windows/amd64
   Context:           default
   Experimental:      true
 
  Server: Docker Engine - Community
   Engine:
-   Version:          20.10.9
+   Version:          20.10.12
    API version:      1.41 (minimum version 1.24)
-   Go version:       go1.16.8
-   Git commit:       79ea9d3
-   Built:            Mon Oct  4 16:06:39 2021
+   Go version:       go1.16.12
+   Git commit:       459d0df
+   Built:            Mon Dec 13 11:42:13 2021
    OS/Arch:          windows/amd64
    Experimental:     true
 
@@ -161,7 +165,7 @@ Old machine (Win 2022):
    Paused: 0
    Stopped: 0
   Images: 1
-  Server Version: 20.10.9
+  Server Version: 20.10.12
   Storage Driver: lcow (linux) windowsfilter (windows)
    LCOW:
    Windows:
@@ -173,7 +177,7 @@ Old machine (Win 2022):
   Swarm: inactive
   Default Isolation: process
   Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
-  Operating System: Windows Server 2022 Datacenter Version 2009 (OS Build 20348.230)
+  Operating System: Windows Server 2022 Datacenter Version 2009 (OS Build 20348.405)
   OSType: windows
   Architecture: x86_64
   CPUs: 4

I stumbled upon this comment when looking for code that could be removed now that
Windows RS1 reaches EOL:

    // We have a retry loop for ErrVmcomputeOperationInvalidState and
    // ErrVmcomputeOperationAccessIsDenied as there is a race condition
    // in RS1 and RS2 building during enumeration when a silo is going away
    // for example under it, in HCS. AccessIsDenied added to fix 30278.
    //
    // TODO: For RS3, we can remove the retries. Also consider
    // using platform APIs (if available) to get this more succinctly. Also
    // consider enhancing the Remove() interface to have context of why
    // the remove is being called - that could improve efficiency by not
    // enumerating compute systems during a remove of a container as it's
    // not required.

We're seeing various failures in CI recently since we updated / rebuilt Jenkins
agents;

    === FAIL: github.com/docker/docker/integration-cli TestDockerSuite/TestContainerAPIChunkedEncoding (0.12s)
    check_test.go:170: assertion failed: error is not nil: Error response from daemon: container 7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0: driver "windowsfilter" failed to remove root filesystem: failed to detach VHD: failed to detach virtual disk: The device is not ready.: rename D:\CI\PR-43138\2\daemon\windowsfilter\7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0 D:\CI\PR-43138\2\daemon\windowsfilter\7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0-removing: Access is denied.: failed to remove 7819ec9f095a635eeb94806db9cbc767eef82837785085a202548c3e809b93c0
    --- FAIL: TestDockerSuite/TestContainerAPIChunkedEncoding (0.12s)

Wondering if there's a regression in Windows that causes this problem to be
reintroduced, so let's see if enabling this retry loop on current versions of
Windows resolves those CI failures.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
@thaJeztah thaJeztah force-pushed the windows_retry_remove branch from b2421e2 to 4ce4c83 Compare June 29, 2022 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant