test: Add tests for logging#43669
Conversation
|
Some failures on Windows; |
corhere
left a comment
There was a problem hiding this comment.
Thanks for adding the concurrent race test! The PR looks good to me, though I have some nits and suggestions.
corhere
left a comment
There was a problem hiding this comment.
Sorry; just thought of something: I would like to see both the json-file and local drivers covered by the integration tests. (Maybe one day journald could join that list, but that's a whole 'nother PR.)
|
Thanks @corhere. I applied your suggestion. Also I encountered some trouble with log rotation. So, the test first logged the messages and then checked if the messages read from LogReader match messages that were logged. This made the TestTailLogsWithRotation test tree fail, because due to low moby/daemon/logger/loggertest/logreader.go Lines 173 to 177 in 3e32104 So I changed the test to use Follow instead and it worked at first, but when I ran the test again a few times then it eventually failed: So it seems like the LogFile is not able to guarantee that no messages are dropped when rotations happen. moby/daemon/logger/jsonfilelog/read_test.go Lines 162 to 170 in 3e32104 So I increased it to make the test not exceed one rotation. This makes other tests from the LogReader suite a bit pointless under the WithRotation logger, because they won't even have a chance for any rotation to occur, due to the low amount of messages that they log. TestConcurrent logs a lot of messages (10000), to make it more possible for the race condition to be visible, when not running the test with Wondering what's the best decision here. Some options:
Ideally we would want the LogFile to not drop messages when rotation happen quickly, but it's probably not feasible. 3 sounds reasonable to me, but maybe there are other options. What do you think? |
e6a03af to
8a4dcf5
Compare
|
@vvoland I'd be happy with option 3. What I most care about is having a unit test which catches the regressions when the race detector is enabled. By the very nature of data races, it's always going to be probabilistic whether the race results in incorrect behavior being observed so I'm not too worried about the test falsely passing when the race detector is not enabled. Another option would be to go back to splitting TestConcurrent out separate from TestFollow and TestTail so that the concurrent-logging tests could use a different log-driver configuration which is more amenable to not dropping logs without altering the config used in TestFollow and TestTail.
It's not feasible to guarantee that log entries could never be dropped by followers because the logger would need to be stalled until all followers caught up enough, blocking the container process when it next writes to STDOUT or STDERR. That being said, there is still room for improvement: the current follower implementation skips log entries if it falls behind by more than one rotation, even in cases where the skipped log entries are still present on disk in a rotated (read: renamed) file. This is actually an improvement over the v20.10 behavior! Before my rewrite, followers would drop any unread log entries the instant the logger performed a rotation. The follower implementation could be improved further so that it only skips log entries if the log file containing those entries was deleted off disk by opening the appropriate rotated ( Side note: that |
Yes, this is reproducible on 8a4dcf5 if you change The f is nil after this open: err is: the return after So, I was writing this comment and debugging in the meantime. After some time I can no longer reproduce the above error (on the same code from the above revision), but now I get: This seems to be much rarer than the previous, but still it seems like the same situation, but another place?... |
8a4dcf5 to
6c37a7a
Compare
|
Anyway, I applied your suggestions to the concurrent test. Thanks! |
|
Window tests fail for the tty case, thanks to the DeepEqual it is now clearly visible that the output also contains ANSI escape codes (which is probably expected in the TTY case). |
4ef0807 to
c6319dc
Compare
c6319dc to
22dd712
Compare
vvoland
left a comment
There was a problem hiding this comment.
Ok now I know why in this case \x1b[Ht still doesn't break the expected output.
dd81fcc to
d3bc85f
Compare
| ShowStdout: true, | ||
| ShowStderr: true, | ||
| }, | ||
| expectedOut: "this is fineaccidents happen", |
There was a problem hiding this comment.
Still some failures on windows 😞 😢 looks like something is stripping the trailing e from fine?
=== FAIL: github.com/docker/docker/integration/container TestLogs/driver_json-file/tty/only_stdout (2.41s)
logs_test.go:166: Stripping: "\x1b[2J\x1b[?25l\x1b[m\x1b[Hthis is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a\x1b[?25h\x1b[?25laccidents happen \b\x1b[?25h"
logs_test.go:156: assertion failed:
--- stdoutStr
+++ tC.expectedOut
string(
- "this is finaccidents happen",
+ "this is fineaccidents happen",
)
--- FAIL: TestLogs/driver_json-file/tty/only_stdout (2.41s)
There was a problem hiding this comment.
It's not really stripping, it's the backspace: this is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a
I thought that maybe the \b is somehow cancelled out by the use of the set title + \a but I just checked it on my Windows 10 machine and this really rendered with the backspace.
However, running the test command on my Windows returns the correct output. But I'm not sure how my Windows version relates to those in our CI that have this issue.
There was a problem hiding this comment.
I thought that maybe the
\bis somehow cancelled out by the use of theset title+\a
The \a character is one of twoish ways to terminate an OSC sequence such as set title. (The other 7-bit way is the two-byte "ST" sequence 0x1b 0x5c, a.k.a. ESC \.) So technically, the \a is part of the set title command.
It looks like the weird backspacing only happens on Windows Server 2019. There are two distinct weird things in the logged string this is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a: the backspace command before the OSC Set Window Title sequence, and the OSC command's argument is null-terminated when it shouldn't be. I booted up a Windows Server 2019 VM and played around a bit to characterize the issue. Note that I'm using Windows Powershell 5.1.
PS> docker build -t busybox --build-arg WINDOWS_BASE_IMAGE=mcr.microsoft.com/windows/servercore `
--build-arg WINDOWS_BASE_IMAGE_TAG=ltsc2019 `
.\contrib\busybox\
PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | Format-Hex
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 1B 5B 32 4A 1B 5B 3F 32 35 6C 1B 5B 6D 1B 5B 48 .[2J.[?25l.[m.[H
00000010 74 68 69 73 20 69 73 20 66 69 6E 65 08 1B 5D 30 this is fine..]0
00000020 3B 43 3A 5C 62 69 6E 5C 73 68 2E 65 78 65 00 07 ;C:\bin\sh.exe..
00000030 1B 5B 3F 32 35 68 1B 5B 3F 32 35 6C 61 63 63 69 .[?25h.[?25lacci
00000040 64 65 6E 74 73 20 68 61 70 70 65 6E 20 08 1B 5B dents happen ..[
00000050 3F 32 35 68 ?25h
PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | Write-Output
this is finaccidents happen
PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2"
this is fineaccidents happen
AND IT GETS WEIRDER!
PS> docker run --name logweird1 -t busybox sh -c "echo -n this is fine; echo -n accidents happen >&2"
this is fineaccidents happen
PS> docker logs logweird1
this is fineaccidents happen
PS> get-content (docker inspect logweird1 --format '{{.LogPath}}')
{"log":"\u001b[2J\u001b[?25l\u001b[m\u001b[H\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8882043Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8882043Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\u001b[H\u001b]0;C:\\bin\\sh.exe\u0000\u0007\u001b[?25h\u001b[?25lthis is fineaccidents happen \u0008\u001b[?25h\u001b[Ht\u001b[1;29H","stream":"stdout","time":"2022-06-07T19:01:12.5417731Z"}
PS> docker run --name logweird2 -t busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | write-output
this is finaccidents happen
PS> get-content (docker inspect logweird2 --format '{{.LogPath}}')
{"log":"\u001b[2J\u001b[?25l\u001b[m\u001b[Hthis is fine\u0008\u001b]0;C:\\bin\\sh.exe\u0000\u0007\u001b[?25h\u001b[?25laccidents happen \u0008\u001b[?25h","stream":"stdout","time":"2022-06-07T19:04:06.1768156Z"}
Redirecting the docker run command's output affects what gets logged by the daemon! (docker run -d yields the same results as redirecting the output: erroneous backspace.) WHAT? HOW?? WHY??! 🤯 Since container logs always pass through the logging driver and the Engine API, there is no association between the CLI session and the container's pseudo-terminal which would be visible to the OS. The most likely explanation for this inconsistent behavior is that there is something different about the API calls the CLI makes when stdout is and is not a console which is necessary for the behavior to change. Curiously, I have found that writing a newline character to stdout first is sufficient to make redirected/detached output not backspace over the "e" in "fine."
The console-attached logs might hold clues to what's going wrong. The sequence of VT commands preceding the first printed characters decodes as follows:
ED(2)-- erase entire displayDECTCEM(l)-- hide cursorSGR(0)-- set all text formatting attributes to defaultsCUP(0;0)-- move cursor to top-leftCR LF* 49 -- move cursor down by 49 linesCUP(0;0)-- move cursor to top-leftOSC(0;"C:\\bin\\sh.exe\x00")-- set window title toC:\bin\sh.exeDECTCEM(h)-- show cursorDECTCEM(l)-- hide cursor
My terminal window was sized to be 50 lines tall when I ran the command, so I don't think the 49 CRLFs is a coincidence. That sequence of commands looks like how a somewhat naive renderer would "repaint" a VT terminal to the state of a cleared text-mode screen buffer. This should come as no surprise as rendering a Win32 Console API screen buffer to a stream of VT sequences is one of the jobs of conhost.exe. My theory is that perhaps the Windows Server 2019 conhost.exe (or the copy inside the base image?) VT renderer starts out in some bad state which results in backspace characters getting output in the wrong places and other rendering oddities, and that setting the ConPTY's terminal dimensions has a side effect of putting the renderer into a good state. The Docker CLI keeps the container PTY dimensions synced with the attached terminal dimensions when stdout is a console, which would explain the differences in behavior between attached and detached/redirected. I have occasionally seen the "e" in "fine" overtyped when the output is not redirected, which could be explained by my theory as there being a race between the container outputting to the PTY and the Docker CLI setting the terminal dimensions. I have witnessed several variations of output by repeatedly running the same container multiple times so there's quite a bit of non-determinism in the PTY rendering, which either uninitialized memory in conhost.exe or a race condition could both potentially explain. Sleeping for half a second seems to result in deterministic output when outputting to an attached console, which is consistent with my set terminal dimensions race-condition theory. But then again, I've also seen nondeterministic output on detached container runs that I couldn't reproduce when I added a sleep, so maybe there's more than one race in play?
In conclusion, the integration test found a bug with Windows containers, and conhost.exe might be involved.
There was a problem hiding this comment.
Thanks, that's a really impressive investigation :)
It's a bit strange that the resize actually does happen, since on Windows the container has the desired ConsoleSize already set on the creation. So there shouldn't be a resize at the start, unless you change your terminal size after docker run. But maybe under the hood there is still a delay between creating the container and setting the console size.
Anyway, I think we already spent a lot of time on those tests, so maybe for now we can skip this test on WS2019, report this to Microsoft or at least create an issue on our side, and revisit this once/if they handle it in a follow up?
There was a problem hiding this comment.
It's a bit strange that the resize actually does happen, since on Windows the container has the desired ConsoleSize already set on the creation. So there shouldn't be a resize at the start, unless you change your terminal size after
docker run.
Ah, so an initial console size is set on creation. I should have checked that before hypothesizing. So maybe a resize isn't necessary, but it does look like in the docker run | write-output case the console size configuration will be set to 0, 0 so perhaps that's a sufficient condition to get the buggy behavior? (That wouldn't explain the occasional buggy behavior I observed when the console size is set to positive values, though.) You know, I don't see any container.WithConsoleSize options in your new logging integration tests. I wonder if adding it would change the test results.
Anyway, I think we already spent a lot of time on those tests, so maybe for now we can skip this test on WS2019, report this to Microsoft or at least create an issue on our side, and revisit this once/if they handle it in a follow up?
What do we report to Microsoft? "Docker container --tty output is weird on an old version of Windows Server and by the way we don't have a minimal reproducer, plz fix it for us kthxbye" isn't likely to get much traction. I think there needs to be significantly more investigation on our end to isolate and characterize the issue before we will be at a point where we could loop Microsoft in and expect a reply. We don't yet know if HCS is necessary to reproduce the issue or if the issue can be reproduced outside a container, which is necessary to determine whether we need to open an issue on https://github.com/microsoft/hcsshim or https://github.com/microsoft/terminal. Or maybe it truly is a bug on our end and newer versions of HCS or conhost.exe happen to be better at clamping nonsensical configuration. Creating an issue on our side sounds like a good idea for coordinating our investigation.
There was a problem hiding this comment.
What do we report to Microsoft?
Let me try some people from their side; @jstarks (I think you wrote most of the original go-ansiterm bits), or otherwise @dcantah and @kevpar may have some ideas.
FWIW; I'm ok with t.Skip() some of these on Windows for now (with a link to this conversation of a tracking ticket), or perhaps we can do a limited "string contains is fine and /or string contains accidents check for windows (i.e., we know we received things on the right stream).
There was a problem hiding this comment.
but it does look like in the docker run | write-output case the console size configuration will be set to 0, 0
Yeah you are right, the size is 0,0 in this case, I missed the GetTtySize, so the resize does happen.
What do we report to Microsoft?
I might have worded myself better - by reporting I mean "We encountered this issue, this doesn't happen on a newer Windows Server, so perhaps that means you fixed it and already know something more about it" rather than "plz fix it for us kthxbye".
I don't see any container.WithConsoleSize options in your new logging integration tests. I wonder if adding it would change the test results.
Just tested, adding a non zero ConsoleSize does not change the result: https://github.com/moby/moby/runs/6812754016?check_suite_focus=true
BTW, I did test the Windows Server 2019 with the conhost.exe replaced (by modifying the image.. yeah that's really bad and evil but it works...). The conhost.exe from my WS2019 installation is the same as the one in the ltsc2019 container (10.0.17763.2746). I replaced it with a conhost.exe from my PC's Windows 10 (10.0.18362.1 ) and.. it works correctly! The backspace is no longer there and redirecting the stdout of docker run produces the same logfile without the additional CR LF (so same as docker run ... | Write-Output in your case).
I also tried the conhost.exe from Windows Server 2022 which is much newer (10.0.20348.558) and.. the terminal didn't start at all - probably due to other dependencies.
So, to conclude:
- conhost.exe is involved - it's sufficient to replace it to get fix the issue
- The relevant changes were made between 10.0.17763.2746 and 10.0.18362.1
- This still can be
conhost.exe happen to be better at clamping nonsensical configurationthough
e0e48c7 to
5efc565
Compare
5efc565 to
baffa6b
Compare
| // which breaks the output: https://github.com/moby/moby/pull/43669#discussion_r891737623 | ||
| if strings.Contains(testEnv.DaemonInfo.OperatingSystem, "Windows Server Version 1809") { | ||
| if tC.logOps.ShowStdout { | ||
| assert.Check(t, strings.Contains(stdout.String(), "this is fine")) |
There was a problem hiding this comment.
The string returned from stripEscapeCodes when the bug manifests is "this is finaccidents happen". This check will return false and therefore the test will flake when that next happens, same as if it was not special-cased at all.
There was a problem hiding this comment.
This compares the stdout.String() which is the raw unparsed string.
There was a problem hiding this comment.
🤦 I read the code to look for exactly that and still misread it. Oops, sorry about that.
Weren't there times when the initial t in this sometimes gets printed out of order? Would that cause the test to flake or was it just a redundant overtype which wouldn't confuse the test?
There was a problem hiding this comment.
This was just a redundant overtype inserted at the end, but the correct string was still there:
\x1b[2J\x1b[?25l\x1b[m\x1b[H
this is fine
\b\x1b]0;C:\bin\sh.exe\x00\a\x1b[?25h\x1b[?25laccidents happen \b\x1b[?25h
\x1b[Ht
\x1b[1;29H
|
Ok, so finally CI is green with the workaround for the Windows Server 2019. Extracted the comments regarding the Windows issue to #43710 - lets do a further investigation there |
1. Add integration tests for the ContainerLogs API call Each test handle a distinct case of ContainerLogs output. - Muxed stream, when container is started without tty - Single stream, when container is started with tty 2. Add unit test for LogReader suite that tests concurrent logging It checks that there are no race conditions when logging concurrently from multiple goroutines. Co-authored-by: Cory Snider <[email protected]> Signed-off-by: Cory Snider <[email protected]> Signed-off-by: Paweł Gronowski <[email protected]>
baffa6b to
2ec3e14
Compare
Supersedes:
(extended with: #43647 (comment))
Add integration tests for the ContainerLogs API call
Each test handle a distinct case of ContainerLogs output.
Add unit test for LogReader suite that tests concurrent logging
It checks that there are no race conditions when logging concurrently
from multiple goroutines.
The tests pass for master.
With the #43650 reverted they fail as expected.
In such case, also
TestReadLog/Concurrentshould also fail when not runningwith
-racedue to checking if the read messages match those that were logged- What I did
Write unit test for race condition when logging from multiple goroutines at once
- How to verify it
or (is there any way to pass -race to the test-unit?)
- A picture of a cute animal (not mandatory but encouraged)