Skip to content

test: Add tests for logging#43669

Merged
thaJeztah merged 1 commit intomoby:masterfrom
vvoland:test-container-logs2
Jun 16, 2022
Merged

test: Add tests for logging#43669
thaJeztah merged 1 commit intomoby:masterfrom
vvoland:test-container-logs2

Conversation

@vvoland
Copy link
Copy Markdown
Contributor

@vvoland vvoland commented May 31, 2022

Supersedes:

(extended with: #43647 (comment))

Add integration tests for the ContainerLogs API call
Each test handle a distinct case of ContainerLogs output.

  • Muxed stream, when container is started without tty
  • Single stream, when container is started with tty

Add unit test for LogReader suite that tests concurrent logging
It checks that there are no race conditions when logging concurrently
from multiple goroutines.

The tests pass for master.

With the #43650 reverted they fail as expected.
In such case, also TestReadLog/Concurrent should also fail when not running
with -race due to checking if the read messages match those that were logged

- What I did
Write unit test for race condition when logging from multiple goroutines at once

- How to verify it

  1. Run unit tests:
make TESTDIRS='github.com/docker/docker/daemon/logger/local'  BIND_DIR=. test-unit
make TESTDIRS='github.com/docker/docker/daemon/logger/jsonfilelog'  BIND_DIR=. test-unit

or (is there any way to pass -race to the test-unit?)

cd daemon/logger/local
go test -race .
cd -
cd daemon/logger/jsonfilelog
go test -race .
  1. Run integration tests:
make BIND_DIR=. TEST_FILTER='TestLogs' test-integration

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Copy Markdown
Member

Some failures on Windows;

=== Failed
=== FAIL: github.com/docker/docker/integration/container TestLogsNotMuxed/stdout_and_stderr (2.23s)
    logs_test.go:166: assertion failed: this is fine25haccidents happen (string) != this is fineaccidents happen (tC.expectedOutput string)
    --- FAIL: TestLogsNotMuxed/stdout_and_stderr (2.23s)

=== FAIL: github.com/docker/docker/integration/container TestLogsNotMuxed/only_stdout (2.36s)
    logs_test.go:166: assertion failed: this is fine25haccidents happen (string) != this is fineaccidents happen (tC.expectedOutput string)
    --- FAIL: TestLogsNotMuxed/only_stdout (2.36s)

Copy link
Copy Markdown
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the concurrent race test! The PR looks good to me, though I have some nits and suggestions.

Comment thread daemon/logger/loggertest/logreader.go
Comment thread daemon/logger/loggertest/logreader.go Outdated
Comment thread integration/container/logs_test.go Outdated
Comment thread integration/container/logs_test.go Outdated
Comment thread daemon/logger/loggertest/logreader.go Outdated
Copy link
Copy Markdown
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry; just thought of something: I would like to see both the json-file and local drivers covered by the integration tests. (Maybe one day journald could join that list, but that's a whole 'nother PR.)

Comment thread integration/container/logs_test.go Outdated
Comment thread integration/container/logs_test.go Outdated
@vvoland
Copy link
Copy Markdown
Contributor Author

vvoland commented Jun 1, 2022

Thanks @corhere. I applied your suggestion. Also I encountered some trouble with log rotation.

So, the test first logged the messages and then checked if the messages read from LogReader match messages that were logged. This made the TestTailLogsWithRotation test tree fail, because due to low max-size the log can't store all the messages, so only the last 10 messages could be read.
This made me think, that considering the nature of Tail, it is not reasonable to expect to store all the messages that were ever logged and it would probably be better for the test to use the Follow, because:

// TestFollow tests the LogReader's follow implementation.
//
// The LogReader is expected to be able to follow an arbitrary number of
// messages at a high rate with no dropped messages.
func (tr Reader) TestFollow(t *testing.T) {

So I changed the test to use Follow instead and it worked at first, but when I ran the test again a few times then it eventually failed:

time="2022-06-01T16:16:37+02:00" level=warning msg="file rotations were missed while following logs; some log messages have been skipped over" file="/var/folders/rv/7kmyxm9n6hd6xcpsh8tx3g300000gn/T/TestFollowLogsWithRotationcompress=falseConcurrent3106684161/001/logconcurrent0.log" missed-rotations=2 module=logger
time="2022-06-01T16:16:37+02:00" level=warning msg="error closing current log file" error="invalid argument" file="/var/folders/rv/7kmyxm9n6hd6xcpsh8tx3g300000gn/T/TestFollowLogsWithRotationcompress=falseConcurrent3106684161/001/logconcurrent0.log" module=logger
--- FAIL: TestFollowLogsWithRotation (0.00s)
    --- FAIL: TestFollowLogsWithRotation/compress=false (0.00s)
        --- FAIL: TestFollowLogsWithRotation/compress=false/Concurrent (0.02s)
            logreader.go:446: assertion failed: stdout msg 167 (string) != stdout msg 112 (string)
            read_test.go:174: TestFollowLogsWithRotation/compress=false/Concurrent:

                logconcurrent0.log      -rw-r-----      375B    2022-06-01 16:16:37.628204038 +0200 CEST
                logconcurrent0.log.1    -rw-r-----      4125B   2022-06-01 16:16:37.627307401 +0200 CEST
                logconcurrent0.log.2    -rw-r-----      4158B   2022-06-01 16:16:37.623293432 +0200 CEST
FAIL
FAIL    github.com/docker/docker/daemon/logger/jsonfilelog      1.207s
FAIL

So it seems like the LogFile is not able to guarantee that no messages are dropped when rotations happen.
This can be solved by adjusting the test logger configuration to make it perform at most one rotation:

Factory: func(t *testing.T, info logger.Info) func(*testing.T) logger.Logger {
// The log follower can fall behind and drop logs if there are too many
// rotations in a short time. If that was to happen, loggertest would fail the
// test. Configure the logger so that there will be only one rotation with the
// set of logs that loggertest writes.
info.Config = map[string]string{
"compress": strconv.FormatBool(cmprs),
"max-size": "4096b",
"max-file": "3",

So I increased it to make the test not exceed one rotation. This makes other tests from the LogReader suite a bit pointless under the WithRotation logger, because they won't even have a chance for any rotation to occur, due to the low amount of messages that they log. TestConcurrent logs a lot of messages (10000), to make it more possible for the race condition to be visible, when not running the test with -race.

Wondering what's the best decision here. Some options:

  1. Not execute TestConcurrent for TestTailLogsWithRotation
  2. Only log the messages concurrently checking for race condition without reading and checking if read messages are as expected. (Would it still be a LogReader test though? 💭)
  3. Make TestConcurrent use the same messages as other tests (<10 messages) - it will still catch the race condition if run under go test -race, but may fail to catch it otherwise. (our CI doesn't run tests with -race yet)

Ideally we would want the LogFile to not drop messages when rotation happen quickly, but it's probably not feasible.

3 sounds reasonable to me, but maybe there are other options. What do you think?

@vvoland vvoland force-pushed the test-container-logs2 branch from e6a03af to 8a4dcf5 Compare June 1, 2022 17:43
@corhere
Copy link
Copy Markdown
Contributor

corhere commented Jun 1, 2022

@vvoland I'd be happy with option 3. What I most care about is having a unit test which catches the regressions when the race detector is enabled. By the very nature of data races, it's always going to be probabilistic whether the race results in incorrect behavior being observed so I'm not too worried about the test falsely passing when the race detector is not enabled.

Another option would be to go back to splitting TestConcurrent out separate from TestFollow and TestTail so that the concurrent-logging tests could use a different log-driver configuration which is more amenable to not dropping logs without altering the config used in TestFollow and TestTail.

Ideally we would want the LogFile to not drop messages when rotation happen quickly, but it's probably not feasible.

It's not feasible to guarantee that log entries could never be dropped by followers because the logger would need to be stalled until all followers caught up enough, blocking the container process when it next writes to STDOUT or STDERR. That being said, there is still room for improvement: the current follower implementation skips log entries if it falls behind by more than one rotation, even in cases where the skipped log entries are still present on disk in a rotated (read: renamed) file. This is actually an improvement over the v20.10 behavior! Before my rewrite, followers would drop any unread log entries the instant the logger performed a rotation. The follower implementation could be improved further so that it only skips log entries if the log file containing those entries was deleted off disk by opening the appropriate rotated (.1, .2, etc.) log files instead of skipping straight to the active log file, if someone cared to put in the work. It is unlikely to be an issue in practice as the default file-size threshold to trigger a rotation is large enough that logs don't get rotated fast enough to cause problems for most users. And if it is a problem for anyone, they can work around it by changing the log driver config to reduce the rate of rotations, switch to a more appropriate log driver for their use-case, or eschew Docker's logging infrastructure and do their own thing with e.g. a log shipper embedded into their container such as fluent-bit or logstash.

Side note: that msg="error closing current log file" error="invalid argument" log message you saw when trying to test with follow doesn't look good. "invalid argument" is os.ErrInvalid, which is returned if you call ((*os.File)(nil)).Close(). That shouldn't happen. You might've found another bug! Is the current code in the PR branch sufficient to reproduce the failure?

Comment thread integration/container/logs_test.go
Comment thread daemon/logger/loggertest/logreader.go Outdated
Comment thread daemon/logger/loggertest/logreader.go Outdated
Comment thread integration/container/logs_test.go Outdated
Comment thread daemon/logger/jsonfilelog/read_test.go Outdated
@vvoland
Copy link
Copy Markdown
Contributor Author

vvoland commented Jun 2, 2022

Side note: that msg="error closing current log file" error="invalid argument" log message you saw when trying to test with follow doesn't look good. "invalid argument" is os.ErrInvalid, which is returned if you call ((*os.File)(nil)).Close(). That shouldn't happen. You might've found another bug! Is the current code in the PR branch sufficient to reproduce the failure?

Yes, this is reproducible on 8a4dcf5 if you change max-size to small values. It never occurs without the missed rotations warning (from what I saw it was always more than 1 missed rotation).

The f is nil after this open:

https://github.com/vvoland/moby/blob/b34123ff3abcf763c4f4c5c4524a7f8969ad1c72/daemon/logger/loggerutils/follow.go#L74

err is:
open /var/folders/rv/7kmyxm9n6hd6xcpsh8tx3g300000gn/T/TestFollowLogsWithRotationcompress=falseConcurrent2959452999/001/logconcurrent0.log: no such file or directory

the return after if err != nil causes the defer block to be executed and we get the ((*os.File)(nil)).Close()

So, I was writing this comment and debugging in the meantime. After some time I can no longer reproduce the above error (on the same code from the above revision), but now I get:
time="2022-06-02T14:43:03+02:00" level=error msg="Error compressing log file after rotation" error="failed to open or create gzip log file: open /var/folders/rv/7kmyxm9n6hd6xcpsh8tx3g300000gn/T/TestFollowLogsWithRotationcompress=trueConcurrent4179564514/001/logconcurrent0.log.1.gz: no such file or directory"

This seems to be much rarer than the previous, but still it seems like the same situation, but another place?...

@vvoland vvoland force-pushed the test-container-logs2 branch from 8a4dcf5 to 6c37a7a Compare June 2, 2022 12:53
@vvoland
Copy link
Copy Markdown
Contributor Author

vvoland commented Jun 2, 2022

Anyway, I applied your suggestions to the concurrent test. Thanks!

@vvoland
Copy link
Copy Markdown
Contributor Author

vvoland commented Jun 2, 2022

Window tests fail for the tty case, thanks to the DeepEqual it is now clearly visible that the output also contains ANSI escape codes (which is probably expected in the TTY case).
I guess it is reasonable to strip those for comparing with expected output.
Might be a bit tricky, but I'll try with a simple removal of known sequences and see how that works.

Comment thread daemon/logger/loggertest/logreader.go
Comment thread daemon/logger/loggertest/logreader.go Outdated
Comment thread integration/container/logs_test.go
@vvoland vvoland force-pushed the test-container-logs2 branch from 4ef0807 to c6319dc Compare June 3, 2022 13:17
@vvoland vvoland requested a review from tianon as a code owner June 3, 2022 13:17
@vvoland vvoland force-pushed the test-container-logs2 branch from c6319dc to 22dd712 Compare June 3, 2022 13:22
Comment thread integration/container/logs_test.go Outdated
Copy link
Copy Markdown
Contributor Author

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now I know why in this case \x1b[Ht still doesn't break the expected output.

@vvoland vvoland force-pushed the test-container-logs2 branch 3 times, most recently from dd81fcc to d3bc85f Compare June 6, 2022 07:31
ShowStdout: true,
ShowStderr: true,
},
expectedOut: "this is fineaccidents happen",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some failures on windows 😞 😢 looks like something is stripping the trailing e from fine?

=== FAIL: github.com/docker/docker/integration/container TestLogs/driver_json-file/tty/only_stdout (2.41s)
    logs_test.go:166: Stripping: "\x1b[2J\x1b[?25l\x1b[m\x1b[Hthis is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a\x1b[?25h\x1b[?25laccidents happen \b\x1b[?25h"
    logs_test.go:156: assertion failed: 
        --- stdoutStr
        +++ tC.expectedOut
          string(
        - 	"this is finaccidents happen",
        + 	"this is fineaccidents happen",
          )
        --- FAIL: TestLogs/driver_json-file/tty/only_stdout (2.41s)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really stripping, it's the backspace: this is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a
I thought that maybe the \b is somehow cancelled out by the use of the set title + \a but I just checked it on my Windows 10 machine and this really rendered with the backspace.
However, running the test command on my Windows returns the correct output. But I'm not sure how my Windows version relates to those in our CI that have this issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that maybe the \b is somehow cancelled out by the use of the set title + \a

The \a character is one of twoish ways to terminate an OSC sequence such as set title. (The other 7-bit way is the two-byte "ST" sequence 0x1b 0x5c, a.k.a. ESC \.) So technically, the \a is part of the set title command.

It looks like the weird backspacing only happens on Windows Server 2019. There are two distinct weird things in the logged string this is fine\b\x1b]0;C:\\bin\\sh.exe\x00\a: the backspace command before the OSC Set Window Title sequence, and the OSC command's argument is null-terminated when it shouldn't be. I booted up a Windows Server 2019 VM and played around a bit to characterize the issue. Note that I'm using Windows Powershell 5.1.

PS> docker build -t busybox --build-arg WINDOWS_BASE_IMAGE=mcr.microsoft.com/windows/servercore `
        --build-arg WINDOWS_BASE_IMAGE_TAG=ltsc2019 `
        .\contrib\busybox\
PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | Format-Hex


           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00000000   1B 5B 32 4A 1B 5B 3F 32 35 6C 1B 5B 6D 1B 5B 48  .[2J.[?25l.[m.[H
00000010   74 68 69 73 20 69 73 20 66 69 6E 65 08 1B 5D 30  this is fine..]0
00000020   3B 43 3A 5C 62 69 6E 5C 73 68 2E 65 78 65 00 07  ;C:\bin\sh.exe..
00000030   1B 5B 3F 32 35 68 1B 5B 3F 32 35 6C 61 63 63 69  .[?25h.[?25lacci
00000040   64 65 6E 74 73 20 68 61 70 70 65 6E 20 08 1B 5B  dents happen ..[
00000050   3F 32 35 68                                      ?25h

PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | Write-Output
this is finaccidents happen
PS> docker run --rm -it busybox sh -c "echo -n this is fine; echo -n accidents happen >&2"
this is fineaccidents happen

AND IT GETS WEIRDER!

PS> docker run --name logweird1 -t busybox sh -c "echo -n this is fine; echo -n accidents happen >&2"
this is fineaccidents happen
PS> docker logs logweird1
this is fineaccidents happen
PS> get-content (docker inspect logweird1 --format '{{.LogPath}}')
{"log":"\u001b[2J\u001b[?25l\u001b[m\u001b[H\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8876173Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8882043Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8882043Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8907976Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\r\n","stream":"stdout","time":"2022-06-07T19:01:11.8915006Z"}
{"log":"\u001b[H\u001b]0;C:\\bin\\sh.exe\u0000\u0007\u001b[?25h\u001b[?25lthis is fineaccidents happen \u0008\u001b[?25h\u001b[Ht\u001b[1;29H","stream":"stdout","time":"2022-06-07T19:01:12.5417731Z"}
PS> docker run --name logweird2 -t busybox sh -c "echo -n this is fine; echo -n accidents happen >&2" | write-output
this is finaccidents happen
PS> get-content (docker inspect logweird2 --format '{{.LogPath}}')
{"log":"\u001b[2J\u001b[?25l\u001b[m\u001b[Hthis is fine\u0008\u001b]0;C:\\bin\\sh.exe\u0000\u0007\u001b[?25h\u001b[?25laccidents happen \u0008\u001b[?25h","stream":"stdout","time":"2022-06-07T19:04:06.1768156Z"}

Redirecting the docker run command's output affects what gets logged by the daemon! (docker run -d yields the same results as redirecting the output: erroneous backspace.) WHAT? HOW?? WHY??! 🤯 Since container logs always pass through the logging driver and the Engine API, there is no association between the CLI session and the container's pseudo-terminal which would be visible to the OS. The most likely explanation for this inconsistent behavior is that there is something different about the API calls the CLI makes when stdout is and is not a console which is necessary for the behavior to change. Curiously, I have found that writing a newline character to stdout first is sufficient to make redirected/detached output not backspace over the "e" in "fine."

The console-attached logs might hold clues to what's going wrong. The sequence of VT commands preceding the first printed characters decodes as follows:

  • ED(2) -- erase entire display
  • DECTCEM(l) -- hide cursor
  • SGR(0) -- set all text formatting attributes to defaults
  • CUP(0;0) -- move cursor to top-left
  • CR LF * 49 -- move cursor down by 49 lines
  • CUP(0;0) -- move cursor to top-left
  • OSC(0;"C:\\bin\\sh.exe\x00") -- set window title to C:\bin\sh.exe
  • DECTCEM(h) -- show cursor
  • DECTCEM(l) -- hide cursor

My terminal window was sized to be 50 lines tall when I ran the command, so I don't think the 49 CRLFs is a coincidence. That sequence of commands looks like how a somewhat naive renderer would "repaint" a VT terminal to the state of a cleared text-mode screen buffer. This should come as no surprise as rendering a Win32 Console API screen buffer to a stream of VT sequences is one of the jobs of conhost.exe. My theory is that perhaps the Windows Server 2019 conhost.exe (or the copy inside the base image?) VT renderer starts out in some bad state which results in backspace characters getting output in the wrong places and other rendering oddities, and that setting the ConPTY's terminal dimensions has a side effect of putting the renderer into a good state. The Docker CLI keeps the container PTY dimensions synced with the attached terminal dimensions when stdout is a console, which would explain the differences in behavior between attached and detached/redirected. I have occasionally seen the "e" in "fine" overtyped when the output is not redirected, which could be explained by my theory as there being a race between the container outputting to the PTY and the Docker CLI setting the terminal dimensions. I have witnessed several variations of output by repeatedly running the same container multiple times so there's quite a bit of non-determinism in the PTY rendering, which either uninitialized memory in conhost.exe or a race condition could both potentially explain. Sleeping for half a second seems to result in deterministic output when outputting to an attached console, which is consistent with my set terminal dimensions race-condition theory. But then again, I've also seen nondeterministic output on detached container runs that I couldn't reproduce when I added a sleep, so maybe there's more than one race in play?

In conclusion, the integration test found a bug with Windows containers, and conhost.exe might be involved.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a really impressive investigation :)
It's a bit strange that the resize actually does happen, since on Windows the container has the desired ConsoleSize already set on the creation. So there shouldn't be a resize at the start, unless you change your terminal size after docker run. But maybe under the hood there is still a delay between creating the container and setting the console size.

Anyway, I think we already spent a lot of time on those tests, so maybe for now we can skip this test on WS2019, report this to Microsoft or at least create an issue on our side, and revisit this once/if they handle it in a follow up?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange that the resize actually does happen, since on Windows the container has the desired ConsoleSize already set on the creation. So there shouldn't be a resize at the start, unless you change your terminal size after docker run.

Ah, so an initial console size is set on creation. I should have checked that before hypothesizing. So maybe a resize isn't necessary, but it does look like in the docker run | write-output case the console size configuration will be set to 0, 0 so perhaps that's a sufficient condition to get the buggy behavior? (That wouldn't explain the occasional buggy behavior I observed when the console size is set to positive values, though.) You know, I don't see any container.WithConsoleSize options in your new logging integration tests. I wonder if adding it would change the test results.

Anyway, I think we already spent a lot of time on those tests, so maybe for now we can skip this test on WS2019, report this to Microsoft or at least create an issue on our side, and revisit this once/if they handle it in a follow up?

What do we report to Microsoft? "Docker container --tty output is weird on an old version of Windows Server and by the way we don't have a minimal reproducer, plz fix it for us kthxbye" isn't likely to get much traction. I think there needs to be significantly more investigation on our end to isolate and characterize the issue before we will be at a point where we could loop Microsoft in and expect a reply. We don't yet know if HCS is necessary to reproduce the issue or if the issue can be reproduced outside a container, which is necessary to determine whether we need to open an issue on https://github.com/microsoft/hcsshim or https://github.com/microsoft/terminal. Or maybe it truly is a bug on our end and newer versions of HCS or conhost.exe happen to be better at clamping nonsensical configuration. Creating an issue on our side sounds like a good idea for coordinating our investigation.

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we report to Microsoft?

Let me try some people from their side; @jstarks (I think you wrote most of the original go-ansiterm bits), or otherwise @dcantah and @kevpar may have some ideas.

FWIW; I'm ok with t.Skip() some of these on Windows for now (with a link to this conversation of a tracking ticket), or perhaps we can do a limited "string contains is fine and /or string contains accidents check for windows (i.e., we know we received things on the right stream).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it does look like in the docker run | write-output case the console size configuration will be set to 0, 0

Yeah you are right, the size is 0,0 in this case, I missed the GetTtySize, so the resize does happen.

What do we report to Microsoft?

I might have worded myself better - by reporting I mean "We encountered this issue, this doesn't happen on a newer Windows Server, so perhaps that means you fixed it and already know something more about it" rather than "plz fix it for us kthxbye".

I don't see any container.WithConsoleSize options in your new logging integration tests. I wonder if adding it would change the test results.

Just tested, adding a non zero ConsoleSize does not change the result: https://github.com/moby/moby/runs/6812754016?check_suite_focus=true

BTW, I did test the Windows Server 2019 with the conhost.exe replaced (by modifying the image.. yeah that's really bad and evil but it works...). The conhost.exe from my WS2019 installation is the same as the one in the ltsc2019 container (10.0.17763.2746). I replaced it with a conhost.exe from my PC's Windows 10 (10.0.18362.1 ) and.. it works correctly! The backspace is no longer there and redirecting the stdout of docker run produces the same logfile without the additional CR LF (so same as docker run ... | Write-Output in your case).
I also tried the conhost.exe from Windows Server 2022 which is much newer (10.0.20348.558) and.. the terminal didn't start at all - probably due to other dependencies.

So, to conclude:

  • conhost.exe is involved - it's sufficient to replace it to get fix the issue
  • The relevant changes were made between 10.0.17763.2746 and 10.0.18362.1
  • This still can be conhost.exe happen to be better at clamping nonsensical configuration though

@vvoland vvoland force-pushed the test-container-logs2 branch 3 times, most recently from e0e48c7 to 5efc565 Compare June 9, 2022 14:19
Comment thread integration/container/logs_test.go Outdated
// which breaks the output: https://github.com/moby/moby/pull/43669#discussion_r891737623
if strings.Contains(testEnv.DaemonInfo.OperatingSystem, "Windows Server Version 1809") {
if tC.logOps.ShowStdout {
assert.Check(t, strings.Contains(stdout.String(), "this is fine"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string returned from stripEscapeCodes when the bug manifests is "this is finaccidents happen". This check will return false and therefore the test will flake when that next happens, same as if it was not special-cased at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compares the stdout.String() which is the raw unparsed string.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I read the code to look for exactly that and still misread it. Oops, sorry about that.

Weren't there times when the initial t in this sometimes gets printed out of order? Would that cause the test to flake or was it just a redundant overtype which wouldn't confuse the test?

Copy link
Copy Markdown
Contributor Author

@vvoland vvoland Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a redundant overtype inserted at the end, but the correct string was still there:

\x1b[2J\x1b[?25l\x1b[m\x1b[H
this is fine
\b\x1b]0;C:\bin\sh.exe\x00\a\x1b[?25h\x1b[?25laccidents happen \b\x1b[?25h
\x1b[Ht
\x1b[1;29H

Comment thread integration/container/logs_test.go Outdated
@vvoland
Copy link
Copy Markdown
Contributor Author

vvoland commented Jun 9, 2022

Ok, so finally CI is green with the workaround for the Windows Server 2019.
Also sorry for the commit mess, I messed up the branches when squashing 🙈

Extracted the comments regarding the Windows issue to #43710 - lets do a further investigation there

1. Add integration tests for the ContainerLogs API call
Each test handle a distinct case of ContainerLogs output.
- Muxed stream, when container is started without tty
- Single stream, when container is started with tty

2. Add unit test for LogReader suite that tests concurrent logging
It checks that there are no race conditions when logging concurrently
from multiple goroutines.

Co-authored-by: Cory Snider <[email protected]>
Signed-off-by: Cory Snider <[email protected]>
Signed-off-by: Paweł Gronowski <[email protected]>
@vvoland vvoland force-pushed the test-container-logs2 branch from baffa6b to 2ec3e14 Compare June 10, 2022 07:30
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants