lxc/file_utils: ensure complete data write in lxc_write_nointr#4597
lxc/file_utils: ensure complete data write in lxc_write_nointr#4597DreamConnected wants to merge 1 commit intolxc:mainfrom
Conversation
2e8bdd6 to
8003f8f
Compare
|
@stgraber Can anyone review it? @mihalicyn seems too busy to handle. |
src/lxc/file_utils.c
Outdated
| if (errno == EINTR) | ||
| continue; | ||
|
|
||
| if (errno == EWOULDBLOCK) { |
There was a problem hiding this comment.
Please can you explain how you figured that there is a problem.
Do you have a pointer to a specific lxc_write_nointr() call site where this error occurs?
AFAIK, EWOULDBLOCK can only happen with sockets.
From commit message:
Previously, lxc_write_nointr could return without writing all data
when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions.
I guess this is about sockets, right? We have only a few cases when we call this function on a socket fd and buffer is tiny. Also, in case of EAGAIN/EWOULDBLOCK error, function returns an error, i.e. there is no silent data corruption.
There was a problem hiding this comment.
I'm sorry to reply now, because I didn't bring the test machine with me on business trip.
To get back to the point, the call to EAGAIN/EWOULDBLOCK that reported an error was located at lxc_terminal_ptx_io. When lxc was attached:
- Terminal fd is changed to non-blocking. lxc_terminal_mainloop_add
- the amount of data was large, and the output exceeded the size of the terminal buffer (1024 bytes). Defined: #define LXC_TERMINAL_BUFFER_SIZE 1024
- Continuous output: the command continuously generated data, not completed at one time.
- Real time: real-time display or continuous monitoring is required.
For example, dd if=/dev/random bs=1000000 count=1 status=none | base64 mentioned in issue. Most easily triggered with pipe |.
There was a problem hiding this comment.
Amazing! I think we should try to make a test for this scenario. Let's think about this a bit.
There was a problem hiding this comment.
Hi @DreamConnected,
I've spend more time with this issue and the more I look, the more I feel that I'm that bad guy who introduced this regression in this commit 715fb4e.
Back then, there was an issue with io_uring and there were several attempts to properly handle this stuff. I believe, that clean and correct solution is 7fd671d, while I claimed in the comment that this is a workaround, in fact, using oneshot-polling mode is the only option there. If we use oneshot mode with io_uring, then making FD-s non-blocking seems unnecessary.
Another point is that the logic you've introduced in lxc_write_nointr basically implements blocking write on a non-blocking FD, right? But why we can't just make FD blocking if we really need this blocking semantic anyways?
My proposal is to take 715fb4e and revert it, then check if this helps with the issue we are trying to fix and at the same time doesn't break io_uring scenario (when LXC is built with -Dio-uring-event-loop=true). It would be awesome if you try this, cause you've already spend a lot of time researching this issue and preparing a fix, and I don't want to take this initiative from you. If this revert helps and doesn't break anything else, then you can submit this revert as your fix. ;)
Feel free to disagree or correct me if I'm wrong ;)
There was a problem hiding this comment.
@mihalicyn,
Right. Revert the commit 715fb4e is a good solution, which may fundamentally solve this issue by causing the terminal fd to return to blocking. But I need some time to make sure this revert won't affect #4304
There was a problem hiding this comment.
But I need some time to make sure this revert won't affect #4304
No worries about that part. I gonna re-introduce CI tests for io-uring case (see #4631), so we will be able to validate this part. I wonder if this revert will help with data corruption issue (it should theoretically, but Software Engineering as Physics are experimental sciences ;-) ).
There was a problem hiding this comment.
@DreamConnected you can rebase this on top of recent main because we've merged #4631 this should give us some confidence that revert doesn't regress anything.
There was a problem hiding this comment.
src/lxc/file_utils.c
Outdated
| .revents = 0 | ||
| }; | ||
|
|
||
| return poll(&pfd, 1, timeout); |
There was a problem hiding this comment.
I guess we need a proper error handling on poll() syscall too. It returns negative value in case of error (including EINTR btw.). In current approach, we don't distinguish between error and "normal" scenarios.
There was a problem hiding this comment.
I guess we need a proper error handling on syscall too. It returns negative value in case of error (including EINTR btw.). In current approach, we don't distinguish between error and "normal" scenarios.
poll()
The processing of this part is really too simple. I will submit new changes after testing.
|
Hi @DreamConnected, thanks for working on this. AFAIU, this is an attempt to fix #4546. (I'm very sorry for a super long delay with review.) Hi @laanwj, thanks for reporting an issue. As far as I see, you expressed interest to test this change here: Please, can you clone this branch, compile LXC and test it to confirm if this change helps to solve the problem for you. Kind regards, |
8003f8f to
50d13a2
Compare
Previously, lxc_write_nointr could return without writing all data when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions. This change: - Implements a loop to continue writing until all data is sent - Handles EINTR, EAGAIN, and EWOULDBLOCK errors appropriately - Uses poll() to wait for fd to become ready when blocked - Maintains backward compatibility while fixing partial write issues Signed-off-by: DreamConnected <[email protected]>
50d13a2 to
1e1ed6e
Compare
|
Sorry for the late response. i've tested this PR (commit 1e1ed6e) and can confirm it fixes the issue i reported in #4546. i've checked in two ways, inside a
Thank you! |
|
Hey @laanwj, thanks for testing this! I was busy with this code and also have prepared an automated test that can recreate a data corruption scenario on CI. |
|
Replaced by #4633 |
OK, there seems to be a problem with the GitHub API. My login token is banned every few minutes (401). |
Previously, lxc_write_nointr could return without writing all data when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions.
This change:
Fixed #4546