lxc/file_utils: ensure complete data write in lxc_write_nointr by DreamConnected · Pull Request #4597 · lxc/lxc

DreamConnected · 2025-10-26T05:31:39Z

Previously, lxc_write_nointr could return without writing all data when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions.

This change:

Implements a loop to continue writing until all data is sent
Handles EINTR, EAGAIN, and EWOULDBLOCK errors appropriately
Uses poll() to wait for fd to become ready when blocked

Fixed #4546

src/lxc/file_utils.c

DreamConnected · 2026-01-06T02:01:10Z

@stgraber Can anyone review it? @mihalicyn seems too busy to handle.

mihalicyn · 2026-01-06T14:39:06Z

src/lxc/file_utils.c

+		if (errno == EINTR)
+			continue;
+
+		if (errno == EWOULDBLOCK) {


Please can you explain how you figured that there is a problem.
Do you have a pointer to a specific lxc_write_nointr() call site where this error occurs?
AFAIK, EWOULDBLOCK can only happen with sockets.

From commit message:

Previously, lxc_write_nointr could return without writing all data
when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions.

I guess this is about sockets, right? We have only a few cases when we call this function on a socket fd and buffer is tiny. Also, in case of EAGAIN/EWOULDBLOCK error, function returns an error, i.e. there is no silent data corruption.

I'm sorry to reply now, because I didn't bring the test machine with me on business trip.

To get back to the point, the call to EAGAIN/EWOULDBLOCK that reported an error was located at lxc_terminal_ptx_io. When lxc was attached:

Terminal fd is changed to non-blocking. lxc_terminal_mainloop_add

the amount of data was large, and the output exceeded the size of the terminal buffer (1024 bytes). Defined: #define LXC_TERMINAL_BUFFER_SIZE 1024

Continuous output: the command continuously generated data, not completed at one time.

Real time: real-time display or continuous monitoring is required.

For example, dd if=/dev/random bs=1000000 count=1 status=none | base64 mentioned in issue. Most easily triggered with pipe |.

Amazing! I think we should try to make a test for this scenario. Let's think about this a bit.

Hi @DreamConnected,

I've spend more time with this issue and the more I look, the more I feel that I'm that bad guy who introduced this regression in this commit 715fb4e.

Back then, there was an issue with io_uring and there were several attempts to properly handle this stuff. I believe, that clean and correct solution is 7fd671d, while I claimed in the comment that this is a workaround, in fact, using oneshot-polling mode is the only option there. If we use oneshot mode with io_uring, then making FD-s non-blocking seems unnecessary.

Another point is that the logic you've introduced in lxc_write_nointr basically implements blocking write on a non-blocking FD, right? But why we can't just make FD blocking if we really need this blocking semantic anyways?

My proposal is to take 715fb4e and revert it, then check if this helps with the issue we are trying to fix and at the same time doesn't break io_uring scenario (when LXC is built with -Dio-uring-event-loop=true). It would be awesome if you try this, cause you've already spend a lot of time researching this issue and preparing a fix, and I don't want to take this initiative from you. If this revert helps and doesn't break anything else, then you can submit this revert as your fix. ;)

Feel free to disagree or correct me if I'm wrong ;)

@mihalicyn,
Right. Revert the commit 715fb4e is a good solution, which may fundamentally solve this issue by causing the terminal fd to return to blocking. But I need some time to make sure this revert won't affect #4304

But I need some time to make sure this revert won't affect #4304

No worries about that part. I gonna re-introduce CI tests for io-uring case (see #4631), so we will be able to validate this part. I wonder if this revert will help with data corruption issue (it should theoretically, but Software Engineering as Physics are experimental sciences ;-) ).

@DreamConnected you can rebase this on top of recent main because we've merged #4631 this should give us some confidence that revert doesn't regress anything.

https://discuss.linuxcontainers.org/t/fixing-forgejo-runners-lxc-logging/25918/4

mihalicyn · 2026-01-06T14:41:56Z

src/lxc/file_utils.c

+		.revents = 0
+	};
+
+	return poll(&pfd, 1, timeout);


I guess we need a proper error handling on poll() syscall too. It returns negative value in case of error (including EINTR btw.). In current approach, we don't distinguish between error and "normal" scenarios.

I guess we need a proper error handling on syscall too. It returns negative value in case of error (including EINTR btw.). In current approach, we don't distinguish between error and "normal" scenarios.poll()

The processing of this part is really too simple. I will submit new changes after testing.

mihalicyn · 2026-01-06T14:45:46Z

Hi @DreamConnected,

thanks for working on this. AFAIU, this is an attempt to fix #4546.
Can you confirm that you was able to reproduce this specific issue from #4546?
Can you confirm that with this patches applied, issue is not reproducible anymore?

(I'm very sorry for a super long delay with review.)

Hi @laanwj,

thanks for reporting an issue. As far as I see, you expressed interest to test this change here:
#4546 (comment)

Please, can you clone this branch, compile LXC and test it to confirm if this change helps to solve the problem for you.

Kind regards,
Alex

Previously, lxc_write_nointr could return without writing all data when write() returned EAGAIN/EWOULDBLOCK due to buffer full conditions. This change: - Implements a loop to continue writing until all data is sent - Handles EINTR, EAGAIN, and EWOULDBLOCK errors appropriately - Uses poll() to wait for fd to become ready when blocked - Maintains backward compatibility while fixing partial write issues Signed-off-by: DreamConnected <[email protected]>

laanwj · 2026-01-20T20:52:37Z

Sorry for the late response.

i've tested this PR (commit 1e1ed6e) and can confirm it fixes the issue i reported in #4546.

i've checked in two ways, inside a lxc-attach session:

Open neovim on various large text files. The problem i reported that the output gets broken off halfway does no longer happen.
Check using dd if=/dev/urandom bs=1 count=5000000 status=none | base64. No lines are cut off anymore.

Thank you!

mihalicyn · 2026-01-21T19:20:07Z

Hey @laanwj,

thanks for testing this!

I was busy with this code and also have prepared an automated test that can recreate a data corruption scenario on CI.
What I've found is that current PR fixes data corruption, but there is something else that makes test to stuck sometimes. It looks like another one and very close bug. I'm digging further.

mihalicyn · 2026-01-22T14:05:07Z

Replaced by #4633

DreamConnected · 2026-01-22T14:08:48Z

Replaced by #4633

OK, there seems to be a problem with the GitHub API. My login token is banned every few minutes (401).

DreamConnected force-pushed the features branch 2 times, most recently from 2e8bdd6 to 8003f8f Compare October 26, 2025 05:57

mihalicyn assigned mihalicyn and unassigned mihalicyn Oct 26, 2025

mihalicyn self-requested a review October 26, 2025 20:30

laanwj reviewed Oct 27, 2025

View reviewed changes

src/lxc/file_utils.c Show resolved Hide resolved

mihalicyn reviewed Jan 6, 2026

View reviewed changes

DreamConnected force-pushed the features branch from 8003f8f to 50d13a2 Compare January 9, 2026 10:18

DreamConnected force-pushed the features branch from 50d13a2 to 1e1ed6e Compare January 9, 2026 11:20

DreamConnected requested a review from mihalicyn January 14, 2026 08:05

mihalicyn self-assigned this Jan 20, 2026

laanwj approved these changes Jan 20, 2026

View reviewed changes

mihalicyn mentioned this pull request Jan 22, 2026

lxc-attach: fix data corruption during heavy IO on PTS #4633

Merged

mihalicyn closed this Jan 22, 2026

Conversation

DreamConnected commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DreamConnected commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihalicyn Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihalicyn commented Jan 6, 2026

Uh oh!

laanwj commented Jan 20, 2026

Uh oh!

mihalicyn commented Jan 21, 2026

Uh oh!

mihalicyn commented Jan 22, 2026

Uh oh!

DreamConnected commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

DreamConnected commented Oct 26, 2025 •

edited

Loading

mihalicyn Jan 19, 2026 •

edited

Loading