Fix infinite callback loop when time is not moving forward#10151
Fix infinite callback loop when time is not moving forward#10151bdraco merged 5 commits intoaio-libs:masterfrom
Conversation
If the keepalive handler is called too soon, it reschedules itself. The test used `now <= close_time`, which means that an exactly on-time notification is treated as "too soon", causing an automatic rescheduling. For real systems the time will eventually advance and break the loop, but with async-solipsism, time doesn't advance until there is some reason to sleep and the loop is infinite. Closes aio-libs#10149.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10151 +/- ##
=======================================
Coverage 98.75% 98.75%
=======================================
Files 122 122
Lines 36954 36997 +43
Branches 4411 4413 +2
=======================================
+ Hits 36494 36538 +44
Misses 313 313
+ Partials 147 146 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
CodSpeed Performance ReportMerging #10151 will not alter performanceComparing Summary
|
|
We need to make this very clear it's not a production problem so we don't end up with unexpected questions or discussions. Additionally we haven't committed to supporting async-solipsism at this time so I changed the title to better reflect that. |
|
I think we should change the fragment to packaging instead of bugfix since it's for testing and doesn't affect production |
|
Is there any chance of a regression test without adding a new dependency? |
|
I think a test could be added using |
Sure, I'll do that. But are you sure aiohttp isn't used in any environment in which the timer precision might be low enough that it causes the event to be re-scheduled unnecessarily (which won't be an infinite loop, but will cause more work than necessary)? I'm thinking of environments that deliberately degrade timer precision to prevent Spectra-type side-channel attacks.
From what I've seen, I've never tested async-solipsism on non-Linux OSes, so it might not be something you want the test suite to depend on at this point. |
That seems unlikely as we haven't had any issue reports, and even
We don't have any maintainers familiar with it either so its not something we could effectively troubleshoot so we wouldn't want to depend on it right now. |
At request of @bdraco
I think we only need to patch |
I'm not so sure — it doesn't touch any packaging metadata / mechanisms / downstream expectations. This would be misleading, in my opinion. I suppose, we could go for But yeah, a regression test would be most welcome here... |
|
|
|
Ok, renamed to |
|
I added a test to ensure the keep alive expires on time. I don't love it, but since |
Backport to 3.11: 💚 backport PR created✅ Backport PR branch: Backported as #10173 🤖 @patchback |
Co-authored-by: J. Nick Koston <[email protected]> (cherry picked from commit 7c12b1a)
Backport to 3.12: 💚 backport PR created✅ Backport PR branch: Backported as #10174 🤖 @patchback |
Co-authored-by: J. Nick Koston <[email protected]> (cherry picked from commit 7c12b1a)
…ime is not moving forward (#10173) Co-authored-by: Bruce Merry <[email protected]> Fixes #123'). --> Fixes #10149.
…ime is not moving forward (#10174) Co-authored-by: Bruce Merry <[email protected]> Fixes #123'). --> Fixes #10149.
|
Hi, I have a single machine (x86_64-linux) where Any ideas what could cause this? Given that everything timing related should be mocked and the build is done via a Nix derivation which produces a consistent environment with the same precise dependency versions across multiple machines, I'm a bit at a loss to explain the failure. Not sure what to add without some ideas. As said, I can't really reproduce the failure elsewhere myself so far… |
|
No idea, you'd have to do some debugging. |
I haven't touched aiohttp in quite a while, but does this test use real TCP sockets? If so, timing isn't going to be deterministic: even with localhost connections, there is a delay between userspace sending data into one end of the socket and the kernel signalling that it's ready on the other end. |
|
The code waits for the response to be read, so the socket should already be idle. It's then just yielding to the loop twice in order to trigger the timeout code which should reset it. So, I wouldn't expect there to be a timing dependency at a glance.. |
|
I'm trying to build the aiohttp package, and having exactly the same issue as @sternenseemann . I tried debugging and found out that If I modify the test and shift the patched loop.time return value even a microsecond further: I can't understand why this is happening. From what I've seen in the code, The only possible reason I could think of, is that when the test was written, asyncio performed the timer handler calls at exactly the specified time. But later python changed this behavior, and the timer is now called only when the timer exceeds the specified time, but not when it's exactly equal. But I don't know how to verify this hypothesis. Can you suggest any ideas I could try? I'm using python 3.12.9 on a RHEL10-derived OS. |
I think this is fine if that gets the test working for everyone.
That's clearly not possible as we are running every commit on Python 3.10-3.14 on Window, Mac and Ubuntu, and the test is still passing on all of them (plus PyPy as well). Maybe RHEL has some patches to Python that affect the behaviour (previous poster did not mention which distro they used). |
But that would be completely against the original idea of the test: checking how the library behaves when the time is exactly identical. Would it not? That was what the pull request changed:
Hm... Thanks for the info, that's really puzzling, but useful to know. I actually wanted to ask what the target python version was, but completely forgot. |
Oh yes, you're right, that is exactly what it's trying to test. You said the function wasn't getting called at all, which is down to the asyncio loop implementation. I guess it's really an implementation detail as to whether the loop calls the function at the recorded time, or immediately after... So, actually looking through the implementation, it actually breaks and doesn't run the callback if the time is |
Or possibly such a small number that it gets lost due to floating point precision during the addition (for reference, my machine returns This suggests the original issue reported with async-solipsism isn't actually resolved for some machines. I'd suggest trying to get a similar change into asyncio to remove the |
|
My two machines report |
I haven't followed the entire context of the discussion, but on my machine it looks like |
No, just like for the others who just posted above, it returns 1e-09 on the machine where I perform the build. |
It's only using the resolution value and then adding to loop.time(). So unless someone is taking 194 days to execute the tests...
Hmm, then I have no idea. Try hacking those asyncio lines and check what the values are, and whether removing the |
|
OK, that is interesting. I don't know why the event loop timer has this value, and what time it's expected to be, but when I run the test, Now the big question is, why the loop timer is using the machine's uptime, whether it is interntional or not. P. S. I can see in the code that |
|
|
So, basically, the test depends on how long the machine was running since last reboot... |
|
Oh, for some reason I thought loop.time() counted from when the loop started running. My suggestion to try and get the |
That would fit. It is succeeding on the machine with 12 days of uptime and failing on the machine with 467 days of uptime. |
|
To be honest, I have no idea why they even implemented the comparison like this. I found a commit when it was done, and originally they did have just Here is the commit, made back in 2014 (Python 3.4): In fact, after I looked through git blame, that part has been done and redone multiple times. Initially there was no extra shift, then it was added, then removed, then added in a slightly different way, then removed again… And unfortunately, the discussions about the reasons of these changes are a bit above my level of knowledge. But since it was discussed and redesigned multiple times, I think they know what they were doing, and there probably are serious reasons to keep the asyncio behavior the way it's implemented now. However, I concocted a dirty fix for this aiohttp test to overcome the floating point resolution issue: Can you think of any possible issues or side effects of this approach? Actually, maybe asyncio itself could adopt some similar fix, but I'm not ready to go to the Python developers with such a suggestion yet… |
|
P. S. I found the relevant suggestion for asyncio fix in Python's GitHub: |
They don't appear to have done the maths in that issue. They are saying that it'd likely become an issue at some point if it was epoch time. They don't seem to realise that it's about 6 months before it breaks... |
|
I guess it's worth trying to create another issue for them, where the exact details will be explained. The only thing is: can you suggest any real life scenarios that are affected by this asyncio's issue? Right now I have the impression that the only effect is, that with high uptime the scheduled timers will be called just a little bit later than they could have been (one extra "tick" will be skipped). But timers never guaranteed exact time of the calls. I don't think asyncio is designed for RTOS applications. We only have the failure because the test verifies very specific conditions. I'm not sure that the Python developers would consider changing the standard library just because "one third-party module in combination with another third-party module used to fail, and now one of the tests rely on something it should have never relied upon, and fails, so please modify asyncio's behavior that has been there for 8 years and did not bother anybody"... |
|
Yeah, it's probably only a minor efficiency loss on long running machines. But, I don't see any advantage to the alternative of not running tasks immediately when they are literally scheduled to run now. |
|
I mean the reason to add the clock resolution is surely to help callbacks get run before their scheduled time passes, so it really makes no sense to have some things not run at the scheduled time, when you're literally trying to make them run slightly early in most cases. |
|
Good news! I've sent the fix suggestion to the Python development team, and they have accepted it. I backported the patch to our version of Python 3.12, and with it applied the aiohttp tests all pass successfully. For reference, here is the link to the accepted pull request: python/cpython#145706 |
What do these changes do?
If the keepalive handler is called too soon, it reschedules itself. The test used
now <= close_time, which means that an exactly on-time notification is treated as "too soon", causing an automatic rescheduling. For real systems the time will eventually advance and break the loop, but with async-solipsism, time doesn't advance until there is some reason to sleep and the loop is infinite.Are there changes in behavior for the user?
This will fix infinite loops when using async-solipsism.
Is it a substantial burden for the maintainers to support this?
No. This does not increase the amount of code at all.
Related issue number
Fixes #10149.
Checklist
CONTRIBUTORS.txt(already there)CHANGES/foldername it
<issue_or_pr_num>.<type>.rst(e.g.588.bugfix.rst)if you don't have an issue number, change it to the pull request
number after creating the PR
.bugfix: A bug fix for something the maintainers deemed animproper undesired behavior that got corrected to match
pre-agreed expectations.
.feature: A new behavior, public APIs. That sort of stuff..deprecation: A declaration of future API removals and breakingchanges in behavior.
.breaking: When something public is removed in a breaking way.Could be deprecated in an earlier release.
.doc: Notable updates to the documentation structure or buildprocess.
.packaging: Notes for downstreams about unobvious side effectsand tooling. Changes in the test invocation considerations and
runtime assumptions.
.contrib: Stuff that affects the contributor experience. e.g.Running tests, building the docs, setting up the development
environment.
.misc: Changes that are hard to assign to any of the abovecategories.
Make sure to use full sentences with correct case and punctuation,
for example:
Use the past tense or the present tense a non-imperative mood,
referring to what's changed compared to the last released version
of this project.