Skip to content

Fix infinite callback loop when time is not moving forward#10151

Merged
bdraco merged 5 commits intoaio-libs:masterfrom
bmerry:fix-repeated-keepalive
Dec 17, 2024
Merged

Fix infinite callback loop when time is not moving forward#10151
bdraco merged 5 commits intoaio-libs:masterfrom
bmerry:fix-repeated-keepalive

Conversation

@bmerry
Copy link
Copy Markdown
Contributor

@bmerry bmerry commented Dec 10, 2024

What do these changes do?

If the keepalive handler is called too soon, it reschedules itself. The test used now <= close_time, which means that an exactly on-time notification is treated as "too soon", causing an automatic rescheduling. For real systems the time will eventually advance and break the loop, but with async-solipsism, time doesn't advance until there is some reason to sleep and the loop is infinite.

Are there changes in behavior for the user?

This will fix infinite loops when using async-solipsism.

Is it a substantial burden for the maintainers to support this?

No. This does not increase the amount of code at all.

Related issue number

Fixes #10149.

Checklist

  • I think the code is well written
  • Unit tests for the changes exist (I am assuming that the keepalive functionality has existing test coverage)
  • Documentation reflects the changes (no documentation change needed)
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt (already there)
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES/ folder
    • name it <issue_or_pr_num>.<type>.rst (e.g. 588.bugfix.rst)

    • if you don't have an issue number, change it to the pull request
      number after creating the PR

      • .bugfix: A bug fix for something the maintainers deemed an
        improper undesired behavior that got corrected to match
        pre-agreed expectations.
      • .feature: A new behavior, public APIs. That sort of stuff.
      • .deprecation: A declaration of future API removals and breaking
        changes in behavior.
      • .breaking: When something public is removed in a breaking way.
        Could be deprecated in an earlier release.
      • .doc: Notable updates to the documentation structure or build
        process.
      • .packaging: Notes for downstreams about unobvious side effects
        and tooling. Changes in the test invocation considerations and
        runtime assumptions.
      • .contrib: Stuff that affects the contributor experience. e.g.
        Running tests, building the docs, setting up the development
        environment.
      • .misc: Changes that are hard to assign to any of the above
        categories.
    • Make sure to use full sentences with correct case and punctuation,
      for example:

      Fixed issue with non-ascii contents in doctest text files
      -- by :user:`contributor-gh-handle`.

      Use the past tense or the present tense a non-imperative mood,
      referring to what's changed compared to the last released version
      of this project.

If the keepalive handler is called too soon, it reschedules itself. The
test used `now <= close_time`, which means that an exactly on-time
notification is treated as "too soon", causing an automatic
rescheduling. For real systems the time will eventually advance and
break the loop, but with async-solipsism, time doesn't advance until
there is some reason to sleep and the loop is infinite.

Closes aio-libs#10149.
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Dec 10, 2024
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.75%. Comparing base (6200513) to head (03f7a3c).
Report is 365 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #10151   +/-   ##
=======================================
  Coverage   98.75%   98.75%           
=======================================
  Files         122      122           
  Lines       36954    36997   +43     
  Branches     4411     4413    +2     
=======================================
+ Hits        36494    36538   +44     
  Misses        313      313           
+ Partials      147      146    -1     
Flag Coverage Δ
CI-GHA 98.64% <100.00%> (+<0.01%) ⬆️
OS-Linux 98.33% <95.45%> (+<0.01%) ⬆️
OS-Windows 96.18% <100.00%> (+<0.01%) ⬆️
OS-macOS 97.44% <95.45%> (+<0.01%) ⬆️
Py-3.10.11 97.28% <100.00%> (+<0.01%) ⬆️
Py-3.10.15 97.82% <95.45%> (-0.05%) ⬇️
Py-3.11.10 ?
Py-3.11.11 97.85% <95.45%> (?)
Py-3.11.9 97.33% <100.00%> (-0.01%) ⬇️
Py-3.12.7 ?
Py-3.12.8 98.38% <100.00%> (?)
Py-3.13.0 ?
Py-3.13.1 98.38% <95.45%> (+0.02%) ⬆️
Py-3.9.13 97.20% <100.00%> (+<0.01%) ⬆️
Py-3.9.20 97.78% <95.45%> (+0.04%) ⬆️
Py-pypy7.3.16 97.35% <95.45%> (+<0.01%) ⬆️
VM-macos 97.44% <95.45%> (+<0.01%) ⬆️
VM-ubuntu 98.33% <95.45%> (+<0.01%) ⬆️
VM-windows 96.18% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Dec 10, 2024

CodSpeed Performance Report

Merging #10151 will not alter performance

Comparing bmerry:fix-repeated-keepalive (03f7a3c) with master (0070045)

Summary

✅ 47 untouched benchmarks

@bdraco bdraco changed the title Fix infinite callback loop when used with async-solipsism Fix infinite callback loop when time is not moving forward Dec 10, 2024
@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

We need to make this very clear it's not a production problem so we don't end up with unexpected questions or discussions.

Additionally we haven't committed to supporting async-solipsism at this time so I changed the title to better reflect that.

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

I think we should change the fragment to packaging instead of bugfix since it's for testing and doesn't affect production

@Dreamsorcerer
Copy link
Copy Markdown
Member

Is there any chance of a regression test without adding a new dependency?

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

I think a test could be added using freezegun

@bmerry
Copy link
Copy Markdown
Contributor Author

bmerry commented Dec 10, 2024

I think we should change the fragment to packaging instead of bugfix since it's for testing and doesn't affect production

Sure, I'll do that. But are you sure aiohttp isn't used in any environment in which the timer precision might be low enough that it causes the event to be re-scheduled unnecessarily (which won't be an infinite loop, but will cause more work than necessary)? I'm thinking of environments that deliberately degrade timer precision to prevent Spectra-type side-channel attacks.

I think a test could be added using freezegun

From what I've seen, freezegun only mocks queries of time, but doesn't provide any mechanisms to fake time passing in select/epoll and similar functions. I don't know how easy it'll be to mock that reliably without falling down a rabbit-hole and re-inventing async-solipsism. It's certainly not something I'll have time to do before I go on leave until January.

I've never tested async-solipsism on non-Linux OSes, so it might not be something you want the test suite to depend on at this point.

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

I think we should change the fragment to packaging instead of bugfix since it's for testing and doesn't affect production

Sure, I'll do that. But are you sure aiohttp isn't used in any environment in which the timer precision might be low enough that it causes the event to be re-scheduled unnecessarily (which won't be an infinite loop, but will cause more work than necessary)? I'm thinking of environments that deliberately degrade timer precision to prevent Spectra-type side-channel attacks.

That seems unlikely as we haven't had any issue reports, and even CLOCK_MONOTONIC_COARSE should tick forward enough that it would not be a problem.

I think a test could be added using freezegun

From what I've seen, freezegun only mocks queries of time, but doesn't provide any mechanisms to fake time passing in select/epoll and similar functions. I don't know how easy it'll be to mock that reliably without falling down a rabbit-hole and re-inventing async-solipsism. It's certainly not something I'll have time to do before I go on leave until January.

I've never tested async-solipsism on non-Linux OSes, so it might not be something you want the test suite to depend on at this point.

We don't have any maintainers familiar with it either so its not something we could effectively troubleshoot so we wouldn't want to depend on it right now.

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

From what I've seen, freezegun only mocks queries of time, but doesn't provide any mechanisms to fake time passing in select/epoll and similar functions. I don't know how easy it'll be to mock that reliably without falling down a rabbit-hole and re-inventing async-solipsism. It's certainly not something I'll have time to do before I go on leave until January.

I think we only need to patch loop.time for this case

@webknjaz
Copy link
Copy Markdown
Member

I think we should change the fragment to packaging

I'm not so sure — it doesn't touch any packaging metadata / mechanisms / downstream expectations. This would be misleading, in my opinion. I suppose, we could go for misc instead. If not, then it might be contrib.

But yeah, a regression test would be most welcome here...

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 10, 2024

misc works for me

@bmerry
Copy link
Copy Markdown
Contributor Author

bmerry commented Dec 11, 2024

Ok, renamed to misc. I might know how to write a unit test using only freezegun, but I'll probably only have time to experiment with it next year.

@bdraco
Copy link
Copy Markdown
Member

bdraco commented Dec 14, 2024

I added a test to ensure the keep alive expires on time. I don't love it, but since RequestHandler isn't patchable, I couldn't come up with something that didn't access the internals.

@bdraco bdraco merged commit 7c12b1a into aio-libs:master Dec 17, 2024
@patchback
Copy link
Copy Markdown
Contributor

patchback bot commented Dec 17, 2024

Backport to 3.11: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.11/7c12b1a9c8b2a9e33fb559229a4c4695de39f08c/pr-10151

Backported as #10173

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback bot pushed a commit that referenced this pull request Dec 17, 2024
@patchback
Copy link
Copy Markdown
Contributor

patchback bot commented Dec 17, 2024

Backport to 3.12: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.12/7c12b1a9c8b2a9e33fb559229a4c4695de39f08c/pr-10151

Backported as #10174

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback bot pushed a commit that referenced this pull request Dec 17, 2024
bdraco pushed a commit that referenced this pull request Dec 17, 2024
…ime is not moving forward (#10173)

Co-authored-by: Bruce Merry <[email protected]>
Fixes #123'). -->
Fixes #10149.
bdraco pushed a commit that referenced this pull request Dec 17, 2024
…ime is not moving forward (#10174)

Co-authored-by: Bruce Merry <[email protected]>
Fixes #123'). -->
Fixes #10149.
@sternenseemann
Copy link
Copy Markdown

sternenseemann commented Jul 29, 2025

Hi, I have a single machine (x86_64-linux) where test_keepalive_expires_on_time which has been added here consistently fails. All other machines I have tried so far passed the test successfully.

Any ideas what could cause this? Given that everything timing related should be mocked and the build is done via a Nix derivation which produces a consistent environment with the same precise dependency versions across multiple machines, I'm a bit at a loss to explain the failure.

=================================== FAILURES ===================================
____________________ test_keepalive_expires_on_time[pyloop] ____________________
[gw1] linux -- Python 3.13.5 /nix/store/djck7mx6jad1w0yy6zings96dyxanls6-python3-3.13.5/bin/python3.13

aiohttp_client = <function aiohttp_client.<locals>.go at 0x7fffc05bd300>

    async def test_keepalive_expires_on_time(aiohttp_client: AiohttpClient) -> None:
        """Test that the keepalive handle expires on time."""
    
        async def handler(request: web.Request) -> web.Response:
            body = await request.read()
            assert b"" == body
            return web.Response(body=b"OK")
    
        app = web.Application()
        app.router.add_route("GET", "/", handler)
    
        connector = aiohttp.TCPConnector(limit=1)
        client = await aiohttp_client(app, connector=connector)
    
        loop = asyncio.get_running_loop()
        now = loop.time()
    
        # Patch loop time so we can control when the keepalive timeout is processed
        with mock.patch.object(loop, "time") as loop_time_mock:
            loop_time_mock.return_value = now
            resp1 = await client.get("/")
            await resp1.read()
            request_handler = client.server.handler.connections[0]
    
            # Ensure the keep alive handle is set
            assert request_handler._keepalive_handle is not None
    
            # Set the loop time to exactly the keepalive timeout
            loop_time_mock.return_value = request_handler._next_keepalive_close_time
    
            # sleep twice to ensure the keep alive timeout is processed
            await asyncio.sleep(0)
            await asyncio.sleep(0)
    
            # Ensure the keep alive handle expires
>           assert request_handler._keepalive_handle is None
E           assert <TimerHandle when=22137692.994616266 RequestHandler._process_keepalive()> is None
E            +  where <TimerHandle when=22137692.994616266 RequestHandler._process_keepalive()> = <RequestHandler connected>._keepalive_handle

aiohttp_client = <function aiohttp_client.<locals>.go at 0x7fffc05bd300>
app        = <Application 0x7fffee213e30>
client     = <aiohttp.test_utils.TestClient object at 0x7fffc0562d50>
connector  = <aiohttp.connector.TCPConnector object at 0x7fffede76d50>
handler    = <function test_keepalive_expires_on_time.<locals>.handler at 0x7fffc05bc180>
loop       = <_UnixSelectorEventLoop running=False closed=False debug=False>
loop_time_mock = <MagicMock name='time' id='140737188871344'>
now        = 22134062.994616266
request_handler = <RequestHandler connected>
resp1      = <ClientResponse(http://127.0.0.1:41051/) [200 OK]>
<CIMultiDictProxy('Content-Length': '2', 'Content-Type': 'application/octet-stream', 'Date': 'Tue, 29 Jul 2025 16:48:28 GMT', 'Server': 'Python/3.13 aiohttp/3.12.13')>


tests/test_web_functional.py:2378: AssertionError

Not sure what to add without some ideas. As said, I can't really reproduce the failure elsewhere myself so far…

@Dreamsorcerer
Copy link
Copy Markdown
Member

No idea, you'd have to do some debugging.

@bmerry
Copy link
Copy Markdown
Contributor Author

bmerry commented Jul 29, 2025

Hi, I have a single machine (x86_64-linux) where test_keepalive_expires_on_time which has been added here consistently fails. All other machines I have tried so far passed the test successfully.

Any ideas what could cause this? Given that everything timing related should be mocked

I haven't touched aiohttp in quite a while, but does this test use real TCP sockets? If so, timing isn't going to be deterministic: even with localhost connections, there is a delay between userspace sending data into one end of the socket and the kernel signalling that it's ready on the other end.

@Dreamsorcerer
Copy link
Copy Markdown
Member

The code waits for the response to be read, so the socket should already be idle. It's then just yielding to the loop twice in order to trigger the timeout code which should reset it. So, I wouldn't expect there to be a timing dependency at a glance..

@CaptainFlint
Copy link
Copy Markdown

I'm trying to build the aiohttp package, and having exactly the same issue as @sternenseemann . I tried debugging and found out that RequestHandler._process_keepalive() is not called at all when running this test.

If I modify the test and shift the patched loop.time return value even a microsecond further:
loop_time_mock.return_value = request_handler._next_keepalive_close_time + 0.001
then _process_keepalive() is called, verifies that the (fake) current time has passed beyond the keepalive timeout, and the keepalive handler is nullified. So the test passes successfully. But when the patched return value is exactly _next_keepalive_close_time, the _process_keepalive() is simply not called at all, and therefore the keepalive handler remains active, which causes the test to fail.

I can't understand why this is happening. From what I've seen in the code, _keepalive_handle is initialized at the end of the RequestHandler.start() method, and is set up as a timer to run _process_keepalive() at the time _next_keepalive_close_time = loop.time() + keepalive_timeout. And this is where I got lost, since I have no experience with asyncio, let alone its internal clockwork. The test clearly expects the _process_keepalive() to be called during one of the asyncio.sleep(0) calls, but I can't tell why it doesn't happen.

The only possible reason I could think of, is that when the test was written, asyncio performed the timer handler calls at exactly the specified time. But later python changed this behavior, and the timer is now called only when the timer exceeds the specified time, but not when it's exactly equal. But I don't know how to verify this hypothesis. Can you suggest any ideas I could try?

I'm using python 3.12.9 on a RHEL10-derived OS.

@Dreamsorcerer
Copy link
Copy Markdown
Member

loop_time_mock.return_value = request_handler._next_keepalive_close_time + 0.001

I think this is fine if that gets the test working for everyone.

But later python changed this behavior

That's clearly not possible as we are running every commit on Python 3.10-3.14 on Window, Mac and Ubuntu, and the test is still passing on all of them (plus PyPy as well). Maybe RHEL has some patches to Python that affect the behaviour (previous poster did not mention which distro they used).

@CaptainFlint
Copy link
Copy Markdown

I think this is fine if that gets the test working for everyone.

But that would be completely against the original idea of the test: checking how the library behaves when the time is exactly identical. Would it not? That was what the pull request changed: now <= close_time to now < close_time.

we are running every commit on Python 3.10-3.14 on Window, Mac and Ubuntu, and the test is still passing on all of them

Hm... Thanks for the info, that's really puzzling, but useful to know. I actually wanted to ask what the target python version was, but completely forgot.
OK, I can try building vanilla python and/or trying another distro, then.

@Dreamsorcerer
Copy link
Copy Markdown
Member

Dreamsorcerer commented Feb 25, 2026

But that would be completely against the original idea of the test: checking how the library behaves when the time is exactly identical. Would it not? That was what the pull request changed: now <= close_time to now < close_time.

Oh yes, you're right, that is exactly what it's trying to test.

You said the function wasn't getting called at all, which is down to the asyncio loop implementation. I guess it's really an implementation detail as to whether the loop calls the function at the recorded time, or immediately after...

So, actually looking through the implementation, it actually breaks and doesn't run the callback if the time is == end_time:
https://github.com/python/cpython/blob/80b2b8833830194fd18cf4e111c5d6d8a3e27b30/Lib/asyncio/base_events.py#L2017
That has clock_resolution added to the current time, so I guess your problem is that time.get_clock_info('monotonic').resolution returns 0 on your machine?
https://github.com/python/cpython/blob/80b2b8833830194fd18cf4e111c5d6d8a3e27b30/Lib/asyncio/base_events.py#L430

@Dreamsorcerer
Copy link
Copy Markdown
Member

Dreamsorcerer commented Feb 25, 2026

That has clock_resolution added to the current time, so I guess your problem is that time.get_clock_info('monotonic').resolution returns 0 on your machine?

Or possibly such a small number that it gets lost due to floating point precision during the addition (for reference, my machine returns 1e-09).

This suggests the original issue reported with async-solipsism isn't actually resolved for some machines. I'd suggest trying to get a similar change into asyncio to remove the = in that check. It doesn't make sense to me that it doesn't run a callback immediately when it's already the time it's been asked to run it at..

@sternenseemann
Copy link
Copy Markdown

My two machines report 1e-09 as well from Python 3.13.12 (from Nixpkgs). On one of them the test suite succeeds, on one it fails (with the same dependencies, aiohttp version etc.). So doesn't seem to be the resolution.

@bmerry
Copy link
Copy Markdown
Contributor Author

bmerry commented Feb 25, 2026

My two machines report 1e-09 as well from Python 3.13.12 (from Nixpkgs). On one of them the test suite succeeds, on one it fails (with the same dependencies, aiohttp version etc.). So doesn't seem to be the resolution.

I haven't followed the entire context of the discussion, but on my machine it looks like time.monotonic() counts from boot. If the uptime is more than 194 days (2^24 seconds), then adding 1e-9 to a time.monotonic() value will not change the value at all due to float rounding.

@CaptainFlint
Copy link
Copy Markdown

That has clock_resolution added to the current time, so I guess your problem is that time.get_clock_info('monotonic').resolution returns 0 on your machine?

No, just like for the others who just posted above, it returns 1e-09 on the machine where I perform the build.

@Dreamsorcerer
Copy link
Copy Markdown
Member

If the uptime is more than 194 days (2^24 seconds), then adding 1e-9 to a time.monotonic() value will not change the value at all due to float rounding.

It's only using the resolution value and then adding to loop.time(). So unless someone is taking 194 days to execute the tests...

No, just like for the others who just posted above, it returns 1e-09 on the machine where I perform the build.

Hmm, then I have no idea. Try hacking those asyncio lines and check what the values are, and whether removing the = works or not.

@CaptainFlint
Copy link
Copy Markdown

CaptainFlint commented Feb 25, 2026

OK, that is interesting. I don't know why the event loop timer has this value, and what time it's expected to be, but when I run the test, loop.time() returns the current uptime of my machine. Right now this is over 260 days, so the returned value from the last try was 22538602.881994113. And when I added the clock tick to the current value, it didn't change. So, yes, it is indeed the floating point resolution issue.

Now the big question is, why the loop timer is using the machine's uptime, whether it is interntional or not.

P. S. I can see in the code that loop.time() just returns time.monotonic(), but I can't see in the python documentation any restrictions on the possible values that this particular timer should return, except that, obviously, it's expected to be monotonic.

@bmerry
Copy link
Copy Markdown
Contributor Author

bmerry commented Feb 25, 2026

Now the big question is, why the loop timer is using the machine's uptime, whether it is interntional or not.

loop.time() ultimately uses the clock_gettime system call with CLOCK_MONOTONIC (or something equivalent), for which the man page says

A nonsettable system-wide clock that represents monotonic time
since—as described by POSIX—"some unspecified point in the
past". On Linux, that point corresponds to the number of seconds
that the system has been running since it was booted.

@CaptainFlint
Copy link
Copy Markdown

CaptainFlint commented Feb 25, 2026

So, basically, the test depends on how long the machine was running since last reboot...

@Dreamsorcerer
Copy link
Copy Markdown
Member

Oh, for some reason I thought loop.time() counted from when the loop started running.

My suggestion to try and get the = removed from asyncio is still valid then and seems like the best approach.

@sternenseemann
Copy link
Copy Markdown

I haven't followed the entire context of the discussion, but on my machine it looks like time.monotonic() counts from boot. If the uptime is more than 194 days (2^24 seconds), then adding 1e-9 to a time.monotonic() value will not change the value at all due to float rounding.

That would fit. It is succeeding on the machine with 12 days of uptime and failing on the machine with 467 days of uptime.

@CaptainFlint
Copy link
Copy Markdown

To be honest, I have no idea why they even implemented the comparison like this. I found a commit when it was done, and originally they did have just handle._when > self.time(), but then replaced it with a "greater-or-equal" comparison with the "next tick".

Here is the commit, made back in 2014 (Python 3.4):
python/cpython@ed1654f
The issue link is broken since it refers to the old pre-GitHub system, the actual link is this: https://bugs.python.org/issue20505

In fact, after I looked through git blame, that part has been done and redone multiple times. Initially there was no extra shift, then it was added, then removed, then added in a slightly different way, then removed again… And unfortunately, the discussions about the reasons of these changes are a bit above my level of knowledge. But since it was discussed and redesigned multiple times, I think they know what they were doing, and there probably are serious reasons to keep the asyncio behavior the way it's implemented now.

However, I concocted a dirty fix for this aiohttp test to overcome the floating point resolution issue:

diff --git a/tests/test_web_functional.py b/tests/test_web_functional.py
index e49798513..bbb0d8b7c 100644
--- a/tests/test_web_functional.py
+++ b/tests/test_web_functional.py
@@ -2353,6 +2353,11 @@ async def test_keepalive_expires_on_time(aiohttp_client: AiohttpClient) -> None:
         # Ensure the keep alive handle is set
         assert request_handler._keepalive_handle is not None
 
+        # Adjust clock resolution to make sure asyncio loop triggers the event
+        old_clock_resolution = request_handler._loop._clock_resolution
+        while request_handler._next_keepalive_close_time + request_handler._loop._clock_resolution == request_handler._next_keepalive_close_time:
+            request_handler._loop._clock_resolution *= 2
+
         # Set the loop time to exactly the keepalive timeout
         loop_time_mock.return_value = request_handler._next_keepalive_close_time
 
@@ -2360,5 +2365,8 @@ async def test_keepalive_expires_on_time(aiohttp_client: AiohttpClient) -> None:
         await asyncio.sleep(0)
         await asyncio.sleep(0)
 
+        # Restore the original resolution
+        request_handler._loop._clock_resolution = old_clock_resolution
+
         # Ensure the keep alive handle expires
         assert request_handler._keepalive_handle is None
 

Can you think of any possible issues or side effects of this approach?

Actually, maybe asyncio itself could adopt some similar fix, but I'm not ready to go to the Python developers with such a suggestion yet…

@CaptainFlint
Copy link
Copy Markdown

CaptainFlint commented Feb 26, 2026

P. S. I found the relevant suggestion for asyncio fix in Python's GitHub:
python/cpython#116342
But it has been declined as unlikely to occur in real life scenarios. Apparently, high-uptime servers are not considered realistic scenarios anymore. There goes that idea...

@Dreamsorcerer
Copy link
Copy Markdown
Member

P. S. I found the relevant suggestion for asyncio fix in Python's GitHub: python/cpython#116342 But it has been declined as unlikely to occur in real life scenarios. Apparently, high-uptime servers are not considered realistic scenarios anymore. There goes that idea...

They don't appear to have done the maths in that issue. They are saying that it'd likely become an issue at some point if it was epoch time. They don't seem to realise that it's about 6 months before it breaks...

@CaptainFlint
Copy link
Copy Markdown

I guess it's worth trying to create another issue for them, where the exact details will be explained.

The only thing is: can you suggest any real life scenarios that are affected by this asyncio's issue? Right now I have the impression that the only effect is, that with high uptime the scheduled timers will be called just a little bit later than they could have been (one extra "tick" will be skipped). But timers never guaranteed exact time of the calls. I don't think asyncio is designed for RTOS applications.

We only have the failure because the test verifies very specific conditions. I'm not sure that the Python developers would consider changing the standard library just because "one third-party module in combination with another third-party module used to fail, and now one of the tests rely on something it should have never relied upon, and fails, so please modify asyncio's behavior that has been there for 8 years and did not bother anybody"...

@Dreamsorcerer
Copy link
Copy Markdown
Member

Yeah, it's probably only a minor efficiency loss on long running machines. But, I don't see any advantage to the alternative of not running tasks immediately when they are literally scheduled to run now.

@Dreamsorcerer
Copy link
Copy Markdown
Member

I mean the reason to add the clock resolution is surely to help callbacks get run before their scheduled time passes, so it really makes no sense to have some things not run at the scheduled time, when you're literally trying to make them run slightly early in most cases.

@CaptainFlint
Copy link
Copy Markdown

CaptainFlint commented Mar 14, 2026

Good news! I've sent the fix suggestion to the Python development team, and they have accepted it. I backported the patch to our version of Python 3.12, and with it applied the aiohttp tests all pass successfully.

For reference, here is the link to the accepted pull request: python/cpython#145706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite loop when used with async_solipsism

6 participants