Have server ack app_heartbeat messages received from client#13810
Have server ack app_heartbeat messages received from client#13810
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
✅ PR preview is ready!
|
📈 Frontend coverage change detectedThe frontend unit test (vitest) coverage has increased by 0.1800%
🎉 Great job on improving test coverage! |
There was a problem hiding this comment.
Pull request overview
This PR implements a server-side acknowledgment mechanism for app heartbeat messages to enable faster detection of connection issues in hosted Streamlit environments (SiS and Community Cloud). When the client sends an app heartbeat, the server now responds with a heartbeat_ack message. If the client doesn't receive this acknowledgment within a timeout (29 seconds), it attempts to reconnect, which either surfaces the connection error dialog immediately or successfully reconnects with minimal disruption.
Changes:
- Added
heartbeat_ackfield to theForwardMsgprotobuf definition for server-to-client acknowledgment - Modified backend
_handle_app_heartbeat_requestto send acknowledgment ForwardMsg when heartbeat is received - Implemented heartbeat timeout tracking in
ConnectionManagerwith automatic reconnection on timeout - Added
reconnect()method toWebsocketConnectionto trigger reconnection without permanent disconnection
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/streamlit/proto/ForwardMsg.proto | Added heartbeat_ack boolean field (26) to ForwardMsg for server acknowledgments |
| lib/streamlit/runtime/app_session.py | Modified _handle_app_heartbeat_request to send heartbeat_ack ForwardMsg response |
| lib/tests/streamlit/runtime/app_session_test.py | Added test verifying that heartbeat handler sends proper ack message |
| frontend/connection/src/constants.ts | Added HEARTBEAT_ACK_TIMEOUT_MS constant (29 seconds) for ack timeout |
| frontend/connection/src/WebsocketConnection.tsx | Added reconnect() method to close connection and transition to PINGING_SERVER state |
| frontend/connection/src/WebsocketConnection.test.tsx | Added tests for reconnect behavior in different connection states |
| frontend/connection/src/ConnectionManager.ts | Implemented heartbeat ack timeout tracking and reconnection logic |
| frontend/connection/src/ConnectionManager.test.ts | Added comprehensive tests for heartbeat timeout, ack handling, and cleanup |
| frontend/app/src/App.tsx | Integrated heartbeat ack handling by calling ConnectionManager methods |
| frontend/app/src/App.test.tsx | Added tests verifying onHeartbeatSent and onHeartbeatAckReceived are called correctly |
| e2e_playwright/test_assets/hostframe.html | Added "Send Heartbeat" button for manual testing |
SummaryThis PR extends the existing Main changes:
Code QualityThe code is well-structured and follows existing patterns in the codebase. Strengths:
Minor observations:
Test CoverageTest coverage is comprehensive and follows best practices. Python tests (
Frontend tests:
E2E test infrastructure:
Backwards CompatibilityFully backwards compatible:
Security & RiskNo security concerns identified:
Risk assessment:
RecommendationsNo blocking issues found. The implementation is clean, well-tested, and follows established patterns. Optional consideration (non-blocking):
VerdictAPPROVED: This is a well-implemented feature with comprehensive test coverage, proper cleanup handling, and full backwards compatibility. The code follows existing patterns and best practices, and the PR description clearly explains the motivation and testing approach. This is an automated AI review using |
frontend/connection/src/constants.ts
Outdated
| * Timeout for heartbeat acknowledgment in milliseconds. | ||
| * If we send a heartbeat and don't receive an ack within this time, | ||
| * we consider the connection unhealthy. This time is set to be 1 second | ||
| * shorter than the heartbeat interval of 30s that we expect to be commonly |
There was a problem hiding this comment.
question: Is there any possibility of race conditions with this value?
There was a problem hiding this comment.
I don't think so. We could have the ack be received and timeout processed nearly simultaneously, but since browser JS is single-threaded and non-preemptive, we should have one of the two clearly happening first.
I think the only thing we risk with having some sort of race is potentially taking a bit longer to finally show the network error dialog, so the consequences aren't too dire if there is something we're missing.
To err on the side of caution, though, I think we should change this to 59s since SiS currently sends the heartbeat request every 60s. I initially wanted to see if we could go drop it on the SiS side to 30s so that we'd be more responsive to network issues, but thinking back, we have been burned previously by timeouts that I thought were very generous being problematic on some user's internet connections, so better safe than sorry I guess.
There was a problem hiding this comment.
You bring up a good point about timing constraints. Would it make sense to turn this into a config option so that SiS could tweak this value without also having to make a code change here? Or do you feel more confident with a hardcoded value?
There was a problem hiding this comment.
I think coordinating having parameters to control whether we turn on the ack behavior as well as setting a config option for the timeout might start to get messy (not sure if we can easily get parameter values plumbed to config options if we wanted to stage that rollout, so that itself might end up being involved if we wanted to control this via config option).
We might be able to change the expectAck field to be something like ackTimeout, and control both whether the ack is expected and what its timeout should be when truthy. Then we'll just have one param to worry about rollout for on the SiS side.
Does this seem reasonable?
There was a problem hiding this comment.
From the OSS side, that seems reasonable to me. If you think that would be beneficial for SiS, then please feel free to implement it. If you think it's over-engineering also feel free to push back.
There was a problem hiding this comment.
This seems like something we'd want to do on the SiS side. I'm not sure if we'll look into pinging more often / with shorter timeouts soon, but if we eventually choose to do so it'll be good to have this configurable so that we don't have to worry about the timeout value on older Streamlit versions. I ended up naming this new message field ackTimeoutMilliseconds.
cf9e264 to
b520036
Compare
|
@sfc-gh-bnisco thanks for the review / sorry for the slow turnaround on my side! I've addressed your PR feedback + also had a realization that we'll want some way to control the rollout from the SiS side, so I made another change adding a new |
501d6a7 to
098ab1e
Compare
SummaryThis PR extends the existing Key changes across the stack:
Code QualityThe code is well-structured and follows existing patterns throughout the codebase. Strengths:
Minor observations:
Test CoverageTest coverage is thorough and follows best practices. Python tests (
Frontend - ConnectionManager.test.ts (new file, 241 lines):
Frontend - App.test.tsx:
Frontend - HostCommunicationManager.test.tsx:
Frontend - WebsocketConnection.test.tsx:
E2E:
Overall, the test coverage is excellent for this feature. Backwards CompatibilityThis change is fully backwards compatible:
Security & Risk
AccessibilityThis PR has no user-facing UI changes. The only frontend-visible change is a new "Send Heartbeat" button in the test-only RecommendationsNo blocking issues found. Minor suggestions for consideration:
VerdictAPPROVED: Well-designed, backwards-compatible extension of the heartbeat mechanism with comprehensive test coverage across all layers. The opt-in approach with host-configurable timeout is a clean design. No blocking issues identified. This is an automated AI review using |
sfc-gh-bnisco
left a comment
There was a problem hiding this comment.
Thanks for the revisions, one possible suggestion inline, but LGTM!
|
|
||
| if (message.type === "SEND_APP_HEARTBEAT") { | ||
| this.props.sendAppHeartbeat() | ||
| this.props.sendAppHeartbeat(message.ackTimeoutMilliseconds ?? 0) |
There was a problem hiding this comment.
suggestion (non-blocking): Might be good to do some lightweight validation on this value:
typeof message.ackTimeoutMilliseconds === 'number' &&
Number.isFinite(message.ackTimeoutMilliseconds) &&
message.ackTimeoutMilliseconds > 0
Describe your changes
In hosted Streamlit app products like SiS and Community Cloud, we're currently very slow to detect when
a user's internet connection is running into problems. This is because we don't have a mechanism to quickly
detect when our websocket connection runs into trouble, and it's possible for minutes to pass after our
websocket connection actually drops before we finally see the network error dialog.
To make this process faster, we extend the existing
SEND_APP_HEARTBEATmechanism, which existsto allow the host of a Streamlit app to request that the OSS client send a heartbeat to the server for the sole
purpose of keeping its websocket connection alive in environments where it'll otherwise time out. With these
new changes, we also have the server send an ack that it received this message back to the client. If no
ack is received, we assume that we may have a connection error and attempt to reconnect our websocket
connection, which will cause us to either
In the normal OSS world, we don't receive
SEND_APP_HEARTBEATmessages, so none of this new codeis ever invoked.
Testing Plan
hostframe.htmlto add a new button to send an app heartbeatapp_session.pyto simulate a network issue