Collect websocket metrics and allow filtering by endpoint family#13324
Collect websocket metrics and allow filtering by endpoint family#13324
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
✅ PR preview is ready!
|
b6f1ed0 to
f7dd6e3
Compare
📉 Python coverage change detectedThe Python unit test coverage has decreased by 0.0493%
💡 Consider adding more unit tests to maintain or improve coverage. Coverage by files
|
f7dd6e3 to
16a014e
Compare
📉 Frontend coverage change detectedThe frontend unit test (vitest) coverage has decreased by 0.0000%
✅ Coverage change is within normal range. |
There was a problem hiding this comment.
Pull request overview
This PR extends the metrics collection system to support filtering metrics by family name and adds new websocket session metrics (connections, reconnections, disconnections, and active sessions count). The filtering capability allows callers to request only specific metrics families, avoiding the performance cost of computing expensive cache statistics when only lightweight metrics are needed.
Key Changes
- Modified
StatsProviderprotocol andStatsManagerto support optional filtering by metric family names - Added
CounterStatandGaugeStatclasses for tracking session events and active session counts - Implemented metrics collection in
WebsocketSessionManagerto track connect, reconnect, and disconnect events - Updated the metrics endpoint to accept
familiesquery parameters for filtering
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
lib/streamlit/runtime/stats.py |
Added CounterStat and GaugeStat classes; modified StatsProvider protocol and StatsManager to support family filtering; added family name constants |
lib/streamlit/runtime/websocket_session_manager.py |
Implemented StatsProvider protocol with event counters; added metrics tracking for session lifecycle events |
lib/streamlit/web/server/stats_request_handler.py |
Added query parameter parsing for family filtering; updated to pass filtered families to stats manager |
lib/streamlit/runtime/runtime.py |
Updated provider registration to use new API; registered session manager for session metrics families |
lib/streamlit/web/server/server.py |
Updated media file storage provider registration to use new API |
lib/streamlit/runtime/state/session_state.py |
Updated SessionState and SessionStateStatProvider to support family filtering |
lib/streamlit/runtime/memory_uploaded_file_manager.py |
Updated get_stats to support family filtering and return dict format |
lib/streamlit/runtime/memory_media_file_storage.py |
Updated get_stats to support family filtering and return dict format |
lib/streamlit/runtime/caching/cache_data_api.py |
Updated cache providers to support family filtering in get_stats method |
lib/streamlit/runtime/caching/cache_resource_api.py |
Updated cache providers to support family filtering in get_stats method |
lib/streamlit/runtime/caching/storage/in_memory_cache_storage_wrapper.py |
Updated storage wrapper to support family filtering |
lib/tests/streamlit/runtime/stats_test.py |
Added comprehensive tests for filtering, CounterStat, GaugeStat, and updated mock provider |
lib/tests/streamlit/runtime/websocket_session_manager_test.py |
Added test suite for websocket session metrics tracking |
lib/tests/streamlit/web/server/stats_handler_test.py |
Added tests for HTTP endpoint filtering and new stat formats |
lib/tests/streamlit/runtime/state/session_state_test.py |
Updated tests to work with new dict-based return format |
lib/tests/streamlit/runtime/memory_media_file_storage_test.py |
Updated tests to work with new dict-based return format |
lib/tests/streamlit/runtime/uploaded_file_manager_test.py |
Updated tests to work with new dict-based return format |
lib/tests/streamlit/runtime/caching/cache_data_api_test.py |
Updated tests to work with new dict-based return format |
lib/tests/streamlit/runtime/caching/cache_resource_api_test.py |
Updated tests to work with new dict-based return format |
79b0028 to
f8ca4f2
Compare
|
@cursor review |
| """ | ||
| result: dict[str, list[Stat]] = {} | ||
|
|
||
| if family_names is None or SESSION_EVENTS_FAMILY in family_names: |
There was a problem hiding this comment.
Can you add a metric for "user time spent in session" or similar? Looking for some way to count the number of seconds that users have had an active session in an app.
I realize this is better suited to an event bus rather than reported as a metric, but this is a quick and dirty way to get this info. Keeping it as a monotonic counter incremented when a session ends would be amazing.
There was a problem hiding this comment.
Can do -- I'll probably defer this until ~Wednesday or so to get this PR merged first so that it doesn't continue to grow, but adding another counter for this should be pretty straightforward.
There was a problem hiding this comment.
Would a single aggregate counter tracking the total number of seconds that all users have spent in aggregate in the app be sufficient here, or are we looking for more granularity?
There was a problem hiding this comment.
Made a quick followup to add a counter metric for this: #13414
SummaryThis PR extends Streamlit's metrics collection capabilities by:
This is a continuation of work from #13265 to make the metrics system more general and performant, as cache stats are expensive to compute and new metrics may need frequent polling. Code QualityThe code is well-structured and follows existing patterns in the codebase. Key observations: Positives:
Minor observations:
@property
def family_name(self) -> str:
... # More conventional for protocols
if family_names is None or ACTIVE_SESSIONS_FAMILY in family_names:
result[ACTIVE_SESSIONS_FAMILY] = [
GaugeStat(
family_name=ACTIVE_SESSIONS_FAMILY,
value=len(self._active_session_info_by_id), # No lock held here
help="Current number of active sessions.",
),
]
Test CoverageExcellent test coverage overall:
All tests follow pytest best practices with descriptive docstrings. Backwards Compatibility
The def get_stats(self) -> list[CacheStat]to: def get_stats(self, family_names: Sequence[str] | None = None) -> Mapping[str, Sequence[Stat]]This affects:
Mitigating factors:
Security & RiskNo security concerns identified:
Low regression risk:
Recommendations
# Reading dict length is atomic in CPython; no lock needed
value=len(self._active_session_info_by_id),
VerdictAPPROVED: This is a well-implemented feature that extends Streamlit's metrics collection with proper test coverage. The breaking change to This is an automated AI review. Please verify the feedback and use your judgment. |
| if family_names is not None and CACHE_MEMORY_FAMILY not in family_names: | ||
| return {} |
There was a problem hiding this comment.
question: Why do we need this extra check in every get_stats. Since the cache provider is already registered with a family, wouldn't it be possible to simply not call get_stats of the provider?
There was a problem hiding this comment.
Hm, this must be left over from when I was still going through some ideas of how to best express that we may serve various stat families from the same StatsProvider.
Let me replace it with a comment saying that we don't need to bother doing any other checks on which stats to serve if this get_stats (and those that are essentially identical) since we only serve one stats family from it.
lib/streamlit/runtime/runtime.py
Outdated
| self._stats_mgr = StatsManager() | ||
| self._stats_mgr.register_provider( | ||
| "cache_memory_bytes", get_data_cache_stats_provider() | ||
| get_data_cache_stats_provider(), [CACHE_MEMORY_FAMILY] |
There was a problem hiding this comment.
suggestion: Would it be possible to just add a property/method to the provider that returns the supported families? I assume a provider always returns the same families, and this would allow self._stats_mgr.register_provider(get_data_cache_stats_provider()) and potentially make it slightly more maintainable.
There was a problem hiding this comment.
Ah yeah, this should be possible. Let me tweak the StatsProvider interface to include this info.
lukasmasuch
left a comment
There was a problem hiding this comment.
Overall, LGMT 👍 However, the context of the Starlette migration, I'm a bit nervous that some aspects will get lost/break once we make Starlette the default without e2e coverage. I think it might be possible to add some basic e2e tests to ensure that, e.g. filtering is supported and some counting happens on websocket disconnects / reconnects.
Related tests: web_server -> Contains some very basic metric endpoint e2e tests already. websocket_reconnects, websocket_disconnect
f8ca4f2 to
89a9876
Compare
89a9876 to
72aa2d6
Compare
I added some e2e tests to verify that filtering works. Didn't add tests to also consider websocket connect/disconnect increments the metric counts, though, since that code is already pretty well covered by tests of the websocket session manager, which the starlette migration uses without any changes. I'll probably still look to add some e2e test coverage here in a followup (I have a few that I'll get around to late this week) but don't want to increase the size of this PR any further. |
In #13324, we added e2e tests to validate that session event metrics were being served but didn't add tests on the metrics behavior itself because: * there's already decent test coverage around this in websocket session manager unit tests. * that PR was already getting much longer than initially expected. This PR goes back to add Playwright tests to validate that session event counter metrics increase when we expect them to.
While this isn't exactly the best way of doing this, it'd be useful for us to have a new metric to track the total length of time users spend in active sessions (see #13324 (comment)). This PR adds this new metric by keeping track of session connect/reconnect times and incrementing a simple accumulator for aggregate session length each time a session is either disconnected or closed. Note that one caveat of doing this is that long-lived active sessions won't be counted until disconnected or closed, but we're not too worried our data being slow to update due to this.
While this isn't exactly the best way of doing this, it'd be useful for us to have a new metric to track the total length of time users spend in active sessions (see streamlit#13324 (comment)). This PR adds this new metric by keeping track of session connect/reconnect times and incrementing a simple accumulator for aggregate session length each time a session is either disconnected or closed. Note that one caveat of doing this is that long-lived active sessions won't be counted until disconnected or closed, but we're not too worried our data being slow to update due to this.
Describe your changes
In #13265, we began the process of extending our metrics collection to be more general than
just being able to collect cache stats.
In this PR, we do a few things to continue the work done in the previous PR:
We need to do this because cache stats are naturally slow to compute (we should try to optimize
this, but naturally it'll take time to serialize an object to compute its size), and some of the new
metrics we'll be adding will be polled for frequently
of active user sessions.
metrics families.
Testing Plan
Python unit tests added
Validated the new endpoint on a local dev environment via curl