Skip to content

Add layer of abstraction to generalize the CacheStat type#13265

Merged
vdonato merged 2 commits intodevelopfrom
vdonato/generalize-metrics-endpoint
Dec 9, 2025
Merged

Add layer of abstraction to generalize the CacheStat type#13265
vdonato merged 2 commits intodevelopfrom
vdonato/generalize-metrics-endpoint

Conversation

@vdonato
Copy link
Copy Markdown
Collaborator

@vdonato vdonato commented Dec 8, 2025

Describe your changes

Currently, the _stcore/metrics endpoint only returns data related to caches, including st.session_state,
@st.cache_data, and @st.cache_resource. In the SiS vNext world, there are various use cases that where
it would be useful to be able to expose more metrics from this endpoint.

We'll be both adding additional metrics and making some changes to the endpoint itself in subsequent
PRs, but as a first step, we'll just add another layer of abstraction to generalize the existing CacheStatsProvider
class.

In this PR, we

  • add a new Stat protocol with required properties that give us enough information to describe an
    OpenMetrics metric family
  • have the existing CacheStat class implement the new protocol
  • change the name of CacheStatsProvider to StatsProvider, and edit the signatures of StatsManager
    methods to now return a dictionary mapping from metric family name to lists of metrics rather than only
    returning lists of CacheStats.
  • make corresponding downstream changes throughout the codebase so that things play nicely with our
    new StatsManager.register_provider() and StatsManager.get_stats() signatures.

We make one relatively small behavioral change to not include metrics comments if we don't have any
metrics to report, but aside from this, we don't make any behavioral changes in this PR.

Testing Plan

  • Unit Tests
    Updated Python unit tests
  • Any manual testing needed?
    curled the metrics endpoint locally to verify that the output looks the same as it did previously.

@snyk-io
Copy link
Copy Markdown
Contributor

snyk-io bot commented Dec 8, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

✅ PR preview is ready!

Name Link
📦 Wheel file https://core-previews.s3-us-west-2.amazonaws.com/pr-13265/streamlit-1.52.1-py3-none-any.whl
📦 @streamlit/component-v2-lib Download from artifacts
🕹️ Preview app pr-13265.streamlit.app (☁️ Deploy here if not accessible)

@vdonato vdonato added security-assessment-completed impact:internal PR changes only affect internal code change:feature PR contains new feature or enhancement implementation labels Dec 8, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

📉 Frontend coverage change detected

The frontend unit test (vitest) coverage has decreased by 0.0000%

  • Current PR: 85.9300% (12445 lines, 1751 missed)
  • Latest develop: 85.9300% (12445 lines, 1751 missed)

✅ Coverage change is within normal range.

📊 View detailed coverage comparison

@vdonato vdonato force-pushed the vdonato/generalize-metrics-endpoint branch from ae7ccff to 7fa6479 Compare December 8, 2025 23:44
@vdonato vdonato changed the title Add layer of indirection in front of StatsProvider/StatsManager types Add layer of abstraction in front of StatsProvider/StatsManager types Dec 9, 2025
@vdonato vdonato changed the title Add layer of abstraction in front of StatsProvider/StatsManager types Add layer of abstraction to generalize the CacheStat type Dec 9, 2025
@vdonato vdonato marked this pull request as ready for review December 9, 2025 01:12
@vdonato vdonato requested a review from a team as a code owner December 9, 2025 01:12
Copilot AI review requested due to automatic review settings December 9, 2025 01:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a layer of abstraction to generalize the metrics system beyond cache-specific statistics. The Stat protocol is added as a general interface for metrics that can be serialized to OpenMetrics format, with CacheStat implementing this protocol. CacheStatsProvider is renamed to StatsProvider, and StatsManager is updated to support multiple metric families by changing its return type from list[CacheStat] to dict[str, Sequence[Stat]].

Key changes:

  • New Stat protocol defines the interface for metrics with properties for family name, type, unit, and help text
  • StatsManager.register_provider() now requires a family name parameter, allowing providers to be grouped by metric family
  • Serialization logic updated to handle multiple metric families and skip empty families

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
lib/streamlit/runtime/stats.py Introduces Stat protocol, renames CacheStatsProvider to StatsProvider, adds metric_type_string_to_proto() helper, updates StatsManager to support family-based registration
lib/streamlit/runtime/state/session_state.py Updates SessionStateStatProvider to inherit from StatsProvider, renames group_stats to group_cache_stats
lib/streamlit/runtime/runtime.py Updates all register_provider() calls to include "cache_memory_bytes" family name parameter
lib/streamlit/runtime/uploaded_file_manager.py Updates protocol inheritance from CacheStatsProvider to StatsProvider
lib/streamlit/runtime/memory_uploaded_file_manager.py Renames group_stats to group_cache_stats
lib/streamlit/runtime/memory_media_file_storage.py Updates class inheritance and renames function call
lib/streamlit/runtime/caching/cache_data_api.py Updates DataCaches and helper function to use StatsProvider and group_cache_stats
lib/streamlit/runtime/caching/cache_resource_api.py Updates ResourceCaches to inherit from StatsProvider and renames function call
lib/streamlit/web/server/server.py Updates register_provider() call to include family name parameter
lib/streamlit/web/server/stats_request_handler.py Updates serialization methods to handle dict of stats by family, with logic to skip empty families
lib/tests/streamlit/runtime/stats_test.py Updates tests for new API, adds tests for multiple families and protocol implementation
lib/tests/streamlit/web/server/stats_handler_test.py Updates test expectations to use dict format, changes empty stats behavior expectation
lib/tests/streamlit/runtime/uploaded_file_manager_test.py Updates comment to reference StatsProvider
lib/tests/streamlit/runtime/memory_media_file_storage_test.py Updates comment to reference StatsProvider

Comment on lines 67 to 85
@staticmethod
def _stats_to_text(stats: list[CacheStat]) -> str:
metric_type = "# TYPE cache_memory_bytes gauge"
metric_unit = "# UNIT cache_memory_bytes bytes"
metric_help = "# HELP Total memory consumed by a cache."
openmetrics_eof = "# EOF\n"

# Format: header, stats, EOF
result = [metric_type, metric_unit, metric_help]
result.extend(stat.to_metric_str() for stat in stats)
result.append(openmetrics_eof)

def _stats_to_text(stats_by_family: dict[str, Sequence[Stat]]) -> str:
result: list[str] = []

for stats in stats_by_family.values():
if not stats:
continue

# All of the stats in a family will have the same family_name, type,
# unit, and help text, so we can just use the first one to construct
# our OpenMetrics comments.
first_stat = stats[0]
result.append(f"# TYPE {first_stat.family_name} {first_stat.type}")
result.append(f"# UNIT {first_stat.family_name} {first_stat.unit}")
result.append(f"# HELP {first_stat.help}")
result.extend(stat.to_metric_str() for stat in stats)

result.append("# EOF\n")
return "\n".join(result)
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StatsManager.get_stats() method returns a dictionary where keys are the registration family names, but the actual serialization in _stats_to_text and _stats_to_proto uses first_stat.family_name from the stats themselves. This creates a potential inconsistency: if a provider's stats have a different family_name than the key they're registered under, the output will use the stat's family_name, not the registration key.

Consider either:

  1. Adding validation in register_provider() or get_stats() to ensure all stats from a provider match their registered family name
  2. Using the dictionary key (registration family name) instead of first_stat.family_name when serializing
  3. Documenting this behavior clearly if it's intentional

Currently, the dictionary key from get_stats() is ignored during serialization, which seems like a code smell.

Copilot uses AI. Check for mistakes.
@lukasmasuch
Copy link
Copy Markdown
Collaborator

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


Comment on lines +233 to +236
self._stats_mgr.register_provider("cache_memory_bytes", self._uploaded_file_mgr)
self._stats_mgr.register_provider(
"cache_memory_bytes", SessionStateStatProvider(self._session_mgr)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The stats from the upload file manager and session manager aren't really cache_memory_bytes or it could be a bit misleading to call it that way since it's just measuring the storage that the session state and memory files take on. Is it already named that way in the current version?

Copy link
Copy Markdown
Collaborator

@lukasmasuch lukasmasuch Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we might want to consider adding the media file storage here, as well as a follow-up

Edit: Looks like this gets registered in server.py. Do you know the reason for this? Maybe it would be good to add a comment here explaining that the media file manager gets registered in a different location (and why)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names are kept the same as in the current implementation, but we can look into changing them in some of the followups.

I think the only reason why the media file manager gets registered in server.py is because the manager gets created there. It's probably ok to leave it as-is because I think that we'll have more metrics registrations added in various other places as we continue to track additional metrics. I'll add a comment mentioning there may be other registrations that happen outside of runtime.py in a followup PR, though.

Comment on lines +64 to +66
def marshall_metric_proto(self, metric: MetricProto) -> None:
"""Fill an OpenMetrics `Metric` protobuf object."""
pass
Copy link
Copy Markdown
Collaborator

@lukasmasuch lukasmasuch Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do you know why we support protobuf here as well? Is this a common thing to do with OpenMetric endpoints? Do we actually want to support this with the "new" metrics endpoint?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use gRPC to communicate between internal services in the SiS world. I can look into whether it might be worth not doing so if the space savings are totally negligible, but I'd imagine we'd prefer to have the option of doing so for consistency

Copy link
Copy Markdown
Collaborator

@lukasmasuch lukasmasuch Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is the _stcore/metrics endpoint usable via gRPC in its current state? Isn't it just an HTTP REST endpoint returning protobuf bytes in its current form?

Copy link
Copy Markdown
Collaborator

@lukasmasuch lukasmasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 but check the copilot feedback for valid/relevant aspects.

Two aspects that might be nice to re-validate when redesigning/updating this endpoint (in follow ups):

  • Do we want/need to support returning protobuf in the endpoint? Is this a common aspect with OpenMetrics, do we want to use this in SiS? I suspect that the space efficiency of protobuf doesn't make a big difference in this case.
  • The current cache_memory_stats metric name seems a bit incorrect given that we also track media file storage, file upload storage, session state. Should we change to a more generic name, or use more specific names for each type of memory?

@lukasmasuch lukasmasuch added the ai-review If applied to PR or issue will run AI review workflow label Dec 9, 2025
@github-actions github-actions bot removed the ai-review If applied to PR or issue will run AI review workflow label Dec 9, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

Summary

This PR introduces a layer of abstraction to generalize the metrics/stats endpoint infrastructure. The main changes are:

  1. New Stat protocol: A new protocol that defines the interface for stats that can be serialized to OpenMetrics format, with properties for family_name, type, unit, help, and methods for serialization.

  2. Renamed classes and functions:

    • CacheStatsProviderStatsProvider
    • group_statsgroup_cache_stats
  3. Enhanced StatsManager:

    • register_provider() now requires a family_name parameter to group providers by metric family
    • get_stats() returns a dictionary mapping family names to sequences of stats
  4. Updated CacheStat: Now implements the Stat protocol with hardcoded values for cache metrics (family_name="cache_memory_bytes", type="gauge", etc.)

  5. Minor behavioral change: When there are no stats, the endpoint now returns only # EOF\n instead of the full header with TYPE, UNIT, and HELP comments.

Code Quality

Strengths

  • Clean abstraction: The new Stat protocol provides a clear interface for future metric types, enabling extensibility without modifying existing code.
  • Consistent refactoring: All usages of the renamed classes/functions have been updated throughout the codebase.
  • Good separation of concerns: The CacheStat class now self-describes its OpenMetrics metadata through protocol properties.
  • Type safety: Appropriate use of cast() and type annotations throughout.

Minor Observations

  1. Protocol body style (lines 40-66 in lib/streamlit/runtime/stats.py): The Stat protocol uses pass as the method body. While functionally correct, using ... (ellipsis) is more idiomatic for Protocol definitions. This is a minor stylistic point and doesn't affect functionality.

  2. StatsProvider uses @AbstractMethod (line 173 in lib/streamlit/runtime/stats.py): The StatsProvider protocol uses @abstractmethod with raise NotImplementedError. For Protocol classes, @abstractmethod is typically unnecessary. However, this follows the existing code style from the original CacheStatsProvider, so it's consistent.

  3. Type annotation in cache_data_api.py (line 636): The cast cast("Sequence[CacheStat]", self.storage.get_stats()) is needed because StatsProvider.get_stats() returns Sequence[Stat], but the DataCache.get_stats() method returns Sequence[CacheStat]. This is correct but worth noting for future maintenance.

Test Coverage

Unit Tests Updated

lib/tests/streamlit/runtime/stats_test.py:

  • Updated MockStatsProvider to use new StatsProvider interface
  • Updated test_get_stats for new dictionary return format
  • Added test_get_stats_multiple_families to test multiple metric families
  • Renamed test_group_statstest_group_cache_stats
  • Added CacheStatProtocolTest with tests for Stat protocol implementation

lib/tests/streamlit/web/server/stats_handler_test.py:

  • Updated mock data structure to use dictionary format
  • Updated test_no_stats to expect only # EOF\n (behavioral change)
  • Updated test_has_stats and test_protobuf_stats with new data structure

lib/tests/streamlit/runtime/memory_media_file_storage_test.py:

  • Updated docstring reference from CacheStatsProvider to StatsProvider

lib/tests/streamlit/runtime/uploaded_file_manager_test.py:

  • Updated docstring reference from CacheStatsProvider to StatsProvider

Test Coverage Assessment

  • Good coverage: The test_protobuf_stats test indirectly exercises marshall_metric_proto and metric_type_string_to_proto by verifying the complete protobuf output.
  • Missing direct tests: The metric_type_string_to_proto function isn't directly unit tested for all type mappings (counter, histogram, etc.) and the UNKNOWN fallback case. However, the current usage only requires "gauge", which is covered.
  • Edge case covered: The test_no_stats test verifies the new behavior when there are no stats.

Backwards Compatibility

Internal API Changes (Low Risk)

The renamed symbols (CacheStatsProvider, group_stats) are internal APIs not exposed in the public st namespace. The risk of breaking external users is minimal because:

  1. These are implementation details in streamlit.runtime.stats
  2. No backward compatibility aliases are provided, but external usage is not expected
  3. All internal usages have been updated

Behavioral Change (Very Low Risk)

The metrics endpoint output format changes when there are no stats:

  • Before: Returns header comments (# TYPE, # UNIT, # HELP) even with empty stats
  • After: Returns only # EOF\n when there are no stats

This is a minor improvement and is unlikely to break any metrics consumers since:

  1. The output is still valid OpenMetrics format
  2. Most metrics consumers handle empty metric sets gracefully

Security & Risk

No security concerns identified:

  • No new endpoints or authentication changes
  • No changes to input validation or data handling
  • No exposure of new information in the metrics endpoint

Low regression risk:

  • Changes are well-scoped to the stats/metrics infrastructure
  • Existing functionality is preserved with proper test coverage
  • No changes to protobuf definitions or frontend code

Recommendations

The PR is well-implemented. Minor suggestions for consideration:

  1. Optional: Add direct unit tests for metric_type_string_to_proto - While currently covered indirectly, explicit tests for all type mappings (especially the UNKNOWN fallback) would improve test robustness for future metric types.

  2. Optional: Consider using ... in Protocol bodies - This is more idiomatic Python for Protocol definitions, though pass works correctly.

  3. Documentation note: The PR description mentions this is preparatory work for "SiS vNext". Consider adding a brief code comment in StatsManager or the module docstring explaining the intended extensibility pattern for future contributors.

Verdict

APPROVED: This PR introduces a clean abstraction layer that generalizes the metrics infrastructure without breaking existing functionality. The code is well-structured, properly tested, and ready for merge. The changes set up a solid foundation for adding additional metric types in future PRs.


This is an automated AI review. Please verify the feedback and use your judgment.

Copy link
Copy Markdown
Collaborator

@sfc-gh-lmasuch sfc-gh-lmasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@vdonato vdonato merged commit c8aca1a into develop Dec 9, 2025
42 checks passed
@vdonato vdonato deleted the vdonato/generalize-metrics-endpoint branch December 9, 2025 22:53
vdonato added a commit that referenced this pull request Dec 16, 2025
)

In #13265, we began the process of extending our metrics collection to be more general than
just being able to collect cache stats.

In this PR, we do a few things to continue the work done in the previous PR:
* continue to iterate on the interface to support requesting only specific metrics families.
   We need to do this because cache stats are naturally slow to compute (we should try to optimize
   this, but naturally it'll take time to serialize an object to compute its size), and some of the new
   metrics we'll be adding will be polled for frequently
* Add new metrics to track websocket connects, disconnects, reconnects, and the number
   of active user sessions.
* Make some changes to the metrics endpoint itself to allow the caller to request specific
   metrics families.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change:feature PR contains new feature or enhancement implementation impact:internal PR changes only affect internal code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants