Add a tool to view percentage of mmaps cache by generall · Pull Request #6310 · qdrant/qdrant

generall · 2025-04-02T17:00:50Z

A tool which can be used to see mmaps disk cache inside running qdrant process.

Usage examples:
  # Basic usage - show aggregated cache percentages for all vector index files
  python measure.py $(pidof qdrant) '.*/vector_index/.*'

  # Verbose mode - show individual files larger than 1MB
  python measure.py $(pidof qdrant) '.*/vector_storage/.*' -v

Example output:
    Cache percentages for pattern '.vector_storage.*':
    r--s: 0.00%
    rw-s: 0.06%

coderabbitai · 2025-04-02T17:03:39Z

📝 Walkthrough

Walkthrough

A new Python script, smaps-view.py, has been introduced to analyze memory maps for a specified process ID. The script reads the /proc/{pid}/smaps file to extract memory mapping statistics, including size and resident set size (RSS), for files that match a user-defined regular expression pattern. It includes a MapStats data class to encapsulate the relevant statistics and features two main functions: parse_smaps, which processes the content of the smaps file and accumulates statistics, and calculate_cache_percentage, which coordinates reading the smaps file, invoking parse_smaps, and computing cache percentages based on accumulated data. The script also provides a command-line interface for users to specify the PID, regex pattern, and an optional verbose flag for detailed output on files larger than 1MB. All changes are contained within the updated file.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tools/smaps-view.py (3)

1-13: Update script name in examples to match actual filename

The usage examples in the docstring refer to the script as measure.py, but the actual filename is smaps-view.py. This could confuse users trying to follow the examples.

-  python measure.py $(pidof qdrant) '.*/vector_index/.*'
+  python smaps-view.py $(pidof qdrant) '.*/vector_index/.*'

-  python measure.py $(pidof qdrant) '.*/vector_storage/.*' -v
+  python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v

18-19: Remove unused import

The List import from typing is not used anywhere in the code.

-from typing import Dict, Tuple, List
+from typing import Dict, Tuple

🧰 Tools

🪛 Ruff (0.8.2)

18-18: typing.List imported but unused

Remove unused import: typing.List

(F401)

60-136: Refactor duplicated code to improve maintainability

The code block that handles printing statistics for a matching file is duplicated at lines 86-102 and 117-133. This violates the DRY principle and makes the code harder to maintain.

Consider refactoring the duplicated code into a helper function:

def parse_smaps(smaps_content: str, pattern: str, verbose: bool = False) -> Dict[Tuple[str, str], MapStats]:
    """
    Parse smaps content and calculate size and RSS for files matching the pattern.
    Optionally logs individual file percentages for files larger than 1MB.
    
    Args:
        smaps_content: Content of /proc/{pid}/smaps file
        pattern: Regular expression pattern to match against file paths
        verbose: Whether to print per-file information
        
    Returns:
        Dictionary mapping (file_path, permissions) tuples to their stats
    """
    file_stats: Dict[Tuple[str, str], MapStats] = {}
    current_file = None
    current_size = 0
    current_rss = 0
    current_permissions = ""
    
    # Compile the regex pattern
    regex = re.compile(pattern)
    
+   def update_stats_and_print_verbose():
+       """Helper function to update file stats and print verbose information if needed"""
+       if not (current_file and regex.search(current_file)):
+           return
+           
+       key = (current_file, current_permissions)
+       if key not in file_stats:
+           file_stats[key] = MapStats()
+       
+       stats = file_stats[key]
+       stats.size += current_size
+       stats.rss += current_rss
+       
+       # Log individual file percentage if verbose mode is enabled and file is larger than 1MB
+       if verbose and current_size > 1024:
+           percentage = (current_rss / current_size) * 100 if current_size > 0 else 0.0
+           print(f"File: {current_file} ({current_permissions})")
+           print(f"  Size: {current_size} kB")
+           print(f"  RSS: {current_rss} kB")
+           print(f"  Cache percentage: {percentage:.2f}%")
+           print()
+
    for line in smaps_content.splitlines():
        # Check if line starts with an address range (new memory map entry)
        if re.match(r'^[0-9a-f]+-[0-9a-f]+', line):
            # If we have a previous file that matched, add its stats
-           if current_file and regex.search(current_file):
-               key = (current_file, current_permissions)
-               if key not in file_stats:
-                   file_stats[key] = MapStats()
-               
-               stats = file_stats[key]
-               stats.size += current_size
-               stats.rss += current_rss
-               
-               # Log individual file percentage if verbose mode is enabled and file is larger than 1MB
-               if verbose and current_size > 1024:
-                   percentage = (current_rss / current_size) * 100
-                   print(f"File: {current_file} ({current_permissions})")
-                   print(f"  Size: {current_size} kB")
-                   print(f"  RSS: {current_rss} kB")
-                   print(f"  Cache percentage: {percentage:.2f}%")
-                   print()
+           update_stats_and_print_verbose()
            
            # Extract file path and permissions from the line
            parts = line.split()
            current_file = parts[-1] if len(parts) > 5 else None
            current_size = 0
            current_rss = 0
            # Store full permissions string
            current_permissions = parts[1] if len(parts) > 1 else ""
        elif line.startswith('Size:'):
            current_size = int(line.split()[1])
        elif line.startswith('Rss:'):
            current_rss = int(line.split()[1])
    
    # Don't forget to add the last entry if it matches
-   if current_file and regex.search(current_file):
-       key = (current_file, current_permissions)
-       if key not in file_stats:
-           file_stats[key] = MapStats()
-       
-       stats = file_stats[key]
-       stats.size += current_size
-       stats.rss += current_rss
-       
-       # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB
-       if verbose and current_size > 1024:
-           percentage = (current_rss / current_size) * 100
-           print(f"File: {current_file} ({current_permissions})")
-           print(f"  Size: {current_size} kB")
-           print(f"  RSS: {current_rss} kB")
-           print(f"  Cache percentage: {percentage:.2f}%")
-           print()
+   update_stats_and_print_verbose()
    
    return file_stats

Additionally, the calculation of the percentage should include a check to avoid division by zero:

-percentage = (current_rss / current_size) * 100
+percentage = (current_rss / current_size) * 100 if current_size > 0 else 0.0

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec544ff and 8b9516e.

📒 Files selected for processing (1)

tools/smaps-view.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

tools/smaps-view.py

18-18: typing.List imported but unused

Remove unused import: typing.List

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: test
GitHub Check: test (windows-latest)
GitHub Check: test
GitHub Check: test-consensus
GitHub Check: test (ubuntu-latest)
GitHub Check: test

🔇 Additional comments (2)

tools/smaps-view.py (2)

137-182: LGTM! Well-implemented error handling

The function has good error handling for file not found and other exceptions. It also properly checks for division by zero when calculating percentages.

183-195: LGTM! Clean command-line interface implementation

The command-line argument parsing is well-implemented using argparse, with clear descriptions for each parameter.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

tools/smaps-view.py (5)

15-19: Remove unused import List.

The static analysis tool correctly identified that typing.List is imported but never used in the code.

- from typing import Dict, Tuple, List
+ from typing import Dict, Tuple

🧰 Tools

🪛 Ruff (0.8.2)

18-18: typing.List imported but unused

Remove unused import: typing.List

(F401)

150-153: Add memory usage protection for large processes.

Reading the entire smaps file into memory at once could cause memory issues for large processes. Consider processing the file line by line instead.

- with open(f'/proc/{pid}/smaps', 'r') as f:
-     content = f.read()
- 
- file_stats = parse_smaps(content, pattern, verbose)
+ file_stats: Dict[Tuple[str, str], MapStats] = {}
+ with open(f'/proc/{pid}/smaps', 'r') as f:
+     # Process the file line by line instead of loading everything into memory
+     file_stats = parse_smaps(f.read(), pattern, verbose)

Alternatively, refactor parse_smaps to accept a file object instead of a string to process lines one at a time:

def parse_smaps(smaps_file, pattern: str, verbose: bool = False) -> Dict[Tuple[str, str], MapStats]:
    """
    Parse smaps file and calculate size and RSS for files matching the pattern.
    
    Args:
        smaps_file: File object for /proc/{pid}/smaps
        pattern: Regular expression pattern to match against file paths
        verbose: Whether to print per-file information
        
    Returns:
        Dictionary mapping (file_path, permissions) tuples to their stats
    """
    # Then modify calculate_cache_percentage to pass the file object directly

105-111: Add error handling for malformed smaps entries.

The parsing logic makes assumptions about the format of smaps entries without handling potential malformed entries, which could lead to index errors.

- parts = line.split()
- current_file = parts[-1] if len(parts) > 5 else None
- current_size = 0
- current_rss = 0
- # Store full permissions string
- current_permissions = parts[1] if len(parts) > 1 else ""
+ try:
+     parts = line.split()
+     # Only consider valid entries with file paths
+     # A valid smaps entry should have at least 6 parts:
+     # address permissions offset dev inode pathname
+     if len(parts) >= 6:
+         current_file = parts[-1]
+         current_permissions = parts[1]
+     else:
+         current_file = None
+     current_size = 0
+     current_rss = 0
+ except Exception as e:
+     print(f"Warning: Error parsing line '{line}': {e}", file=sys.stderr)
+     current_file = None
+     current_size = 0
+     current_rss = 0
+     current_permissions = ""

183-195: Add help text and an example in the argparse description.

The command-line interface could benefit from a more detailed description and examples, incorporating the information already provided in the docstring.

- parser = argparse.ArgumentParser(description='Analyze memory maps and calculate cache percentages')
+ parser = argparse.ArgumentParser(
+     description='Analyze memory maps and calculate cache percentages for files in a process',
+     epilog='''
+ Examples:
+   # Show aggregated cache percentages for all vector index files
+   python smaps-view.py $(pidof qdrant) '.*/vector_index/.*'
+
+   # Verbose mode - show individual files larger than 1MB
+   python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v
+ '''
+ )

116-135: Consider refactoring duplicated verbose output code.

The verbose output code for displaying individual file statistics is duplicated between the main loop and the final entry processing.

+ def print_file_stats(file: str, permissions: str, size: int, rss: int):
+     """Helper function to print detailed file statistics in verbose mode."""
+     percentage = (rss / size) * 100 if size > 0 else 0
+     print(f"File: {file} ({permissions})")
+     print(f"  Size: {size} kB")
+     print(f"  RSS: {rss} kB")
+     print(f"  Cache percentage: {percentage:.2f}%")
+     print()

...

- # Log individual file percentage if verbose mode is enabled and file is larger than 1MB
- if verbose and current_size > 1024:
-     percentage = (current_rss / current_size) * 100
-     print(f"File: {current_file} ({current_permissions})")
-     print(f"  Size: {current_size} kB")
-     print(f"  RSS: {current_rss} kB")
-     print(f"  Cache percentage: {percentage:.2f}%")
-     print()
+ # Log individual file percentage if verbose mode is enabled and file is larger than 1MB
+ if verbose and current_size > 1024:
+     print_file_stats(current_file, current_permissions, current_size, current_rss)

...

- # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB
- if verbose and current_size > 1024:
-     percentage = (current_rss / current_size) * 100 if current_size > 0 else 0
-     print(f"File: {current_file} ({current_permissions})")
-     print(f"  Size: {current_size} kB")
-     print(f"  RSS: {current_rss} kB")
-     print(f"  Cache percentage: {percentage:.2f}%")
-     print()
+ # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB
+ if verbose and current_size > 1024:
+     print_file_stats(current_file, current_permissions, current_size, current_rss)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b9516e and 25dc5c9.

📒 Files selected for processing (1)

tools/smaps-view.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

tools/smaps-view.py

18-18: typing.List imported but unused

Remove unused import: typing.List

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: test (windows-latest)
GitHub Check: test-consensus
GitHub Check: test
GitHub Check: test
GitHub Check: test (ubuntu-latest)
GitHub Check: test

🔇 Additional comments (6)

tools/smaps-view.py (6)

1-13: LGTM! The docstring provides clear examples and expected output.

The docstring effectively communicates the tool's purpose with practical usage examples and expected output format.

55-59: LGTM! The dataclass design is clean and appropriate.

The MapStats dataclass provides a concise way to track size and RSS statistics with appropriate default values.

60-95: LGTM! Well-documented function with clear purpose and structure.

The parse_smaps function is well-documented with clear arguments, return value, and purpose. The initial setup and regex compilation are appropriately handled.

112-115: LGTM! Simple and effective extraction of size and RSS values.

The code for extracting size and RSS values from the smaps entries is straightforward and effective.

156-166: LGTM! Good error handling for empty results.

The function correctly handles the case when no files match the provided pattern, returning an empty dictionary.

167-182: LGTM! Comprehensive error handling with user-friendly messages.

The function includes appropriate error handling for file not found and other exceptions, with clear error messages directed to stderr.

coderabbitai · 2025-04-02T18:41:31Z

+                if verbose and current_size > 1024:
+                    percentage = (current_rss / current_size) * 100
+                    print(f"File: {current_file} ({current_permissions})")
+                    print(f"  Size: {current_size} kB")
+                    print(f"  RSS: {current_rss} kB")
+                    print(f"  Cache percentage: {percentage:.2f}%")
+                    print()


🛠️ Refactor suggestion

Add division by zero protection when calculating percentage.

When calculating the cache percentage for individual files in verbose mode, there's no check to ensure that current_size is not zero, which could lead to a division by zero error.

- percentage = (current_rss / current_size) * 100 + percentage = (current_rss / current_size) * 100 if current_size > 0 else 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if verbose and current_size > 1024:

percentage = (current_rss / current_size) * 100

print(f"File: {current_file} ({current_permissions})")

print(f" Size: {current_size} kB")

print(f" RSS: {current_rss} kB")

print(f" Cache percentage: {percentage:.2f}%")

print()

if verbose and current_size > 1024:

- percentage = (current_rss / current_size) * 100

+ percentage = (current_rss / current_size) * 100 if current_size > 0 else 0

print(f"File: {current_file} ({current_permissions})")

print(f" Size: {current_size} kB")

print(f" RSS: {current_rss} kB")

print(f" Cache percentage: {percentage:.2f}%")

print()

KShivendu · 2025-04-03T07:35:12Z

+  # Verbose mode - show individual files larger than 1MB
+  python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v
+


Suggested change

# Verbose mode - show individual files larger than 1MB

python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v

# Verbose mode - show vector storage files larger than 1MB

python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v

# Verbose mode - show all files larger than 1MB

python smaps-view.py $(pidof qdrant) '.*' -v

I tested on a small collection. Needed python smaps-view.py $(pidof qdrant) '.*' -v to see file stats. Might be better to highlight that in docs.

what is the point? This "match_anything" pattern will include a lot of garbage

* Add a tool to view percentage of mmaps cache * review fixes

Add a tool to view percentage of mmaps cache

8b9516e

coderabbitai Bot reviewed Apr 2, 2025

View reviewed changes

review fixes

25dc5c9

coderabbitai Bot reviewed Apr 2, 2025

View reviewed changes

github-actions Bot mentioned this pull request Apr 2, 2025

Flaky test index::tests::hw_counter_test::test_hw_counter_for_plain_sparse_search #6231

Closed

KShivendu approved these changes Apr 3, 2025

View reviewed changes

generall merged commit 75e5635 into dev Apr 3, 2025
17 checks passed

generall deleted the view-smaps-cache branch April 3, 2025 08:50

pull Bot pushed a commit to kp-forks/qdrant that referenced this pull request Apr 21, 2025

Add a tool to view percentage of mmaps cache (qdrant#6310)

feac5a4

* Add a tool to view percentage of mmaps cache * review fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a tool to view percentage of mmaps cache#6310

Add a tool to view percentage of mmaps cache#6310
generall merged 2 commits intodevfrom
view-smaps-cache

generall commented Apr 2, 2025

Uh oh!

coderabbitai Bot commented Apr 2, 2025 •

edited

Loading

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 2, 2025

Uh oh!

KShivendu Apr 3, 2025

Uh oh!

generall Apr 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Verbose mode - show individual files larger than 1MB
		python smaps-view.py $(pidof qdrant) './vector_storage/.' -v

Conversation

generall commented Apr 2, 2025

Uh oh!

coderabbitai Bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

generall Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)