Add a tool to view percentage of mmaps cache#6310
Conversation
📝 WalkthroughWalkthroughA new Python script, ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tools/smaps-view.py (3)
1-13: Update script name in examples to match actual filenameThe usage examples in the docstring refer to the script as
measure.py, but the actual filename issmaps-view.py. This could confuse users trying to follow the examples.- python measure.py $(pidof qdrant) '.*/vector_index/.*' + python smaps-view.py $(pidof qdrant) '.*/vector_index/.*' - python measure.py $(pidof qdrant) '.*/vector_storage/.*' -v + python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v
18-19: Remove unused importThe
Listimport from typing is not used anywhere in the code.-from typing import Dict, Tuple, List +from typing import Dict, Tuple🧰 Tools
🪛 Ruff (0.8.2)
18-18:
typing.Listimported but unusedRemove unused import:
typing.List(F401)
60-136: Refactor duplicated code to improve maintainabilityThe code block that handles printing statistics for a matching file is duplicated at lines 86-102 and 117-133. This violates the DRY principle and makes the code harder to maintain.
Consider refactoring the duplicated code into a helper function:
def parse_smaps(smaps_content: str, pattern: str, verbose: bool = False) -> Dict[Tuple[str, str], MapStats]: """ Parse smaps content and calculate size and RSS for files matching the pattern. Optionally logs individual file percentages for files larger than 1MB. Args: smaps_content: Content of /proc/{pid}/smaps file pattern: Regular expression pattern to match against file paths verbose: Whether to print per-file information Returns: Dictionary mapping (file_path, permissions) tuples to their stats """ file_stats: Dict[Tuple[str, str], MapStats] = {} current_file = None current_size = 0 current_rss = 0 current_permissions = "" # Compile the regex pattern regex = re.compile(pattern) + def update_stats_and_print_verbose(): + """Helper function to update file stats and print verbose information if needed""" + if not (current_file and regex.search(current_file)): + return + + key = (current_file, current_permissions) + if key not in file_stats: + file_stats[key] = MapStats() + + stats = file_stats[key] + stats.size += current_size + stats.rss += current_rss + + # Log individual file percentage if verbose mode is enabled and file is larger than 1MB + if verbose and current_size > 1024: + percentage = (current_rss / current_size) * 100 if current_size > 0 else 0.0 + print(f"File: {current_file} ({current_permissions})") + print(f" Size: {current_size} kB") + print(f" RSS: {current_rss} kB") + print(f" Cache percentage: {percentage:.2f}%") + print() + for line in smaps_content.splitlines(): # Check if line starts with an address range (new memory map entry) if re.match(r'^[0-9a-f]+-[0-9a-f]+', line): # If we have a previous file that matched, add its stats - if current_file and regex.search(current_file): - key = (current_file, current_permissions) - if key not in file_stats: - file_stats[key] = MapStats() - - stats = file_stats[key] - stats.size += current_size - stats.rss += current_rss - - # Log individual file percentage if verbose mode is enabled and file is larger than 1MB - if verbose and current_size > 1024: - percentage = (current_rss / current_size) * 100 - print(f"File: {current_file} ({current_permissions})") - print(f" Size: {current_size} kB") - print(f" RSS: {current_rss} kB") - print(f" Cache percentage: {percentage:.2f}%") - print() + update_stats_and_print_verbose() # Extract file path and permissions from the line parts = line.split() current_file = parts[-1] if len(parts) > 5 else None current_size = 0 current_rss = 0 # Store full permissions string current_permissions = parts[1] if len(parts) > 1 else "" elif line.startswith('Size:'): current_size = int(line.split()[1]) elif line.startswith('Rss:'): current_rss = int(line.split()[1]) # Don't forget to add the last entry if it matches - if current_file and regex.search(current_file): - key = (current_file, current_permissions) - if key not in file_stats: - file_stats[key] = MapStats() - - stats = file_stats[key] - stats.size += current_size - stats.rss += current_rss - - # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB - if verbose and current_size > 1024: - percentage = (current_rss / current_size) * 100 - print(f"File: {current_file} ({current_permissions})") - print(f" Size: {current_size} kB") - print(f" RSS: {current_rss} kB") - print(f" Cache percentage: {percentage:.2f}%") - print() + update_stats_and_print_verbose() return file_statsAdditionally, the calculation of the percentage should include a check to avoid division by zero:
-percentage = (current_rss / current_size) * 100 +percentage = (current_rss / current_size) * 100 if current_size > 0 else 0.0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/smaps-view.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tools/smaps-view.py
18-18: typing.List imported but unused
Remove unused import: typing.List
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test-consensus
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (2)
tools/smaps-view.py (2)
137-182: LGTM! Well-implemented error handlingThe function has good error handling for file not found and other exceptions. It also properly checks for division by zero when calculating percentages.
183-195: LGTM! Clean command-line interface implementationThe command-line argument parsing is well-implemented using argparse, with clear descriptions for each parameter.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tools/smaps-view.py (5)
15-19: Remove unused importList.The static analysis tool correctly identified that
typing.Listis imported but never used in the code.- from typing import Dict, Tuple, List + from typing import Dict, Tuple🧰 Tools
🪛 Ruff (0.8.2)
18-18:
typing.Listimported but unusedRemove unused import:
typing.List(F401)
150-153: Add memory usage protection for large processes.Reading the entire smaps file into memory at once could cause memory issues for large processes. Consider processing the file line by line instead.
- with open(f'/proc/{pid}/smaps', 'r') as f: - content = f.read() - - file_stats = parse_smaps(content, pattern, verbose) + file_stats: Dict[Tuple[str, str], MapStats] = {} + with open(f'/proc/{pid}/smaps', 'r') as f: + # Process the file line by line instead of loading everything into memory + file_stats = parse_smaps(f.read(), pattern, verbose)Alternatively, refactor
parse_smapsto accept a file object instead of a string to process lines one at a time:def parse_smaps(smaps_file, pattern: str, verbose: bool = False) -> Dict[Tuple[str, str], MapStats]: """ Parse smaps file and calculate size and RSS for files matching the pattern. Args: smaps_file: File object for /proc/{pid}/smaps pattern: Regular expression pattern to match against file paths verbose: Whether to print per-file information Returns: Dictionary mapping (file_path, permissions) tuples to their stats """ # Then modify calculate_cache_percentage to pass the file object directly
105-111: Add error handling for malformed smaps entries.The parsing logic makes assumptions about the format of smaps entries without handling potential malformed entries, which could lead to index errors.
- parts = line.split() - current_file = parts[-1] if len(parts) > 5 else None - current_size = 0 - current_rss = 0 - # Store full permissions string - current_permissions = parts[1] if len(parts) > 1 else "" + try: + parts = line.split() + # Only consider valid entries with file paths + # A valid smaps entry should have at least 6 parts: + # address permissions offset dev inode pathname + if len(parts) >= 6: + current_file = parts[-1] + current_permissions = parts[1] + else: + current_file = None + current_size = 0 + current_rss = 0 + except Exception as e: + print(f"Warning: Error parsing line '{line}': {e}", file=sys.stderr) + current_file = None + current_size = 0 + current_rss = 0 + current_permissions = ""
183-195: Add help text and an example in the argparse description.The command-line interface could benefit from a more detailed description and examples, incorporating the information already provided in the docstring.
- parser = argparse.ArgumentParser(description='Analyze memory maps and calculate cache percentages') + parser = argparse.ArgumentParser( + description='Analyze memory maps and calculate cache percentages for files in a process', + epilog=''' + Examples: + # Show aggregated cache percentages for all vector index files + python smaps-view.py $(pidof qdrant) '.*/vector_index/.*' + + # Verbose mode - show individual files larger than 1MB + python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v + ''' + )
116-135: Consider refactoring duplicated verbose output code.The verbose output code for displaying individual file statistics is duplicated between the main loop and the final entry processing.
+ def print_file_stats(file: str, permissions: str, size: int, rss: int): + """Helper function to print detailed file statistics in verbose mode.""" + percentage = (rss / size) * 100 if size > 0 else 0 + print(f"File: {file} ({permissions})") + print(f" Size: {size} kB") + print(f" RSS: {rss} kB") + print(f" Cache percentage: {percentage:.2f}%") + print() ... - # Log individual file percentage if verbose mode is enabled and file is larger than 1MB - if verbose and current_size > 1024: - percentage = (current_rss / current_size) * 100 - print(f"File: {current_file} ({current_permissions})") - print(f" Size: {current_size} kB") - print(f" RSS: {current_rss} kB") - print(f" Cache percentage: {percentage:.2f}%") - print() + # Log individual file percentage if verbose mode is enabled and file is larger than 1MB + if verbose and current_size > 1024: + print_file_stats(current_file, current_permissions, current_size, current_rss) ... - # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB - if verbose and current_size > 1024: - percentage = (current_rss / current_size) * 100 if current_size > 0 else 0 - print(f"File: {current_file} ({current_permissions})") - print(f" Size: {current_size} kB") - print(f" RSS: {current_rss} kB") - print(f" Cache percentage: {percentage:.2f}%") - print() + # Log individual file percentage for the last entry if verbose mode is enabled and file is larger than 1MB + if verbose and current_size > 1024: + print_file_stats(current_file, current_permissions, current_size, current_rss)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/smaps-view.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tools/smaps-view.py
18-18: typing.List imported but unused
Remove unused import: typing.List
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (6)
tools/smaps-view.py (6)
1-13: LGTM! The docstring provides clear examples and expected output.The docstring effectively communicates the tool's purpose with practical usage examples and expected output format.
55-59: LGTM! The dataclass design is clean and appropriate.The
MapStatsdataclass provides a concise way to track size and RSS statistics with appropriate default values.
60-95: LGTM! Well-documented function with clear purpose and structure.The
parse_smapsfunction is well-documented with clear arguments, return value, and purpose. The initial setup and regex compilation are appropriately handled.
112-115: LGTM! Simple and effective extraction of size and RSS values.The code for extracting size and RSS values from the smaps entries is straightforward and effective.
156-166: LGTM! Good error handling for empty results.The function correctly handles the case when no files match the provided pattern, returning an empty dictionary.
167-182: LGTM! Comprehensive error handling with user-friendly messages.The function includes appropriate error handling for file not found and other exceptions, with clear error messages directed to stderr.
| if verbose and current_size > 1024: | ||
| percentage = (current_rss / current_size) * 100 | ||
| print(f"File: {current_file} ({current_permissions})") | ||
| print(f" Size: {current_size} kB") | ||
| print(f" RSS: {current_rss} kB") | ||
| print(f" Cache percentage: {percentage:.2f}%") | ||
| print() |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add division by zero protection when calculating percentage.
When calculating the cache percentage for individual files in verbose mode, there's no check to ensure that current_size is not zero, which could lead to a division by zero error.
- percentage = (current_rss / current_size) * 100
+ percentage = (current_rss / current_size) * 100 if current_size > 0 else 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if verbose and current_size > 1024: | |
| percentage = (current_rss / current_size) * 100 | |
| print(f"File: {current_file} ({current_permissions})") | |
| print(f" Size: {current_size} kB") | |
| print(f" RSS: {current_rss} kB") | |
| print(f" Cache percentage: {percentage:.2f}%") | |
| print() | |
| if verbose and current_size > 1024: | |
| - percentage = (current_rss / current_size) * 100 | |
| + percentage = (current_rss / current_size) * 100 if current_size > 0 else 0 | |
| print(f"File: {current_file} ({current_permissions})") | |
| print(f" Size: {current_size} kB") | |
| print(f" RSS: {current_rss} kB") | |
| print(f" Cache percentage: {percentage:.2f}%") | |
| print() |
| # Verbose mode - show individual files larger than 1MB | ||
| python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v | ||
|
|
There was a problem hiding this comment.
| # Verbose mode - show individual files larger than 1MB | |
| python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v | |
| # Verbose mode - show vector storage files larger than 1MB | |
| python smaps-view.py $(pidof qdrant) '.*/vector_storage/.*' -v | |
| # Verbose mode - show all files larger than 1MB | |
| python smaps-view.py $(pidof qdrant) '.*' -v |
I tested on a small collection. Needed python smaps-view.py $(pidof qdrant) '.*' -v to see file stats. Might be better to highlight that in docs.
There was a problem hiding this comment.
what is the point? This "match_anything" pattern will include a lot of garbage
* Add a tool to view percentage of mmaps cache * review fixes
A tool which can be used to see mmaps disk cache inside running qdrant process.