fix: check if cache latest files enabled for cache local file #8219

hengfeiyang · 2025-09-01T04:13:50Z

User description

We also should check if cache latest files enabled when we try to cache local file to disk cache.

PR Type

Bug fix

Description

Guard caching with enabled flag
Apply to parquet merge paths
Apply to Tantivy index caching
Prevent unintended disk writes

Diagram Walkthrough

flowchart LR
  CFG["Config: cache_latest_files.enabled"]
  P1["parquet.rs merge_files"]
  M1["compact/merge.rs merge_files (path 1)"]
  M2["compact/merge.rs merge_files (path 2)"]
  T1["tantivy create_tantivy_index"]
  DISK["disk cache set(...)"]

  CFG -- "enabled && cache_* && download_from_node" --> P1
  CFG -- "enabled && cache_* && download_from_node" --> M1
  CFG -- "enabled && cache_* && download_from_node" --> M2
  CFG -- "enabled && cache_* && download_from_node" --> T1

  P1 -- "conditional write" --> DISK
  M1 -- "conditional write" --> DISK
  M2 -- "conditional write" --> DISK
  T1 -- "conditional write" --> DISK

File Walkthrough

Relevant files

Bug fix

parquet.rs `Guard parquet caching with global enable flag` src/job/files/parquet.rs Add `cfg.cache_latest_files.enabled` to condition Gate parquet disk caching on enabled flag	+4/-1
merge.rs `Enforce enabled flag in merge caching paths` src/service/compact/merge.rs Add `enabled` check to first merge upload path Add `enabled` check to secondary merge upload path Prevent cache write when caching disabled	+7/-2
mod.rs `Respect enabled flag for Tantivy index caching` src/service/tantivy/mod.rs Store `get_config()` in `cfg` Add `enabled` check to index caching condition	+4/-2

github-actions · 2025-09-01T04:14:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Redundant Clone The code converts `puffin_bytes` to `Bytes` using `Bytes::from(puffin_bytes.clone())`. If `to_puffin_bytes()` already returns a `Vec<u8>`, the clone may be unnecessary. Consider moving or reusing the buffer without cloning to avoid extra allocation. infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes.clone())) .await?; log::info!("file: {idx_file_name} file_data::disk::set success"); Config Retrieval `get_config()` is called once and stored in `cfg` here, but other parts of the module may still call `get_config()` repeatedly. Ensure consistency across the codebase to avoid mixed patterns and potential stale config reads if config can change at runtime. let cfg = get_config(); if cfg.cache_latest_files.enabled && cfg.cache_latest_files.cache_index && cfg.cache_latest_files.download_from_node {

github-actions · 2025-09-01T04:14:57Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Remove unnecessary bytes clone Avoid cloning `puffin_bytes` when converting to `Bytes` to prevent an unnecessary allocation and copy, which is significant for large indexes. Convert the owned vector directly into `Bytes` using `Bytes::from(puffin_bytes)` and adjust the variable to be mutable if needed. src/service/tantivy/mod.rs [70-77] let cfg = get_config(); if cfg.cache_latest_files.enabled && cfg.cache_latest_files.cache_index && cfg.cache_latest_files.download_from_node { - infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes.clone())) + infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes)) .await?; log::info!("file: {idx_file_name} file_data::disk::set success"); } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies an avoidable clone of `puffin_bytes` and proposes moving it into `Bytes::from`, reducing allocations for large data; it's accurate and improves efficiency without altering logic.	Medium
	Avoid cloning large buffer Eliminate the extra `buf.clone()` before caching to avoid duplicating potentially large buffers. Since `buf` is created right before and not reused later in this scope, move it into `set` directly. src/job/files/parquet.rs [799-805] if cfg.cache_latest_files.enabled && cfg.cache_latest_files.cache_parquet && cfg.cache_latest_files.download_from_node { - infra::cache::file_data::disk::set(&new_file_key, buf.clone()).await?; + infra::cache::file_data::disk::set(&new_file_key, buf).await?; log::debug!("merge_files {new_file_key} file_data::disk::set success"); } Suggestion importance[1-10]: 6 __ Why: Replacing `buf.clone()` with moving `buf` into `set` saves memory and CPU; impact is moderate but beneficial if `buf` is not needed afterward, and matches the new hunk code exactly.	Low
	Remove redundant buffer clone Prevent unnecessary buffer cloning when writing to cache to save memory and CPU. Pass `buf` by value into `set` if it isn't needed afterward in the scope. src/service/compact/merge.rs [934-940] if cfg.cache_latest_files.enabled && cfg.cache_latest_files.cache_parquet && cfg.cache_latest_files.download_from_node { - infra::cache::file_data::disk::set(&new_file_key, buf.clone()).await?; + infra::cache::file_data::disk::set(&new_file_key, buf).await?; log::debug!("merge_files {new_file_key} file_data::disk::set success"); } Suggestion importance[1-10]: 6 __ Why: Eliminating `buf.clone()` in this caching path avoids unnecessary duplication; it's a correct micro-optimization aligned with the surrounding code and conditions.	Low

greptile-apps

Greptile Summary

This PR fixes a configuration logic bug in the cache latest files system by adding missing checks for the master enabled flag across three critical files. The OpenObserve codebase has a hierarchical caching configuration system where cache_latest_files.enabled serves as the master switch that should gate all caching operations, with specific sub-flags like cache_parquet, cache_index, and download_from_node controlling individual caching behaviors.

The affected files were inconsistently checking the configuration flags:

src/job/files/parquet.rs: Missing enabled check in the ingester's parquet file caching logic
src/service/compact/merge.rs: Missing enabled check in two locations within the compactor's merge operations
src/service/tantivy/mod.rs: Missing enabled check in the Tantivy search index caching logic

The fix ensures that all three components now follow the same pattern: first check if cache_latest_files.enabled is true, then verify the specific feature flags (cache_parquet or cache_index), and finally confirm download_from_node is enabled. This creates consistent behavior where setting ZO_CACHE_LATEST_FILES_ENABLED=false will properly disable all local file caching operations across the ingester, compactor, and search indexing components.

Additionally, the Tantivy module change includes a minor optimization by caching the config reference in a local variable instead of calling get_config() multiple times.

Confidence score: 5/5

This PR is safe to merge with minimal risk as it only adds missing configuration checks without changing core logic
Score reflects simple, consistent changes that fix a clear configuration bug with no breaking changes to existing functionality
No files require special attention as all changes follow the same straightforward pattern

_{3 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

fix: check if cache latest files enabled for cache local file

0e45c71

hengfeiyang requested review from Loaki07 and haohuaijin September 1, 2025 04:14

github-actions bot added ☢️ Bug Something isn't working 🧹 Updates Review effort 2/5 labels Sep 1, 2025

haohuaijin approved these changes Sep 1, 2025

View reviewed changes

greptile-apps bot reviewed Sep 1, 2025

View reviewed changes

hengfeiyang merged commit f49a4a3 into main Sep 1, 2025
28 of 29 checks passed

hengfeiyang deleted the fix/compactor-cache branch September 1, 2025 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: check if cache latest files enabled for cache local file #8219

fix: check if cache latest files enabled for cache local file #8219

Uh oh!

hengfeiyang commented Sep 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: check if cache latest files enabled for cache local file #8219

fix: check if cache latest files enabled for cache local file #8219

Uh oh!

Conversation

hengfeiyang commented Sep 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions bot commented Sep 1, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Sep 1, 2025

PR Code Suggestions ✨

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 5/5

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hengfeiyang commented Sep 1, 2025 •

edited by github-actions bot

Loading