Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Sep 1, 2025

User description

We also should check if cache latest files enabled when we try to cache local file to disk cache.


PR Type

Bug fix


Description

  • Guard caching with enabled flag

  • Apply to parquet merge paths

  • Apply to Tantivy index caching

  • Prevent unintended disk writes


Diagram Walkthrough

flowchart LR
  CFG["Config: cache_latest_files.enabled"]
  P1["parquet.rs merge_files"]
  M1["compact/merge.rs merge_files (path 1)"]
  M2["compact/merge.rs merge_files (path 2)"]
  T1["tantivy create_tantivy_index"]
  DISK["disk cache set(...)"]

  CFG -- "enabled && cache_* && download_from_node" --> P1
  CFG -- "enabled && cache_* && download_from_node" --> M1
  CFG -- "enabled && cache_* && download_from_node" --> M2
  CFG -- "enabled && cache_* && download_from_node" --> T1

  P1 -- "conditional write" --> DISK
  M1 -- "conditional write" --> DISK
  M2 -- "conditional write" --> DISK
  T1 -- "conditional write" --> DISK
Loading

File Walkthrough

Relevant files
Bug fix
parquet.rs
Guard parquet caching with global enable flag                       

src/job/files/parquet.rs

  • Add cfg.cache_latest_files.enabled to condition
  • Gate parquet disk caching on enabled flag
+4/-1     
merge.rs
Enforce enabled flag in merge caching paths                           

src/service/compact/merge.rs

  • Add enabled check to first merge upload path
  • Add enabled check to secondary merge upload path
  • Prevent cache write when caching disabled
+7/-2     
mod.rs
Respect enabled flag for Tantivy index caching                     

src/service/tantivy/mod.rs

  • Store get_config() in cfg
  • Add enabled check to index caching condition
+4/-2     

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Redundant Clone

The code converts puffin_bytes to Bytes using Bytes::from(puffin_bytes.clone()). If to_puffin_bytes() already returns a Vec<u8>, the clone may be unnecessary. Consider moving or reusing the buffer without cloning to avoid extra allocation.

infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes.clone()))
    .await?;
log::info!("file: {idx_file_name} file_data::disk::set success");
Config Retrieval

get_config() is called once and stored in cfg here, but other parts of the module may still call get_config() repeatedly. Ensure consistency across the codebase to avoid mixed patterns and potential stale config reads if config can change at runtime.

let cfg = get_config();
if cfg.cache_latest_files.enabled
    && cfg.cache_latest_files.cache_index
    && cfg.cache_latest_files.download_from_node
{

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Remove unnecessary bytes clone

Avoid cloning puffin_bytes when converting to Bytes to prevent an unnecessary
allocation and copy, which is significant for large indexes. Convert the owned
vector directly into Bytes using Bytes::from(puffin_bytes) and adjust the variable
to be mutable if needed.

src/service/tantivy/mod.rs [70-77]

 let cfg = get_config();
 if cfg.cache_latest_files.enabled
     && cfg.cache_latest_files.cache_index
     && cfg.cache_latest_files.download_from_node
 {
-    infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes.clone()))
+    infra::cache::file_data::disk::set(&idx_file_name, Bytes::from(puffin_bytes))
         .await?;
     log::info!("file: {idx_file_name} file_data::disk::set success");
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies an avoidable clone of puffin_bytes and proposes moving it into Bytes::from, reducing allocations for large data; it's accurate and improves efficiency without altering logic.

Medium
Avoid cloning large buffer

Eliminate the extra buf.clone() before caching to avoid duplicating potentially
large buffers. Since buf is created right before and not reused later in this scope,
move it into set directly.

src/job/files/parquet.rs [799-805]

 if cfg.cache_latest_files.enabled
     && cfg.cache_latest_files.cache_parquet
     && cfg.cache_latest_files.download_from_node
 {
-    infra::cache::file_data::disk::set(&new_file_key, buf.clone()).await?;
+    infra::cache::file_data::disk::set(&new_file_key, buf).await?;
     log::debug!("merge_files {new_file_key} file_data::disk::set success");
 }
Suggestion importance[1-10]: 6

__

Why: Replacing buf.clone() with moving buf into set saves memory and CPU; impact is moderate but beneficial if buf is not needed afterward, and matches the new hunk code exactly.

Low
Remove redundant buffer clone

Prevent unnecessary buffer cloning when writing to cache to save memory and CPU.
Pass buf by value into set if it isn't needed afterward in the scope.

src/service/compact/merge.rs [934-940]

 if cfg.cache_latest_files.enabled
     && cfg.cache_latest_files.cache_parquet
     && cfg.cache_latest_files.download_from_node
 {
-    infra::cache::file_data::disk::set(&new_file_key, buf.clone()).await?;
+    infra::cache::file_data::disk::set(&new_file_key, buf).await?;
     log::debug!("merge_files {new_file_key} file_data::disk::set success");
 }
Suggestion importance[1-10]: 6

__

Why: Eliminating buf.clone() in this caching path avoids unnecessary duplication; it's a correct micro-optimization aligned with the surrounding code and conditions.

Low

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a configuration logic bug in the cache latest files system by adding missing checks for the master enabled flag across three critical files. The OpenObserve codebase has a hierarchical caching configuration system where cache_latest_files.enabled serves as the master switch that should gate all caching operations, with specific sub-flags like cache_parquet, cache_index, and download_from_node controlling individual caching behaviors.

The affected files were inconsistently checking the configuration flags:

  • src/job/files/parquet.rs: Missing enabled check in the ingester's parquet file caching logic
  • src/service/compact/merge.rs: Missing enabled check in two locations within the compactor's merge operations
  • src/service/tantivy/mod.rs: Missing enabled check in the Tantivy search index caching logic

The fix ensures that all three components now follow the same pattern: first check if cache_latest_files.enabled is true, then verify the specific feature flags (cache_parquet or cache_index), and finally confirm download_from_node is enabled. This creates consistent behavior where setting ZO_CACHE_LATEST_FILES_ENABLED=false will properly disable all local file caching operations across the ingester, compactor, and search indexing components.

Additionally, the Tantivy module change includes a minor optimization by caching the config reference in a local variable instead of calling get_config() multiple times.

Confidence score: 5/5

  • This PR is safe to merge with minimal risk as it only adds missing configuration checks without changing core logic
  • Score reflects simple, consistent changes that fix a clear configuration bug with no breaking changes to existing functionality
  • No files require special attention as all changes follow the same straightforward pattern

3 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@hengfeiyang hengfeiyang merged commit f49a4a3 into main Sep 1, 2025
28 of 29 checks passed
@hengfeiyang hengfeiyang deleted the fix/compactor-cache branch September 1, 2025 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants