Skip to content

[GLUTEN-8846][CH] [Part 3] Add benchmark for Icerberg Delete#9192

Merged
baibaichen merged 4 commits intoapache:mainfrom
baibaichen:feature/iceberg-mor
Apr 1, 2025
Merged

[GLUTEN-8846][CH] [Part 3] Add benchmark for Icerberg Delete#9192
baibaichen merged 4 commits intoapache:mainfrom
baibaichen:feature/iceberg-mor

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Apr 1, 2025

What changes were proposed in this pull request?

(Fixes: #8846)

Summary of Changes

This commit primarily:

  1. Fixes directory name capitalization from "iceberg" to "Iceberg" across the codebase
  2. Reorganizes and improves include statements in several files
  3. Adds new benchmark tests for Iceberg with delete operations
  4. Makes minor code improvements and bug fixes

Key Changes

Directory Structure and Naming

  • Renamed iceberg directory to Iceberg with proper capitalization
  • Updated all include paths to reflect this change
  • Updated CMakeLists.txt to use the new capitalized directory name

Code Improvements

  • Fixed typo: begionbegin in BlockTypeUtils.h
  • Added assertion to verify input_format is not null in NormalFileReader constructor
  • Improved readability of DeltaReader::create by splitting a long line
  • Simplified row counting in ParquetFormatFile.cpp using std::ranges::fold_left
  • Added null check before using input in IcebergReader::create

New Functionality: Added code for benchmarking Iceberg read operations with equality deletes and position deletes

These two benchmark functions (BM_IcebergReadWithPositionDeletes and BM_IcebergReadWithEqualityDeletes) are designed to evaluate the performance of ClickHouse when integrated with Apache Iceberg delete operations. Supports parameterized testing with different deletion ratios (0% to 100%)

Run on (24 X 1285.69 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 1280 KiB (x12)
  L3 Unified 30720 KiB (x1)
Load Average: 2.34, 1.91, 2.30
----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_IcebergReadWithPositionDeletes/0/iterations:10         39.3 ms         39.3 ms           10
BM_IcebergReadWithPositionDeletes/1/iterations:10          156 ms          155 ms           10
BM_IcebergReadWithPositionDeletes/10/iterations:10         148 ms          145 ms           10
BM_IcebergReadWithPositionDeletes/50/iterations:10         273 ms          267 ms           10
BM_IcebergReadWithPositionDeletes/100/iterations:10        347 ms          335 ms           10
BM_IcebergReadWithEqualityDeletes/0/iterations:10         36.8 ms         36.8 ms           10
BM_IcebergReadWithEqualityDeletes/1/iterations:10          192 ms          191 ms           10
BM_IcebergReadWithEqualityDeletes/5/iterations:10          617 ms          616 ms           10
BM_IcebergReadWithEqualityDeletes/10/iterations:10        1821 ms         1817 ms           10
BM_IcebergReadWithEqualityDeletes/50/iterations:10        9716 ms         9703 ms           10

Position Delete Performance (BM_IcebergReadWithPositionDeletes)

Position delete tests show performance decreases as deletion ratio increases, but the growth is relatively slow:

  • Only 39.3ms required with no deletions
  • Performance drops significantly to 156ms when deleting 1%, profiler shows hot pot is in bitmap contain test
  • Remains relatively stable (148ms) when deleting 10% of data
  • Increases to 273ms when deleting 50%
  • Reaches 347ms for complete deletion (100%), profiler shows 2 hot pots: one is in reading file column of position file, and another is in creating bitmap.

This indicates that for position deletes, performance degradation has a near-linear relationship with deletion ratio, with processing time increasing approximately 8.8 times from 0 to 100% deletion ratio.

Equality Delete Performance (BM_IcebergReadWithEqualityDeletes)

Equality delete tests demonstrate a dramatic performance decline:

  • 36.8ms with no deletions (similar to position deletes)
  • Increases to 192ms when deleting 1%
  • Reaches 617ms when deleting 5%
  • Spikes to 1821ms when deleting 10%
  • Reaches 9716ms when deleting 50%

Equality delete performance degradation shows exponential growth, with processing time increasing approximately 264 times from 0 to 50% deletion ratio.

Analysis

There is only one hotpot, i.e. DB::Set::executeImplCase which is part of the logic for checking if elements in a set of columns belong to a predefined set, and supporting both positive and negative set operations.

How was this patch tested?

Existed UTs

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2025

#8846

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2025

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen requested a review from Copilot April 1, 2025 08:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 11 out of 25 changed files in this pull request and generated no comments.

Files not reviewed (14)
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickHouseTPCHAbstractSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/tpch/GlutenClickHouseHDFSSuite.scala: Language not supported
  • cpp-ch/local-engine/Common/BlockTypeUtils.cpp: Language not supported
  • cpp-ch/local-engine/Common/BlockTypeUtils.h: Language not supported
  • cpp-ch/local-engine/Common/PlanUtil.cpp: Language not supported
  • cpp-ch/local-engine/Parser/FunctionExecutor.cpp: Language not supported
  • cpp-ch/local-engine/Parser/RelParsers/ReadRelParser.cpp: Language not supported
  • cpp-ch/local-engine/Parser/RelParsers/WriteRelParser.cpp: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/CMakeLists.txt: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/FileReader.cpp: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/Iceberg/EqualityDeleteFileReader.cpp: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/Iceberg/IcebergReader.cpp: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/Iceberg/PositionalDeleteFileReader.cpp: Language not supported
  • cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp: Language not supported

@baibaichen
Copy link
Copy Markdown
Contributor Author

@CodiumAI-Agent /review

@QodoAI-Agent
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

8846 - Partially compliant

Compliant requirements:

  • Renamed directory and updated include paths
  • Added benchmark tests for Iceberg delete operations
  • Improved assertions and code structure (e.g., null check for input_format)

Non-compliant requirements:

  • The improvement regarding continuous integer deletion filtering remains unaddressed

Requires further human verification:

  • Review benchmark accuracy and performance consistency
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Absolute Path Calculation

Verify that the new absoluteParquetPath is correctly computed from rootPath and used in file operations.

final protected lazy val absoluteParquetPath = rootPath + parquetTableDataPath

protected val tablesPath: String
Type Conversion Update

Ensure that replacing blockToNameAndTypeList with blockToRowType preserves all expected type conversion semantics.

DB::Block toSampleBlock(const RowType & type);
RowType blockToRowType(const DB::Block & header);
DB::DataTypePtr wrapNullableType(bool nullable, DB::DataTypePtr nested_type);
Input Format Validation

Confirm that the null-check and handling of the input from input_format_callback are robust in all scenarios.

    ? nullptr
    : EqualityDeleteFileReader::createDeleteExpr(context, file_->getFileSchema(), delete_files, it_equal->second, new_header);

auto input = input_format_callback(new_header);
if (!input)
    return nullptr;
return std::make_unique<IcebergReader>(

@baibaichen
Copy link
Copy Markdown
Contributor Author

@CodiumAI-Agent /improve

@QodoAI-Agent
Copy link
Copy Markdown

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add error handling for dynamic_cast

Replace the assert with explicit error handling to avoid unexpected behavior in
release builds if the cast fails.

cpp-ch/local-engine/Storages/SubstraitSource/Iceberg/SimpleParquetReader.cpp [71-72]

 auto * seekable = dynamic_cast<SeekableReadBuffer *>(read_buffer_reader_.get());
-assert(seekable != nullptr);
+if (!seekable)
+{
+    // Handle error gracefully, e.g., return an error status or throw a controlled exception
+}
Suggestion importance[1-10]: 7

__

Why: The suggestion replaces an assert with explicit error handling to improve runtime safety in release builds, though it is an error handling improvement that does not fundamentally alter functionality.

Medium
General
Add logging for null input

Log a diagnostic message before returning nullptr to aid debugging when the input
format creation fails.

cpp-ch/local-engine/Storages/SubstraitSource/Iceberg/IcebergReader.cpp [63-65]

 auto input = input_format_callback(new_header);
 if (!input)
+{
+    // Log an error or warning about null input_format result.
     return nullptr;
+}
Suggestion importance[1-10]: 3

__

Why: The change offers a minor improvement by adding a diagnostic comment for debugging, but it only slightly enhances code clarity and does not fix a critical issue.

Low

@baibaichen baibaichen marked this pull request as ready for review April 1, 2025 08:35
Copy link
Copy Markdown
Contributor

@zzcclp zzcclp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baibaichen baibaichen merged commit 2f312ea into apache:main Apr 1, 2025
9 checks passed
@baibaichen baibaichen deleted the feature/iceberg-mor branch April 1, 2025 09:59
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Apr 5, 2025
baibaichen pushed a commit that referenced this pull request Apr 5, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250405)

* Fix Build due to ClickHouse/ClickHouse#78515

* remove temp code due to #9192

* Fix test fail due to ClickHouse/ClickHouse#78022

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH][UMBRELLA] Support Iceberg MOR

4 participants