Skip to content

feat(sql): upgrades to querying external parquet files#6369

Draft
nwoolmer wants to merge 43 commits intomasterfrom
feat-external-parquet-upgrades
Draft

feat(sql): upgrades to querying external parquet files#6369
nwoolmer wants to merge 43 commits intomasterfrom
feat-external-parquet-upgrades

Conversation

@nwoolmer
Copy link
Copy Markdown
Contributor

@nwoolmer nwoolmer commented Nov 10, 2025

WIP for parquet usability roadmap. Not ready for review, lots of refactoring required.

Closes #5280

Projection

Lack of projection causes high memory usage and slow queries when working with large parquet files. This PR aims to address this, by pushing down projections from higher query models directly to the record and page frame cursors.

Q1 Clickbench and hits.parquet

SELECT count(*) FROM read_parquet('hits.parquet') WHERE AdvEngineId <> 0; 
Single-threaded Multithreaded
Master 106s DNF (OOM)
PR 670ms 269ms

Note: This PR does change some of the permissiveness around Parquet metadata validation. Previously, it was required that the file was always fully decoded and the metadata matched exactly. Now only the projected columns must be readable from the file. Therefore, if you change the schema by changing an underlying column type, it will throw an exception. But if you add an extra column to the schema, it will simply be ignored.

Glob/hive-partitioned reads

Generally, large parquet datasets will come in a partitioned form. Querying this is unergonomic, so usually a means to query the files using a wildcard pattern is provided. For example:

SELECT count() FROM read_parquet('hits/*.parquet') WHERE AdvEngineId <> 0; -- 1.26s cold, 1.06s hot
SELECT count() FROM read_parquet('hits.parquet') WHERE AdvEngineId <> 0; -- 546ms cold, 270ms hot

Min/max statistics inc. timestamp intrinsics

Changelist

  • Projection
    • Single-threaded read_parquet
    • Multi-threaded read_parquet
    • Tests for unions, joins etc.
  • Glob/hive-partitioned reads
    • Querying multiple files at once
    • Fast calculateSize
    • Fast skipRows
    • Virtual column addition for key=v syntax
      • With a path like `day=2025-01-01/symbol=BTC-USD/data.parquet
      • We should automatically add the day and symbol columns to the projection (this is industry standard).
      • For paths that don't follow this pattern, we do not do this.
    • Tests etc.

Out of scope - minmax stats etc. separate holistic PR required.

@nwoolmer nwoolmer added SQL Issues or changes relating to SQL execution Core Related to storage, data type, etc. labels Nov 10, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 10, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-external-parquet-upgrades

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nwoolmer nwoolmer added the Performance Performance improvements label Nov 10, 2025
nwoolmer and others added 12 commits November 11, 2025 10:00
# Conflicts:
#	core/src/main/java/io/questdb/griffin/engine/functions/catalogue/FilesFunctionFactory.java
#	core/src/main/java/io/questdb/griffin/engine/functions/regex/GlobStrFunctionFactory.java
#	core/src/main/java/io/questdb/griffin/engine/functions/table/GlobFilesFunctionFactory.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/table/GlobFilesFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/table/GlobFilesIntegrationTest.java
# Conflicts:
#	core/src/main/java/io/questdb/griffin/SqlOptimiser.java
fromParquetColumnIndexes.setAll(metadataIndex, -1);
for (int i = 0, n = addressCache.getColumnCount(); i < n; i++) {
final int columnIndex = addressCache.getColumnIndexes().getQuick(i);
final int parquetColumnIndex = toParquetColumnIndexes.getQuick(columnIndex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table reader column indexes and parquet column indexes may not match. That's why we have this additional indirection via toParquetColumnIndexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Related to storage, data type, etc. Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add projection for virtual table SQL functions like read_parquet()

3 participants