feat(sql): upgrades to querying external parquet files by nwoolmer · Pull Request #6369 · questdb/questdb

nwoolmer · 2025-11-10T10:49:19Z

WIP for parquet usability roadmap. Not ready for review, lots of refactoring required.

Projection

Lack of projection causes high memory usage and slow queries when working with large parquet files. This PR aims to address this, by pushing down projections from higher query models directly to the record and page frame cursors.

Q1 Clickbench and hits.parquet

SELECT count(*) FROM read_parquet('hits.parquet') WHERE AdvEngineId <> 0;

	Single-threaded	Multithreaded
Master	106s	DNF (OOM)
PR	670ms	269ms

Note: This PR does change some of the permissiveness around Parquet metadata validation. Previously, it was required that the file was always fully decoded and the metadata matched exactly. Now only the projected columns must be readable from the file. Therefore, if you change the schema by changing an underlying column type, it will throw an exception. But if you add an extra column to the schema, it will simply be ignored.

Glob/hive-partitioned reads

Generally, large parquet datasets will come in a partitioned form. Querying this is unergonomic, so usually a means to query the files using a wildcard pattern is provided. For example:

SELECT count() FROM read_parquet('hits/*.parquet') WHERE AdvEngineId <> 0; -- 1.26s cold, 1.06s hot
SELECT count() FROM read_parquet('hits.parquet') WHERE AdvEngineId <> 0; -- 546ms cold, 270ms hot

Min/max statistics inc. timestamp intrinsics

Changelist

Out of scope - minmax stats etc. separate holistic PR required.

This reverts commit 0bef6a1.

…arquet_projection

coderabbitai · 2025-11-10T10:50:02Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-external-parquet-upgrades

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

# Conflicts: # core/src/main/java/io/questdb/griffin/engine/functions/catalogue/FilesFunctionFactory.java # core/src/main/java/io/questdb/griffin/engine/functions/regex/GlobStrFunctionFactory.java # core/src/main/java/io/questdb/griffin/engine/functions/table/GlobFilesFunctionFactory.java # core/src/test/java/io/questdb/test/griffin/engine/functions/table/GlobFilesFunctionFactoryTest.java # core/src/test/java/io/questdb/test/griffin/engine/functions/table/GlobFilesIntegrationTest.java

# Conflicts: # core/src/main/java/io/questdb/griffin/SqlOptimiser.java

puzpuzpuz · 2025-12-16T08:54:44Z

core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java

+        fromParquetColumnIndexes.setAll(metadataIndex, -1);
        for (int i = 0, n = addressCache.getColumnCount(); i < n; i++) {
            final int columnIndex = addressCache.getColumnIndexes().getQuick(i);
-            final int parquetColumnIndex = toParquetColumnIndexes.getQuick(columnIndex);


Table reader column indexes and parquet column indexes may not match. That's why we have this additional indirection via toParquetColumnIndexes.

nwoolmer added 26 commits October 18, 2025 21:03

projection single threaded

9495a58

test fix

32aa93d

Merge branch 'refs/heads/master' into nw_parquet_projection

ec61a57

absolute mess of bruteforcing to try to get query to not crash.

0bef6a1

single threaded

129e514

Revert "absolute mess of bruteforcing to try to get query to not crash."

447fbf4

This reverts commit 0bef6a1.

Merge branch 'refs/heads/master' into nw_parquet_projection

457b574

reworking

d2f197a

reworking

e5f51e3

speed up parallel execution

40a6969

Merge branch 'master' into nw_parquet_projection

61337f4

cleanup

e9bf986

Merge remote-tracking branch 'origin/nw_parquet_projection' into nw_p…

3624fb0

…arquet_projection

Merge branch 'master' into nw_parquet_projection

f175f11

cleanup

b57694a

cleanup

fa176b2

fix tests and address comments

e4134e0

fix test

2cd6d0f

Merge branch 'master' into nw_parquet_projection

c9e0224

safetysafety

1be748c

glob syntax sugar

de82f6a

iterate

e91dc5e

integration test

27e6b34

safety

374a00a

hive reads, safety commit, speeding up skipRows next

0d5ecfd

iterating

85ef1ef

nwoolmer added SQL Issues or changes relating to SQL execution Core Related to storage, data type, etc. labels Nov 10, 2025

nwoolmer mentioned this pull request Nov 10, 2025

perf(sql): pushdown projections to parquet file reads #6287

Closed

3 tasks

nwoolmer added the Performance Performance improvements label Nov 10, 2025

nwoolmer and others added 12 commits November 11, 2025 10:00

iterating

7007119

cleanup

5ddc043

robo tests

80b915e

fix fds

52d8834

iteration

2ff1419

iteration

9d5e82d

iteration

0260c5c

Merge branch 'refs/heads/master' into feat-external-parquet-upgrades

95307ac

# Conflicts: # core/src/main/java/io/questdb/griffin/SqlOptimiser.java

todos

734f4a0

merge fallout

d7cc38d

pushdown topdoen cols for MutableMetadataRecordCursorFactory

edba49d

puzpuzpuz reviewed Dec 16, 2025

View reviewed changes

kafka1991 added 4 commits December 16, 2025 19:58

refactor glob function

0872e88

add GlobFilesFunctionFactory tests

f316180

glob add fuzz tests

4c039d4

revert unnecessary MutableMetadataRecordCursorFactory

6111ad4

This was referenced Dec 18, 2025

feat(sql): add column projection pushdown for read_parquet() #6551

Merged

feat(sql): align glob() function with DuckDB glob syntax #6552

Merged

buf fix

d42cd1a

nwoolmer mentioned this pull request Mar 18, 2026

fix(sql): fix read_parquet on Parquet files with stale QDB metadata #6885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sql): upgrades to querying external parquet files#6369

feat(sql): upgrades to querying external parquet files#6369
nwoolmer wants to merge 43 commits intomasterfrom
feat-external-parquet-upgrades

nwoolmer commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Review skipped

Uh oh!

puzpuzpuz Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nwoolmer commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

puzpuzpuz Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nwoolmer commented Nov 10, 2025 •

edited

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading