Skip to content

chore(sql): queryable parquet partitions#4918

Merged
bluestreak01 merged 233 commits intomasterfrom
puzpuzpuz_query_parquet
Dec 5, 2024
Merged

chore(sql): queryable parquet partitions#4918
bluestreak01 merged 233 commits intomasterfrom
puzpuzpuz_query_parquet

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Sep 3, 2024

Tasks

Dev

  • [must-have] Time interval intrinsics - @puzpuzpuz
  • [must-have] alter table <table_name> convert partition to native - @eugenels
  • [must-have] Implement TableReader columns/parquet fd refresh on parquet->native conversion (add a stress test for that) - @eugenels
  • [must-have] Include metadata offset into TableReader's Parquet partition metadata snapshots (currently it only has fds) and use the offset when reading from Parquet - @amunra
  • [must-have] Integrate time interval intrinsics with the metadata offset patch and move per-row group fields from PartitionFrame to PageFrame - @puzpuzpuz
  • [must-have] Consistent column index handling when querying/writing parquet partitions (e.g. dedup assumes matching column indexes between table writer and parquet files)
  • [must-have] Dedup support - @eugenels
  • [must-have] DDLs and coltop support for parquet partitions, e.g. when a coltop column is created, insert the column into older Parquet partitions - @eugenels
  • [must-have] Write symbol index files for parquet partitions (a lot of factories assume that we have the files) - @puzpuzpuz
  • [must-have] UPDATE on parquet partitions
  • [must-have] Rebuild index on UPDATE on parquet partitions
  • [must-have] Compaction background job for Parquet partitions
  • [must-have] Alter table column type support for parquet partitions
  • Index repair CLI tool needs to support parquet partitions
  • Extract QdbMeta to separate Java API
  • CI for our modified parquet2 lib
  • When encoding symbols, only walk through the source column once. We should review the rest of the columns too for this. - @eugenels
  • Refactoring and code dedup around symbol column (todos left in code).
  • Export KV metadata for ARROW:schema and (possibly) pandas so that symbols are read as categorical data.
  • Track Rust memory alloc - @amunra
  • Raise CairoException from Rust, not RuntimeError.
  • Don't collect parquet stats for varchar/string/symbol/binary when encoding partitions (to save disk space)
  • Display minTimestamp/maxTimestamp for parquet partitions in SHOW PARTITIONS

Profiling

... long list here :-)

Fuzz Testing

  • Stress multiple row groups
  • Stress adding and deleting columns
  • Stress selecting individual columns
  • Test IN and BETWEEN queries
  • [must-have] Add parquet partition conversion into existing fuzz tests (to/from)

Plain old testing

  • Concurrent read-write test to make sure that parquet offsets is correctly read and used
  • Dropping partitions
  • Detach + attach partitions
  • Renaming columns
  • Changing column types
  • Column tops
  • Test read_parquet function against a parquet partition we've generated that contains symbols (it will probably break since read_parquet expects string column results, not i32 indices).
  • Parallel algo queries (threading)
    • Latest on
    • Filter
    • Group by
    • Sample by
  • Cover TableReader metadata change handling when converting to/from parquet/regular partitions
  • Parameterize existing O3 tests to work with parquet
  • Write integration tests to ensure our parquet files can be consumed by:
    • Iceberg
    • Pandas
    • Polars
    • PyArrow
    • DuckDB
    • Spark

@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution DO NOT MERGE These changes should not be merged to main branch labels Sep 3, 2024
@puzpuzpuz puzpuzpuz self-assigned this Sep 3, 2024
@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_query_parquet branch 4 times, most recently from 1629bc6 to fc2892f Compare September 3, 2024 16:20
@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_query_parquet branch from fc2892f to 6425628 Compare September 3, 2024 17:03
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 4, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ puzpuzpuz
✅ bluestreak01
✅ eugenels
❌ GitHub Actions - Rebuild Native Libraries


GitHub Actions - Rebuild Native Libraries seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

puzpuzpuz and others added 5 commits November 26, 2024 16:34
…quet

# Conflicts:
#	core/src/main/java/io/questdb/cairo/CairoConfigurationWrapper.java
#	core/src/main/java/io/questdb/cairo/ColumnVersionReader.java
#	core/src/main/java/io/questdb/cairo/TableWriter.java
#	core/src/main/java/io/questdb/cairo/TxReader.java
#	core/src/main/java/io/questdb/cairo/sql/BindVariableService.java
#	core/src/main/java/io/questdb/griffin/SqlCompilerImpl.java
#	core/src/main/java/io/questdb/griffin/engine/ops/AlterOperation.java
#	core/src/main/java/io/questdb/griffin/engine/ops/AlterOperationBuilder.java
#	core/src/main/java/io/questdb/std/ObjectPool.java
#	core/src/test/java/io/questdb/test/TestServerMain.java
#	core/src/test/java/io/questdb/test/cairo/CairoTestConfiguration.java
#	core/src/test/java/io/questdb/test/cairo/fuzz/WalWriterFuzzTest.java
#	core/src/test/java/io/questdb/test/cairo/o3/O3MaxLagFuzzTest.java
#	core/src/test/java/io/questdb/test/cairo/wal/TableSequencerImplTest.java
#	core/src/test/java/io/questdb/test/cutlass/http/ImportIODispatcherTest.java
#	core/src/test/java/io/questdb/test/cutlass/pgwire/BasePGTest.java
#	core/src/test/java/io/questdb/test/cutlass/pgwire/PGJobContextTest.java
#	core/src/test/java/io/questdb/test/griffin/AlterTableConvertPartitionTest.java
#	core/src/test/java/io/questdb/test/griffin/KeywordAsTableNameTest.java
#	core/src/test/java/io/questdb/test/griffin/ParallelFilterTest.java
#	core/src/test/java/io/questdb/test/griffin/ParallelGroupByFuzzTest.java
#	core/src/test/java/io/questdb/test/griffin/ParallelLatestByTest.java
#	core/src/test/java/io/questdb/test/griffin/VarcharConversionTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/eq/EqTimestampCursorFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/CovarPopGroupByFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/CovarSampleGroupByFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/json/JsonExtractVarcharFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/str/LPadFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/str/LPadStrFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/str/RPadFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/functions/str/RPadStrFunctionFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/table/PageFrameRecordCursorImplFactoryTest.java
#	core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java
@puzpuzpuz puzpuzpuz changed the title feat(sql): queryable parquet partitions chore(sql): queryable parquet partitions Dec 5, 2024
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 4200 / 4669 (89.96%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/wal/seq/MetadataServiceStub.java 0 2 00.00%
🔵 parquet_write/binary.rs 0 4 00.00%
🔵 io/questdb/cairo/wal/seq/SequencerMetadataService.java 0 1 00.00%
🔵 parquet_write/string.rs 0 4 00.00%
🔵 io/questdb/griffin/engine/functions/long256/LongsToLong256FunctionFactory.java 0 2 00.00%
🔵 io/questdb/network/HeartBeatException.java 0 1 00.00%
🔵 io/questdb/cairo/DebugUtils.java 0 6 00.00%
🔵 io/questdb/std/str/FileNameExtractorUtf8Sequence.java 0 1 00.00%
🔵 io/questdb/cairo/vm/NullMapWriter.java 0 4 00.00%
🔵 parquet_write/primitive.rs 0 10 00.00%
🔵 parquet_read/slicer/dict_decoder.rs 0 16 00.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctStringGroupByFunction.java 1 2 50.00%
🔵 io/questdb/log/NullLogRecord.java 1 2 50.00%
🔵 io/questdb/griffin/engine/ops/AlterOperation.java 7 11 63.64%
🔵 parquet_write/varchar.rs 21 32 65.62%
🔵 parquet_write/jni.rs 61 89 68.54%
🔵 parquet/error.rs 92 130 70.77%
🔵 parquet_read/slicer/mod.rs 24 33 72.73%
🔵 parquet_write/update.rs 13 18 72.22%
🔵 io/questdb/cutlass/http/WaitProcessor.java 14 19 73.68%
🔵 io/questdb/cairo/TableUtils.java 7 9 77.78%
🔵 parquet_read/column_sink/fixed.rs 56 71 78.87%
🔵 parquet_read/column_sink/var.rs 47 60 78.33%
🔵 cairo.rs 57 70 81.43%
🔵 io/questdb/cairo/O3PartitionJob.java 145 175 82.86%
🔵 io/questdb/network/IOContext.java 5 6 83.33%
🔵 io/questdb/cairo/TableWriter.java 369 436 84.63%
🔵 parquet_read/slicer/rle.rs 134 154 87.01%
🔵 io/questdb/cairo/ParquetTimestampFinder.java 72 81 88.89%
🔵 parquet_read/decode.rs 703 780 90.13%
🔵 parquet/qdb_metadata.rs 99 110 90.00%
🔵 io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java 53 58 91.38%
🔵 io/questdb/griffin/engine/functions/table/ReadParquetFunctionFactory.java 21 23 91.30%
🔵 parquet_write/symbol.rs 76 83 91.57%
🔵 io/questdb/cairo/sql/PageFrameMemoryPool.java 153 164 93.29%
🔵 io/questdb/griffin/engine/table/BwdTableReaderPageFrameCursor.java 64 68 94.12%
🔵 io/questdb/cairo/TableReader.java 172 180 95.56%
🔵 parquet_read/meta.rs 95 100 95.00%
🔵 parquet/col_type.rs 89 93 95.70%
🔵 io/questdb/std/Unsafe.java 55 57 96.49%
🔵 parquet_read/jni.rs 192 200 96.00%
🔵 allocator.rs 425 427 99.53%
🔵 io/questdb/griffin/engine/table/SymbolIndexFilteredRowCursor.java 1 1 100.00%
🔵 io/questdb/cairo/IndexBuilder.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/parquet/RowGroupBuffers.java 32 32 100.00%
🔵 io/questdb/griffin/engine/table/LatestByValueListRecordCursor.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionDecoder.java 39 39 100.00%
🔵 io/questdb/cairo/RecoverVarIndex.java 1 1 100.00%
🔵 io/questdb/cairo/wal/WalTxnDetails.java 1 1 100.00%
🔵 io/questdb/cairo/TxWriter.java 7 7 100.00%
🔵 io/questdb/cairo/FullBwdPartitionFrameCursorFactory.java 1 1 100.00%
🔵 io/questdb/mp/MCSequence.java 7 7 100.00%
🔵 io/questdb/griffin/engine/table/LatestByAllIndexedRecordCursor.java 9 9 100.00%
🔵 io/questdb/cairo/IntervalBwdPartitionFrameCursor.java 17 17 100.00%
🔵 io/questdb/MessageBusImpl.java 4 4 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilteredNegativeLimitRecordCursor.java 4 4 100.00%
🔵 io/questdb/griffin/engine/ops/AlterOperationBuilder.java 5 5 100.00%
🔵 io/questdb/PropServerConfiguration.java 7 7 100.00%
🔵 io/questdb/griffin/engine/table/parquet/OwnedMemoryPartitionDescriptor.java 14 14 100.00%
🔵 io/questdb/cairo/sql/async/PageFrameSequence.java 5 5 100.00%
🔵 io/questdb/cutlass/http/HttpRequestProcessor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilteredRecordCursor.java 5 5 100.00%
🔵 io/questdb/griffin/engine/table/LatestBySubQueryRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/GroupByNotKeyedVectorRecordCursorFactory.java 2 2 100.00%
🔵 io/questdb/jit/CompiledFilterIRSerializer.java 3 3 100.00%
🔵 io/questdb/mp/SCSequence.java 11 11 100.00%
🔵 io/questdb/std/json/SimdJsonResult.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/FwdTableReaderPageFrameCursor.java 69 69 100.00%
🔵 io/questdb/griffin/engine/functions/eq/EqTimestampCursorFunctionFactory.java 1 1 100.00%
🔵 io/questdb/cairo/mig/Mig620.java 1 1 100.00%
🔵 io/questdb/griffin/engine/ops/CreateTableOperationBuilderImpl.java 6 6 100.00%
🔵 io/questdb/std/datetime/DateLocale.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/LatestByValuesIndexedFilteredRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cairo/sql/PageFrameMemoryRecord.java 11 11 100.00%
🔵 parquet_write/util.rs 1 1 100.00%
🔵 io/questdb/cutlass/line/udp/DefaultLineUdpReceiverConfiguration.java 2 2 100.00%
🔵 io/questdb/cairo/IntervalFwdPartitionFrameCursor.java 17 17 100.00%
🔵 io/questdb/griffin/engine/functions/bool/InTimestampTimestampFunctionFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/functions/bool/InDoubleFunctionFactory.java 2 2 100.00%
🔵 io/questdb/cairo/sql/async/PageFrameReduceJob.java 1 1 100.00%
🔵 io/questdb/cairo/SymbolColumnIndexer.java 3 3 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionDescriptor.java 30 30 100.00%
🔵 io/questdb/cairo/NativeTimestampFinder.java 14 14 100.00%
🔵 io/questdb/griffin/PurgingOperator.java 2 2 100.00%
🔵 io/questdb/cairo/TxReader.java 7 7 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/LatestByDeferredListValuesFilteredRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cairo/O3Basket.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionUpdater.java 13 13 100.00%
🔵 io/questdb/cairo/O3OpenColumnJob.java 2 2 100.00%
🔵 io/questdb/cairo/ColumnVersionReader.java 2 2 100.00%
🔵 io/questdb/cairo/CairoEngine.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/SelectedRecordCursorFactory.java 8 8 100.00%
🔵 io/questdb/griffin/engine/table/AbstractDeferredValueRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cairo/mig/Mig607.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/LatestByValueIndexedFilteredRecordCursorFactory.java 2 2 100.00%
🔵 io/questdb/tasks/LatestByTask.java 6 6 100.00%
🔵 io/questdb/std/Rnd.java 4 4 100.00%
🔵 io/questdb/griffin/engine/table/TimeFrameRecordCursorImpl.java 6 6 100.00%
🔵 io/questdb/griffin/engine/table/ShowPartitionsRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/FillRangeRecordCursorFactory.java 3 3 100.00%
🔵 parquet_write/file.rs 6 6 100.00%
🔵 io/questdb/std/IOURingImpl.java 8 8 100.00%
🔵 io/questdb/cairo/SymbolMapWriter.java 2 2 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFirstLastRecordCursorFactory.java 6 6 100.00%
🔵 io/questdb/cairo/O3PartitionPurgeJob.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/parquet/MappedMemoryPartitionDescriptor.java 6 6 100.00%
🔵 io/questdb/std/datetime/AbstractTimeZoneRules.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/parquet/RowGroupStatBuffers.java 38 38 100.00%
🔵 io/questdb/PropertyKey.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/VectorAggregateEntry.java 1 1 100.00%
🔵 io/questdb/griffin/UpdateOperatorImpl.java 1 1 100.00%
🔵 io/questdb/cairo/sql/async/PageFrameReduceTask.java 12 12 100.00%
🔵 io/questdb/std/Vect.java 1 1 100.00%
🔵 parquet_write/mod.rs 5 5 100.00%
🔵 io/questdb/cairo/IntervalBwdPartitionFrameCursorFactory.java 1 1 100.00%
🔵 io/questdb/std/MemoryTag.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/table/ReadParquetRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/table/AsyncJitFilteredRecordCursorFactory.java 3 3 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionEncoder.java 1 1 100.00%
🔵 io/questdb/cairo/AbstractIndexReader.java 18 18 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/GroupByRecordCursorFactory.java 3 3 100.00%
🔵 io/questdb/std/datetime/microtime/TimestampFormatFactory.java 1 1 100.00%
🔵 io/questdb/std/LongList.java 4 4 100.00%
🔵 io/questdb/griffin/engine/table/LatestByAllIndexedRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cairo/BinaryAlterSerializer.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/PageFrameRecordCursorFactory.java 3 3 100.00%
🔵 io/questdb/griffin/engine/groupby/StableAwareStringHolder.java 2 2 100.00%
🔵 io/questdb/griffin/SqlCompilerImpl.java 13 13 100.00%
🔵 io/questdb/network/NetworkFacadeImpl.java 5 5 100.00%
🔵 io/questdb/std/LongLongHashSet.java 4 4 100.00%
🔵 io/questdb/griffin/ConvertOperatorImpl.java 1 1 100.00%
🔵 io/questdb/cairo/FullFwdPartitionFrameCursorFactory.java 2 2 100.00%
🔵 io/questdb/cairo/sql/TableReferenceOutOfDateException.java 5 5 100.00%
🔵 io/questdb/cairo/sql/PageFrameAddressCache.java 34 34 100.00%
🔵 io/questdb/cairo/FullBwdPartitionFrameCursor.java 16 16 100.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 2 2 100.00%
🔵 io/questdb/cairo/mig/Mig506.java 1 1 100.00%
🔵 io/questdb/cairo/ColumnVersionWriter.java 8 8 100.00%
🔵 io/questdb/cairo/wal/seq/TableTransactionLog.java 27 27 100.00%
🔵 io/questdb/cairo/AbstractFullPartitionFrameCursor.java 3 3 100.00%
🔵 io/questdb/griffin/engine/table/AbstractPageFrameRecordCursor.java 5 5 100.00%
🔵 io/questdb/cutlass/pgwire/PGConnectionContext.java 1 1 100.00%
🔵 io/questdb/griffin/DropIndexOperator.java 14 14 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilteredRecordCursorFactory.java 3 3 100.00%
🔵 io/questdb/griffin/engine/functions/bool/InLongFunctionFactory.java 2 2 100.00%
🔵 io/questdb/cairo/FullFwdPartitionFrameCursor.java 17 17 100.00%
🔵 io/questdb/griffin/SqlKeywords.java 7 7 100.00%
🔵 io/questdb/cairo/ColumnPurgeOperator.java 2 2 100.00%
🔵 io/questdb/griffin/engine/table/LatestByValueFilteredRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cutlass/text/CopyTask.java 1 1 100.00%
🔵 parquet_write/schema.rs 37 37 100.00%
🔵 io/questdb/cairo/AbstractIntervalPartitionFrameCursor.java 22 22 100.00%
🔵 parquet_read/mod.rs 75 75 100.00%
🔵 io/questdb/cairo/IntervalFwdPartitionFrameCursorFactory.java 1 1 100.00%
🔵 io/questdb/metrics/WorkerMetrics.java 2 2 100.00%
🔵 io/questdb/std/DoubleList.java 4 4 100.00%
🔵 io/questdb/cairo/CairoException.java 7 7 100.00%
🔵 parquet/io.rs 10 10 100.00%

@bluestreak01 bluestreak01 merged commit 2ced117 into master Dec 5, 2024
@bluestreak01 bluestreak01 deleted the puzpuzpuz_query_parquet branch December 5, 2024 13:27
@puzpuzpuz puzpuzpuz mentioned this pull request Feb 6, 2025
75 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants