Skip to content

Streaming queries model with cursors#63312

Open
Michicosun wants to merge 236 commits intoClickHouse:masterfrom
Michicosun:issues/42990/cursors
Open

Streaming queries model with cursors#63312
Michicosun wants to merge 236 commits intoClickHouse:masterfrom
Michicosun:issues/42990/cursors

Conversation

@Michicosun
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This pull request is a combination of

with implementation of cursors on top of that.

Implemented:

  • Streaming queries model with cursors
  • MergeTree (queue_mode=1)
    • Strict FIFO order of Stream consumption (order of block numbers) inside each partition
    • Fast Cursor Lookup (finding offset inside table using PK)
    • Parallel consumption of different partitions
  • ReplicatedMergeTree (queue_mode=1) same as for regular MergeTree but
    • Strict FIFO order across all data from all replicas
    • Exactly-Once in insert-select queries with specified cursor (cursor update will be commited in the same transaction as part)
  • Distributed
    • Supports Streaming if dst table supports
    • Wrap/Unwrap cursors for shard, for example this query: SELECT * FROM dist STREAM CURSOR {'0.{dst_cursor}'} will be rewritten to query: SELECT * FROM local STREAM CURSOR dst_cursor
  • Other tables
    • Streaming via default subscription: in-memory queue of blocks.
  • Persistent Cursors via saving in Keeper (implementation differs with type of query and storage type if insert-select query)

extension for sql:

SELECT ... FROM ... STREAM [TAIL | (CURSOR 'keeper-key' {cursor-map})]

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)
Modify your CI run

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • All with Azure
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4
Details

@Avogar
Copy link
Copy Markdown
Member

Avogar commented Jun 17, 2024

Test 03036_dynamic_read_subcolumns was failing in master. It's already fixed so I updated the branch with recent changes to avoid its failing in the CI

@Jhors2
Copy link
Copy Markdown

Jhors2 commented Nov 14, 2024

I'm quite interested in this implementation, is this still being driven in any way?

@alexey-milovidov alexey-milovidov mentioned this pull request Dec 31, 2024
76 tasks
@zhanglistar
Copy link
Copy Markdown
Contributor

zhanglistar commented Mar 27, 2025

@Michicosun Really great work! Will you implement dataflow model https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101, like time window, watermark, triggers, etc. ?

@jovezhong
Copy link
Copy Markdown

Hi @zhanglistar , if you are interested in watermark, out of order processing, etc, you may check our stream processing engine built with ClickHouse codebase https://github.com/timeplus-io/proton We are covered in another O'Reilly book https://www.oreilly.com/library/view/streaming-databases/9781098154820/ It's not exactly dataflow but can do many things similar to Apache Flink, but in C++ speed and low footprint. Happy to work with ClickHouse community to work on stream processing together.

@zhanglistar
Copy link
Copy Markdown
Contributor

Hi @zhanglistar , if you are interested in watermark, out of order processing, etc, you may check our stream processing engine built with ClickHouse codebase https://github.com/timeplus-io/proton We are covered in another O'Reilly book https://www.oreilly.com/library/view/streaming-databases/9781098154820/ It's not exactly dataflow but can do many things similar to Apache Flink, but in C++ speed and low footprint. Happy to work with ClickHouse community to work on stream processing together.

@jovezhong Thanks for your reply. I have seen proton, great job! And there is another streamming processing framework like arroyo, risingwave writen in Rust. I am studying these things, and also, we may impletement native flink task like https://developer.aliyun.com/article/1634363 in gluten project https://github.com/apache/incubator-gluten, and we already have a POC, but there still lots of work to do, like stateful functions.

@EmeraldShift
Copy link
Copy Markdown
Contributor

Could this hypothetically interact with or build on top of #79417 and #79471? Those PRs combined seem to support a stable low-level notion of a cursor (_part_starting_offset + _part_offset) and a stable table snapshot to refer to across multiple (sub-)queries, though maybe this can be extended to work across requests for more rows from a stream.

Just wondering, because I have no real knowledge here but the above PRs seem to be useful for this feature.

@jdarpinian
Copy link
Copy Markdown

Is this still planned for 2025?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants