Streaming support for DataFusion

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Consume a streaming source from Kafka. (to start) Datafusion already has a batch oriented processing framework that works pretty well. I wanna extend this to also be able to consume from streaming sources such as a kafka topic.

**Describe the solution you'd like**
So I think it makes sense to start with what Spark Streaming did for it's Streaming implementation which is the idea of micro-batching.

![Blank diagram](https://user-images.githubusercontent.com/6778339/148859801-d020ea6e-0927-4d7c-a35b-f0ab8f50df25.png)
Generally the idea pictured above is DF will listen to a topic for some period of time (defined at start up) then execute operations on that collected batch window of events. In the case of Kafka there normally these are either JSON or Avro which already has encoders in DataFusion.

I spent sometime looking at the types in data source and I came to the conclusion that it would probably be possible to implement this on top of the current API, but frankly, it would suck. The source traits all have a naming convention centered around tables and files, which a Kafka topic is technically neither. Basically what I am saying is an implementation here would be highly confusing to anyone trying to understand why things are named what they are. I propose we add a set additional traits specifically for streaming sources. These would map to an execution plan like the other data sources, but should have ability to manage the stream information such as committing offsets, checkpointing, watermarking. These are probably secondary things to come a bit after a "get the thing to work" implementation, but I wanted to just put it out there that these traits initially would look rather bare and not have much difference from the other data sources. They would though quickly diverge from those contracts into ones that support managing these stream operations. 

What I am not sure about is while these types should likely live in DataFusion the actual implementation probably should not. At least start as a contrib module and maybe be promoted into the main repo eventually if it makes sense.

Does this make sense? I can start a draft PR soon to get the ball rolling on discussion into actual code.

_TODO:_

- [ ] Basic hookup (I can run a `df.show()`)
- [ ] Start to develop API contract
- [ ] Delivery Guarantees
- [ ] Water marking
- [ ] Research on `map_partitions` and `SortMergeJoin`

**Describe alternatives you've considered**
I don't know, I could play more video games or something, but I'd rather do this. Joking aside, Flock has streaming capabilities but it's mostly based around cloud offerings and not a long running process like a Spark streaming job.

**Additional context**
@alamb additional ideas, how you wanna get started?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming support for DataFusion #1544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming support for DataFusion #1544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions