Skip to content

Streaming support for DataFusion #1544

@hntd187

Description

@hntd187

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Consume a streaming source from Kafka. (to start) Datafusion already has a batch oriented processing framework that works pretty well. I wanna extend this to also be able to consume from streaming sources such as a kafka topic.

Describe the solution you'd like
So I think it makes sense to start with what Spark Streaming did for it's Streaming implementation which is the idea of micro-batching.

Blank diagram
Generally the idea pictured above is DF will listen to a topic for some period of time (defined at start up) then execute operations on that collected batch window of events. In the case of Kafka there normally these are either JSON or Avro which already has encoders in DataFusion.

I spent sometime looking at the types in data source and I came to the conclusion that it would probably be possible to implement this on top of the current API, but frankly, it would suck. The source traits all have a naming convention centered around tables and files, which a Kafka topic is technically neither. Basically what I am saying is an implementation here would be highly confusing to anyone trying to understand why things are named what they are. I propose we add a set additional traits specifically for streaming sources. These would map to an execution plan like the other data sources, but should have ability to manage the stream information such as committing offsets, checkpointing, watermarking. These are probably secondary things to come a bit after a "get the thing to work" implementation, but I wanted to just put it out there that these traits initially would look rather bare and not have much difference from the other data sources. They would though quickly diverge from those contracts into ones that support managing these stream operations.

What I am not sure about is while these types should likely live in DataFusion the actual implementation probably should not. At least start as a contrib module and maybe be promoted into the main repo eventually if it makes sense.

Does this make sense? I can start a draft PR soon to get the ball rolling on discussion into actual code.

TODO:

  • Basic hookup (I can run a df.show())
  • Start to develop API contract
  • Delivery Guarantees
  • Water marking
  • Research on map_partitions and SortMergeJoin

Describe alternatives you've considered
I don't know, I could play more video games or something, but I'd rather do this. Joking aside, Flock has streaming capabilities but it's mostly based around cloud offerings and not a long running process like a Spark streaming job.

Additional context
@alamb additional ideas, how you wanna get started?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions