refine: refine writer interface#741
Conversation
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @ZENOTME for this pr, left some question about this change.
| type R: FileWriter<O>; | ||
| /// Build file writer. | ||
| fn build(self) -> impl Future<Output = Result<Self::R>> + Send; | ||
| fn build(self, schema: SchemaRef) -> impl Future<Output = Result<Self::R>> + Send; |
There was a problem hiding this comment.
I don't understand, why this can't be passed as constructor parameter of the FileWriterBuilder?
There was a problem hiding this comment.
Let's use original process of creating equality delete writer as example:
let equality_ids = vec![0_i32, 8];
let equality_config =
EqualityDeleteWriterConfig::new(equality_ids, Arc::new(schema), None).unwrap();
let delete_schema =
arrow_schema_to_schema(equality_config.projected_arrow_schema_ref()).unwrap();
let projector = equality_config.projector.clone();
// prepare writer
let pb = ParquetWriterBuilder::new(
WriterProperties::builder().build(),
Arc::new(delete_schema),
file_io.clone(),
location_gen,
file_name_gen,
);
let mut equality_delete_writer = EqualityDeleteFileWriterBuilder::new(pb)
.build(equality_config)
.await?;We need to get the projected schema from EqualityDeleteWriterConfig. Otherwise, the user have to project the schema by themselves. After remove writer config, we need to other thing help us to project the schema and let user get from it.
One pattern I realized for now is that the schema of file writer is always rely on the iceberg writer which use it. So I think move it to the build function can make thing looks more simple.🤔
There was a problem hiding this comment.
Sorry, but I'm not convinced that this simplified things a lot. I think the construction of writers will not be called directly by user, since they are low level apis. Schema is only part of the arguments.
There was a problem hiding this comment.
I'm a beginner rustacean, but you could still pass the equality_config to the constructor, right?
And I agree with @liurenjie1024, I think in general we should think more about the high-level API. For example, constructing the writers by yourself can be pretty error-prone. Much rather I would see a Table object returning a fully configured writer based on the current-schema.
There was a problem hiding this comment.
Make sense. Let's remove the change.
crates/iceberg/src/writer/mod.rs
Outdated
| /// Get the current file written size. | ||
| fn current_written_size(&self) -> usize; | ||
| /// Get the schema of the current file. | ||
| fn current_schema(&self) -> SchemaRef; |
There was a problem hiding this comment.
When this will be useful?
There was a problem hiding this comment.
This function will be useful in delta writer. But it's fine to remove for now, let's add it until we find it's required.
738a4a2 to
db06454
Compare
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @ZENOTME for this pr, LGTM!
crates/iceberg/src/writer/mod.rs
Outdated
| /// The current file status of iceberg writer. It implement for the writer which write a single | ||
| /// file. | ||
| pub trait CurrentFileStatus { | ||
| pub trait CurrentWriterStatus { |
There was a problem hiding this comment.
It would be better to split to another pr.
|
Let's wait for a while to see other reviewer's point of view. |
db06454 to
d0a2897
Compare
c-thiel
left a comment
There was a problem hiding this comment.
LGTM, thanks @ZENOTME.
I think we should derive Debug for most structs in public interfaces, I just commented the first, but there are a few touched by this PR. At least for EqualityDeleteWriterConfig it would be helpful. For EqualityDeleteWriterConfig, or configs in general, PartialEq would also be good.
| @@ -29,38 +29,27 @@ use crate::Result; | |||
| #[derive(Clone)] | |||
| pub struct DataFileWriterBuilder<B: FileWriterBuilder> { | |||
There was a problem hiding this comment.
Should we maybe also derive Debug?
Hi, I'm working on writer recently and find some weaknesses of our original writer interface design:
And I realized that this custom config param cause we can't combine the write flexibility. E.g. In partition writer
So avoid this problem, we should pass the custom param when create the builder and the build interface should looks like
fn build() -> SelfIn our original design, user should pass the schema to file writer builder when create them like
However, sometimes the schema is hard to determine when we create them. E.g. equality delete writer, we only know what the schema looks like util we pass the equality id and create the equality delete writer. To avoid the problem, we change
fn build() -> Selfof FileWriterBuilder tofn build(schema:SchemaRef) -> Self. By this way, the schema of FileWriter is determined by base writer.I send this discussion as a PR to make it easier to express my idea. BTW, I applied this change and complete partition writer, delta writer in https://github.com/ZENOTME/iceberg-rust/tree/partition_writer and it looks well.
Feel free for any suggestion. cc @liurenjie1024 @Xuanwo @Fokko @c-thiel