fix: respect DataFrameWriteOptions::with_single_file_output for paths without extensions#19931
Conversation
… without extensions
| insert_op, | ||
| keep_partition_by_columns: conf.keep_partition_by_columns, | ||
| file_extension: conf.file_extension.clone(), | ||
| // For deserialized plans, use extension heuristic (backward compatible) |
There was a problem hiding this comment.
For how long this backward compatible approach will be used ? At some point it should start using the new config property.
IMO the proper way would be to add a new optional field to the Protobuf format, then try to read it and use None only if it is missing, but if it is not missing then use its value.
There was a problem hiding this comment.
Thanks for the feedback. You were right, I made some changes based on the input. Let me know if this looks better.
alamb
left a comment
There was a problem hiding this comment.
Thank you @kumarUjjawal -- this looks good to me
I had a few minor suggestions -- let me know what you think
| // Create a temporary file location for the encrypted parquet file | ||
| let tmp_source = TempDir::new()?; | ||
| let tempfile = tmp_source.path().join("cars_encrypted"); | ||
| let tempfile = tmp_source.path().join("cars_encrypted.parquet"); |
There was a problem hiding this comment.
does it need to be called .parquet even though the dataframe explicitly says write_parquet? Or is this just to clean up the code?
There was a problem hiding this comment.
If there is no file extension and there is no configuration (Some(true)) then the heuristics decide that this is a folder and creates a Parquet partition file.
If there is no config and the path has file extension then it writes all the content in this file (no partitions)
There was a problem hiding this comment.
Actually why is this change needed for the example? Is it fixing some behaviour made by this change or just clarifying the example?
There was a problem hiding this comment.
Before this PR, with_single_file_output(true) was silently ignored for paths without extensions it would create a directory cars_encrypted/ with parquet files inside. The read_parquet() call then found these files in the directory. After this PR, with_single_file_output(true) is correctly respected, creating a single file at the exact path cars_encrypted . However, read_parquet() with default options expects a .parquet extension and fails validation. Adding .parquet to the filename fixes this and makes the example explicit about the expected file format.
datafusion/core/src/dataframe/mod.rs
Outdated
| }; | ||
|
|
||
| // Build copy options, including single_file_output if explicitly set | ||
| let mut copy_options: HashMap<String, String> = HashMap::new(); |
There was a problem hiding this comment.
since this seems to be a common piece of functionality, what do you think about creating a common function that takes a DataFrameWriteOptions and returns a CopyOptions?
There was a problem hiding this comment.
Thank you. I will take a lool at that and see how it shapes.
|
Thank you @martin-g , as always, for the review |
|
I took the liberty of adding a small note to the upgrade guide too |
| /// Override for single file output behavior. | ||
| /// - `None`: use extension heuristic (path with extension = single file) | ||
| /// - `Some(true)`: force single file output at exact path | ||
| /// - `Some(false)`: force directory output with generated filenames | ||
| pub single_file_output: Option<bool>, |
There was a problem hiding this comment.
I think it would be better to make this an enum, something like
enum FileOutputMode {
Automatic,
SingleFile,
Directory,
}It makes it clearer at a glance vs needing to check the comments to know what None, Some(true) and Some(false) mean
Jefffrey
left a comment
There was a problem hiding this comment.
Could you also ensure the PR body is up to date with latest changes? Currently still refers to Option<bool>
| // Create a temporary file location for the encrypted parquet file | ||
| let tmp_source = TempDir::new()?; | ||
| let tempfile = tmp_source.path().join("cars_encrypted"); | ||
| let tempfile = tmp_source.path().join("cars_encrypted.parquet"); |
There was a problem hiding this comment.
Actually why is this change needed for the example? Is it fixing some behaviour made by this change or just clarifying the example?
datafusion/core/src/dataframe/mod.rs
Outdated
| self | ||
| } | ||
|
|
||
| fn copy_to_options(&self) -> HashMap<String, String> { |
There was a problem hiding this comment.
This is a bit confusingly named for what it does
| /// This verifies the fix for the regression where extension heuristics | ||
| /// ignored the explicit with_single_file_output(true) setting. | ||
| #[tokio::test] | ||
| async fn test_single_file_output_without_extension() -> Result<()> { |
There was a problem hiding this comment.
Do we have tests for all the modes? automatic, single, directory
| }; | ||
|
|
||
| // Parse single_file_output option if explicitly set | ||
| let file_output_mode = match source_option_tuples |
There was a problem hiding this comment.
I wonder if it would be better to push this enum into single_file_output itself as well instead of having this parse from an Option<bool> logic 🤔
There was a problem hiding this comment.
I kept with_single_file_output(bool) in DataFrameWriteOptions for backward compatibility, it's the existing public API. Internally, I changed the field to Option so we can distinguish "not set" from "explicitly set to false". The conversion to FileOutputMode happens in the physical planner. Adding with_file_output_mode(FileOutputMode) would expand the API surface; are you okay with that?
| // Filter out sink-related options that are not format options | ||
| let format_options: HashMap<String, String> = source_option_tuples |
There was a problem hiding this comment.
What's the need for filtering the option here?
There was a problem hiding this comment.
The single_file_output option is a sink-level configuration that controls file output behavior, not a format-specific option . The file_type_to_format().create(session_state, &format_options) call validates options against the format's config schema and would fail with "unrecognized option" if we pass single_file_output. So we extract it first for FileSinkConfig, then filter it out before passing the remaining options to the format factory.
| // When not set, uses extension heuristic (path with extension = single file). | ||
| // When set to true, forces single file output at exact path. | ||
| // When set to false, forces directory output with generated filenames. | ||
| optional bool single_file_output = 12; |
There was a problem hiding this comment.
Might be better to represent this as an actual enum to more align with the source 🤔
| insert_op: InsertOp::Overwrite, | ||
| keep_partition_by_columns: true, | ||
| file_extension: "json".into(), | ||
| file_output_mode: FileOutputMode::Automatic, |
There was a problem hiding this comment.
Can we use different modes for each of the tests for coverage
|
|
||
| ### `FileSinkConfig` adds `single_file_output` | ||
|
|
||
| `FileSinkConfig` now includes a `single_file_output: Option<bool>` field to override the |
| *to the main branch and are awaiting release in this version. See [#19692] for | ||
| \*more details. | ||
|
|
||
| [#19692]: https://github.com/apache/datafusion/issues/19692 |
|
Thanks again @kumarUjjawal and @Jefffrey and @martin-g |
Which issue does this PR close?
DataFrameWriteOptions::with_single_file_outputproduces a directory #13323.Rationale for this change
When using
DataFrameWriteOptions::with_single_file_output(true), the setting was being ignored if the output path didn't have a file extension. For example:Would create a directory /path/to/output/ with files inside instead of a single file at /path/to/output.
This happened because the demuxer used a heuristic based solely on file extension, ignoring the explicit user setting.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Breaking for direct FileSinkConfig construction: The struct now requires file_output_mode: FileOutputMode field. Use FileOutputMode::Automatic to preserve existing behavior.