Skip to content

Comments

fix: respect DataFrameWriteOptions::with_single_file_output for paths without extensions#19931

Merged
alamb merged 10 commits intoapache:mainfrom
kumarUjjawal:fix/regression_datagrame_single_file
Jan 28, 2026
Merged

fix: respect DataFrameWriteOptions::with_single_file_output for paths without extensions#19931
alamb merged 10 commits intoapache:mainfrom
kumarUjjawal:fix/regression_datagrame_single_file

Conversation

@kumarUjjawal
Copy link
Contributor

@kumarUjjawal kumarUjjawal commented Jan 21, 2026

Which issue does this PR close?

Rationale for this change

When using DataFrameWriteOptions::with_single_file_output(true), the setting was being ignored if the output path didn't have a file extension. For example:

df.write_parquet("/path/to/output", 
    DataFrameWriteOptions::new().with_single_file_output(true), 
    None).await?;

Would create a directory /path/to/output/ with files inside instead of a single file at /path/to/output.

This happened because the demuxer used a heuristic based solely on file extension, ignoring the explicit user setting.

What changes are included in this PR?

  • New FileOutputMode enum: Uses explicit modes (Automatic, SingleFile, Directory) in FileSinkConfig for clearer output path handling.
  • The demuxer now uses the user's explicit setting instead of always relying on extension-based heuristics.

Are these changes tested?

  • New unit test test_single_file_output_without_extension tests the fixed behavior
  • All sqllogictest pass

Are there any user-facing changes?

Breaking for direct FileSinkConfig construction: The struct now requires file_output_mode: FileOutputMode field. Use FileOutputMode::Automatic to preserve existing behavior.

@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 21, 2026
insert_op,
keep_partition_by_columns: conf.keep_partition_by_columns,
file_extension: conf.file_extension.clone(),
// For deserialized plans, use extension heuristic (backward compatible)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how long this backward compatible approach will be used ? At some point it should start using the new config property.

IMO the proper way would be to add a new optional field to the Protobuf format, then try to read it and use None only if it is missing, but if it is not missing then use its value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. You were right, I made some changes based on the input. Let me know if this looks better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kumarUjjawal -- this looks good to me

I had a few minor suggestions -- let me know what you think

// Create a temporary file location for the encrypted parquet file
let tmp_source = TempDir::new()?;
let tempfile = tmp_source.path().join("cars_encrypted");
let tempfile = tmp_source.path().join("cars_encrypted.parquet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be called .parquet even though the dataframe explicitly says write_parquet? Or is this just to clean up the code?

Copy link
Member

@martin-g martin-g Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no file extension and there is no configuration (Some(true)) then the heuristics decide that this is a folder and creates a Parquet partition file.
If there is no config and the path has file extension then it writes all the content in this file (no partitions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why is this change needed for the example? Is it fixing some behaviour made by this change or just clarifying the example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, with_single_file_output(true) was silently ignored for paths without extensions it would create a directory cars_encrypted/ with parquet files inside. The read_parquet() call then found these files in the directory. After this PR, with_single_file_output(true) is correctly respected, creating a single file at the exact path cars_encrypted . However, read_parquet() with default options expects a .parquet extension and fails validation. Adding .parquet to the filename fixes this and makes the example explicit about the expected file format.

};

// Build copy options, including single_file_output if explicitly set
let mut copy_options: HashMap<String, String> = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this seems to be a common piece of functionality, what do you think about creating a common function that takes a DataFrameWriteOptions and returns a CopyOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will take a lool at that and see how it shapes.

@alamb
Copy link
Contributor

alamb commented Jan 23, 2026

Thank you @martin-g , as always, for the review

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 23, 2026
@alamb
Copy link
Contributor

alamb commented Jan 23, 2026

I took the liberty of adding a small note to the upgrade guide too

Comment on lines 115 to 119
/// Override for single file output behavior.
/// - `None`: use extension heuristic (path with extension = single file)
/// - `Some(true)`: force single file output at exact path
/// - `Some(false)`: force directory output with generated filenames
pub single_file_output: Option<bool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to make this an enum, something like

enum FileOutputMode {
    Automatic,
    SingleFile,
    Directory,
}

It makes it clearer at a glance vs needing to check the comments to know what None, Some(true) and Some(false) mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks @Jefffrey

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also ensure the PR body is up to date with latest changes? Currently still refers to Option<bool>

// Create a temporary file location for the encrypted parquet file
let tmp_source = TempDir::new()?;
let tempfile = tmp_source.path().join("cars_encrypted");
let tempfile = tmp_source.path().join("cars_encrypted.parquet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why is this change needed for the example? Is it fixing some behaviour made by this change or just clarifying the example?

self
}

fn copy_to_options(&self) -> HashMap<String, String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusingly named for what it does

/// This verifies the fix for the regression where extension heuristics
/// ignored the explicit with_single_file_output(true) setting.
#[tokio::test]
async fn test_single_file_output_without_extension() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for all the modes? automatic, single, directory

};

// Parse single_file_output option if explicitly set
let file_output_mode = match source_option_tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to push this enum into single_file_output itself as well instead of having this parse from an Option<bool> logic 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept with_single_file_output(bool) in DataFrameWriteOptions for backward compatibility, it's the existing public API. Internally, I changed the field to Option so we can distinguish "not set" from "explicitly set to false". The conversion to FileOutputMode happens in the physical planner. Adding with_file_output_mode(FileOutputMode) would expand the API surface; are you okay with that?

Comment on lines +567 to +568
// Filter out sink-related options that are not format options
let format_options: HashMap<String, String> = source_option_tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need for filtering the option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single_file_output option is a sink-level configuration that controls file output behavior, not a format-specific option . The file_type_to_format().create(session_state, &format_options) call validates options against the format's config schema and would fail with "unrecognized option" if we pass single_file_output. So we extract it first for FileSinkConfig, then filter it out before passing the remaining options to the format factory.

// When not set, uses extension heuristic (path with extension = single file).
// When set to true, forces single file output at exact path.
// When set to false, forces directory output with generated filenames.
optional bool single_file_output = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to represent this as an actual enum to more align with the source 🤔

insert_op: InsertOp::Overwrite,
keep_partition_by_columns: true,
file_extension: "json".into(),
file_output_mode: FileOutputMode::Automatic,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use different modes for each of the tests for coverage


### `FileSinkConfig` adds `single_file_output`

`FileSinkConfig` now includes a `single_file_output: Option<bool>` field to override the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to update this

*to the main branch and are awaiting release in this version. See [#19692] for
\*more details.

[#19692]: https://github.com/apache/datafusion/issues/19692
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb added this pull request to the merge queue Jan 28, 2026
@alamb
Copy link
Contributor

alamb commented Jan 28, 2026

Thanks again @kumarUjjawal and @Jefffrey and @martin-g

Merged via the queue into apache:main with commit 36c0cda Jan 28, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: DataFrameWriteOptions::with_single_file_output produces a directory

4 participants