Skip to content

Conversation

@xanderbailey
Copy link
Contributor

@xanderbailey xanderbailey commented Dec 5, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

Spark's csv writer can do this and it would be nice to have feature parity in projects like datafusion.
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

Adds two new options for ignore_leading_whitespace and ignore_trailing_whitespace to csv writer.
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.

Are these changes tested?

Yes
We typically require tests for all PRs in order to:

  1. Prevent the code from being accidentally broken by subsequent changes
  2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 5, 2025
Comment on lines 170 to 175
if self.ignore_leading_whitespace {
trimmed = trimmed.trim_start();
}
if self.ignore_trailing_whitespace {
trimmed = trimmed.trim_end();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very open to suggestions if this is the right way to go about this. It looks like the other options are passed into FormatOptions but that's used much more widely than just the CSV reader...

@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

run benchmark csv_writer

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing xb/add_ignore_leading_whitespace (60e3f5a) to a67cd19 diff
BENCH_NAME=csv_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench csv_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=xb_add_ignore_leading_whitespace
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                    main                                   xb_add_ignore_leading_whitespace
-----                    ----                                   --------------------------------
record_batches_to_csv    1.03     17.8±0.15µs        ? ?/sec    1.00     17.3±0.16µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xanderbailey -- I think this is a nice change. CI needs to be fixed and I have a suggestion for expanded type support and improved documentation, but I don't think those are necessary before merging this PR

I also started some benchmarks to make sure this doesn't inadvertently slow things down

self.null_value.as_deref().unwrap_or(DEFAULT_NULL_VALUE)
}

/// Set whether to ignore leading whitespace in string values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have personally suggested using "strip leading whitespace" but I see this is consistent with Spark

Perhaps you could add an example to this documentation like

/// 
/// For example, a string values such as "  foo" will be written as "foo"

self.ignore_leading_whitespace
}

/// Set whether to ignore trailing whitespace in string values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to above, can we add an example?

/// 
/// For example, a string values such as "foo. " will be written as "foo"

use arrow_schema::*;
use std::sync::Arc;

fn main() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this example. However, I think this might be hard to find

Would you be willing to move this example into a doc example in https://github.com/apache/arrow-rs/blob/7c6a883302551dde7e89bfed1779c74dac677a0a/arrow-csv/src/writer.rs#L20-L19

Perhaps as a way to show how to use options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to. Would you remove this one? Or keep this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have kept this and added an example to the docs inline

byte_record.push_field(buffer.as_bytes());

// Apply whitespace trimming if options are enabled and the column is a string type
let field_bytes = if should_trim && batch.column(col_idx).data_type() == &DataType::Utf8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other string types as well, such as LargeUtf8 and Utf8View. Could you please add support (and a test) for them as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes of course! I'm still used to spark types!

@alamb alamb merged commit e3a670e into apache:main Dec 15, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

Thanks @xanderbailey

alamb pushed a commit that referenced this pull request Dec 17, 2025
# Which issue does this PR close?
Following on from #8960, we are
now exposing the quote style as a part of the csv writer options which
allows users to quote columns similar to Spark's `quoteAll` setting.
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #[9003](#9003).

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
Expose `QuoteStyle` in the `WriterBuilder` 
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
Yes with examples and unit tests.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
alamb added a commit that referenced this pull request Dec 19, 2025
# Which issue does this PR close?


# Rationale for this change

while reviewing @xanderbailey's PR in
#8960, I found that there are
examples for arrow-csv and they are hard to find. Also each example add
extra binaries and thus slows down CI and tests. For example the
`whitespace_handling` example makes a new 2.9MB binary:

```shell
cargo run -p arrow-csv --example whitespace_handling
...
du -s -h target/debug/examples/whitespace_handling
2.9M	target/debug/examples/whitespace_handling
```

Let's consolidate the examples to make them easier to find


# What changes are included in this PR?

1. Consolidate the examples
2. Improver other csv docs

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
3. Serve as another way to document the expected behavior of the code


# Are there any user-facing changes?

Docs only, no functional changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add spark like ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options to the csv writer

3 participants