-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add ignore leading and trailing white space to csv parser #8960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ignore leading and trailing white space to csv parser #8960
Conversation
arrow-csv/src/writer.rs
Outdated
| if self.ignore_leading_whitespace { | ||
| trimmed = trimmed.trim_start(); | ||
| } | ||
| if self.ignore_trailing_whitespace { | ||
| trimmed = trimmed.trim_end(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very open to suggestions if this is the right way to go about this. It looks like the other options are passed into FormatOptions but that's used much more widely than just the CSV reader...
|
run benchmark csv_writer |
|
🤖 |
|
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xanderbailey -- I think this is a nice change. CI needs to be fixed and I have a suggestion for expanded type support and improved documentation, but I don't think those are necessary before merging this PR
I also started some benchmarks to make sure this doesn't inadvertently slow things down
| self.null_value.as_deref().unwrap_or(DEFAULT_NULL_VALUE) | ||
| } | ||
|
|
||
| /// Set whether to ignore leading whitespace in string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have personally suggested using "strip leading whitespace" but I see this is consistent with Spark
Perhaps you could add an example to this documentation like
///
/// For example, a string values such as " foo" will be written as "foo"
| self.ignore_leading_whitespace | ||
| } | ||
|
|
||
| /// Set whether to ignore trailing whitespace in string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to above, can we add an example?
///
/// For example, a string values such as "foo. " will be written as "foo"
| use arrow_schema::*; | ||
| use std::sync::Arc; | ||
|
|
||
| fn main() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for this example. However, I think this might be hard to find
Would you be willing to move this example into a doc example in https://github.com/apache/arrow-rs/blob/7c6a883302551dde7e89bfed1779c74dac677a0a/arrow-csv/src/writer.rs#L20-L19
Perhaps as a way to show how to use options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to. Would you remove this one? Or keep this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have kept this and added an example to the docs inline
arrow-csv/src/writer.rs
Outdated
| byte_record.push_field(buffer.as_bytes()); | ||
|
|
||
| // Apply whitespace trimming if options are enabled and the column is a string type | ||
| let field_bytes = if should_trim && batch.column(col_idx).data_type() == &DataType::Utf8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other string types as well, such as LargeUtf8 and Utf8View. Could you please add support (and a test) for them as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes of course! I'm still used to spark types!
|
Thanks @xanderbailey |
# Which issue does this PR close? Following on from #8960, we are now exposing the quote style as a part of the csv writer options which allows users to quote columns similar to Spark's `quoteAll` setting. <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #[9003](#9003). # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Expose `QuoteStyle` in the `WriterBuilder` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes with examples and unit tests. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? # Rationale for this change while reviewing @xanderbailey's PR in #8960, I found that there are examples for arrow-csv and they are hard to find. Also each example add extra binaries and thus slows down CI and tests. For example the `whitespace_handling` example makes a new 2.9MB binary: ```shell cargo run -p arrow-csv --example whitespace_handling ... du -s -h target/debug/examples/whitespace_handling 2.9M target/debug/examples/whitespace_handling ``` Let's consolidate the examples to make them easier to find # What changes are included in this PR? 1. Consolidate the examples 2. Improver other csv docs # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code # Are there any user-facing changes? Docs only, no functional changes
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
ignoreLeadingWhiteSpaceandignoreTrailingWhiteSpaceoptions to the csv writer #8961Rationale for this change
Spark's csv writer can do this and it would be nice to have feature parity in projects like datafusion.
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
What changes are included in this PR?
Adds two new options for
ignore_leading_whitespaceandignore_trailing_whitespaceto csv writer.There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are these changes tested?
Yes
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.