Skip to content

Allow specifying comment character for CSV reader#5759

Merged
tustvold merged 1 commit intoapache:masterfrom
bbannier:t/comment
May 13, 2024
Merged

Allow specifying comment character for CSV reader#5759
tustvold merged 1 commit intoapache:masterfrom
bbannier:t/comment

Conversation

@bbannier
Copy link
Member

This patch adds reader support for a comment character for reading CSV files. While comments like almost nothing around the CSV format are not truly standardized, a common format supported by many CSV readers12 is to ignore full lines starting with a comment character (often #); inline or end of line comments are not supported.

Example:

# This is a comment in a CSV file without header.
1,2
# Comment inside the data block.
11,22

The implementation of this for Arrow is pretty straight-forward as all we need to do is expose the existing comment option of csv_core used to read CSV files.

Closes #5758.

Footnotes

  1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

  2. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 12, 2024
This patch adds reader support for a comment character for reading CSV
files. While comments like almost nothing around the CSV format are not
truly standardized, a common format supported by many CSV
readers[^1][^2] is to ignore full lines starting with a comment
character (often `#`); inline or end of line comments are not supported.

Example:

    # This is a comment in a CSV file without header.
    1,2
    # Comment inside the data block.
    11,22

The implementation of this for Arrow is pretty straight-forward as all
we need to do is expose the existing `comment` option of `csv_core` used
to read CSV files.

Closes apache#5758.

[^1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
[^2]: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
@bbannier bbannier marked this pull request as ready for review May 12, 2024 08:40
@bbannier
Copy link
Member Author

The CI failure for integration / Archery test With other arrows (pull_request) seems preexisting, it e.g., fails on current master as well, https://github.com/apache/arrow-rs/actions/runs/9043234896/job/24850651652.

bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
bbannier added a commit to bbannier/datafusion that referenced this pull request May 12, 2024
This commit switches to used version of arrow-rs to the version of
apache/arrow-rs#5759 which introduces support for comments in CSV input
files.
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you.

Integration test failure is unrelated

@tustvold tustvold merged commit 6ab67df into apache:master May 13, 2024
@bbannier bbannier deleted the t/comment branch June 3, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support skipping comments in CSV files

2 participants