ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python by lidavidm · Pull Request #9725 · apache/arrow

lidavidm · 2021-03-16T15:27:13Z

This adds ReadOptions to CsvFileFormat and exposes ReadOptions, ConvertOptions, and CsvFragmentScanOptions to Python.

ReadOptions was added to CsvFileFormat as its options can affect the discovered schema. For the block size, which does not need to be global, a field was added to CsvFragmentScanOptions.

github-actions · 2021-03-16T21:10:54Z

https://issues.apache.org/jira/browse/ARROW-8631

cpp/src/arrow/dataset/dataset_internal.h

cpp/src/arrow/dataset/file_csv.h

lidavidm

This also needs to support R.

jorisvandenbossche · 2021-03-18T14:07:13Z

cpp/src/arrow/dataset/scanner.h

Just ScanOptions instead of FragmentScanOptions might be more descriptive? (I find the "fragment" in it a bit confusing) Because it's not that this can be set for each fragment. It's the same for all fragments for one scan.

Ah, but we already have a ScanOptions of course ;)
Then maybe FormatScanOptions? Since it are format-specific options?

That's fair, but that does collide with ScanOptions itself, unless you mean just the naming of the builder method?

Ah, I see…hmm, but if we had a hypothetical FlightFragment, we'd still want to have scan options specific to that fragment, right?

That would be "FlightScanOptios" then?

Yeah, I think the slight niggle I have there is that there wouldn't be a corresponding 'Flight(File)Format'. Maybe 'PerScanOptions'? But it's not a big deal and FormatScanOptions is OK with me too.

Ah, OK, I see now that the "Fragment" in the name was meant to mean the general "fragment type", while I interpreted it as the "single fragment".
Anyway, this is more a nitpicky remark, not too important ;)

jorisvandenbossche · 2021-03-18T14:11:50Z

This will be a really nice improvement for reading CSV files with the datasets API. I just answered (maybe a bit prematurely) an SO question that needs this (about CSV files without header rows, https://stackoverflow.com/questions/66585648/reading-partitioned-datasets-stored-as-csv-with-pyarrow-dataset/66655815)

cpp/src/arrow/dataset/file_csv.h

python/pyarrow/_dataset.pyx

lidavidm · 2021-03-19T18:06:43Z

Looks like the JNI/datasets bindings build once it's kicked again.

nealrichardson · 2021-03-19T18:48:32Z

The R user experience here is not good; I'm happy to improve it in a followup, but I'm not sure how feasible that will be. I'm not sure I understand why FragmentScanOptions is separate from CsvFileFormat--it seems that all of the options I provide there are csv-specific. The issue is that I want to declare those (null_values, etc.) up front, along with all of the other parsing instructions for the files (column_names, etc.). If we decide that those are two objects, ok, but that means that open_dataset() needs to assemble the Dataset (via DatasetFactory, creating a CsvFileFormat along the way) and FragmentScanOptions and then some how attach the FragmentScanOptions to the R Dataset object and carry it around until a scan is initialized on the dataset. (R users never make a ScannerBuilder themselves, it's all wrapped in higher-level functions.) Maybe that's not a problem but that's not something we've had to do elsewhere.

lidavidm · 2021-03-19T18:56:42Z

The motivation was to support more advanced users who might want to scan the same files repeatedly with different options. But that is a niche use case and the common case is a bit confusing. Logically, the separation is roughly between 'things that would change the schema or format', e.g. the separator, or rows to skip, and 'everything else', e.g. the set of null values - but this isn't obvious to a user who probably just wants to specify all their options together.

Maybe the respective scan options could be inlined or embedded into the file format to provide defaults? Which could then be overridden if a user wants to do something more complex. That would be some boilerplate, but would make things easier.

nealrichardson · 2021-03-19T19:12:24Z

Maybe the respective scan options could be inlined or embedded into the file format to provide defaults?

Yeah I think that would be nice. I don't understand well the use case of scanning the same files with different parsing options unless I'm trying to figure out what the "right" options are. To me, things like null_values are not scan-time preferences, they're properties that describe what's in the files, so I want to declare them up front and don't need to adjust them later.

Is there a reason one would need to scan the same dataset with different parsing options, rather than create a new dataset with the options specified up front? I wonder whether the extra complexity in accepting them also at scan time is worth it if there's a simple solution like that.

lidavidm · 2021-03-19T19:15:02Z

@bkietz Is it okay to share your doc, or quote from it? I think it has the necessary context.

lidavidm · 2021-03-19T21:02:20Z

@nealrichardson let me know if the API in the latest commit is closer to what you'd expect; if there aren't objections I can clean it up and implement the missing bits (and maybe see if I can make csv_file_format_parse_options handle keywords for convert and parse options).

(test-dataset.R to save you the trouble of scrolling)

nealrichardson · 2021-03-19T21:52:19Z

Ultimately what I'm looking for is open_dataset(path, format = "csv", column_names = c(..), null_values = c("NA", "null", "."), strings_can_be_null = TRUE) and have all of those args go through to the right place. The average R user shouldn't have to call FileFormat$create() directly. I think the extra args are already passed to FileFormat$create() inside the dataset factory constructor, so you're right, csv_file_format_parse_options() probably just needs to be reworked. I'm happy to take that on (or rope @jonkeane or @ianmcook into it) if you want to make a followup JIRA, or feel free to push it ahead yourself, you've done a great job getting it to here!

lidavidm · 2021-03-21T00:12:29Z

To share the context for this refactor, Ben has this doc: https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing

This is something we need to do; dask and other advanced parquet consumers need ridiculously sophisticated hooks for scanning (let alone writing). For example: whether to populate statistics (for reading into a single table with no filter there is no point in converting statistics to expressions), whether they should be accumulated or cached (cudf folks wanted to copy the unparsed metadata buffers to the GPU), conversion details (dict_columns might be interactively decided when a string column is discovered to have few distinct values), I/O minutiae (stream block size/buffering/chunking/... might be decided after a scan starts taking too long), ...

Depending on what everyone thinks, I may revisit the implementation, but yes, let's try to present a convenient API for R/Python users, and have a nice feature to announce for 4.0.

cpp/src/arrow/dataset/file_csv.h

nealrichardson

Thanks for working on this! Just a couple of final notes.

r/tests/testthat/test-dataset.R

nealrichardson · 2021-03-22T15:22:02Z

r/tests/testthat/test-dataset.R

I think we'll want to accept scan_options in collect() since that's the way that users typically do a scan, but that can be done in a followup.

Filed ARROW-12059.

… the file format

Co-authored-by: Neal Richardson <[email protected]>

lidavidm · 2021-03-23T11:58:35Z

This should be ready again (minus what seems to be a Travis queue buildup)

bkietz

LGTM

lidavidm added Component: C++ Component: Python labels Mar 16, 2021

lidavidm requested a review from bkietz March 16, 2021 15:27

bkietz requested changes Mar 17, 2021

View reviewed changes

cpp/src/arrow/dataset/dataset_internal.h Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_csv.h Outdated Show resolved Hide resolved

lidavidm force-pushed the arrow-8631 branch from 2f189fb to 90b6471 Compare March 17, 2021 17:49

lidavidm commented Mar 17, 2021

View reviewed changes

lidavidm force-pushed the arrow-8631 branch from 90b6471 to 8c73344 Compare March 18, 2021 13:48

jorisvandenbossche reviewed Mar 18, 2021

View reviewed changes

bkietz requested changes Mar 18, 2021

View reviewed changes

cpp/src/arrow/dataset/file_csv.h Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

github-actions bot added the Component: R label Mar 18, 2021

lidavidm force-pushed the arrow-8631 branch from 7635e7b to 227ab49 Compare March 19, 2021 21:01

lidavidm force-pushed the arrow-8631 branch from 227ab49 to 35be1b9 Compare March 19, 2021 21:04

lidavidm force-pushed the arrow-8631 branch 2 times, most recently from d0bd5f0 to c595a40 Compare March 19, 2021 22:52

pitrou reviewed Mar 22, 2021

View reviewed changes

cpp/src/arrow/dataset/file_csv.h Outdated Show resolved Hide resolved

nealrichardson reviewed Mar 22, 2021

View reviewed changes

lidavidm added 4 commits March 22, 2021 13:57

ARROW-8631: [C++][Dataset] Expose csv::ReadOptions

25b0817

ARROW-8631: [Python][Dataset] Expose CsvFragmentScanOptions

ac53ac9

ARROW-8631: [C++][Dataset] Don't expose csv::ReadOptions wholesale

8df714f

ARROW-8631: [R][Dataset] Expose csv format/scan options

4a895a6

lidavidm and others added 3 commits March 22, 2021 13:57

ARROW-8631: [C++][Python][R] let user specify default scan options on…

c2d6428

… the file format

Update r/tests/testthat/test-dataset.R

e372f59

Co-authored-by: Neal Richardson <[email protected]>

ARROW-8631: [C++][Python][R] don't inline options structs

40f1b83

lidavidm force-pushed the arrow-8631 branch from 35b7d24 to 40f1b83 Compare March 22, 2021 19:29

bkietz approved these changes Mar 23, 2021

View reviewed changes

bkietz closed this in 7d233cb Mar 23, 2021

asfimport mentioned this pull request Mar 23, 2021

[C++][Dataset] Add ConvertOptions and ReadOptions to CsvFileFormat #24793

Closed

Conversation

lidavidm commented Mar 16, 2021

Uh oh!

github-actions bot commented Mar 16, 2021

Uh oh!

Uh oh!

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 18, 2021

Uh oh!

Uh oh!

Uh oh!

lidavidm commented Mar 19, 2021

Uh oh!

nealrichardson commented Mar 19, 2021

Uh oh!

lidavidm commented Mar 19, 2021

Uh oh!

nealrichardson commented Mar 19, 2021

Uh oh!

lidavidm commented Mar 19, 2021

Uh oh!

lidavidm commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nealrichardson commented Mar 19, 2021

Uh oh!

lidavidm commented Mar 21, 2021

Uh oh!

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Mar 23, 2021

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jorisvandenbossche Mar 18, 2021 •

edited

Loading

lidavidm commented Mar 19, 2021 •

edited

Loading