ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python#9725
ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python#9725lidavidm wants to merge 7 commits intoapache:masterfrom
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
This also needs to support R.
cpp/src/arrow/dataset/scanner.h
Outdated
There was a problem hiding this comment.
Just ScanOptions instead of FragmentScanOptions might be more descriptive? (I find the "fragment" in it a bit confusing) Because it's not that this can be set for each fragment. It's the same for all fragments for one scan.
There was a problem hiding this comment.
Ah, but we already have a ScanOptions of course ;)
Then maybe FormatScanOptions? Since it are format-specific options?
There was a problem hiding this comment.
That's fair, but that does collide with ScanOptions itself, unless you mean just the naming of the builder method?
There was a problem hiding this comment.
Ah, I see…hmm, but if we had a hypothetical FlightFragment, we'd still want to have scan options specific to that fragment, right?
There was a problem hiding this comment.
That would be "FlightScanOptios" then?
There was a problem hiding this comment.
Yeah, I think the slight niggle I have there is that there wouldn't be a corresponding 'Flight(File)Format'. Maybe 'PerScanOptions'? But it's not a big deal and FormatScanOptions is OK with me too.
There was a problem hiding this comment.
Ah, OK, I see now that the "Fragment" in the name was meant to mean the general "fragment type", while I interpreted it as the "single fragment".
Anyway, this is more a nitpicky remark, not too important ;)
|
This will be a really nice improvement for reading CSV files with the datasets API. I just answered (maybe a bit prematurely) an SO question that needs this (about CSV files without header rows, https://stackoverflow.com/questions/66585648/reading-partitioned-datasets-stored-as-csv-with-pyarrow-dataset/66655815) |
|
Looks like the JNI/datasets bindings build once it's kicked again. |
|
The R user experience here is not good; I'm happy to improve it in a followup, but I'm not sure how feasible that will be. I'm not sure I understand why |
|
The motivation was to support more advanced users who might want to scan the same files repeatedly with different options. But that is a niche use case and the common case is a bit confusing. Logically, the separation is roughly between 'things that would change the schema or format', e.g. the separator, or rows to skip, and 'everything else', e.g. the set of null values - but this isn't obvious to a user who probably just wants to specify all their options together. Maybe the respective scan options could be inlined or embedded into the file format to provide defaults? Which could then be overridden if a user wants to do something more complex. That would be some boilerplate, but would make things easier. |
Yeah I think that would be nice. I don't understand well the use case of scanning the same files with different parsing options unless I'm trying to figure out what the "right" options are. To me, things like Is there a reason one would need to scan the same dataset with different parsing options, rather than create a new dataset with the options specified up front? I wonder whether the extra complexity in accepting them also at scan time is worth it if there's a simple solution like that. |
|
@bkietz Is it okay to share your doc, or quote from it? I think it has the necessary context. |
|
@nealrichardson let me know if the API in the latest commit is closer to what you'd expect; if there aren't objections I can clean it up and implement the missing bits (and maybe see if I can make (test-dataset.R to save you the trouble of scrolling) |
|
Ultimately what I'm looking for is |
d0bd5f0 to
c595a40
Compare
|
To share the context for this refactor, Ben has this doc: https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing
Depending on what everyone thinks, I may revisit the implementation, but yes, let's try to present a convenient API for R/Python users, and have a nice feature to announce for 4.0. |
nealrichardson
left a comment
There was a problem hiding this comment.
Thanks for working on this! Just a couple of final notes.
r/tests/testthat/test-dataset.R
Outdated
There was a problem hiding this comment.
I think we'll want to accept scan_options in collect() since that's the way that users typically do a scan, but that can be done in a followup.
Co-authored-by: Neal Richardson <[email protected]>
|
This should be ready again (minus what seems to be a Travis queue buildup) |
This adds ReadOptions to CsvFileFormat and exposes ReadOptions, ConvertOptions, and CsvFragmentScanOptions to Python.
ReadOptions was added to CsvFileFormat as its options can affect the discovered schema. For the block size, which does not need to be global, a field was added to CsvFragmentScanOptions.