ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code#10191
ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code#10191nealrichardson wants to merge 17 commits intoapache:masterfrom
Conversation
3bbe342 to
9640cff
Compare
|
The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct? |
Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less |
eb6848f to
4f854b8
Compare
27c62f5 to
795e1f9
Compare
|
I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough. |
|
Yeah I may have transposed the numbers, will confirm |
jonkeane
left a comment
There was a problem hiding this comment.
This looks good, a few (small) comments
|
@github-actions crossbow submit -g r |
|
Revision: bc6e3563c91b45e15432804b75ba8f0b3a1b0575 Submitted crossbow builds: ursacomputing/crossbow @ actions-407 |
|
This looks great. The increased dependency on Dataset means we will have to skip many more tests to make |
Co-authored-by: Jonathan Keane <[email protected]>
bc6e356 to
fa5731e
Compare
|
@github-actions crossbow submit test-r-minimal-build |
|
Revision: fa5731e Submitted crossbow builds: ursacomputing/crossbow @ actions-410
|
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:
Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}(ARROW-12632). I will remove theas.factormethod and leave a TODO to restore it after that JIRA is resolved.mutate(); this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:
array_expression