ARROW-16607: [R] Improve KeyValueMetadata handling by nealrichardson · Pull Request #13210 · apache/arrow

nealrichardson · 2022-05-21T14:21:57Z

Pushes KVM handling into ExecPlan so that Run() preserves the R metadata we want.
Also pushes special handling for a kind of collapsed query from collect() into Build().
Better encapsulate KVM for the the $metadata and $r_metadata so that as a user/developer, you never have to touch the serialize/deserialize functions, you just have a list to work with. This is a slight API change, most noticeable if you were to print(tab$metadata); better is to print(str(tab$metdata)).
Factor out a common utility in r/src for taking cpp11::strings (named character vector) and producing arrow::KeyValueMetadata

The upshot of all of this is that we can push the ExecPlan evaluation into as_record_batch_reader(), and all that collect() does on top is read the RBR into a Table/data.frame. This means that we can plug dplyr queries into anything else that expects a RecordBatchReader, and it will be (to the maximum extent possible, given the limitations of ExecPlan) streaming, not requiring you to compute() and materialize things first.

github-actions · 2022-05-21T14:22:16Z

https://issues.apache.org/jira/browse/ARROW-16607

github-actions · 2022-05-21T14:22:18Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wjones127

This is a very nice cleanup. Looks like backwards compatibility is well tested in r/tests/testthat/test-backwards-compatibility.R, which was my one concern.

nealrichardson · 2022-05-24T11:48:19Z

@github-actions crossbow submit test-r-arrow-backwards-compatibility

github-actions · 2022-05-24T11:49:25Z

Revision: bace9494a2454e417d935babfb162fdd0ed7c0cb

Submitted crossbow builds: ursacomputing/crossbow @ actions-f099564b37

Task	Status
test-r-arrow-backwards-compatibility

… collect to ExecPlan

paleolimbot

My thoughts about this PR are mostly high-level. Notably, there seems to be a lot of code dedicated to the preservation of attributes...are we sure that preserving them is worth it? Classes are now preserved via vctrs_extension_type() (e.g., how POSIXlt is now handled, which doesn't require R metadata), so it might not be important now as it once was.

I personally find the automatic restoration of attributes a pain to program around...the current example I have to find a workaround for is this one:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)

open_dataset(tf) %>% 
  select(NAME, FIPS) %>% 
  collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?

Because there are some common situations where the metadata will get dropped anyway (e.g., by renaming or dropping a column), I would personally prefer to see a dplyr query drop all the metadata (I'm aware that I'm missing a lot of history as to why that metadata is preserved!).

nealrichardson · 2022-05-26T12:35:30Z

are we sure that preserving them is worth it?

The short answer for why we do this is that people expect to be able to save a data.frame to a Parquet file and get the same thing back when they load it. Pandas took a similar approach (attaching metadata to the schema).

The probably longer answer is that I'm not aware of vctrs_extension_type and what it does or does not do. Maybe some of this can be dropped (though for backwards compatibility, not entirely dropped yet). That said, I suspect that doesn't cover everything because I get failing tests without this code.

Would you like to make a followup JIRA where you can explore deleting some/all of this special metadata handling?

paleolimbot · 2022-05-26T16:48:21Z

Done! (ARROW-16670).

ursabot · 2022-05-26T22:51:25Z

Benchmark runs are scheduled for baseline = 156dc72 and contender = a6025f1. a6025f1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.5% ⬆️0.04%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.2% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a6025f15 ec2-t3-xlarge-us-east-2
[Finished] a6025f15 test-mac-arm
[Finished] a6025f15 ursa-i9-9960x
[Finished] a6025f15 ursa-thinkcentre-m75q
[Finished] 156dc72c ec2-t3-xlarge-us-east-2
[Finished] 156dc72c test-mac-arm
[Finished] 156dc72c ursa-i9-9960x
[Finished] 156dc72c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label May 21, 2022

wjones127 approved these changes May 23, 2022

View reviewed changes

nealrichardson added 8 commits May 24, 2022 07:51

Sketch out adding KVM to ExecPlan_run

50f0b4b

Factor out strings_to_kvm()

5f412ff

Move KVM handling from collect to Run

a6ddb1b

Refactor r_metadata handling to encapsulate serialization better

2c46225

Consolidate plan running in as_record_batch_reader and push more from…

fd76b33

… collect to ExecPlan

Some cleanups

a5617d7

Consolidate metadata handling in an ExecNode method; add JIRAs for TODOs

9c069e5

Update for UnionNode

2be6f48

nealrichardson force-pushed the kvm branch from bace949 to 2be6f48 Compare May 24, 2022 11:54

paleolimbot reviewed May 25, 2022

View reviewed changes

nealrichardson closed this in a6025f1 May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16607: [R] Improve KeyValueMetadata handling#13210

ARROW-16607: [R] Improve KeyValueMetadata handling#13210
nealrichardson wants to merge 8 commits intoapache:masterfrom
nealrichardson:kvm

nealrichardson commented May 21, 2022

Uh oh!

github-actions bot commented May 21, 2022

Uh oh!

github-actions bot commented May 21, 2022

Uh oh!

wjones127 left a comment

Uh oh!

nealrichardson commented May 24, 2022

Uh oh!

github-actions bot commented May 24, 2022

Uh oh!

paleolimbot left a comment

Uh oh!

nealrichardson commented May 26, 2022 •

edited

Loading

Uh oh!

paleolimbot commented May 26, 2022

Uh oh!

ursabot commented May 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nealrichardson commented May 21, 2022

Uh oh!

github-actions bot commented May 21, 2022

Uh oh!

github-actions bot commented May 21, 2022

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented May 24, 2022

Uh oh!

github-actions bot commented May 24, 2022

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented May 26, 2022

Uh oh!

ursabot commented May 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealrichardson commented May 26, 2022 •

edited

Loading