Skip to content

ARROW-16607: [R] Improve KeyValueMetadata handling#13210

Closed
nealrichardson wants to merge 8 commits intoapache:masterfrom
nealrichardson:kvm
Closed

ARROW-16607: [R] Improve KeyValueMetadata handling#13210
nealrichardson wants to merge 8 commits intoapache:masterfrom
nealrichardson:kvm

Conversation

@nealrichardson
Copy link
Member

  • Pushes KVM handling into ExecPlan so that Run() preserves the R metadata we want.
  • Also pushes special handling for a kind of collapsed query from collect() into Build().
  • Better encapsulate KVM for the the $metadata and $r_metadata so that as a user/developer, you never have to touch the serialize/deserialize functions, you just have a list to work with. This is a slight API change, most noticeable if you were to print(tab$metadata); better is to print(str(tab$metdata)).
  • Factor out a common utility in r/src for taking cpp11::strings (named character vector) and producing arrow::KeyValueMetadata

The upshot of all of this is that we can push the ExecPlan evaluation into as_record_batch_reader(), and all that collect() does on top is read the RBR into a Table/data.frame. This means that we can plug dplyr queries into anything else that expects a RecordBatchReader, and it will be (to the maximum extent possible, given the limitations of ExecPlan) streaming, not requiring you to compute() and materialize things first.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice cleanup. Looks like backwards compatibility is well tested in r/tests/testthat/test-backwards-compatibility.R, which was my one concern.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit test-r-arrow-backwards-compatibility

@github-actions
Copy link

Revision: bace9494a2454e417d935babfb162fdd0ed7c0cb

Submitted crossbow builds: ursacomputing/crossbow @ actions-f099564b37

Task Status
test-r-arrow-backwards-compatibility Github Actions

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts about this PR are mostly high-level. Notably, there seems to be a lot of code dedicated to the preservation of attributes...are we sure that preserving them is worth it? Classes are now preserved via vctrs_extension_type() (e.g., how POSIXlt is now handled, which doesn't require R metadata), so it might not be important now as it once was.

I personally find the automatic restoration of attributes a pain to program around...the current example I have to find a workaround for is this one:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)

open_dataset(tf) %>% 
  select(NAME, FIPS) %>% 
  collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?

Because there are some common situations where the metadata will get dropped anyway (e.g., by renaming or dropping a column), I would personally prefer to see a dplyr query drop all the metadata (I'm aware that I'm missing a lot of history as to why that metadata is preserved!).

@nealrichardson
Copy link
Member Author

nealrichardson commented May 26, 2022

are we sure that preserving them is worth it?

The short answer for why we do this is that people expect to be able to save a data.frame to a Parquet file and get the same thing back when they load it. Pandas took a similar approach (attaching metadata to the schema).

The probably longer answer is that I'm not aware of vctrs_extension_type and what it does or does not do. Maybe some of this can be dropped (though for backwards compatibility, not entirely dropped yet). That said, I suspect that doesn't cover everything because I get failing tests without this code.

Would you like to make a followup JIRA where you can explore deleting some/all of this special metadata handling?

@paleolimbot
Copy link
Member

Done! (ARROW-16670).

@ursabot
Copy link

ursabot commented May 26, 2022

Benchmark runs are scheduled for baseline = 156dc72 and contender = a6025f1. a6025f1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.5% ⬆️0.04%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.2% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a6025f15 ec2-t3-xlarge-us-east-2
[Finished] a6025f15 test-mac-arm
[Finished] a6025f15 ursa-i9-9960x
[Finished] a6025f15 ursa-thinkcentre-m75q
[Finished] 156dc72c ec2-t3-xlarge-us-east-2
[Finished] 156dc72c test-mac-arm
[Finished] 156dc72c ursa-i9-9960x
[Finished] 156dc72c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants