ARROW-16607: [R] Improve KeyValueMetadata handling#13210
ARROW-16607: [R] Improve KeyValueMetadata handling#13210nealrichardson wants to merge 8 commits intoapache:masterfrom
Conversation
|
|
wjones127
left a comment
There was a problem hiding this comment.
This is a very nice cleanup. Looks like backwards compatibility is well tested in r/tests/testthat/test-backwards-compatibility.R, which was my one concern.
|
@github-actions crossbow submit test-r-arrow-backwards-compatibility |
|
Revision: bace9494a2454e417d935babfb162fdd0ed7c0cb Submitted crossbow builds: ursacomputing/crossbow @ actions-f099564b37
|
… collect to ExecPlan
paleolimbot
left a comment
There was a problem hiding this comment.
My thoughts about this PR are mostly high-level. Notably, there seems to be a lot of code dedicated to the preservation of attributes...are we sure that preserving them is worth it? Classes are now preserved via vctrs_extension_type() (e.g., how POSIXlt is now handled, which doesn't require R metadata), so it might not be important now as it once was.
I personally find the automatic restoration of attributes a pain to program around...the current example I have to find a workaround for is this one:
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1
nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)
open_dataset(tf) %>%
select(NAME, FIPS) %>%
collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?Because there are some common situations where the metadata will get dropped anyway (e.g., by renaming or dropping a column), I would personally prefer to see a dplyr query drop all the metadata (I'm aware that I'm missing a lot of history as to why that metadata is preserved!).
The short answer for why we do this is that people expect to be able to save a data.frame to a Parquet file and get the same thing back when they load it. Pandas took a similar approach (attaching metadata to the schema). The probably longer answer is that I'm not aware of Would you like to make a followup JIRA where you can explore deleting some/all of this special metadata handling? |
|
Done! (ARROW-16670). |
|
Benchmark runs are scheduled for baseline = 156dc72 and contender = a6025f1. a6025f1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
print(tab$metadata); better is toprint(str(tab$metdata)).The upshot of all of this is that we can push the ExecPlan evaluation into
as_record_batch_reader(), and all thatcollect()does on top is read the RBR into a Table/data.frame. This means that we can plug dplyr queries into anything else that expects a RecordBatchReader, and it will be (to the maximum extent possible, given the limitations of ExecPlan) streaming, not requiring you tocompute()and materialize things first.