[python, R] fix incorrect schema property for dict dataframe attributes#4209
[python, R] fix incorrect schema property for dict dataframe attributes#4209bkmartinjr merged 7 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4209 +/- ##
==========================================
+ Coverage 69.02% 74.96% +5.94%
==========================================
Files 168 228 +60
Lines 19074 30777 +11703
Branches 1267 1266 -1
==========================================
+ Hits 13166 23073 +9907
- Misses 5461 7269 +1808
+ Partials 447 435 -12
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
mojaveazure
left a comment
There was a problem hiding this comment.
A couple of changes, one stylistic for NEWS.md and one to strengthen string-checking in check_arrow_data_types()
Also, do we want to port any of the new Python tests to R for this?
| is_dict_of_string <- function(x) { | ||
| startsWith(x$ToString(), "dictionary<values=large_string,") || | ||
| startsWith(x$ToString(), "dictionary<values=string,") | ||
| } | ||
|
|
||
| compatible <- if ((is_string(from) && is_string(to)) || (is_dict_of_string(from) && is_dict_of_string(to))) { |
There was a problem hiding this comment.
Can we change is_string() to
is_string <- function(x) {
if (inherits(x, what = "DictionaryType")) {
return(inherits(x$value_type, what = c("Utf8", "LargeUtf8")))
}
return(inherits(x, what = c("Utf8", "LargeUtf8")))
}to avoid grepping __repr__() calls and eliminate is_dict_of_string()?
If we want string types and dictionaries of strings to be separate, then these should be
is_string <- function(x) inherits(x, what = c("Utf8", "LargeUtf8"))
is_dict_of_string <- function(x) inherits(x, "DictionaryType") &&
inherits(x$value_type, what = c("Utf8", "LargeUtf8"))There was a problem hiding this comment.
I am fine with your first suggestion - happy to benefit from your R expertise!
There was a problem hiding this comment.
One question I had while reading this -- what about binary? From what I can see, the code has always been buggy, e.g., binary is up-cast to large_binary, but the code never handled that case.
There was a problem hiding this comment.
R does not differentiate between Python str and bytes, everything is always a character(); we do have a raw vector, but those are relatively unused
There was a problem hiding this comment.
I filed this as SOMA-448
I am going to leave that to you and @jp-dark. The R tests look quite far behind the python unit tests, so it is likely either a very big job, or a question of which tests deserve cherry-picking. From a visual inspection, it does look like there was a latent bug around binary/large_binary equivalence, so it is very likely a good idea to add at least some type equivalence tests. I propose as a separate task. |
|
@mojaveazure - changes made, and I filed the binary/large_binary issue as SOMA-440. If you are otherwise good with the PR, could you approve (merge is blocked at the moment). TY! |
mojaveazure
left a comment
There was a problem hiding this comment.
Thanks for taking care of this Bruce! Can you bump the develop version in DESCRIPTION from 1.18.99.5 to 1.18.99.6 before merging?
…than string (#4220) * return Arrow tables with dictionary of large_string, rather than string * disable expect_no_condition * expose optional downcast of dict-of-large-var types in ArrowAdaptor::to_arrow * revert R changes in #4209, toggle dict-of-large handling by language * further trim CI mem usage on MacOS * bump R dev version
All TileDB string and array types are automatically up-cast to
large_stringandlarge_binaryrespectively, including those used in an enumeration (aka dictionary). TileDB internally stores all of these types using 64-bit offsets, aka thelargeArrow variant.The
schemaaccess for DataFrame arrays was incorrectly reporting attributes with a dictionary/enum value type for string/binary as the Arrow small variant (stringandbinaryrespectively), instead of the actuallarge_string/large_binarytype.This PR fixes the underlying derivation of the schema for enum/dict, and updates the Python and R tests which were incorrectly hard-coded to presume the bug. I also added (in Python) a more robust set of tests to cover all of the expected up-cast cases.
Fixes SOMA-440
Note to reviewers: please check the R changes, as I have less comfort with the way the R code is checking for compatible types (in particular, it seems to be relying on the internal string representation, rather than an actual type enumeration).