Skip to content

[python, R] fix incorrect schema property for dict dataframe attributes#4209

Merged
bkmartinjr merged 7 commits intomainfrom
bkm/soma-440
Aug 28, 2025
Merged

[python, R] fix incorrect schema property for dict dataframe attributes#4209
bkmartinjr merged 7 commits intomainfrom
bkm/soma-440

Conversation

@bkmartinjr
Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr commented Aug 27, 2025

All TileDB string and array types are automatically up-cast to large_string and large_binary respectively, including those used in an enumeration (aka dictionary). TileDB internally stores all of these types using 64-bit offsets, aka the large Arrow variant.

The schema access for DataFrame arrays was incorrectly reporting attributes with a dictionary/enum value type for string/binary as the Arrow small variant (string and binary respectively), instead of the actual large_string/large_binary type.

This PR fixes the underlying derivation of the schema for enum/dict, and updates the Python and R tests which were incorrectly hard-coded to presume the bug. I also added (in Python) a more robust set of tests to cover all of the expected up-cast cases.

Fixes SOMA-440

Note to reviewers: please check the R changes, as I have less comfort with the way the R code is checking for compatible types (in particular, it seems to be relying on the internal string representation, rather than an actual type enumeration).

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 74.96%. Comparing base (b0526de) to head (12a75b6).
⚠️ Report is 76 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4209      +/-   ##
==========================================
+ Coverage   69.02%   74.96%   +5.94%     
==========================================
  Files         168      228      +60     
  Lines       19074    30777   +11703     
  Branches     1267     1266       -1     
==========================================
+ Hits        13166    23073    +9907     
- Misses       5461     7269    +1808     
+ Partials      447      435      -12     
Flag Coverage Δ
libtiledbsoma 56.63% <0.00%> (+<0.01%) ⬆️
python 89.63% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.63% <ø> (-0.03%) ⬇️
libtiledbsoma 46.16% <0.00%> (+0.82%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bkmartinjr bkmartinjr marked this pull request as ready for review August 28, 2025 01:53
@bkmartinjr bkmartinjr changed the title [python] fix incorrect schema property for dict dataframe attributes [python, R] fix incorrect schema property for dict dataframe attributes Aug 28, 2025
Copy link
Copy Markdown
Member

@mojaveazure mojaveazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes, one stylistic for NEWS.md and one to strengthen string-checking in check_arrow_data_types()

Also, do we want to port any of the new Python tests to R for this?

Comment thread apis/r/R/utils-arrow.R Outdated
Comment on lines +319 to +324
is_dict_of_string <- function(x) {
startsWith(x$ToString(), "dictionary<values=large_string,") ||
startsWith(x$ToString(), "dictionary<values=string,")
}

compatible <- if ((is_string(from) && is_string(to)) || (is_dict_of_string(from) && is_dict_of_string(to))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change is_string() to

is_string <- function(x) {
  if (inherits(x, what = "DictionaryType")) {
    return(inherits(x$value_type, what = c("Utf8", "LargeUtf8")))
  }
  return(inherits(x, what = c("Utf8", "LargeUtf8")))
}

to avoid grepping __repr__() calls and eliminate is_dict_of_string()?

If we want string types and dictionaries of strings to be separate, then these should be

is_string <- function(x) inherits(x, what = c("Utf8", "LargeUtf8"))
is_dict_of_string <- function(x) inherits(x, "DictionaryType") && 
  inherits(x$value_type, what = c("Utf8", "LargeUtf8"))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with your first suggestion - happy to benefit from your R expertise!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I had while reading this -- what about binary? From what I can see, the code has always been buggy, e.g., binary is up-cast to large_binary, but the code never handled that case.

Copy link
Copy Markdown
Member

@mojaveazure mojaveazure Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R does not differentiate between Python str and bytes, everything is always a character(); we do have a raw vector, but those are relatively unused

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed this as SOMA-448

Comment thread apis/r/NEWS.md Outdated
@bkmartinjr
Copy link
Copy Markdown
Member Author

bkmartinjr commented Aug 28, 2025

Also, do we want to port any of the new Python tests to R for this?

I am going to leave that to you and @jp-dark. The R tests look quite far behind the python unit tests, so it is likely either a very big job, or a question of which tests deserve cherry-picking.

From a visual inspection, it does look like there was a latent bug around binary/large_binary equivalence, so it is very likely a good idea to add at least some type equivalence tests. I propose as a separate task.

@bkmartinjr
Copy link
Copy Markdown
Member Author

@mojaveazure - changes made, and I filed the binary/large_binary issue as SOMA-440. If you are otherwise good with the PR, could you approve (merge is blocked at the moment). TY!

@bkmartinjr bkmartinjr requested a review from mojaveazure August 28, 2025 18:50
Copy link
Copy Markdown
Member

@mojaveazure mojaveazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this Bruce! Can you bump the develop version in DESCRIPTION from 1.18.99.5 to 1.18.99.6 before merging?

@bkmartinjr bkmartinjr merged commit 8b561fc into main Aug 28, 2025
24 checks passed
@bkmartinjr bkmartinjr deleted the bkm/soma-440 branch August 28, 2025 23:04
bkmartinjr pushed a commit that referenced this pull request Sep 8, 2025
…than string (#4220)

* return Arrow tables with dictionary of large_string, rather than string

* disable expect_no_condition

* expose optional downcast of dict-of-large-var types in ArrowAdaptor::to_arrow

* revert R changes in #4209, toggle dict-of-large handling by language

* further trim CI mem usage on MacOS

* bump R dev version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants