[python, R] fix incorrect schema property for dict dataframe attributes by bkmartinjr · Pull Request #4209 · single-cell-data/TileDB-SOMA

bkmartinjr · 2025-08-27T20:35:48Z

All TileDB string and array types are automatically up-cast to large_string and large_binary respectively, including those used in an enumeration (aka dictionary). TileDB internally stores all of these types using 64-bit offsets, aka the large Arrow variant.

The schema access for DataFrame arrays was incorrectly reporting attributes with a dictionary/enum value type for string/binary as the Arrow small variant (string and binary respectively), instead of the actual large_string/large_binary type.

This PR fixes the underlying derivation of the schema for enum/dict, and updates the Python and R tests which were incorrectly hard-coded to presume the bug. I also added (in Python) a more robust set of tests to cover all of the expected up-cast cases.

Fixes SOMA-440

Note to reviewers: please check the R changes, as I have less comfort with the way the R code is checking for compatible types (in particular, it seems to be relying on the internal string representation, rather than an actual type enumeration).

codecov · 2025-08-27T20:41:56Z

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 74.96%. Comparing base (b0526de) to head (12a75b6).
⚠️ Report is 76 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4209      +/-   ##
==========================================
+ Coverage   69.02%   74.96%   +5.94%     
==========================================
  Files         168      228      +60     
  Lines       19074    30777   +11703     
  Branches     1267     1266       -1     
==========================================
+ Hits        13166    23073    +9907     
- Misses       5461     7269    +1808     
+ Partials      447      435      -12

Flag	Coverage Δ
libtiledbsoma	`56.63% <0.00%> (+<0.01%)`	⬆️
python	`89.63% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`89.63% <ø> (-0.03%)`	⬇️
libtiledbsoma	`46.16% <0.00%> (+0.82%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mojaveazure

A couple of changes, one stylistic for NEWS.md and one to strengthen string-checking in check_arrow_data_types()

Also, do we want to port any of the new Python tests to R for this?

mojaveazure · 2025-08-28T15:46:31Z

+  is_dict_of_string <- function(x) {
+    startsWith(x$ToString(), "dictionary<values=large_string,") ||
+      startsWith(x$ToString(), "dictionary<values=string,")
+  }
+
+  compatible <- if ((is_string(from) && is_string(to)) || (is_dict_of_string(from) && is_dict_of_string(to))) {


Can we change is_string() to

is_string <- function(x) { if (inherits(x, what = "DictionaryType")) { return(inherits(x$value_type, what = c("Utf8", "LargeUtf8"))) } return(inherits(x, what = c("Utf8", "LargeUtf8"))) }

to avoid grepping __repr__() calls and eliminate is_dict_of_string()?

If we want string types and dictionaries of strings to be separate, then these should be

is_string <- function(x) inherits(x, what = c("Utf8", "LargeUtf8")) is_dict_of_string <- function(x) inherits(x, "DictionaryType") && inherits(x$value_type, what = c("Utf8", "LargeUtf8"))

I am fine with your first suggestion - happy to benefit from your R expertise!

One question I had while reading this -- what about binary? From what I can see, the code has always been buggy, e.g., binary is up-cast to large_binary, but the code never handled that case.

R does not differentiate between Python str and bytes, everything is always a character(); we do have a raw vector, but those are relatively unused

I filed this as SOMA-448

bkmartinjr · 2025-08-28T16:26:44Z

Also, do we want to port any of the new Python tests to R for this?

I am going to leave that to you and @jp-dark. The R tests look quite far behind the python unit tests, so it is likely either a very big job, or a question of which tests deserve cherry-picking.

From a visual inspection, it does look like there was a latent bug around binary/large_binary equivalence, so it is very likely a good idea to add at least some type equivalence tests. I propose as a separate task.

bkmartinjr · 2025-08-28T18:48:07Z

@mojaveazure - changes made, and I filed the binary/large_binary issue as SOMA-440. If you are otherwise good with the PR, could you approve (merge is blocked at the moment). TY!

mojaveazure

Thanks for taking care of this Bruce! Can you bump the develop version in DESCRIPTION from 1.18.99.5 to 1.18.99.6 before merging?

…than string (#4220) * return Arrow tables with dictionary of large_string, rather than string * disable expect_no_condition * expose optional downcast of dict-of-large-var types in ArrowAdaptor::to_arrow * revert R changes in #4209, toggle dict-of-large handling by language * further trim CI mem usage on MacOS * bump R dev version

fix incorrect schema property for dict dataframe attributes

3dfd8d0

bkmartinjr added 2 commits August 27, 2025 16:47

debugging

498f87f

more debugging

8f650f5

bkmartinjr marked this pull request as ready for review August 28, 2025 01:53

bkmartinjr requested review from jp-dark and mojaveazure August 28, 2025 01:54

bkmartinjr changed the title ~~[python] fix incorrect schema property for dict dataframe attributes~~ [python, R] fix incorrect schema property for dict dataframe attributes Aug 28, 2025

add to changelog

f21571c

jp-dark approved these changes Aug 28, 2025

View reviewed changes

mojaveazure requested changes Aug 28, 2025

View reviewed changes

bkmartinjr added 2 commits August 28, 2025 09:24

suggestions from PR review

8763936

Merge branch 'main' into bkm/soma-440

d78bf80

bkmartinjr requested a review from mojaveazure August 28, 2025 18:50

mojaveazure approved these changes Aug 28, 2025

View reviewed changes

bump R version

12a75b6

bkmartinjr merged commit 8b561fc into main Aug 28, 2025
24 checks passed

bkmartinjr deleted the bkm/soma-440 branch August 28, 2025 23:04

bkmartinjr pushed a commit that referenced this pull request Sep 8, 2025

revert R changes in #4209, toggle dict-of-large handling by language

e66948a

bkmartinjr mentioned this pull request Sep 8, 2025

[python] return Arrow tables with dictionary of large_string, rather than string #4220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python, R] fix incorrect schema property for dict dataframe attributes#4209

[python, R] fix incorrect schema property for dict dataframe attributes#4209
bkmartinjr merged 7 commits intomainfrom
bkm/soma-440

bkmartinjr commented Aug 27, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

mojaveazure left a comment

Uh oh!

mojaveazure Aug 28, 2025

Uh oh!

bkmartinjr Aug 28, 2025

Uh oh!

bkmartinjr Aug 28, 2025

Uh oh!

mojaveazure Aug 28, 2025 •

edited

Loading

Uh oh!

bkmartinjr Aug 28, 2025

Uh oh!

Uh oh!

bkmartinjr commented Aug 28, 2025 •

edited

Loading

Uh oh!

bkmartinjr commented Aug 28, 2025

Uh oh!

mojaveazure left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bkmartinjr commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mojaveazure left a comment

Choose a reason for hiding this comment

Uh oh!

mojaveazure Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

bkmartinjr Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

bkmartinjr Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

mojaveazure Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkmartinjr Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkmartinjr commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkmartinjr commented Aug 28, 2025

Uh oh!

mojaveazure left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkmartinjr commented Aug 27, 2025 •

edited

Loading

codecov Bot commented Aug 27, 2025 •

edited

Loading

mojaveazure Aug 28, 2025 •

edited

Loading

bkmartinjr commented Aug 28, 2025 •

edited

Loading