PRs
Merged PRs:
Details
Closed/abandoned PRs:
Details
Issues which are related but non-blocking:
See also: [sc-51048].
Problem to be solved
Users want to know the shape of an array, in the SciPy sense:
- Reads and writes are bounds-checked against the shape
- This retains its value regardless of which values of a sparse array are or are not actually occupied
- Users can
resize.
- Some users need the ability to grow their datasets later, using either
tiledbsoma.io's append mode, or subsequent writes using the tiledbsoma API.
- Note that the cellxgene census doesn't need this: eact week's published census has fixed shape, and any updates will happen in new storage, on a new week.
Using TileDB-SOMA up until the present:
- The TIleDB
domain is immutable after array creation
- This does bounds-checking for reads and writes, which is good
- To leverage this to function as a
shape, users would need to set the domain at array-creation time. However, users lose the ability to grow their datasets later.
- There is a
non_empty_domain accessor
- This only indicates min/max coordinates at which data exists. Consider an
X array for 100 cells and 200 genes. If non-zero expression counts exist only for cell join IDs 2-17, then the non_empty_domain will indicate (2,17) along soma_dim_0.
- Consider an
obms["X_pca"] within the same experiment. This may be 100 cells by 50 PCA components: we need a placd to store the number 50.
- Therefore users cannot leverage this to function as a
shape accessor.
- We have offered a
used_shape accessor since TileDB-SOMA 1.5.
- This functions as a
shape accessor, in the SciPy sense, but it is not multi-writer safe.
New feature for TileDB-SOMA 1.15:
- Arrays will have a
shape
- Reads and writes are bounds-checked against the shape
- This retains its value regardless of which values of a sparse array are or are not actually occupied
- Users can
resize
- The
used_shape accessor will be deprecated in TileDB-SOMA 1.13, and slated for removal in TileDB-SOMA 1.14.
Compatiblity:
This will now require users to do an explicit resize before appending/growing TileDB-SOMA Experiments. Guidance in the form of example notebooks will be provided.
Tracking
See also: [sc-41074] and [sc-51048].
Scheduling
Support arrives in TileDB Core 2.25. Deprecations for TileDB-SOMA will be released with 1.13. Full support within TileDB-SOMA will be release in 1.14.
Details
SOMA API mods as we've discussed in a Google doc are as follows.
SOMADataFrame
create: Retain the domain argument
- Issue:
- Core has a
(lo, hi) tuple per dim, e.g. (0,99) or (10,19)
- SOMA has count per dim, with 0 implicit: e.g. 100 or 20
- For
SparseNDArray and DenseNDArray core can have (lo, hi) and SOMA can have count
- For
DataFrame there can be multiple dims --- default is a single soma_joinid
- That could be treated either in
(lo, hi) fashion or count fashion
- However additional dims (e.g.
cell_type) can be on any type, including strings, floats, etc. where there is no implicit lo=0
- Therefore we need to keep the current SOMA API wherein
DataFrame takes a domain argument (in (lo, hi) fashion) and not a shape argument (in count fashion)
SparseNDArray and DenseNDArray
create
- Have an optional shape argument which is of type
Tuple[Int,...] where each element is the cell count of the corresponding dimension
- If unsupplied, or if supplied but None in any slot: use the minimum 0 in each slot – nothing larger makes sense since we will not support downsize
- User guidance should make clear that it will not be possible to create an ‘old’ style array with the ‘new style’ API. (See also the upgrade logic below.)
All three of SOMADataFrame, SparseNDArray, DenseNDArray
write
- For new arrays, created with the new shape feature:
- Core will bounds-check that coordinates provided at
write time are within the current shape
- Core will raise
tiledb.cc.TileDBError to TileDB-SOMA, which will catch and raise IndexError, and R-standard behavior on the R side
- For old arrays created before this feature:
- Core will not bounds-check that coordinates provided at write time are within the current shape
- Existing
used_shape accessor
- TileDB-SOMA will deprecate this over a release cycle.
- For new arrays: raise
NotImplementedError
- For old arrays: return what’s currently returned, with a deprecation warning.
- Mechanism for determining old vs. new:
array.schema.version (the core storage version).
- Existing
shape accessor
- For new arrays:
- Have this return the new shape as proposed by core, no longer returning the TileDB domain.
- For old arrays created before this feature:
- Return the TileDB domain as now.
- Existing
non_empty_domain accessor
- Same behavior for old and new arrays (unaffected by this proposal).
- Keep this accessor supported, but, with user notes that it’s generally non-useful
- This should return None (or R equivalent) when there is a schema but no data have been written.
- New
maxshape accessor
- Maps the core-level
(lo, hi) accessor for domain to count-style accessor hi+1. E.g. if the core domain is either (0,99) or (50,99) then TileDB-SOMA maxshape will say 100.
- Same behavior for old and new arrays.
- Let users query for what the TileDB domain is, with user notes that it’s the maximum that users can reshape to.
- Issac suggests: maybe
domain or maxshape (see h5py).
- New
resize mutator
- Note:
reshape means something else in the community (numpy, zarr, h5py), e.g. a 5x20 (total 100 cells) being reinterpreted as 4x25 (still 100 cells). The standard name for changing cell-count is resize.
- For old arrays created before this feature: raise
NotImplementedError.
- For new arrays:
- Will raise
ValueError if the new shape is smaller on any dim than currently in storage
- Regardless of whether any data have been written whatsoever
- Will raise
ValueError if the new shape exceeds the TileDB domain from create time, which will serve TileDB-SOMA in a role of “max possible shape the user can reshape to”
- Otherwise, any calls to write from this point will bounds-check writes within this new shape
- We don’t expect resize to be multi-writer safe with regard to write ; user notes must be clear on this point
- New
tiledbsoma_upgrade_shape method for SparseNDArray and DenseNDArray
- This will leverage
array.schema.version to see if an upgrade is needed
- Leverage core support for storage-version updates
- This will take a shape argument as in
create
- For arrays created with “just-right” size: this will succeed
- For arrays created with “room-for-growth” / “two billion-ish” size: this will succeed
- If the user passes a shape which exceeds the current TileDB domain: this will fail
- New
tiledbsoma_upgrade_domain method for DataFrame
- Same as for
SparseNDArray/DenseNDArray except it will take a domain at the SOMA-API level just as DataFrame's create method
tiledbsoma.io
- The user-facing API has no shape arguments and thus won’t need changing.
- Internally to
tiledbsoma.io, we’ll still ask the tiledbsoma API for the “big domain” (2 billionish)
- Append mode:
- Will need a new
resize method at the Experiment level
- Users will need to:
- Register as now
- Call the experiment-level
resize
- Could be
exp.resize(...), or (better) this could be tiledbsoma.io.reshape_experiment
- In either case: this method will take the new
obs and var counts as inputs:
exp.obs.reshape to new obs count
exp.ms[name].var.reshape to new var count
exp.ms[name].X[name].reshape to new obs count x var count
exp.ms[name].obsm[name].reshape to new obs count x same width
exp.ms[name].obsp[name].reshape to new obs count x obs count
exp.ms[name].varm[name].reshape to new var count x same width
exp.ms[name].varp[name].reshape to new var count x var count
- Do the individual append-mode writes as now
PRs
Merged PRs:
Details
kerl/schevo-timestamp-methodizekerl/name-neatenkerl/ut-soma-exc-simplifytest/common.cc#2910kerl/test-common-parameterizekerl/cpp-test-deadstripkerl/minor-unit-test-helper-modkerl/cpp-ut-helper-neatenuse_current_domainunit-test parameterization #2938kerl/more-cur-dom-parameterizekerl/cpp-strict-int64-shapekerl/arrow-util-current-domain-optionalkerl/step-two-tempresizeforSparseNDArrayandDenseNDArray#2947kerl/cpp-ndarray-resize-testingkerl/dataframe-test-fixturekerl/cpp-variant-indexed-dataframesDataFrame.shape#2916kerl/sdf-shapeDataFrame#2917kerl/cpp-resizesupgrade_shapeforSparseNDArrayandDenseNDArray#2948kerl/upgrade-shape-int64kerl/sdf-test-accessorskerl/py-r-accessor-plumbingkerl/sdf-domain-accessorskerl/dense-linkpybind11exception-mapping #2963kerl/nightly-fixDenseNDArraywrite after create #2970kerl/dense-writeable-after-createkerl/minor-trimdomain/maxdomain#2969kerl/more-py-domain-name-neatenkerl/libtiledbsoma-env-logging-levelkerl/py-r-creation-pathsresizeandtiledbsoma_upgrade_shape#2950kerl/py-r-test-2nanoarrowhelpers #2994kerl/nanoarrow-helperskerl/polydom3kerl/polydom5kerl/polydom6nnzof variant-indexed dataframes #2990kerl/variant-nnz-bugDataFrametest case withsoma_joinidnot first #3019kerl/index-swapkerl/ut-max-shapekerl/polydom4kerl/fix-3020-mergekerl/one-more-renamekerl/ff-notvalgrindissue in unit-test code #3029kerl/ut-vgkerl/table-utils-memoryDataFrame#3067kerl/improve-sdf-test-field-nameskerl/ut-generateDataFramedomain forlibtiledbsomaunit-test cases #3069kerl/cpp-sdf-domain-at-createkerl/hll-domainishkerl/max-domain-int64kerl/maybe-resize-soma-joinid-cpp-tweakdomainargument toDataFramecreate#3032kerl/sdf-domain-at-create-- fixes [r]SOMADataFramecreateneeds to accept adomainargument #2967DataFrameresizer #3091kerl/maybe-resize-soma-joinid-py-rkerl/cpp-exp-resize-prepDataFrameobjects shapeable at ingest #3089kerl/r-dataframe-shapeabledomainargument betweenCollection.add_new_dataframeandDataFrame.createSOMA#233kerl/cpp-ut-name-shortenskerl/helper-renamekerl/cpp-can-resizers-nameskerl/cpp-dataframe-sizing-helperskerl/cpp-dataframe-upgrade-testkerl/py-resizer-connectskerl/py-can-upgrade-shapekerl/registration-shape-acceessorskerl/py-exp-shapingkerl/py-exp-shaping2kerl/py-exp-resizekerl/py-domain-at-create-ut-1kerl/py-domain-at-create-ut-2kerl/py-domain-at-create-ut-3kerl/py-domain-at-create-ut-4kerl/py-domain-at-create-ut-5kerl/min-size-2kerl/r-min-sizingcan_upgrade_domain#3211kerl/cpp-ugr-domkerl/ff-interopkerl/ffonkerl/docstring-prunekerl/prefixingkerl/fix-bad-mergeupgrade_domain#3235kerl/py-r-ugr-domkerl/py-r-ugr-dom-2upgrade_domain#3238kerl/py-r-ugr-dom-3set_reader_coordstoset_coords#3253kerl/set-coords-renamepybind11shape methods #3261kerl/pybind11-nda-sizingkerl/dense-227-akerl/dense-range-trimkerl/dim-explosionkerl/python-227-dense-ned-readkerl/r-227-dense-fixeskerl/r-dense-227-morefunction_name_for_messages#3286kerl/more-fn4m.rstfiles #3283kerl/readthedocs-pre-1.15tiledbsoma_upgrade_shapeforDenseNDArray#3288kerl/dense-ugrshkerl/notebook-shape-upgradekerl/new-shape-doc-updates.tgzfiles in source control #3295kerl/notebook-data-refreshkerl/notebook-new-shape-refreshkerl/ffenakerl/r-data-refreshkerl/sdf-sjid-lower-zerokerl/dense-example-data-refreshkerl/new-shape-notebook-and-vignettekerl/upgrade-experiment-resourceskerl/fix-notebook-mergekerl/more-use-shapekerl/revert-3300kerl/227ause_current_domainunit-test/feature-flag teardown, part 1 of 4 #3369kerl/ucd1use_current_domainunit-test/feature-flag teardown, part 2 of 4 #3370kerl/ucd2use_current_domainunit-test/feature-flag teardown, part 3 of 4 #3371kerl/ucd3use_current_domainunit-test/feature-flag teardown, part 4 of 4 #3372kerl/ucd4domainargument tocreate#3396kerl/domain-at-create-docstringskerl/new-shape-vignettekerl/new-shape-more-docstringscheck_onlysupport for domain/shape updates #3400kerl/check-only-rClosed/abandoned PRs:
Details
kerl/feature-flag-temp-- folded into 2962kerl/polydomtiledbsoma.io[WIP] #2964kerl/tiledbsoma-io-testkerl/min-sizeupgrade_domainforDataFrame#3220kerl/cpp-ugr-dom-2dev#3244kerl/dense-227-fixesshapeaccessor forDataFrame[RFC] #3276kerl/dataframe-shapeIssues which are related but non-blocking:
SparseNDArray/DenseNDArraycreatemethods need to accept tile extent fromPlatformConfig#2966_cast_domainish#3081See also: [sc-51048].
Problem to be solved
Users want to know the
shapeof an array, in the SciPy sense:resize.tiledbsoma.io's append mode, or subsequent writes using thetiledbsomaAPI.Using TileDB-SOMA up until the present:
domainis immutable after array creationshape, users would need to set thedomainat array-creation time. However, users lose the ability to grow their datasets later.non_empty_domainaccessorXarray for 100 cells and 200 genes. If non-zero expression counts exist only for cell join IDs 2-17, then thenon_empty_domainwill indicate(2,17)alongsoma_dim_0.obms["X_pca"]within the same experiment. This may be 100 cells by 50 PCA components: we need a placd to store the number 50.shapeaccessor.used_shapeaccessor since TileDB-SOMA 1.5.shapeaccessor, in the SciPy sense, but it is not multi-writer safe.New feature for TileDB-SOMA 1.15:
shaperesizeused_shapeaccessor will be deprecated in TileDB-SOMA 1.13, and slated for removal in TileDB-SOMA 1.14.Compatiblity:
This will now require users to do an explicit
resizebefore appending/growing TileDB-SOMA Experiments. Guidance in the form of example notebooks will be provided.Tracking
See also: [sc-41074] and [sc-51048].
Scheduling
Support arrives in TileDB Core 2.25. Deprecations for TileDB-SOMA will be released with 1.13. Full support within TileDB-SOMA will be release in 1.14.
Details
SOMA API mods as we've discussed in a Google doc are as follows.
SOMADataFramecreate: Retain thedomainargument(lo, hi)tuple per dim, e.g.(0,99)or(10,19)SparseNDArrayandDenseNDArraycore can have(lo, hi)and SOMA can havecountDataFramethere can be multiple dims --- default is a singlesoma_joinid(lo, hi)fashion orcountfashioncell_type) can be on any type, including strings, floats, etc. where there is no implicit lo=0DataFrametakes adomainargument (in(lo, hi)fashion) and not ashapeargument (incountfashion)SparseNDArray and DenseNDArraycreateTuple[Int,...]where each element is the cell count of the corresponding dimensionAll three of
SOMADataFrame,SparseNDArray,DenseNDArraywritewritetime are within the current shapetiledb.cc.TileDBErrorto TileDB-SOMA, which will catch and raiseIndexError, and R-standard behavior on the R sideused_shapeaccessorNotImplementedErrorarray.schema.version(the core storage version).shapeaccessornon_empty_domainaccessormaxshapeaccessor(lo, hi)accessor for domain to count-style accessor hi+1. E.g. if the core domain is either(0,99)or(50,99)then TileDB-SOMAmaxshapewill say 100.domainormaxshape(see h5py).resizemutatorreshapemeans something else in the community (numpy, zarr, h5py), e.g. a 5x20 (total 100 cells) being reinterpreted as 4x25 (still 100 cells). The standard name for changing cell-count isresize.NotImplementedError.ValueErrorif the new shape is smaller on any dim than currently in storageValueErrorif the new shape exceeds the TileDB domain from create time, which will serve TileDB-SOMA in a role of “max possible shape the user can reshape to”tiledbsoma_upgrade_shapemethod for SparseNDArray and DenseNDArrayarray.schema.versionto see if an upgrade is neededcreatetiledbsoma_upgrade_domain methodforDataFrameSparseNDArray/DenseNDArrayexcept it will take a domain at the SOMA-API level just asDataFrame's create methodtiledbsoma.iotiledbsoma.io, we’ll still ask the tiledbsoma API for the “big domain” (2 billionish)resizemethod at theExperimentlevelresizeexp.resize(...), or (better) this could betiledbsoma.io.reshape_experimentobsandvarcounts as inputs:exp.obs.reshapeto newobscountexp.ms[name].var.reshapeto newvarcountexp.ms[name].X[name].reshapeto newobscount xvarcountexp.ms[name].obsm[name].reshapeto newobscount x same widthexp.ms[name].obsp[name].reshapeto newobscount xobscountexp.ms[name].varm[name].reshapeto newvarcount x same widthexp.ms[name].varp[name].reshapeto newvarcount xvarcount