[c++,python] Add ManagedQuery.submit_batch for global-order write#3983
[c++,python] Add ManagedQuery.submit_batch for global-order write#3983
ManagedQuery.submit_batch for global-order write#3983Conversation
ManagedQuery.submit_batch for GOWManagedQuery.submit_batch for global-order write
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3983 +/- ##
=======================================
Coverage 65.90% 65.90%
=======================================
Files 158 158
Lines 21000 21008 +8
Branches 1242 1241 -1
=======================================
+ Hits 13839 13845 +6
- Misses 6748 6750 +2
Partials 413 413
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
fec813c to
324d940
Compare
da4ad07 to
7c98894
Compare
7c98894 to
791e1ec
Compare
|
Closes SOMA-162 |
791e1ec to
f9de5ae
Compare
7abcc31 to
c36ce1b
Compare
| ).close() | ||
|
|
||
|
|
||
| def test_fragments_in_writes(tmp_path): |
There was a problem hiding this comment.
Need additional tests. Stuff I can think of:
- tests that probe the expected error cases the ManagedQuery should catch/check? OTOMH:
- batches have out-of-order indices
- out-of-order submits
- submitted batches with incompatible schema
- anything else you can think of
- tests on arrays with more dimensions, and dimensions of other types (both the common cases, e.g., spatial dataframes, and user-specified cases such as a soma_joinid+string dimensions)
- edge cases, such as empty batches/chunks
- tests on sparse & dense array
- (if relevant?) nullability layer and batched? (likely not needed)
There was a problem hiding this comment.
thanks for adding additional tests. The one important case missing (IMHO) is a DataFrame with user-specified index columns. PointCloud covers some of this (multiple numeric indices), but not the case where the user specifies a string index.
Something like a DataFrame with index_column_names containing a string and float column in addition to the normal soma_joinid
I'm just trying to cover the edge cases of variable length or non-integral dimensions, e.g.,
df = pd.DataFrame(
{
"soma_joinid": pd.Series([4], dtype=np.int64),
"str_idx": pd.Series(['a'],dtype=str),
"float_idx": pd.Series([1.1],dtype=np.float32),
"attr": pd.Series(["hi"], dtype="str"),
}
)
soma.DataFrame.create(uri, schema=pa.Schema.from_pandas(df), index_column_names=['soma_joinid','str_idx','float_idx'])
which creates:
ArraySchema(
domain=Domain(*[
Dim(name='str_col', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=3), ])),
Dim(name='float_idx', domain=(-3.4028235e+38, 3.4028235e+38), tile=1.0, dtype='float32', filters=FilterList([ZstdFilter(level=3), ])),
Dim(name='soma_joinid', domain=(0, 9223372036854775805), tile=1, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
]),
attrs=[
Attr(name='attr', dtype='<U0', var=True, nullable=True, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=100000,
sparse=True,
allows_duplicates=False,
)
bkmartinjr
left a comment
There was a problem hiding this comment.
Do we need C++ tests for the ManagedQuery class changes?
| if not batches: | ||
| return | ||
|
|
||
| layout = clib.ResultOrder.unordered if sort_coords else clib.ResultOrder.globalorder |
There was a problem hiding this comment.
Where/how is this flag (sort_coords) documented for the end user? Or is it only a private interface?
If public, we need docstrings and a HISTORY.md entry at a minimum.
There was a problem hiding this comment.
This flag is only used internally. The user sets this option using the platform config.
There was a problem hiding this comment.
AFAIK, platform_config is also undocumented.
@aaronwolen - do we need a plan to document this, or are you OK if it lands w/o user-facing help/examples?
There was a problem hiding this comment.
I think for now the most logical place to document this for users is in TileDBWriteOptions. @nguyenv could you add those docstrings in this PR? I think your explanation in the CHANGELOG looks good.
bkmartinjr
left a comment
There was a problem hiding this comment.
I have done a preliminary review:
- definitely needs more unit tests
- the user-facing aspects of this need work - e.g., docstrings/docs, HISTORY.md, etc.
a3eccb1 to
407637b
Compare
407637b to
0529003
Compare
|
|
||
| ### Changed | ||
|
|
||
| - \[[#3983](https://github.com/single-cell-data/TileDB-SOMA/pull/3983)\] [python] Global-order writes across multiple batches now generate a single fragment instead of multiple fragments. This internal optimization improves efficiency without changing the user interface. |
There was a problem hiding this comment.
is "without changing the user interface" true? It looks like there is a user-facing parameter in platform_config, which has a specific effect on package behavior, so that isn't just an internal optimization.
Suggest something like:
Multiple writes of pre-sorted data may now be written to a single fragment using TileDB global order writes. Enable this performance optimization by setting the
platform_configparametersort_coordstoTruein the call towrite. Will raise an error if data is not written in global sort order.
@aaronwolen - please feel free to suggest better prose for the HISTORY, as we do not have any other docs on this new feature
bkmartinjr
left a comment
There was a problem hiding this comment.
Looking good! I would like to see one additional test (DataFrame with both joinid and a user-specified str index) to make sure we cover the string index corner case.
Also a question (maybe for Aaron?) about docs on this feature
bkmartinjr
left a comment
There was a problem hiding this comment.
On the tests -- none of the tests confirm that only a single fragment was created. Given that a primary goal of GOW is minimizing fragments, it would be a good idea if the tests actually confirm it.
Can we add something simple to do so? (e.g., count elements in the fragments directory before & after)?
bkmartinjr
left a comment
There was a problem hiding this comment.
Recommend checking fragment count in more tests, but otherwise it LGTM!
|
Closes SOMA-162 |
Issue and/or context:
Original PR: #3764
Reverted here: #3886 due to segfault (#3879)
But now re-opening with modifications
Changes:
On top of the changes already implemented in #3764, this code introduces
submit_batchwhich combines theManagedQuery::set_array_dataandManagedQuery::submit_writecalls.Closes SOMA-162.