Skip to content

[c++,python] Add ManagedQuery.submit_batch for global-order write#3983

Merged
nguyenv merged 4 commits intomainfrom
viviannguyen/separate-writemanagedquery
Jun 9, 2025
Merged

[c++,python] Add ManagedQuery.submit_batch for global-order write#3983
nguyenv merged 4 commits intomainfrom
viviannguyen/separate-writemanagedquery

Conversation

@nguyenv
Copy link
Copy Markdown
Contributor

@nguyenv nguyenv commented Apr 15, 2025

Issue and/or context:

Original PR: #3764
Reverted here: #3886 due to segfault (#3879)
But now re-opening with modifications

Changes:

On top of the changes already implemented in #3764, this code introduces submit_batch which combines the ManagedQuery::set_array_data and ManagedQuery::submit_write calls.

Closes SOMA-162.

@johnkerl johnkerl changed the title [c++,python] Add ManagedQuery.submit_batch for GOW [c++,python] Add ManagedQuery.submit_batch for global-order write Apr 15, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 86.27451% with 7 lines in your changes missing coverage. Please review.

Project coverage is 65.90%. Comparing base (5eca866) to head (522cb23).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3983   +/-   ##
=======================================
  Coverage   65.90%   65.90%           
=======================================
  Files         158      158           
  Lines       21000    21008    +8     
  Branches     1242     1241    -1     
=======================================
+ Hits        13839    13845    +6     
- Misses       6748     6750    +2     
  Partials      413      413           
Flag Coverage Δ
libtiledbsoma 53.94% <75.00%> (+0.01%) ⬆️
python 89.36% <89.74%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.36% <89.74%> (-0.05%) ⬇️
libtiledbsoma 44.62% <72.72%> (+0.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch from fec813c to 324d940 Compare April 16, 2025 23:31
@nguyenv nguyenv marked this pull request as ready for review April 16, 2025 23:31
@nguyenv nguyenv requested review from bkmartinjr and johnkerl April 17, 2025 17:04
@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch 2 times, most recently from da4ad07 to 7c98894 Compare April 24, 2025 17:36
@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch from 7c98894 to 791e1ec Compare May 15, 2025 20:25
@jp-dark
Copy link
Copy Markdown
Collaborator

jp-dark commented May 22, 2025

Closes SOMA-162

@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch from 791e1ec to f9de5ae Compare May 28, 2025 17:32
@nguyenv nguyenv requested a review from jp-dark May 28, 2025 17:34
@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch from 7abcc31 to c36ce1b Compare May 28, 2025 20:53
Comment thread apis/python/tests/test_dataframe.py Outdated
).close()


def test_fragments_in_writes(tmp_path):
Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need additional tests. Stuff I can think of:

  1. tests that probe the expected error cases the ManagedQuery should catch/check? OTOMH:
    • batches have out-of-order indices
    • out-of-order submits
    • submitted batches with incompatible schema
    • anything else you can think of
  2. tests on arrays with more dimensions, and dimensions of other types (both the common cases, e.g., spatial dataframes, and user-specified cases such as a soma_joinid+string dimensions)
  3. edge cases, such as empty batches/chunks
  4. tests on sparse & dense array
  5. (if relevant?) nullability layer and batched? (likely not needed)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding additional tests. The one important case missing (IMHO) is a DataFrame with user-specified index columns. PointCloud covers some of this (multiple numeric indices), but not the case where the user specifies a string index.

Something like a DataFrame with index_column_names containing a string and float column in addition to the normal soma_joinid

I'm just trying to cover the edge cases of variable length or non-integral dimensions, e.g.,

df = pd.DataFrame(
    {
        "soma_joinid": pd.Series([4], dtype=np.int64),
        "str_idx": pd.Series(['a'],dtype=str),
        "float_idx": pd.Series([1.1],dtype=np.float32),
        "attr": pd.Series(["hi"], dtype="str"),
    }
)
soma.DataFrame.create(uri, schema=pa.Schema.from_pandas(df), index_column_names=['soma_joinid','str_idx','float_idx'])

which creates:

ArraySchema(
  domain=Domain(*[
    Dim(name='str_col', domain=('', ''), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=3), ])),
    Dim(name='float_idx', domain=(-3.4028235e+38, 3.4028235e+38), tile=1.0, dtype='float32', filters=FilterList([ZstdFilter(level=3), ])),
    Dim(name='soma_joinid', domain=(0, 9223372036854775805), tile=1, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='attr', dtype='<U0', var=True, nullable=True, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need C++ tests for the ManagedQuery class changes?

if not batches:
return

layout = clib.ResultOrder.unordered if sort_coords else clib.ResultOrder.globalorder
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where/how is this flag (sort_coords) documented for the end user? Or is it only a private interface?

If public, we need docstrings and a HISTORY.md entry at a minimum.

Copy link
Copy Markdown
Contributor Author

@nguyenv nguyenv Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is only used internally. The user sets this option using the platform config.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, platform_config is also undocumented.

@aaronwolen - do we need a plan to document this, or are you OK if it lands w/o user-facing help/examples?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now the most logical place to document this for users is in TileDBWriteOptions. @nguyenv could you add those docstrings in this PR? I think your explanation in the CHANGELOG looks good.

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a preliminary review:

  • definitely needs more unit tests
  • the user-facing aspects of this need work - e.g., docstrings/docs, HISTORY.md, etc.

@nguyenv nguyenv requested a review from bkmartinjr June 3, 2025 19:49
@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch 2 times, most recently from a3eccb1 to 407637b Compare June 3, 2025 21:21
@nguyenv nguyenv force-pushed the viviannguyen/separate-writemanagedquery branch from 407637b to 0529003 Compare June 3, 2025 23:37
Comment thread apis/python/HISTORY.md Outdated

### Changed

- \[[#3983](https://github.com/single-cell-data/TileDB-SOMA/pull/3983)\] [python] Global-order writes across multiple batches now generate a single fragment instead of multiple fragments. This internal optimization improves efficiency without changing the user interface.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "without changing the user interface" true? It looks like there is a user-facing parameter in platform_config, which has a specific effect on package behavior, so that isn't just an internal optimization.

Suggest something like:

Multiple writes of pre-sorted data may now be written to a single fragment using TileDB global order writes. Enable this performance optimization by setting the platform_config parameter sort_coords to True in the call to write. Will raise an error if data is not written in global sort order.

@aaronwolen - please feel free to suggest better prose for the HISTORY, as we do not have any other docs on this new feature

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I would like to see one additional test (DataFrame with both joinid and a user-specified str index) to make sure we cover the string index corner case.

Also a question (maybe for Aaron?) about docs on this feature

@nguyenv nguyenv requested a review from bkmartinjr June 9, 2025 15:23
Comment thread apis/python/HISTORY.md Outdated
Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the tests -- none of the tests confirm that only a single fragment was created. Given that a primary goal of GOW is minimizing fragments, it would be a good idea if the tests actually confirm it.

Can we add something simple to do so? (e.g., count elements in the fragments directory before & after)?

Copy link
Copy Markdown
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend checking fragment count in more tests, but otherwise it LGTM!

@nguyenv nguyenv merged commit 0522d9b into main Jun 9, 2025
29 of 35 checks passed
@nguyenv nguyenv deleted the viviannguyen/separate-writemanagedquery branch June 9, 2025 18:03
@aaronwolen
Copy link
Copy Markdown
Member

Closes SOMA-162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants