Skip to content

Conversation

@d-chambers
Copy link
Contributor

@d-chambers d-chambers commented Oct 11, 2025

Description

This PR adds support for FBE (frequency band extracted) patches in prodml files. It also tests that slicing out of bands returns empty spools for all commonly tested fiberIO instances.

Checklist

I have (if applicable):

  • referenced the GitHub issue this PR closes.
  • documented the new feature with docstrings and/or appropriate doc page.
  • included tests. See testing guidelines.
  • added the "ready_for_review" tag once the PR is ready to be reviewed.

Summary by CodeRabbit

  • New Features

    • Expanded ProdML support and added ProdML patch attribute types.
  • Bug Fixes

    • Readers across formats now return empty results when time/distance selections yield no data (prevent empty patches).
  • Behavior Changes

    • SEGY reads no longer perform channel/time trimming; full patch returned unless selection excludes all samples.
  • Tests

    • Added tests verifying empty-read behavior when slices exclude all patches; minor docstring fix.
  • Chores

    • Added a new sample dataset entry to the data registry.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

Walkthrough

Adds empty-data guards across many IO readers to short-circuit when selections yield no data, renames several internal helpers from _get_patch to _get_patches (now returning lists), refactors ProdML parsing to a processor/iterator-driven flow, adds ProdML attr classes and a data-registry entry, and updates tests for empty-selection reads.

Changes

Cohort / File(s) Summary
ProdML overhaul
dascore/io/prodml/core.py, dascore/io/prodml/utils.py, dascore/data_registry.txt
Replaces dict-based attr construction with a processor/iterator-driven flow: adds ProdMLRawPatchAttrs/ProdMLFbePatchAttrs, processor registries, _yield_prodml_attrs_coords, and updates _read_prodml usage. Also adds a new data registry entry.
_get_patch → _get_patches conversions
dascore/io/ap_sensing/core.py, dascore/io/ap_sensing/utils.py, dascore/io/silixah5/core.py, dascore/io/silixah5/utils.py, dascore/io/sintela_binary/core.py, dascore/io/sintela_binary/utils.py
Renames helper to _get_patches, updates imports and call sites, changes return shape to lists of Patch(es), and adds early empty-data short-circuits in utils.
Readers: empty-data guards
dascore/io/dashdf5/core.py, dascore/io/gdr/core.py, dascore/io/h5simple/core.py, dascore/io/neubrex/core.py, dascore/io/segy/core.py, dascore/io/tdms/core.py, dascore/io/terra15/core.py, dascore/io/dasdae/core.py, dascore/io/optodas/utils.py, dascore/io/febus/utils.py
Adds early-return or skip behavior when selected coords/data are empty to avoid creating empty Patch objects and to prevent invalid reshapes; adjusts control flow accordingly.
dasdae utilities
dascore/io/dasdae/utils.py
Adds _KWARG_NON_KEYS and _kwargs_empty(kwargs) to determine when kwargs are effectively empty and to control filtering of empty patches.
Tests & minor fixes
tests/test_io/test_common_io.py, tests/test_io/test_dasdae/test_dasdae.py
Adds tests asserting reads return empty spools when selections exclude all data, removes a runtime skip, and fixes a docstring typo.

Possibly related PRs

  • add Sintela binary v3 support #516 — Renames and call-site updates for sintela_binary _get_patch_get_patches, which directly touches the same helper and affected modules.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title “Prodml fbe” is too terse and does not clearly communicate the nature of the change or the action being taken, as it omits that support for frequency‐band‐extracted patches is being added and reads more like a label than a descriptive summary. Please expand the title to a clear, imperative statement such as “Add support for FBE patches in ProdML files” so that teammates can immediately understand the primary change.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed The pull request description follows the repository template by including a clear “## Description” section explaining the feature and a “## Checklist” with the required items, satisfying the structural requirements for a complete description.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch prodml_fbe

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0aa431 and f66a73f.

📒 Files selected for processing (1)
  • dascore/io/prodml/utils.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (5)
  • get_coord (1039-1058)
  • step (1068-1070)
  • keys (1027-1029)
  • update (240-308)
  • get_coord_manager (1127-1215)
dascore/core/coords.py (4)
  • get_coord (1515-1703)
  • update (738-751)
  • update (965-967)
  • data (606-608)
dascore/utils/misc.py (4)
  • iterate (257-269)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Run benchmarks
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
🔇 Additional comments (5)
dascore/io/prodml/utils.py (5)

176-176: Dict union operator requires Python 3.9+.

The dict union operator (fbe | raws) on line 176 requires Python 3.9 or later. Verify that the project's minimum Python version requirement is at least 3.9.

If Python 3.8 support is still needed, use {**fbe, **raws} instead:

-    for name, node in (fbe | raws).items():
+    for name, node in {**fbe, **raws}.items():

220-231: Verify data units are read from the correct location.

For many ProdML files, RawDataUnit may be attached to the RawData dataset rather than the parent group. The current code on line 225 only reads from node_info.node (the group). Consider reading from both the group and the dataset to ensure units are found:

 @register_func(_NODE_ATTRS_PROCESSORS, key="raw")
 def _get_raw_node_attr_coords(node_info, d_coord, base_info):
     """Get the raw data information."""
     info = dict(base_info)
     t_coord = _get_time_coord(node_info.node)
     info.update(_get_data_unit_and_type(node_info.node))
+    # Override with dataset-level units if present (some files store units on dataset).
+    info.update(_get_data_unit_and_type(node_info.node["RawData"]))
     coords = dc.get_coord_manager(
         coords={"time": t_coord, "distance": d_coord},
         dims=_get_dims_from_attrs(node_info.node["RawData"].attrs),
     )
     return ProdMLRawPatchAttrs(**info), coords

Based on learnings


233-256: LGTM! FBE distance slicing correctly implemented.

The distance slicing logic now correctly:

  • Uses StartLocusIndex for the starting position (line 245)
  • Uses NumberOfLoci for the count (line 246)
  • Handles both bytes and regular types with proper type conversion (lines 247-248)

This addresses the previous concern about using the wrong attribute for start_ind.


307-321: LGTM! Clean iterator-based implementation.

The refactored implementation using a single iteration over _yield_data_nodes with registry-based processor dispatch is clear and maintainable. The empty-data check on line 319 correctly prevents creating patches with no data.


48-55: Mapping is now consistent with ProdMLFbePatchAttrs.

The mapping from "WindowFunction" to "window_function" on line 52 correctly matches the field definition on line 81. Previous review comments flagged a mismatch with window_type, but the code is now internally consistent.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot added the IO Work for reading/writing different formats label Oct 11, 2025
@codecov
Copy link

codecov bot commented Oct 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.92%. Comparing base (f8d7949) to head (f66a73f).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master     #571    +/-   ##
========================================
  Coverage   99.92%   99.92%            
========================================
  Files         127      128     +1     
  Lines       10622    10878   +256     
========================================
+ Hits        10614    10870   +256     
  Misses          8        8            
Flag Coverage Δ
unittests 99.92% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)

141-156: Guard against single-element time arrays (division by zero risk)

If array_len == 1, (end - start) / (array_len - 1) divides by zero. Fallback to using the time array directly.

-    step = (end - start) / (array_len - 1)
-    time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
+    if array_len == 1:
+        time_coord = get_coord(data=np.asarray(time_array[:]), units="s")
+    else:
+        step = (end - start) / (array_len - 1)
+        time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
🧹 Nitpick comments (4)
dascore/io/terra15/core.py (1)

73-74: Consider checking before patch creation for efficiency.

The empty-data guard works correctly but checks patch.data.size after the patch has been created by _read_terra15. Other IO modules in this PR check before creating the patch, which is more efficient. If feasible, consider adding the empty-data check within _read_terra15 or checking the data before patch construction.

dascore/io/sintela_binary/utils.py (1)

224-229: Move the empty-data check before Patch creation.

The current implementation creates a Patch object before checking if the data is empty. This is inefficient because Patch creation can be expensive for memory-mapped data. Move the empty-data guard before the Patch instantiation to avoid unnecessary allocation.

Apply this diff:

     if time is not None or distance is not None:
         cm, data = cm.select(data, time=time, distance=distance)
-    patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs)
     if not data.size:  # no data in this slice.
         return []
+    patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs)
     return [patch]
dascore/io/dasdae/core.py (1)

120-123: Document empty-patch guard rationale

  • Add an inline comment above the if not patch.data.size and kwargs guard explaining that empty patches are preserved on full reads (for round-trip support) but skipped only when filters (kwargs) are provided.
  • Add a test for dc.read(path, time=…) (or other selection kwargs) to assert that empty patches are correctly filtered out under selection.
dascore/io/prodml/utils.py (1)

264-269: Avoid eager loading FBE data

Returning node[:] loads full arrays into memory. Return the dataset and let downstream selection handle slicing/laziness, matching raw behavior.

-    node = node_info.node
-    return node[:]
+    return node_info.node
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6b3ae and 71c2dba.

📒 Files selected for processing (21)
  • dascore/data_registry.txt (1 hunks)
  • dascore/io/ap_sensing/core.py (2 hunks)
  • dascore/io/ap_sensing/utils.py (1 hunks)
  • dascore/io/dasdae/core.py (1 hunks)
  • dascore/io/dashdf5/core.py (1 hunks)
  • dascore/io/febus/utils.py (3 hunks)
  • dascore/io/gdr/core.py (1 hunks)
  • dascore/io/h5simple/core.py (1 hunks)
  • dascore/io/neubrex/core.py (2 hunks)
  • dascore/io/optodas/utils.py (1 hunks)
  • dascore/io/prodml/core.py (3 hunks)
  • dascore/io/prodml/utils.py (6 hunks)
  • dascore/io/segy/core.py (1 hunks)
  • dascore/io/silixah5/core.py (2 hunks)
  • dascore/io/silixah5/utils.py (1 hunks)
  • dascore/io/sintela_binary/core.py (2 hunks)
  • dascore/io/sintela_binary/utils.py (2 hunks)
  • dascore/io/tdms/core.py (1 hunks)
  • dascore/io/terra15/core.py (1 hunks)
  • tests/test_io/test_common_io.py (1 hunks)
  • tests/test_io/test_dasdae/test_dasdae.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
dascore/io/tdms/core.py (4)
dascore/core/patch.py (2)
  • data (272-283)
  • size (302-315)
dascore/core/coords.py (2)
  • data (606-608)
  • size (473-475)
dascore/core/coordmanager.py (1)
  • size (802-804)
dascore/core/spool.py (1)
  • spool (663-692)
dascore/io/silixah5/core.py (1)
dascore/io/silixah5/utils.py (3)
  • _get_attr (94-101)
  • _get_patches (104-117)
  • _get_version_string (32-39)
dascore/io/sintela_binary/utils.py (2)
dascore/io/ap_sensing/utils.py (1)
  • _get_patches (90-101)
dascore/io/silixah5/utils.py (1)
  • _get_patches (104-117)
dascore/io/terra15/core.py (2)
dascore/core/patch.py (2)
  • data (272-283)
  • size (302-315)
dascore/core/spool.py (1)
  • spool (663-692)
dascore/io/h5simple/core.py (2)
dascore/core/patch.py (1)
  • size (302-315)
dascore/core/spool.py (1)
  • spool (663-692)
dascore/io/dasdae/core.py (4)
dascore/io/dasdae/utils.py (1)
  • _read_patch (155-167)
dascore/core/patch.py (2)
  • data (272-283)
  • size (302-315)
dascore/core/coords.py (2)
  • data (606-608)
  • size (473-475)
dascore/core/coordmanager.py (1)
  • size (802-804)
dascore/io/silixah5/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
  • _get_patches (90-101)
dascore/io/sintela_binary/utils.py (1)
  • _get_patches (216-229)
dascore/compat.py (1)
  • array (31-37)
dascore/io/dashdf5/core.py (1)
dascore/core/spool.py (1)
  • spool (663-692)
dascore/io/prodml/core.py (2)
dascore/io/prodml/utils.py (3)
  • _get_prodml_version_str (96-106)
  • _read_prodml (299-310)
  • _yield_prodml_attrs_coords (271-282)
dascore/core/attrs.py (1)
  • update (234-255)
dascore/io/segy/core.py (2)
dascore/core/spool.py (1)
  • spool (663-692)
dascore/core/patch.py (1)
  • Patch (28-510)
dascore/io/ap_sensing/utils.py (2)
dascore/io/silixah5/utils.py (1)
  • _get_patches (104-117)
dascore/io/sintela_binary/utils.py (1)
  • _get_patches (216-229)
dascore/io/febus/utils.py (2)
dascore/core/patch.py (5)
  • size (302-315)
  • data (272-283)
  • dtype (318-320)
  • shape (286-299)
  • coords (255-269)
dascore/core/coords.py (4)
  • size (473-475)
  • empty (679-696)
  • max (443-445)
  • data (606-608)
tests/test_io/test_common_io.py (3)
dascore/io/core.py (4)
  • scan (483-500)
  • scan (784-900)
  • read (472-481)
  • read (587-651)
dascore/core/coordmanager.py (1)
  • max (1064-1066)
dascore/core/coords.py (1)
  • max (443-445)
dascore/io/prodml/utils.py (3)
dascore/core/coordmanager.py (3)
  • get_coord (1039-1058)
  • update (240-308)
  • get_coord_manager (1127-1215)
dascore/utils/misc.py (3)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
dascore/io/sintela_binary/core.py (3)
dascore/io/ap_sensing/utils.py (1)
  • _get_patches (90-101)
dascore/io/silixah5/utils.py (1)
  • _get_patches (104-117)
dascore/io/sintela_binary/utils.py (1)
  • _get_patches (216-229)
dascore/io/ap_sensing/core.py (1)
dascore/io/ap_sensing/utils.py (3)
  • _get_attrs_dict (76-87)
  • _get_patches (90-101)
  • _get_version_string (10-20)
🪛 Ruff (0.13.3)
dascore/io/prodml/utils.py

302-302: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: Run benchmarks
🔇 Additional comments (21)
dascore/data_registry.txt (1)

34-34: LGTM!

The new data registry entry for prodml_fbe_1.h5 follows the established format and aligns with the PR's objective to add FBE support.

dascore/io/optodas/utils.py (1)

87-88: LGTM!

The empty-data guard correctly prevents patch creation when selection yields no data, returning an empty list consistent with the function's return type.

dascore/io/gdr/core.py (1)

59-60: LGTM!

The empty-data guard appropriately short-circuits processing when trimming results in no data, avoiding unnecessary patch creation.

dascore/io/h5simple/core.py (1)

41-42: LGTM with observation.

The empty-data guard correctly prevents patch creation when trimming yields no data. Note that this implementation checks new_cm.size (coordinate manager size) rather than new_data.size (data array size) as seen in other readers. Both approaches should be equivalent when coordinates and data are properly aligned.

dascore/io/tdms/core.py (1)

62-63: LGTM!

The empty-data guard appropriately short-circuits when coordinate selection yields no data, consistent with the pattern applied across other IO modules.

dascore/io/segy/core.py (2)

50-51: LGTM!

The empty-data guard correctly prevents patch creation when filtering yields no data.


54-54: LGTM!

Explicitly wrapping the patch in a list ([patch]) improves clarity and aligns with the pattern used in other IO modules.

dascore/io/dashdf5/core.py (1)

71-72: LGTM!

The empty-data guard appropriately returns an empty spool when coordinate selection yields no data, consistent with the unified pattern applied across IO modules in this PR.

dascore/io/neubrex/core.py (1)

74-75: LGTM!

The empty-data guards are properly placed after data trimming and consistently return empty spools when no data remains. This aligns with the PR's objective to handle out-of-range selections gracefully across all IO readers.

Also applies to: 121-122

tests/test_io/test_dasdae/test_dasdae.py (1)

133-142: LGTM! Clear documentation of the breaking change.

The xfail marker and explanatory comment properly document why empty patches can no longer round-trip. The note about this being an obscure edge case that won't affect real-world usage is helpful context.

dascore/io/sintela_binary/core.py (1)

18-18: LGTM!

The import and call site updates correctly adapt to the renamed _get_patches function, which now returns a list instead of a single Patch.

Also applies to: 77-79

dascore/io/silixah5/utils.py (1)

104-117: LGTM!

The refactoring correctly:

  • Renames the function to _get_patches reflecting the list return type
  • Checks for empty data before creating the Patch object (efficient)
  • Wraps the result in a list for consistent return type
  • Follows the same pattern as other readers in this PR
dascore/io/silixah5/core.py (1)

14-14: LGTM!

The changes correctly update the import and call site to use the refactored _get_patches function. The variable naming (patches) now accurately reflects that it's a list.

Also applies to: 65-68

tests/test_io/test_common_io.py (1)

353-365: LGTM!

The test correctly verifies that reading with a time window entirely after the file's time range returns an empty spool. The logic is sound:

  • Scans file to determine max time
  • Adds a one-second offset to ensure the query is beyond available data
  • Asserts empty spool is returned

This aligns with the PR's objective to handle out-of-range selections gracefully.

dascore/io/febus/utils.py (3)

231-237: Good early bail-out on empty time selection

Avoids min/max on empty arrays and unnecessary work. LGTM.


263-268: Graceful handling when distance is fully sliced out

Returning a 2D zero-length-distance array is consistent with empty-spool behavior. LGTM.


277-279: Skip creating patches for empty data

Correctly avoids empty patches. LGTM.

dascore/io/prodml/core.py (2)

54-57: Scan integration looks good

Iterating generator and updating coords on attrs is correct. LGTM.


67-68: Read path aligned with new utils

Delegates to _read_prodml and spools. LGTM.

dascore/io/ap_sensing/utils.py (1)

98-99: Empty selection guard is correct

Returns [] when slicing removes all data. Matches other IO behaviors. LGTM.

dascore/io/ap_sensing/core.py (1)

64-66: Updated to _get_patches

Callsite matches new list-returning helper. LGTM.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 11, 2025

CodSpeed Performance Report

Merging #571 will degrade performances by 16.27%

Comparing prodml_fbe (f66a73f) with master (f2da091)

Summary

❌ 2 regressions
✅ 52 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_scan 254.4 ms 303.8 ms -16.27%
test_scan_df 349.3 ms 406.5 ms -14.05%

@d-chambers
Copy link
Contributor Author

Merging #571 will degrade performances by 16.15%

This looks bad, but the real issue isn't that the code is 16% slower, but rather I added another test file. Predictably, the benchmarks that scan all the test data files will take longer when it has to scan more files.

Running the benchmark again after removing the new file on this branch I get:

128 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and on master I get:
127 ms ± 782 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the performance change really isn't significant.

@coderabbitai coderabbitai bot added the ready_for_review PR is ready for review label Oct 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)

130-155: Single-sample time array: guard divide-by-zero

If time array length is 1, computing step with (array_len - 1) divides by zero. Handle the singleton case explicitly.

-    step = (end - start) / (array_len - 1)
-    time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
+    if array_len == 1:
+        time_coord = get_coord(data=np.asarray(time_array[:]).astype("datetime64[us]"))
+        return time_coord
+    step = (end - start) / (array_len - 1)
+    time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)

218-229: Raw units source: read from RawData dataset, not group

RawDataUnit typically lives on the RawData dataset. Use dataset attrs (with group as fallback) to populate units.

 def _get_raw_node_attr_coords(node_info, d_coord, base_info):
     """Get the raw data information."""
     info = dict(base_info)
     t_coord = _get_time_coord(node_info.node)
-    info.update(_get_data_unit_and_type(node_info.node))
+    # Get type from group; units from dataset (fallback to group).
+    info.update(_get_data_unit_and_type(node_info.node))
+    info.update(_get_data_unit_and_type(node_info.node["RawData"]))
     coords = dc.get_coord_manager(
         coords={"time": t_coord, "distance": d_coord},
         dims=_get_dims_from_attrs(node_info.node["RawData"].attrs),
     )
     return ProdMLRawPatchAttrs(**info), coords

Based on learnings


231-252: FBE distance slice: ensure integer indices

Cast StartLocusIndex and NumberOfLoci to int before slicing. Prevents type issues when attrs are bytes/strings or numpy scalars.

 def _get_processed_node_attr_coords(node_info, d_coord, base_info):
@@
-    start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0)
-    total_inds = node_info.parent_node.attrs["NumberOfLoci"]
+    start_ind = int(unbyte(node_info.parent_node.attrs.get("StartLocusIndex", 0)))
+    total_inds = int(unbyte(node_info.parent_node.attrs["NumberOfLoci"]))
@@
-    distance = d_coord[start_ind : start_ind + total_inds]
+    distance = d_coord[start_ind : start_ind + total_inds]
🧹 Nitpick comments (3)
dascore/io/prodml/utils.py (3)

254-269: Avoid eager read of FBE data; keep it lazy and consistent with raw

Return the dataset instead of materializing it; selection will slice as needed.

 @register_func(_NODE_DATA_PROCESSORS, key="fbe")
 def _get_fbe_data(node_info):
     """Get the data from a fbe node."""
     node = node_info.node
-    return node[:]
+    return node

285-296: Normalize dim names: include 'Distance' mapping

Some files use 'Distance' in Dimensions; map to 'distance' for consistency.

 def _get_dims_from_attrs(attrs):
@@
-    map_ = {"locus": "distance", "Locus": "distance", "Time": "time"}
+    map_ = {"locus": "distance", "Locus": "distance", "Distance": "distance", "Time": "time"}

299-306: Good: strict zip added; consider single-pass to avoid double traversal

zip(strict=True) is correct. To avoid scanning nodes twice, iterate once and compute attrs/data per node.

Example refactor:

def _read_prodml(fi, distance=None, time=None):
    out = []
    acq = fi["Acquisition"]
    base_info = maybe_get_items(acq.attrs, _ROOT_ATTRS)
    d_coord = _get_distance_coord(acq)
    for info in _yield_data_nodes(fi):
        attr_func = _NODE_ATTRS_PROCESSORS[info.patch_type]
        data_func = _NODE_DATA_PROCESSORS[info.patch_type]
        attrs, cm = attr_func(info, d_coord, base_info)
        data = data_func(info)
        if time is not None or distance is not None:
            cm, data = cm.select(array=data, time=time, distance=distance)
        if data.size:
            out.append(dc.Patch(data=data, attrs=attrs, coords=cm))
    return out
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 71c2dba and 7d69ec5.

📒 Files selected for processing (5)
  • dascore/io/dasdae/core.py (2 hunks)
  • dascore/io/dasdae/utils.py (1 hunks)
  • dascore/io/prodml/utils.py (6 hunks)
  • dascore/io/sintela_binary/utils.py (2 hunks)
  • tests/test_io/test_dasdae/test_dasdae.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/test_io/test_dasdae/test_dasdae.py
🧰 Additional context used
🧬 Code graph analysis (3)
dascore/io/prodml/utils.py (3)
dascore/core/coordmanager.py (4)
  • get_coord (1039-1058)
  • keys (1027-1029)
  • update (240-308)
  • get_coord_manager (1127-1215)
dascore/utils/misc.py (4)
  • iterate (257-269)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
dascore/io/sintela_binary/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
  • _get_patches (90-101)
dascore/io/silixah5/utils.py (1)
  • _get_patches (104-117)
dascore/compat.py (1)
  • array (31-37)
dascore/io/dasdae/core.py (3)
dascore/io/dasdae/utils.py (2)
  • _kwargs_empty (186-192)
  • _read_patch (155-167)
dascore/core/patch.py (2)
  • data (272-283)
  • size (302-315)
dascore/core/coords.py (2)
  • data (606-608)
  • size (473-475)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (macos-latest, 3.11)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: Run benchmarks
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
🔇 Additional comments (6)
dascore/io/sintela_binary/utils.py (3)

216-218: LGTM! Signature aligns with broader refactoring.

The function rename to _get_patches (plural) and the addition of time and distance parameters are consistent with similar changes across other IO modules (e.g., ap_sensing/utils.py, silixah5/utils.py). This standardizes the API for selective data loading.


224-227: LGTM! Early return prevents empty patches efficiently.

The empty-data guard correctly returns an empty list when slicing yields no data, avoiding unnecessary Patch creation and memory allocation. This placement (after slicing the memmap, before loading into memory) maintains optimal memory efficiency.


229-229: All references to _get_patch have been removed; call sites now use _get_patches returning lists.

dascore/io/dasdae/utils.py (1)

186-192: LGTM! Clean helper for filtering selection kwargs.

The implementation correctly identifies effectively empty kwargs by filtering out file metadata keys and None values. The explicit bool() conversion makes the emptiness check clear and readable.

dascore/io/dasdae/core.py (2)

24-24: LGTM! Import of the helper function.

The import is correctly placed with other utility imports.


121-124: LGTM! Correctly filters empty patches from selection results.

The logic properly skips patches with empty data when selection criteria were applied (non-empty kwargs), while preserving empty patches when no selection was performed. This prevents returning empty patches from unsuccessful band selections while maintaining backward compatibility for direct reads.

The condition not patch.data.size and not _kwargs_empty(kwargs) correctly captures the intended behavior: skip only when both the data is empty AND meaningful selection criteria were provided.

Comment on lines +48 to +55
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_function",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Field mismatch: map WindowFunction to window_type

ProdMLFbePatchAttrs defines window_type, but mapping uses window_function. This leaves window_type unset.

 _FBE_PARENT_ATTRS = {
     "RawReference": "raw_reference",
     "TransformSize": "transform_size",
     "TransformType": "transform_type",
-    "WindowFunction": "window_function",
+    "WindowFunction": "window_type",
     "WindowOverlap": "window_overlap",
     "WindowSize": "window_size",
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_function",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_type",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
🤖 Prompt for AI Agents
In dascore/io/prodml/utils.py around lines 48 to 55, the _FBE_PARENT_ATTRS
mapping incorrectly maps "WindowFunction" to "window_function" but
ProdMLFbePatchAttrs expects "window_type"; update the mapping key-value pair to
map "WindowFunction" to "window_type" so the attribute is set correctly, and run
tests or linters that validate ProdMLFbePatchAttrs to ensure no other mismatches
remain.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)

134-143: Fix divide-by-zero when only one time sample.

array_len == 1 makes (array_len - 1) zero; guard and return direct coord.

Apply this diff:

   array_len = len(time_array)
   assert array_len > 0, "Missing time array in ProdML file."
-  time_attrs = time_array.attrs
+  # Special case: a single timestamp -> return array-based coord.
+  if array_len == 1:
+      time_arr = time_array[:].astype("datetime64[us]")
+      return get_coord(data=time_arr)
+  time_attrs = time_array.attrs
   start_str = unbyte(time_attrs["PartStartTime"]).split("+")[0]
   start = dc.to_datetime64(start_str.rstrip("Z"))
   end_str = unbyte(time_attrs["PartEndTime"]).split("+")[0]
   end = dc.to_datetime64(end_str.rstrip("Z"))
   step = (end - start) / (array_len - 1)
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)

218-229: Read RawDataUnit from the RawData dataset, not the group.

Many ProdMLs store units on the RawData dataset. Keep data_type from group; fetch units from dataset.

Apply this diff:

 def _get_raw_node_attr_coords(node_info, d_coord, base_info):
     """Get the raw data information."""
     info = dict(base_info)
     t_coord = _get_time_coord(node_info.node)
-    info.update(_get_data_unit_and_type(node_info.node))
+    # data_type from group (eg RawDescription), units from dataset (RawDataUnit)
+    info.update(_get_data_unit_and_type(node_info.node))
+    info.update(_get_data_unit_and_type(node_info.node["RawData"]))
     coords = dc.get_coord_manager(
         coords={"time": t_coord, "distance": d_coord},
         dims=_get_dims_from_attrs(node_info.node["RawData"].attrs),
     )
     return ProdMLRawPatchAttrs(**info), coords

231-252: Cast StartLocusIndex/NumberOfLoci to int for FBE distance slice.

Avoid byte/str/NumPy-scalar issues in slicing; previous review noted this.

Apply this diff:

-    start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0)
-    total_inds = node_info.parent_node.attrs["NumberOfLoci"]
+    start_ind = int(node_info.parent_node.attrs.get("StartLocusIndex", 0))
+    total_inds = int(node_info.parent_node.attrs["NumberOfLoci"])
     # Put the coords back together.
     distance = d_coord[start_ind : start_ind + total_inds]

Also verify WindowOverlap=0 is preserved: maybe_get_items skips falsy values. If needed, set explicitly:

-    out.update(maybe_get_items(node_info.parent_node.attrs, _FBE_PARENT_ATTRS))
+    out.update(maybe_get_items(node_info.parent_node.attrs, _FBE_PARENT_ATTRS))
+    if "WindowOverlap" in node_info.parent_node.attrs:
+        out["window_overlap"] = int(node_info.parent_node.attrs["WindowOverlap"])
🧹 Nitpick comments (4)
dascore/io/dasdae/utils.py (1)

186-193: LGTM! Clean implementation of empty kwargs detection.

The logic correctly distinguishes between metadata keys and actual selection criteria, which aligns well with the empty-data guard pattern described in the PR summary.

Consider these optional enhancements:

  1. Add type hints for clarity:
-def _kwargs_empty(kwargs):
+def _kwargs_empty(kwargs: dict) -> bool:
     """Determine if the keyword arguments are effectively empty."""
  1. Extract the metadata keys as a module-level constant for reusability:
# Near the top of the file
_METADATA_KEYS = {"file_version", "file_format", "path"}

def _kwargs_empty(kwargs: dict) -> bool:
    """Determine if the keyword arguments are effectively empty."""
    out = {i: v for i, v in kwargs.items() if v is not None and i not in _METADATA_KEYS}
    return not bool(out)
dascore/io/sintela_binary/utils.py (1)

216-218: Unused **kwargs parameter.

The **kwargs parameter is declared but never used in the function body. If it's intended for future extensibility or API consistency with similar functions in the codebase, consider adding a comment to document this intention.

dascore/io/prodml/utils.py (2)

119-125: HDF5 attr types: cast to native ints/floats to avoid surprises.

Attrs can be NumPy scalars/bytes; explicit casts make math/slicing robust.

Apply this diff:

-    step = attrs["SpatialSamplingInterval"]
-    num_dist_channels = attrs["NumberOfLoci"]
-    start = attrs["StartLocusIndex"] * step
+    step = float(attrs["SpatialSamplingInterval"])
+    num_dist_channels = int(attrs["NumberOfLoci"])
+    start_ind = int(attrs.get("StartLocusIndex", 0))
+    start = start_ind * step
     stop = start + num_dist_channels * step

298-309: Avoid double traversal of HDF5; compute attrs+data in one pass.

Current zip() walks _yield_data_nodes twice. Single-pass reduces IO and complexity.

Apply this diff:

 def _read_prodml(fi, distance=None, time=None):
     """Read the prodml values into a patch."""
     out = []
-    iterator = zip(_yield_prodml_attrs_coords(fi), _yield_data_nodes(fi), strict=True)
-    for (attrs, cm), info in iterator:
-        data_func = _NODE_DATA_PROCESSORS[info.patch_type]
-        data = data_func(info)
+    acq = fi["Acquisition"]
+    base_info = maybe_get_items(acq.attrs, _ROOT_ATTRS)
+    d_coord = _get_distance_coord(acq)
+    for info in _yield_data_nodes(fi):
+        attr_func = _NODE_ATTRS_PROCESSORS[info.patch_type]
+        attrs, cm = attr_func(info, d_coord, base_info)
+        data = _NODE_DATA_PROCESSORS[info.patch_type](info)
         if time is not None or distance is not None:
             cm, data = cm.select(array=data, time=time, distance=distance)
         if data.size:
             out.append(dc.Patch(data=data, attrs=attrs, coords=cm))
     return out
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 71c2dba and ed4f198.

📒 Files selected for processing (5)
  • dascore/io/dasdae/core.py (2 hunks)
  • dascore/io/dasdae/utils.py (1 hunks)
  • dascore/io/prodml/utils.py (6 hunks)
  • dascore/io/sintela_binary/utils.py (2 hunks)
  • tests/test_io/test_dasdae/test_dasdae.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_io/test_dasdae/test_dasdae.py
  • dascore/io/dasdae/core.py
🧰 Additional context used
🧬 Code graph analysis (2)
dascore/io/sintela_binary/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
  • _get_patches (90-101)
dascore/io/silixah5/utils.py (1)
  • _get_patches (104-117)
dascore/compat.py (1)
  • array (31-37)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (4)
  • get_coord (1039-1058)
  • keys (1027-1029)
  • update (240-308)
  • get_coord_manager (1127-1215)
dascore/core/coords.py (4)
  • get_coord (1515-1703)
  • update (738-751)
  • update (965-967)
  • data (606-608)
dascore/utils/misc.py (3)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code (macos-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: Run benchmarks
🔇 Additional comments (2)
dascore/io/sintela_binary/utils.py (2)

220-228: Verify coordinate consistency between attrs and explicit coords parameter.

The patch_attrs object is created at line 220 before any slicing occurs. When slicing is applied at line 225, cm is updated to the sliced coordinate manager, but patch_attrs retains the original (unsliced) coordinates. At line 228, the Patch is constructed with both coords=cm (sliced) and attrs=patch_attrs (containing original coords).

This differs from the pattern in ap_sensing/utils.py (lines 89-100) and silixah5/utils.py (lines 103-116), where attrs["coords"] is explicitly updated after slicing, and then the attrs object is validated.

Please verify that the Patch constructor correctly handles this scenario, ensuring that the explicit coords parameter takes precedence and that patch.attrs.coords is synchronized with patch.coords to avoid inconsistencies.

Consider refactoring to match the pattern in similar functions by keeping attrs as a dict, updating attrs["coords"] after slicing, and then validating:

 def _get_patches(
     resource, attr_class=PatchAttrs, extras=None, time=None, distance=None, **kwargs
 ):
     """Get the patch from the sintela file."""
-    patch_attrs, cm, header = _get_attrs_coords_header(resource, attr_class, extras)
+    _, cm, header = _get_attrs_coords_header(resource, attr_class, extras)
+    attrs = _get_attr_dict(header, extras)
+    attrs["coords"] = cm
     data = _load_data(resource, header)
     # Apply slicing if needed. This is done before patch creation so a memmap
     # can be sliced before loading all data into memory.
     if time is not None or distance is not None:
         cm, data = cm.select(data, time=time, distance=distance)
+        attrs["coords"] = cm
     if not data.size:  # no data in this slice.
         return []
+    patch_attrs = attr_class.model_validate(attrs)
     patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs)
     return [patch]

226-229: LGTM: Empty data handling and list-based return.

The empty data check at lines 226-227 correctly returns an empty list when the sliced data has no elements, avoiding unnecessary patch creation. The return type change at line 229 (wrapping the patch in a list) aligns with the broader repository refactoring to standardize patch retrieval across multiple IO modules, as seen in similar functions in ap_sensing/utils.py and silixah5/utils.py.

Comment on lines +48 to +55
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_function",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

window_type never populated (mapping mismatch).

FBE_PARENT_ATTRS maps WindowFunction to window_function, but the attrs model defines window_type. The value is lost.

Apply this diff:

 _FBE_PARENT_ATTRS = {
     "RawReference": "raw_reference",
     "TransformSize": "transform_size",
     "TransformType": "transform_type",
-    "WindowFunction": "window_function",
+    "WindowFunction": "window_type",
     "WindowOverlap": "window_overlap",
     "WindowSize": "window_size",
 }

Optionally, add a field for transform_type in ProdMLFbePatchAttrs if you want it validated instead of becoming an extra.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_function",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
"WindowFunction": "window_type",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}
🤖 Prompt for AI Agents
In dascore/io/prodml/utils.py around lines 48 to 55, the mapping uses
"WindowFunction": "window_function" which doesn't match the attrs model that
defines window_type, so the window type value is dropped; update the mapping to
"WindowFunction": "window_type" so the attribute is populated correctly, and
optionally add a transform_type field to the ProdMLFbePatchAttrs model if you
want transform_type validated rather than treated as an extra.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)

224-229: Still missing raw data units (read from RawData dataset).

Line 225 pulls units from the group, but ProdML files typically store RawDataUnit on the RawData dataset. That leaves data_units empty for raw patches. Please fetch units from node_info.node["RawData"].attrs, falling back to the group only if absent.

Apply this diff:

-    info.update(_get_data_unit_and_type(node_info.node))
+    raw_dataset = node_info.node["RawData"]
+    info.update(_get_data_unit_and_type(raw_dataset))
+    if "data_units" not in info:
+        info.update(_get_data_unit_and_type(node_info.node))

245-248: Convert FBE indices to integers before slicing.

StartLocusIndex / NumberOfLoci often arrive as byte strings or zero-d arrays. Using them directly in distance[start_ind:...] raises TypeError. Decode/collapse first (e.g., int(unbyte(...)) and int(np.asarray(...).item())) before slicing.

Apply this diff:

-    start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0)
-    total_inds = node_info.parent_node.attrs["NumberOfLoci"]
+    start_raw = node_info.parent_node.attrs.get("StartLocusIndex", 0)
+    total_raw = node_info.parent_node.attrs["NumberOfLoci"]
+    start_ind = int(np.asarray(unbyte(start_raw)) if isinstance(start_raw, (bytes, np.bytes_)) else np.asarray(start_raw).item())
+    total_inds = int(np.asarray(unbyte(total_raw)) if isinstance(total_raw, (bytes, np.bytes_)) else np.asarray(total_raw).item())
🧹 Nitpick comments (1)
dascore/io/dasdae/utils.py (1)

190-197: LGTM! Consider adding a type hint.

The function logic is correct and handles all edge cases properly. The approach of filtering kwargs before checking emptiness is clear and maintainable.

Optionally, you could add a type hint for the kwargs parameter to improve type safety:

-def _kwargs_empty(kwargs) -> bool:
+def _kwargs_empty(kwargs: dict) -> bool:

Or if you want to be more specific:

from typing import Any

def _kwargs_empty(kwargs: dict[str, Any]) -> bool:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed4f198 and 79408d2.

📒 Files selected for processing (2)
  • dascore/io/dasdae/utils.py (2 hunks)
  • dascore/io/prodml/utils.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (5)
  • get_coord (1039-1058)
  • step (1068-1070)
  • update (240-308)
  • get_coord_manager (1127-1215)
  • select (594-636)
dascore/core/coords.py (9)
  • get_coord (1515-1703)
  • update (738-751)
  • update (965-967)
  • data (606-608)
  • select (323-329)
  • select (997-1017)
  • select (1149-1174)
  • select (1320-1346)
  • select (1431-1456)
dascore/utils/misc.py (4)
  • iterate (257-269)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Run benchmarks
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
🔇 Additional comments (1)
dascore/io/dasdae/utils.py (1)

15-16: LGTM!

The constant is well-defined with a clear purpose. Using a set is appropriate for efficient membership testing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)

74-85: Rename window_function to window_type to match expected field name.

Past reviews flagged that WindowFunction should map to window_type, not window_function. The class definition should use window_type to align with this expectation.

Apply this diff:

 class ProdMLFbePatchAttrs(ProdMLRawPatchAttrs):
     """Patch attrs for fbe (frequency band extracted) data in Prodml."""
 
     raw_reference: UTF8Str = ""
     transform_size: float = np.nan
     transform_type: UTF8Str = ""
     window_size: int | None = None
-    window_function: UTF8Str = ""
+    window_type: UTF8Str = ""
     window_overlap: int | None = None
     start_frequency: float = 0
     end_frequency: float = np.inf

Also update the mapping at line 52:

 _FBE_PARENT_ATTRS = {
     "RawReference": "raw_reference",
     "TransformSize": "transform_size",
     "TransformType": "transform_type",
-    "WindowFunction": "window_function",
+    "WindowFunction": "window_type",
     "WindowOverlap": "window_overlap",
     "WindowSize": "window_size",
 }

220-231: Read units from RawData dataset for consistency.

Line 225 reads units from the group, but line 228 reads dimensions from the RawData dataset. For consistency and correctness (as flagged in past reviews), units should also be read from the dataset.

Apply this diff:

 @register_func(_NODE_ATTRS_PROCESSORS, key="raw")
 def _get_raw_node_attr_coords(node_info, d_coord, base_info):
     """Get the raw data information."""
     info = dict(base_info)
     t_coord = _get_time_coord(node_info.node)
     info.update(_get_data_unit_and_type(node_info.node))
+    # Override with dataset-level attributes if present
+    info.update(_get_data_unit_and_type(node_info.node["RawData"]))
     coords = dc.get_coord_manager(
         coords={"time": t_coord, "distance": d_coord},
         dims=_get_dims_from_attrs(node_info.node["RawData"].attrs),
     )
     return ProdMLRawPatchAttrs(**info), coords
🧹 Nitpick comments (2)
dascore/io/prodml/utils.py (2)

245-257: Simplify the type conversion logic.

The current type conversion for start_ind and total_inds is unnecessarily verbose. Since unbyte returns a string for bytes, you can simplify the conversion.

Apply this diff:

-    start_ind = int(
-        np.asarray(unbyte(start_raw))
-        if isinstance(start_raw, bytes | np.bytes_)
-        else np.asarray(start_raw).item()
-    )
-    total_inds = int(
-        np.asarray(unbyte(total_raw))
-        if isinstance(total_raw, bytes | np.bytes_)
-        else np.asarray(total_raw).item()
-    )
+    start_ind = int(unbyte(start_raw) if isinstance(start_raw, bytes | np.bytes_) else start_raw)
+    total_inds = int(unbyte(total_raw) if isinstance(total_raw, bytes | np.bytes_) else total_raw)

266-274: Consider replacing assert with explicit exception.

Line 271 uses assert for validation, which can be disabled with Python's -O flag. For production code, prefer raising an explicit exception.

 @register_func(_NODE_DATA_PROCESSORS, key="raw")
 def _get_raw_data(node_info):
     """Get the data from a raw node."""
     node = node_info.node
     data_node = [x for x in list(node) if x.lower().endswith("data")]
-    assert len(data_node) == 1, "more than one data node found."
+    if len(data_node) != 1:
+        msg = f"Expected exactly one data node, found {len(data_node)} in {node_info.name}"
+        raise ValueError(msg)
     data = node[data_node[0]]
     return data
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79408d2 and a0aa431.

📒 Files selected for processing (1)
  • dascore/io/prodml/utils.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (5)
  • get_coord (1039-1058)
  • step (1068-1070)
  • keys (1027-1029)
  • update (240-308)
  • get_coord_manager (1127-1215)
dascore/core/coords.py (4)
  • get_coord (1515-1703)
  • update (738-751)
  • update (965-967)
  • data (606-608)
dascore/utils/misc.py (4)
  • iterate (257-269)
  • maybe_get_items (438-460)
  • register_func (35-58)
  • unbyte (370-374)
dascore/core/attrs.py (4)
  • PatchAttrs (65-301)
  • items (181-183)
  • get (174-179)
  • update (234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: Run benchmarks
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
🔇 Additional comments (4)
dascore/io/prodml/utils.py (4)

27-55: LGTM! Well-structured attribute mappings.

The root and FBE attribute mappings handle multiple naming conventions (e.g., PulseWidthUnit, PulseWidthUnits, PulseWidth.uom) which accommodates different ProdML schema versions.


160-203: LGTM! Well-structured node traversal with empty-data handling.

The iterator-based approach cleanly separates raw and FBE node traversal, and the empty data check (lines 181-182) ensures only valid raw nodes are yielded.


315-329: LGTM! Correct empty-data handling.

The function correctly skips creating patches when selections yield empty data (line 327), which aligns with the PR's objective to return empty spools for out-of-bounds selections.


282-294: LGTM! Single-pass iteration resolves past zip concern.

The refactored single-pass iteration over _yield_data_nodes eliminates the previous concern about missing strict=True on zip.

@d-chambers d-chambers merged commit a7f8e48 into master Oct 13, 2025
24 checks passed
@d-chambers d-chambers deleted the prodml_fbe branch October 13, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO Work for reading/writing different formats ready_for_review PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants