-
Notifications
You must be signed in to change notification settings - Fork 28
Prodml fbe #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prodml fbe #571
Conversation
WalkthroughAdds empty-data guards across many IO readers to short-circuit when selections yield no data, renames several internal helpers from Changes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)dascore/io/prodml/utils.py (4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
🔇 Additional comments (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #571 +/- ##
========================================
Coverage 99.92% 99.92%
========================================
Files 127 128 +1
Lines 10622 10878 +256
========================================
+ Hits 10614 10870 +256
Misses 8 8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)
141-156: Guard against single-element time arrays (division by zero risk)If array_len == 1, (end - start) / (array_len - 1) divides by zero. Fallback to using the time array directly.
- step = (end - start) / (array_len - 1) - time_coord = get_coord(start=start, stop=end + step, step=step, units="s") + if array_len == 1: + time_coord = get_coord(data=np.asarray(time_array[:]), units="s") + else: + step = (end - start) / (array_len - 1) + time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
🧹 Nitpick comments (4)
dascore/io/terra15/core.py (1)
73-74: Consider checking before patch creation for efficiency.The empty-data guard works correctly but checks
patch.data.sizeafter the patch has been created by_read_terra15. Other IO modules in this PR check before creating the patch, which is more efficient. If feasible, consider adding the empty-data check within_read_terra15or checking the data before patch construction.dascore/io/sintela_binary/utils.py (1)
224-229: Move the empty-data check before Patch creation.The current implementation creates a Patch object before checking if the data is empty. This is inefficient because Patch creation can be expensive for memory-mapped data. Move the empty-data guard before the Patch instantiation to avoid unnecessary allocation.
Apply this diff:
if time is not None or distance is not None: cm, data = cm.select(data, time=time, distance=distance) - patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs) if not data.size: # no data in this slice. return [] + patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs) return [patch]dascore/io/dasdae/core.py (1)
120-123: Document empty-patch guard rationale
- Add an inline comment above the
if not patch.data.size and kwargsguard explaining that empty patches are preserved on full reads (for round-trip support) but skipped only when filters (kwargs) are provided.- Add a test for
dc.read(path, time=…)(or other selection kwargs) to assert that empty patches are correctly filtered out under selection.dascore/io/prodml/utils.py (1)
264-269: Avoid eager loading FBE dataReturning node[:] loads full arrays into memory. Return the dataset and let downstream selection handle slicing/laziness, matching raw behavior.
- node = node_info.node - return node[:] + return node_info.node
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
dascore/data_registry.txt(1 hunks)dascore/io/ap_sensing/core.py(2 hunks)dascore/io/ap_sensing/utils.py(1 hunks)dascore/io/dasdae/core.py(1 hunks)dascore/io/dashdf5/core.py(1 hunks)dascore/io/febus/utils.py(3 hunks)dascore/io/gdr/core.py(1 hunks)dascore/io/h5simple/core.py(1 hunks)dascore/io/neubrex/core.py(2 hunks)dascore/io/optodas/utils.py(1 hunks)dascore/io/prodml/core.py(3 hunks)dascore/io/prodml/utils.py(6 hunks)dascore/io/segy/core.py(1 hunks)dascore/io/silixah5/core.py(2 hunks)dascore/io/silixah5/utils.py(1 hunks)dascore/io/sintela_binary/core.py(2 hunks)dascore/io/sintela_binary/utils.py(2 hunks)dascore/io/tdms/core.py(1 hunks)dascore/io/terra15/core.py(1 hunks)tests/test_io/test_common_io.py(1 hunks)tests/test_io/test_dasdae/test_dasdae.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
dascore/io/tdms/core.py (4)
dascore/core/patch.py (2)
data(272-283)size(302-315)dascore/core/coords.py (2)
data(606-608)size(473-475)dascore/core/coordmanager.py (1)
size(802-804)dascore/core/spool.py (1)
spool(663-692)
dascore/io/silixah5/core.py (1)
dascore/io/silixah5/utils.py (3)
_get_attr(94-101)_get_patches(104-117)_get_version_string(32-39)
dascore/io/sintela_binary/utils.py (2)
dascore/io/ap_sensing/utils.py (1)
_get_patches(90-101)dascore/io/silixah5/utils.py (1)
_get_patches(104-117)
dascore/io/terra15/core.py (2)
dascore/core/patch.py (2)
data(272-283)size(302-315)dascore/core/spool.py (1)
spool(663-692)
dascore/io/h5simple/core.py (2)
dascore/core/patch.py (1)
size(302-315)dascore/core/spool.py (1)
spool(663-692)
dascore/io/dasdae/core.py (4)
dascore/io/dasdae/utils.py (1)
_read_patch(155-167)dascore/core/patch.py (2)
data(272-283)size(302-315)dascore/core/coords.py (2)
data(606-608)size(473-475)dascore/core/coordmanager.py (1)
size(802-804)
dascore/io/silixah5/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
_get_patches(90-101)dascore/io/sintela_binary/utils.py (1)
_get_patches(216-229)dascore/compat.py (1)
array(31-37)
dascore/io/dashdf5/core.py (1)
dascore/core/spool.py (1)
spool(663-692)
dascore/io/prodml/core.py (2)
dascore/io/prodml/utils.py (3)
_get_prodml_version_str(96-106)_read_prodml(299-310)_yield_prodml_attrs_coords(271-282)dascore/core/attrs.py (1)
update(234-255)
dascore/io/segy/core.py (2)
dascore/core/spool.py (1)
spool(663-692)dascore/core/patch.py (1)
Patch(28-510)
dascore/io/ap_sensing/utils.py (2)
dascore/io/silixah5/utils.py (1)
_get_patches(104-117)dascore/io/sintela_binary/utils.py (1)
_get_patches(216-229)
dascore/io/febus/utils.py (2)
dascore/core/patch.py (5)
size(302-315)data(272-283)dtype(318-320)shape(286-299)coords(255-269)dascore/core/coords.py (4)
size(473-475)empty(679-696)max(443-445)data(606-608)
tests/test_io/test_common_io.py (3)
dascore/io/core.py (4)
scan(483-500)scan(784-900)read(472-481)read(587-651)dascore/core/coordmanager.py (1)
max(1064-1066)dascore/core/coords.py (1)
max(443-445)
dascore/io/prodml/utils.py (3)
dascore/core/coordmanager.py (3)
get_coord(1039-1058)update(240-308)get_coord_manager(1127-1215)dascore/utils/misc.py (3)
maybe_get_items(438-460)register_func(35-58)unbyte(370-374)dascore/core/attrs.py (4)
PatchAttrs(65-301)items(181-183)get(174-179)update(234-255)
dascore/io/sintela_binary/core.py (3)
dascore/io/ap_sensing/utils.py (1)
_get_patches(90-101)dascore/io/silixah5/utils.py (1)
_get_patches(104-117)dascore/io/sintela_binary/utils.py (1)
_get_patches(216-229)
dascore/io/ap_sensing/core.py (1)
dascore/io/ap_sensing/utils.py (3)
_get_attrs_dict(76-87)_get_patches(90-101)_get_version_string(10-20)
🪛 Ruff (0.13.3)
dascore/io/prodml/utils.py
302-302: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: Run benchmarks
🔇 Additional comments (21)
dascore/data_registry.txt (1)
34-34: LGTM!The new data registry entry for
prodml_fbe_1.h5follows the established format and aligns with the PR's objective to add FBE support.dascore/io/optodas/utils.py (1)
87-88: LGTM!The empty-data guard correctly prevents patch creation when selection yields no data, returning an empty list consistent with the function's return type.
dascore/io/gdr/core.py (1)
59-60: LGTM!The empty-data guard appropriately short-circuits processing when trimming results in no data, avoiding unnecessary patch creation.
dascore/io/h5simple/core.py (1)
41-42: LGTM with observation.The empty-data guard correctly prevents patch creation when trimming yields no data. Note that this implementation checks
new_cm.size(coordinate manager size) rather thannew_data.size(data array size) as seen in other readers. Both approaches should be equivalent when coordinates and data are properly aligned.dascore/io/tdms/core.py (1)
62-63: LGTM!The empty-data guard appropriately short-circuits when coordinate selection yields no data, consistent with the pattern applied across other IO modules.
dascore/io/segy/core.py (2)
50-51: LGTM!The empty-data guard correctly prevents patch creation when filtering yields no data.
54-54: LGTM!Explicitly wrapping the patch in a list (
[patch]) improves clarity and aligns with the pattern used in other IO modules.dascore/io/dashdf5/core.py (1)
71-72: LGTM!The empty-data guard appropriately returns an empty spool when coordinate selection yields no data, consistent with the unified pattern applied across IO modules in this PR.
dascore/io/neubrex/core.py (1)
74-75: LGTM!The empty-data guards are properly placed after data trimming and consistently return empty spools when no data remains. This aligns with the PR's objective to handle out-of-range selections gracefully across all IO readers.
Also applies to: 121-122
tests/test_io/test_dasdae/test_dasdae.py (1)
133-142: LGTM! Clear documentation of the breaking change.The xfail marker and explanatory comment properly document why empty patches can no longer round-trip. The note about this being an obscure edge case that won't affect real-world usage is helpful context.
dascore/io/sintela_binary/core.py (1)
18-18: LGTM!The import and call site updates correctly adapt to the renamed
_get_patchesfunction, which now returns a list instead of a single Patch.Also applies to: 77-79
dascore/io/silixah5/utils.py (1)
104-117: LGTM!The refactoring correctly:
- Renames the function to
_get_patchesreflecting the list return type- Checks for empty data before creating the Patch object (efficient)
- Wraps the result in a list for consistent return type
- Follows the same pattern as other readers in this PR
dascore/io/silixah5/core.py (1)
14-14: LGTM!The changes correctly update the import and call site to use the refactored
_get_patchesfunction. The variable naming (patches) now accurately reflects that it's a list.Also applies to: 65-68
tests/test_io/test_common_io.py (1)
353-365: LGTM!The test correctly verifies that reading with a time window entirely after the file's time range returns an empty spool. The logic is sound:
- Scans file to determine max time
- Adds a one-second offset to ensure the query is beyond available data
- Asserts empty spool is returned
This aligns with the PR's objective to handle out-of-range selections gracefully.
dascore/io/febus/utils.py (3)
231-237: Good early bail-out on empty time selectionAvoids min/max on empty arrays and unnecessary work. LGTM.
263-268: Graceful handling when distance is fully sliced outReturning a 2D zero-length-distance array is consistent with empty-spool behavior. LGTM.
277-279: Skip creating patches for empty dataCorrectly avoids empty patches. LGTM.
dascore/io/prodml/core.py (2)
54-57: Scan integration looks goodIterating generator and updating coords on attrs is correct. LGTM.
67-68: Read path aligned with new utilsDelegates to _read_prodml and spools. LGTM.
dascore/io/ap_sensing/utils.py (1)
98-99: Empty selection guard is correctReturns [] when slicing removes all data. Matches other IO behaviors. LGTM.
dascore/io/ap_sensing/core.py (1)
64-66: Updated to _get_patchesCallsite matches new list-returning helper. LGTM.
CodSpeed Performance ReportMerging #571 will degrade performances by 16.27%Comparing Summary
Benchmarks breakdown
|
This looks bad, but the real issue isn't that the code is 16% slower, but rather I added another test file. Predictably, the benchmarks that scan all the test data files will take longer when it has to scan more files. Running the benchmark again after removing the new file on this branch I get:
and on master I get: So the performance change really isn't significant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)
130-155: Single-sample time array: guard divide-by-zeroIf time array length is 1, computing step with (array_len - 1) divides by zero. Handle the singleton case explicitly.
- step = (end - start) / (array_len - 1) - time_coord = get_coord(start=start, stop=end + step, step=step, units="s") + if array_len == 1: + time_coord = get_coord(data=np.asarray(time_array[:]).astype("datetime64[us]")) + return time_coord + step = (end - start) / (array_len - 1) + time_coord = get_coord(start=start, stop=end + step, step=step, units="s")
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)
218-229: Raw units source: read from RawData dataset, not groupRawDataUnit typically lives on the RawData dataset. Use dataset attrs (with group as fallback) to populate units.
def _get_raw_node_attr_coords(node_info, d_coord, base_info): """Get the raw data information.""" info = dict(base_info) t_coord = _get_time_coord(node_info.node) - info.update(_get_data_unit_and_type(node_info.node)) + # Get type from group; units from dataset (fallback to group). + info.update(_get_data_unit_and_type(node_info.node)) + info.update(_get_data_unit_and_type(node_info.node["RawData"])) coords = dc.get_coord_manager( coords={"time": t_coord, "distance": d_coord}, dims=_get_dims_from_attrs(node_info.node["RawData"].attrs), ) return ProdMLRawPatchAttrs(**info), coordsBased on learnings
231-252: FBE distance slice: ensure integer indicesCast StartLocusIndex and NumberOfLoci to int before slicing. Prevents type issues when attrs are bytes/strings or numpy scalars.
def _get_processed_node_attr_coords(node_info, d_coord, base_info): @@ - start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0) - total_inds = node_info.parent_node.attrs["NumberOfLoci"] + start_ind = int(unbyte(node_info.parent_node.attrs.get("StartLocusIndex", 0))) + total_inds = int(unbyte(node_info.parent_node.attrs["NumberOfLoci"])) @@ - distance = d_coord[start_ind : start_ind + total_inds] + distance = d_coord[start_ind : start_ind + total_inds]
🧹 Nitpick comments (3)
dascore/io/prodml/utils.py (3)
254-269: Avoid eager read of FBE data; keep it lazy and consistent with rawReturn the dataset instead of materializing it; selection will slice as needed.
@register_func(_NODE_DATA_PROCESSORS, key="fbe") def _get_fbe_data(node_info): """Get the data from a fbe node.""" node = node_info.node - return node[:] + return node
285-296: Normalize dim names: include 'Distance' mappingSome files use 'Distance' in Dimensions; map to 'distance' for consistency.
def _get_dims_from_attrs(attrs): @@ - map_ = {"locus": "distance", "Locus": "distance", "Time": "time"} + map_ = {"locus": "distance", "Locus": "distance", "Distance": "distance", "Time": "time"}
299-306: Good: strict zip added; consider single-pass to avoid double traversalzip(strict=True) is correct. To avoid scanning nodes twice, iterate once and compute attrs/data per node.
Example refactor:
def _read_prodml(fi, distance=None, time=None): out = [] acq = fi["Acquisition"] base_info = maybe_get_items(acq.attrs, _ROOT_ATTRS) d_coord = _get_distance_coord(acq) for info in _yield_data_nodes(fi): attr_func = _NODE_ATTRS_PROCESSORS[info.patch_type] data_func = _NODE_DATA_PROCESSORS[info.patch_type] attrs, cm = attr_func(info, d_coord, base_info) data = data_func(info) if time is not None or distance is not None: cm, data = cm.select(array=data, time=time, distance=distance) if data.size: out.append(dc.Patch(data=data, attrs=attrs, coords=cm)) return out
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
dascore/io/dasdae/core.py(2 hunks)dascore/io/dasdae/utils.py(1 hunks)dascore/io/prodml/utils.py(6 hunks)dascore/io/sintela_binary/utils.py(2 hunks)tests/test_io/test_dasdae/test_dasdae.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/test_io/test_dasdae/test_dasdae.py
🧰 Additional context used
🧬 Code graph analysis (3)
dascore/io/prodml/utils.py (3)
dascore/core/coordmanager.py (4)
get_coord(1039-1058)keys(1027-1029)update(240-308)get_coord_manager(1127-1215)dascore/utils/misc.py (4)
iterate(257-269)maybe_get_items(438-460)register_func(35-58)unbyte(370-374)dascore/core/attrs.py (4)
PatchAttrs(65-301)items(181-183)get(174-179)update(234-255)
dascore/io/sintela_binary/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
_get_patches(90-101)dascore/io/silixah5/utils.py (1)
_get_patches(104-117)dascore/compat.py (1)
array(31-37)
dascore/io/dasdae/core.py (3)
dascore/io/dasdae/utils.py (2)
_kwargs_empty(186-192)_read_patch(155-167)dascore/core/patch.py (2)
data(272-283)size(302-315)dascore/core/coords.py (2)
data(606-608)size(473-475)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: Run benchmarks
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
🔇 Additional comments (6)
dascore/io/sintela_binary/utils.py (3)
216-218: LGTM! Signature aligns with broader refactoring.The function rename to
_get_patches(plural) and the addition oftimeanddistanceparameters are consistent with similar changes across other IO modules (e.g.,ap_sensing/utils.py,silixah5/utils.py). This standardizes the API for selective data loading.
224-227: LGTM! Early return prevents empty patches efficiently.The empty-data guard correctly returns an empty list when slicing yields no data, avoiding unnecessary Patch creation and memory allocation. This placement (after slicing the memmap, before loading into memory) maintains optimal memory efficiency.
229-229: All references to_get_patchhave been removed; call sites now use_get_patchesreturning lists.dascore/io/dasdae/utils.py (1)
186-192: LGTM! Clean helper for filtering selection kwargs.The implementation correctly identifies effectively empty kwargs by filtering out file metadata keys and None values. The explicit
bool()conversion makes the emptiness check clear and readable.dascore/io/dasdae/core.py (2)
24-24: LGTM! Import of the helper function.The import is correctly placed with other utility imports.
121-124: LGTM! Correctly filters empty patches from selection results.The logic properly skips patches with empty data when selection criteria were applied (non-empty kwargs), while preserving empty patches when no selection was performed. This prevents returning empty patches from unsuccessful band selections while maintaining backward compatibility for direct reads.
The condition
not patch.data.size and not _kwargs_empty(kwargs)correctly captures the intended behavior: skip only when both the data is empty AND meaningful selection criteria were provided.
| _FBE_PARENT_ATTRS = { | ||
| "RawReference": "raw_reference", | ||
| "TransformSize": "transform_size", | ||
| "TransformType": "transform_type", | ||
| "WindowFunction": "window_function", | ||
| "WindowOverlap": "window_overlap", | ||
| "WindowSize": "window_size", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field mismatch: map WindowFunction to window_type
ProdMLFbePatchAttrs defines window_type, but mapping uses window_function. This leaves window_type unset.
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
- "WindowFunction": "window_function",
+ "WindowFunction": "window_type",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _FBE_PARENT_ATTRS = { | |
| "RawReference": "raw_reference", | |
| "TransformSize": "transform_size", | |
| "TransformType": "transform_type", | |
| "WindowFunction": "window_function", | |
| "WindowOverlap": "window_overlap", | |
| "WindowSize": "window_size", | |
| } | |
| _FBE_PARENT_ATTRS = { | |
| "RawReference": "raw_reference", | |
| "TransformSize": "transform_size", | |
| "TransformType": "transform_type", | |
| "WindowFunction": "window_type", | |
| "WindowOverlap": "window_overlap", | |
| "WindowSize": "window_size", | |
| } |
🤖 Prompt for AI Agents
In dascore/io/prodml/utils.py around lines 48 to 55, the _FBE_PARENT_ATTRS
mapping incorrectly maps "WindowFunction" to "window_function" but
ProdMLFbePatchAttrs expects "window_type"; update the mapping key-value pair to
map "WindowFunction" to "window_type" so the attribute is set correctly, and run
tests or linters that validate ProdMLFbePatchAttrs to ensure no other mismatches
remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dascore/io/prodml/utils.py (1)
134-143: Fix divide-by-zero when only one time sample.array_len == 1 makes (array_len - 1) zero; guard and return direct coord.
Apply this diff:
array_len = len(time_array) assert array_len > 0, "Missing time array in ProdML file." - time_attrs = time_array.attrs + # Special case: a single timestamp -> return array-based coord. + if array_len == 1: + time_arr = time_array[:].astype("datetime64[us]") + return get_coord(data=time_arr) + time_attrs = time_array.attrs start_str = unbyte(time_attrs["PartStartTime"]).split("+")[0] start = dc.to_datetime64(start_str.rstrip("Z")) end_str = unbyte(time_attrs["PartEndTime"]).split("+")[0] end = dc.to_datetime64(end_str.rstrip("Z")) step = (end - start) / (array_len - 1)
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)
218-229: Read RawDataUnit from the RawData dataset, not the group.Many ProdMLs store units on the RawData dataset. Keep data_type from group; fetch units from dataset.
Apply this diff:
def _get_raw_node_attr_coords(node_info, d_coord, base_info): """Get the raw data information.""" info = dict(base_info) t_coord = _get_time_coord(node_info.node) - info.update(_get_data_unit_and_type(node_info.node)) + # data_type from group (eg RawDescription), units from dataset (RawDataUnit) + info.update(_get_data_unit_and_type(node_info.node)) + info.update(_get_data_unit_and_type(node_info.node["RawData"])) coords = dc.get_coord_manager( coords={"time": t_coord, "distance": d_coord}, dims=_get_dims_from_attrs(node_info.node["RawData"].attrs), ) return ProdMLRawPatchAttrs(**info), coords
231-252: Cast StartLocusIndex/NumberOfLoci to int for FBE distance slice.Avoid byte/str/NumPy-scalar issues in slicing; previous review noted this.
Apply this diff:
- start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0) - total_inds = node_info.parent_node.attrs["NumberOfLoci"] + start_ind = int(node_info.parent_node.attrs.get("StartLocusIndex", 0)) + total_inds = int(node_info.parent_node.attrs["NumberOfLoci"]) # Put the coords back together. distance = d_coord[start_ind : start_ind + total_inds]Also verify WindowOverlap=0 is preserved: maybe_get_items skips falsy values. If needed, set explicitly:
- out.update(maybe_get_items(node_info.parent_node.attrs, _FBE_PARENT_ATTRS)) + out.update(maybe_get_items(node_info.parent_node.attrs, _FBE_PARENT_ATTRS)) + if "WindowOverlap" in node_info.parent_node.attrs: + out["window_overlap"] = int(node_info.parent_node.attrs["WindowOverlap"])
🧹 Nitpick comments (4)
dascore/io/dasdae/utils.py (1)
186-193: LGTM! Clean implementation of empty kwargs detection.The logic correctly distinguishes between metadata keys and actual selection criteria, which aligns well with the empty-data guard pattern described in the PR summary.
Consider these optional enhancements:
- Add type hints for clarity:
-def _kwargs_empty(kwargs): +def _kwargs_empty(kwargs: dict) -> bool: """Determine if the keyword arguments are effectively empty."""
- Extract the metadata keys as a module-level constant for reusability:
# Near the top of the file _METADATA_KEYS = {"file_version", "file_format", "path"} def _kwargs_empty(kwargs: dict) -> bool: """Determine if the keyword arguments are effectively empty.""" out = {i: v for i, v in kwargs.items() if v is not None and i not in _METADATA_KEYS} return not bool(out)dascore/io/sintela_binary/utils.py (1)
216-218: Unused**kwargsparameter.The
**kwargsparameter is declared but never used in the function body. If it's intended for future extensibility or API consistency with similar functions in the codebase, consider adding a comment to document this intention.dascore/io/prodml/utils.py (2)
119-125: HDF5 attr types: cast to native ints/floats to avoid surprises.Attrs can be NumPy scalars/bytes; explicit casts make math/slicing robust.
Apply this diff:
- step = attrs["SpatialSamplingInterval"] - num_dist_channels = attrs["NumberOfLoci"] - start = attrs["StartLocusIndex"] * step + step = float(attrs["SpatialSamplingInterval"]) + num_dist_channels = int(attrs["NumberOfLoci"]) + start_ind = int(attrs.get("StartLocusIndex", 0)) + start = start_ind * step stop = start + num_dist_channels * step
298-309: Avoid double traversal of HDF5; compute attrs+data in one pass.Current zip() walks _yield_data_nodes twice. Single-pass reduces IO and complexity.
Apply this diff:
def _read_prodml(fi, distance=None, time=None): """Read the prodml values into a patch.""" out = [] - iterator = zip(_yield_prodml_attrs_coords(fi), _yield_data_nodes(fi), strict=True) - for (attrs, cm), info in iterator: - data_func = _NODE_DATA_PROCESSORS[info.patch_type] - data = data_func(info) + acq = fi["Acquisition"] + base_info = maybe_get_items(acq.attrs, _ROOT_ATTRS) + d_coord = _get_distance_coord(acq) + for info in _yield_data_nodes(fi): + attr_func = _NODE_ATTRS_PROCESSORS[info.patch_type] + attrs, cm = attr_func(info, d_coord, base_info) + data = _NODE_DATA_PROCESSORS[info.patch_type](info) if time is not None or distance is not None: cm, data = cm.select(array=data, time=time, distance=distance) if data.size: out.append(dc.Patch(data=data, attrs=attrs, coords=cm)) return out
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
dascore/io/dasdae/core.py(2 hunks)dascore/io/dasdae/utils.py(1 hunks)dascore/io/prodml/utils.py(6 hunks)dascore/io/sintela_binary/utils.py(2 hunks)tests/test_io/test_dasdae/test_dasdae.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_io/test_dasdae/test_dasdae.py
- dascore/io/dasdae/core.py
🧰 Additional context used
🧬 Code graph analysis (2)
dascore/io/sintela_binary/utils.py (3)
dascore/io/ap_sensing/utils.py (1)
_get_patches(90-101)dascore/io/silixah5/utils.py (1)
_get_patches(104-117)dascore/compat.py (1)
array(31-37)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (4)
get_coord(1039-1058)keys(1027-1029)update(240-308)get_coord_manager(1127-1215)dascore/core/coords.py (4)
get_coord(1515-1703)update(738-751)update(965-967)data(606-608)dascore/utils/misc.py (3)
maybe_get_items(438-460)register_func(35-58)unbyte(370-374)dascore/core/attrs.py (4)
PatchAttrs(65-301)items(181-183)get(174-179)update(234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: Run benchmarks
🔇 Additional comments (2)
dascore/io/sintela_binary/utils.py (2)
220-228: Verify coordinate consistency between attrs and explicit coords parameter.The
patch_attrsobject is created at line 220 before any slicing occurs. When slicing is applied at line 225,cmis updated to the sliced coordinate manager, butpatch_attrsretains the original (unsliced) coordinates. At line 228, the Patch is constructed with bothcoords=cm(sliced) andattrs=patch_attrs(containing original coords).This differs from the pattern in
ap_sensing/utils.py(lines 89-100) andsilixah5/utils.py(lines 103-116), whereattrs["coords"]is explicitly updated after slicing, and then the attrs object is validated.Please verify that the Patch constructor correctly handles this scenario, ensuring that the explicit
coordsparameter takes precedence and thatpatch.attrs.coordsis synchronized withpatch.coordsto avoid inconsistencies.Consider refactoring to match the pattern in similar functions by keeping attrs as a dict, updating
attrs["coords"]after slicing, and then validating:def _get_patches( resource, attr_class=PatchAttrs, extras=None, time=None, distance=None, **kwargs ): """Get the patch from the sintela file.""" - patch_attrs, cm, header = _get_attrs_coords_header(resource, attr_class, extras) + _, cm, header = _get_attrs_coords_header(resource, attr_class, extras) + attrs = _get_attr_dict(header, extras) + attrs["coords"] = cm data = _load_data(resource, header) # Apply slicing if needed. This is done before patch creation so a memmap # can be sliced before loading all data into memory. if time is not None or distance is not None: cm, data = cm.select(data, time=time, distance=distance) + attrs["coords"] = cm if not data.size: # no data in this slice. return [] + patch_attrs = attr_class.model_validate(attrs) patch = dc.Patch(data=array(data), coords=cm, attrs=patch_attrs) return [patch]
226-229: LGTM: Empty data handling and list-based return.The empty data check at lines 226-227 correctly returns an empty list when the sliced data has no elements, avoiding unnecessary patch creation. The return type change at line 229 (wrapping the patch in a list) aligns with the broader repository refactoring to standardize patch retrieval across multiple IO modules, as seen in similar functions in
ap_sensing/utils.pyandsilixah5/utils.py.
| _FBE_PARENT_ATTRS = { | ||
| "RawReference": "raw_reference", | ||
| "TransformSize": "transform_size", | ||
| "TransformType": "transform_type", | ||
| "WindowFunction": "window_function", | ||
| "WindowOverlap": "window_overlap", | ||
| "WindowSize": "window_size", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_type never populated (mapping mismatch).
FBE_PARENT_ATTRS maps WindowFunction to window_function, but the attrs model defines window_type. The value is lost.
Apply this diff:
_FBE_PARENT_ATTRS = {
"RawReference": "raw_reference",
"TransformSize": "transform_size",
"TransformType": "transform_type",
- "WindowFunction": "window_function",
+ "WindowFunction": "window_type",
"WindowOverlap": "window_overlap",
"WindowSize": "window_size",
}Optionally, add a field for transform_type in ProdMLFbePatchAttrs if you want it validated instead of becoming an extra.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _FBE_PARENT_ATTRS = { | |
| "RawReference": "raw_reference", | |
| "TransformSize": "transform_size", | |
| "TransformType": "transform_type", | |
| "WindowFunction": "window_function", | |
| "WindowOverlap": "window_overlap", | |
| "WindowSize": "window_size", | |
| } | |
| _FBE_PARENT_ATTRS = { | |
| "RawReference": "raw_reference", | |
| "TransformSize": "transform_size", | |
| "TransformType": "transform_type", | |
| "WindowFunction": "window_type", | |
| "WindowOverlap": "window_overlap", | |
| "WindowSize": "window_size", | |
| } |
🤖 Prompt for AI Agents
In dascore/io/prodml/utils.py around lines 48 to 55, the mapping uses
"WindowFunction": "window_function" which doesn't match the attrs model that
defines window_type, so the window type value is dropped; update the mapping to
"WindowFunction": "window_type" so the attribute is populated correctly, and
optionally add a transform_type field to the ProdMLFbePatchAttrs model if you
want transform_type validated rather than treated as an extra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)
224-229: Still missing raw data units (read fromRawDatadataset).Line 225 pulls units from the group, but ProdML files typically store
RawDataUniton theRawDatadataset. That leavesdata_unitsempty for raw patches. Please fetch units fromnode_info.node["RawData"].attrs, falling back to the group only if absent.Apply this diff:
- info.update(_get_data_unit_and_type(node_info.node)) + raw_dataset = node_info.node["RawData"] + info.update(_get_data_unit_and_type(raw_dataset)) + if "data_units" not in info: + info.update(_get_data_unit_and_type(node_info.node))
245-248: Convert FBE indices to integers before slicing.
StartLocusIndex/NumberOfLocioften arrive as byte strings or zero-d arrays. Using them directly indistance[start_ind:...]raisesTypeError. Decode/collapse first (e.g.,int(unbyte(...))andint(np.asarray(...).item())) before slicing.Apply this diff:
- start_ind = node_info.parent_node.attrs.get("StartLocusIndex", 0) - total_inds = node_info.parent_node.attrs["NumberOfLoci"] + start_raw = node_info.parent_node.attrs.get("StartLocusIndex", 0) + total_raw = node_info.parent_node.attrs["NumberOfLoci"] + start_ind = int(np.asarray(unbyte(start_raw)) if isinstance(start_raw, (bytes, np.bytes_)) else np.asarray(start_raw).item()) + total_inds = int(np.asarray(unbyte(total_raw)) if isinstance(total_raw, (bytes, np.bytes_)) else np.asarray(total_raw).item())
🧹 Nitpick comments (1)
dascore/io/dasdae/utils.py (1)
190-197: LGTM! Consider adding a type hint.The function logic is correct and handles all edge cases properly. The approach of filtering kwargs before checking emptiness is clear and maintainable.
Optionally, you could add a type hint for the
kwargsparameter to improve type safety:-def _kwargs_empty(kwargs) -> bool: +def _kwargs_empty(kwargs: dict) -> bool:Or if you want to be more specific:
from typing import Any def _kwargs_empty(kwargs: dict[str, Any]) -> bool:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dascore/io/dasdae/utils.py(2 hunks)dascore/io/prodml/utils.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (5)
get_coord(1039-1058)step(1068-1070)update(240-308)get_coord_manager(1127-1215)select(594-636)dascore/core/coords.py (9)
get_coord(1515-1703)update(738-751)update(965-967)data(606-608)select(323-329)select(997-1017)select(1149-1174)select(1320-1346)select(1431-1456)dascore/utils/misc.py (4)
iterate(257-269)maybe_get_items(438-460)register_func(35-58)unbyte(370-374)dascore/core/attrs.py (4)
PatchAttrs(65-301)items(181-183)get(174-179)update(234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: Run benchmarks
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
🔇 Additional comments (1)
dascore/io/dasdae/utils.py (1)
15-16: LGTM!The constant is well-defined with a clear purpose. Using a set is appropriate for efficient membership testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
dascore/io/prodml/utils.py (2)
74-85: Renamewindow_functiontowindow_typeto match expected field name.Past reviews flagged that
WindowFunctionshould map towindow_type, notwindow_function. The class definition should usewindow_typeto align with this expectation.Apply this diff:
class ProdMLFbePatchAttrs(ProdMLRawPatchAttrs): """Patch attrs for fbe (frequency band extracted) data in Prodml.""" raw_reference: UTF8Str = "" transform_size: float = np.nan transform_type: UTF8Str = "" window_size: int | None = None - window_function: UTF8Str = "" + window_type: UTF8Str = "" window_overlap: int | None = None start_frequency: float = 0 end_frequency: float = np.infAlso update the mapping at line 52:
_FBE_PARENT_ATTRS = { "RawReference": "raw_reference", "TransformSize": "transform_size", "TransformType": "transform_type", - "WindowFunction": "window_function", + "WindowFunction": "window_type", "WindowOverlap": "window_overlap", "WindowSize": "window_size", }
220-231: Read units from RawData dataset for consistency.Line 225 reads units from the group, but line 228 reads dimensions from the
RawDatadataset. For consistency and correctness (as flagged in past reviews), units should also be read from the dataset.Apply this diff:
@register_func(_NODE_ATTRS_PROCESSORS, key="raw") def _get_raw_node_attr_coords(node_info, d_coord, base_info): """Get the raw data information.""" info = dict(base_info) t_coord = _get_time_coord(node_info.node) info.update(_get_data_unit_and_type(node_info.node)) + # Override with dataset-level attributes if present + info.update(_get_data_unit_and_type(node_info.node["RawData"])) coords = dc.get_coord_manager( coords={"time": t_coord, "distance": d_coord}, dims=_get_dims_from_attrs(node_info.node["RawData"].attrs), ) return ProdMLRawPatchAttrs(**info), coords
🧹 Nitpick comments (2)
dascore/io/prodml/utils.py (2)
245-257: Simplify the type conversion logic.The current type conversion for
start_indandtotal_indsis unnecessarily verbose. Sinceunbytereturns a string for bytes, you can simplify the conversion.Apply this diff:
- start_ind = int( - np.asarray(unbyte(start_raw)) - if isinstance(start_raw, bytes | np.bytes_) - else np.asarray(start_raw).item() - ) - total_inds = int( - np.asarray(unbyte(total_raw)) - if isinstance(total_raw, bytes | np.bytes_) - else np.asarray(total_raw).item() - ) + start_ind = int(unbyte(start_raw) if isinstance(start_raw, bytes | np.bytes_) else start_raw) + total_inds = int(unbyte(total_raw) if isinstance(total_raw, bytes | np.bytes_) else total_raw)
266-274: Consider replacing assert with explicit exception.Line 271 uses
assertfor validation, which can be disabled with Python's-Oflag. For production code, prefer raising an explicit exception.@register_func(_NODE_DATA_PROCESSORS, key="raw") def _get_raw_data(node_info): """Get the data from a raw node.""" node = node_info.node data_node = [x for x in list(node) if x.lower().endswith("data")] - assert len(data_node) == 1, "more than one data node found." + if len(data_node) != 1: + msg = f"Expected exactly one data node, found {len(data_node)} in {node_info.name}" + raise ValueError(msg) data = node[data_node[0]] return data
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dascore/io/prodml/utils.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/io/prodml/utils.py (4)
dascore/core/coordmanager.py (5)
get_coord(1039-1058)step(1068-1070)keys(1027-1029)update(240-308)get_coord_manager(1127-1215)dascore/core/coords.py (4)
get_coord(1515-1703)update(738-751)update(965-967)data(606-608)dascore/utils/misc.py (4)
iterate(257-269)maybe_get_items(438-460)register_func(35-58)unbyte(370-374)dascore/core/attrs.py (4)
PatchAttrs(65-301)items(181-183)get(174-179)update(234-255)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: Run benchmarks
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
🔇 Additional comments (4)
dascore/io/prodml/utils.py (4)
27-55: LGTM! Well-structured attribute mappings.The root and FBE attribute mappings handle multiple naming conventions (e.g.,
PulseWidthUnit,PulseWidthUnits,PulseWidth.uom) which accommodates different ProdML schema versions.
160-203: LGTM! Well-structured node traversal with empty-data handling.The iterator-based approach cleanly separates raw and FBE node traversal, and the empty data check (lines 181-182) ensures only valid raw nodes are yielded.
315-329: LGTM! Correct empty-data handling.The function correctly skips creating patches when selections yield empty data (line 327), which aligns with the PR's objective to return empty spools for out-of-bounds selections.
282-294: LGTM! Single-pass iteration resolves past zip concern.The refactored single-pass iteration over
_yield_data_nodeseliminates the previous concern about missingstrict=Trueonzip.
Description
This PR adds support for FBE (frequency band extracted) patches in prodml files. It also tests that slicing out of bands returns empty spools for all commonly tested fiberIO instances.
Checklist
I have (if applicable):
Summary by CodeRabbit
New Features
Bug Fixes
Behavior Changes
Tests
Chores