fix 533 #536

d-chambers · 2025-09-15T18:15:15Z

Description

This PR fixes the double chunk issue mentioned in #533.
This is done by resetting the spool's dataframe to the source dataframe upon using Spool.chunk, and also ensuring the source dataframe is also filtered and time adjusted in Spool.select

Checklist

I have (if applicable):

referenced the GitHub issue this PR closes.
documented the new feature with docstrings or appropriate doc page.
included a test. See testing guidelines.
your name has been added to the contributors page (docs/contributors.md).
added the "ready_for_review" tag once the PR is ready to be reviewed.

Summary by CodeRabbit

Bug Fixes
- Improved chunking and subsetting behavior for spools/patches so chained operations (split/merge and select paths) produce consistent patch counts and valid time ranges.
Documentation
- Clarified utility behavior for returning a column or broadcasting a provided value when the column is missing.
Tests
- Added tests covering chained chunk operations, validating patch types, integrity, and time_min/time_max consistency.

coderabbitai · 2025-09-15T18:15:23Z

Walkthrough

Chunking and selection in the spool now operate on the source dataframe view; a pandas util docstring was clarified; and tests were added to validate chained chunk operations and resulting patch/dataframe time integrity.

Changes

Cohort / File(s)	Summary
Spool chunking and selection source alignment `dascore/core/spool.py`	Chunking now operates on `self._source_df` (drop-columns applied to source) and new spools are constructed with `source_df` set accordingly. Selection (`select`) computes a source-aligned view via `adjust_segments(self._source_df.loc[inst.index], ignore_bad_kwargs=True, **kwargs)` and uses that `source` for the new spool; an inline comment marks dropped rows.
Docstring clarification in pandas utils `dascore/utils/pd.py`	Updated `_column_or_value` docstring to state that when a column is missing it returns a NumPy array broadcasting the provided value to the dataframe length. No functional or API changes.
Tests for chained chunk operations `tests/test_core/test_patch_chunk.py`	Added tests exercising chained chunk operations: chunk→merge→chunk, multiple chained chunks, and chunk split then merge. Tests assert resulting items are `dc.Patch`, that time_min/time_max are not NaN, and that dataframe time columns align with patch attributes.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "fix 533" directly references the issue addressed by this changeset and is related to the primary fix (preventing double-chunking in Spool.chunk and adjusting source handling in Spool.select), but it is very terse and does not describe the actual change for someone scanning PR history.
Description Check	✅ Passed	The PR description states the problem and the high-level fix (resetting the spool dataframe on Spool.chunk and adjusting the source in Spool.select) and references issue #533, matching the template's structure; however the Checklist is left unchecked and the description could briefly list key changed files or the added tests to help reviewers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix_double_chunk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-09-15T18:17:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.88%. Comparing base (4242dc7) to head (69ec910).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #536   +/-   ##
=======================================
  Coverage   99.88%   99.88%           
=======================================
  Files         122      122           
  Lines       10040    10045    +5     
=======================================
+ Hits        10028    10033    +5     
  Misses         12       12

Flag	Coverage Δ
unittests	`99.88% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

dascore/utils/pd.py (1)
453-457: Docstring over‑specifies dtype; align wording with implementation.

The fallback returns a numpy array broadcast of the provided value (not necessarily bool). Suggest clarifying the docstring.
-    """Return the values from a column, if they exist, else bool array of value."""
+    """Return column values if present; else a numpy array broadcast of `value` to len(df)."""
dascore/core/spool.py (2)
521-523: Correct: chunk on source_df to prevent “double chunk”. Remove stale comment.

Using self._source_df here fixes #533 semantics. Drop the commented legacy line.
-        df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore")
-        # df = self._df
+        df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore")
574-582: Select: align source view with instruction indices.

Adjusting source via adjust_segments(self._source_df.loc[inst.index], …) eliminates desync between current/instructions/source during chained ops. Good fix.

Tiny cleanup: the “Determine if the instructions are the same as the source dataframe.” comment above no longer reflects the logic; consider removing.
tests/test_core/test_patch_chunk.py (3)
502-536: Solid regression for #533; reduce order‑coupling to DataFrame rows.

Accessing result_contents.iloc[i] assumes DataFrame row order matches iteration order. Safer to reset the index to guarantee alignment.
-        result_contents = result_spool.get_contents()
+        result_contents = result_spool.get_contents().reset_index(drop=True)
537-548: Make the example explicit to avoid fixture drift.

Use get_example_spool("random_das") for determinism if defaults ever change.
-        spool = dc.get_example_spool()
+        spool = dc.get_example_spool("random_das")
549-579: Good chain test; mirror the index reset as above.

Same order assumption here; apply reset_index for robustness.
-        result_contents = result_spool.get_contents()
+        result_contents = result_spool.get_contents().reset_index(drop=True)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13b2bf2 and f62e9d1.

📒 Files selected for processing (3)

dascore/core/spool.py (3 hunks)
dascore/utils/pd.py (1 hunks)
tests/test_core/test_patch_chunk.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

dascore/core/spool.py (1)

dascore/utils/pd.py (1)

adjust_segments (255-282)

tests/test_core/test_patch_chunk.py (5)

dascore/examples.py (2)

random_spool (529-552)

get_example_spool (686-732)

tests/conftest.py (3)

random_spool (447-451)

spool (569-571)

patch (375-377)

dascore/core/spool.py (5)

spool (675-704)

chunk (111-158)

chunk (511-542)

get_contents (186-195)

get_contents (633-635)

dascore/utils/chunk.py (1)

chunk (419-461)

dascore/utils/time.py (1)

to_datetime64 (22-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: test_code (windows-latest, 3.11)
GitHub Check: test_code (windows-latest, 3.13)
GitHub Check: test_code (windows-latest, 3.12)
GitHub Check: test_code (ubuntu-latest, 3.10)
GitHub Check: test_code (windows-latest, 3.10)
GitHub Check: test_code (macos-latest, 3.12)
GitHub Check: test_code (ubuntu-latest, 3.11)
GitHub Check: test_code (macos-latest, 3.10)
GitHub Check: test_code (ubuntu-latest, 3.12)
GitHub Check: test_code (macos-latest, 3.13)
GitHub Check: test_code (ubuntu-latest, 3.13)
GitHub Check: test_code_min_deps (windows-latest, 3.13)
GitHub Check: test_code_min_deps (macos-latest, 3.13)
GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
GitHub Check: test_code_min_deps (windows-latest, 3.12)
GitHub Check: test_code_min_deps (macos-latest, 3.12)
GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)

🔇 Additional comments (1)

dascore/core/spool.py (1)

538-542: Preserving source_df across chunk is correct.

Passing the original source_df prevents compounding transforms across chained chunks.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/test_core/test_patch_chunk.py (1)

561-576: Repeat: avoid assuming dataframe row order matches iteration order.

Mirror the vectorized checks from the earlier test to reduce brittleness.

Apply this diff:

-        result_contents = result_spool.get_contents().reset_index(drop=True)
-        for i, patch in enumerate(result_spool):
-            # Assert no NaN values in patch attributes
-            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
-            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
-
-            # Verify dataframe contains reasonable time values
-            df_row = result_contents.iloc[i]
-            df_time_min = dc.to_datetime64(df_row["time_min"])
-            df_time_max = dc.to_datetime64(df_row["time_max"])
-
-            # Dataframe times should not be NaN or invalid
-            assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min"
-            assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max"
-            assert df_time_min <= df_time_max, f"DF row {i} has invalid time range"
+        result_contents = result_spool.get_contents()
+        for i, patch in enumerate(result_spool):
+            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
+            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
+        assert result_contents["time_min"].notna().all()
+        assert result_contents["time_max"].notna().all()
+        dt_min = dc.to_datetime64(result_contents["time_min"])
+        dt_max = dc.to_datetime64(result_contents["time_max"])
+        assert (dt_min <= dt_max).all()

Optional: extract a small helper to DRY these assertions across tests.

def _assert_df_times_valid(spool):
    df = spool.get_contents()
    assert df["time_min"].notna().all()
    assert df["time_max"].notna().all()
    dt_min = dc.to_datetime64(df["time_min"])
    dt_max = dc.to_datetime64(df["time_max"])
    assert (dt_min <= dt_max).all()

🧹 Nitpick comments (4)

tests/test_core/test_patch_chunk.py (4)

517-519: Relax assertion to avoid data-dependent flakiness.

Requiring more patches than the original spool is brittle if the merged duration is short. Prefer asserting non-emptiness and/or relative to the merged count.

Apply this diff:

-        # Should have more patches (chunking into smaller pieces)
-        assert len(result_spool) > len(spool)
+        # Result should be non-empty and not smaller than the merged count.
+        assert len(result_spool) >= 1
+        assert len(result_spool) >= len(chunk_1_spool)

521-536: Decouple patch iteration from dataframe row order to prevent order-coupling flakes.

Indexing result_contents.iloc[i] assumes iteration order matches dataframe row order; that isn’t guaranteed. Validate patch attrs and dataframe columns independently.

Apply this diff:

-        result_contents = result_spool.get_contents().reset_index(drop=True)
-        for i, patch in enumerate(result_spool):
-            # Assert no NaN values in patch attributes
-            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
-            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
-
-            # Verify dataframe contains reasonable time values
-            df_row = result_contents.iloc[i]
-            df_time_min = dc.to_datetime64(df_row["time_min"])
-            df_time_max = dc.to_datetime64(df_row["time_max"])
-
-            # Dataframe times should not be NaN or invalid
-            assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min"
-            assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max"
-            assert df_time_min <= df_time_max, f"DF row {i} has invalid time range"
+        result_contents = result_spool.get_contents()
+        # Assert patch attributes are valid
+        for i, patch in enumerate(result_spool):
+            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
+            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
+        # Assert dataframe consistency without assuming row order == iteration order
+        assert result_contents["time_min"].notna().all()
+        assert result_contents["time_max"].notna().all()
+        dt_min = dc.to_datetime64(result_contents["time_min"])
+        dt_max = dc.to_datetime64(result_contents["time_max"])
+        assert (dt_min <= dt_max).all()

540-546: Strengthen the chained-chunk test with a basic non-emptiness check.

Quick sanity guard before iterating.

Apply this diff:

-        # Should be able to access all patches
+        # Should be able to access all patches
+        assert len(result_spool) > 0
         for i in range(len(result_spool)):
             patch = result_spool[i]
             assert isinstance(patch, dc.Patch)

549-552: Assert equivalence to direct merge to validate end-to-end idempotence.

Merging after splitting should equal a direct merge on the same input (for consistent attrs). Adds a stronger check for #533.

Apply this diff:

         # First chunk into smaller pieces, then merge back
         result_spool = random_spool.chunk(time=1).chunk(time=...)
 
+        # Should match a direct merge from the original spool
+        expected_spool = random_spool.chunk(time=...)
+        assert result_spool == expected_spool

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f62e9d1 and 69ec910.

📒 Files selected for processing (3)

dascore/core/spool.py (3 hunks)
dascore/utils/pd.py (1 hunks)
tests/test_core/test_patch_chunk.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

dascore/utils/pd.py

🚧 Files skipped from review as they are similar to previous changes (1)

dascore/core/spool.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_core/test_patch_chunk.py (5)

dascore/examples.py (1)

random_spool (529-552)

tests/conftest.py (3)

random_spool (447-451)

spool (569-571)

patch (375-377)

dascore/core/spool.py (5)

spool (673-702)

chunk (111-158)

chunk (511-541)

get_contents (186-195)

get_contents (631-633)

dascore/utils/chunk.py (1)

chunk (419-461)

dascore/utils/time.py (1)

to_datetime64 (22-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: test_code_min_deps (windows-latest, 3.12)
GitHub Check: test_code_min_deps (windows-latest, 3.13)
GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
GitHub Check: test_code_min_deps (macos-latest, 3.12)
GitHub Check: test_code (windows-latest, 3.12)
GitHub Check: test_code (ubuntu-latest, 3.11)
GitHub Check: test_code (macos-latest, 3.10)
GitHub Check: test_code (macos-latest, 3.12)
GitHub Check: test_code (ubuntu-latest, 3.12)
GitHub Check: test_code (ubuntu-latest, 3.13)
GitHub Check: test_code (macos-latest, 3.11)
GitHub Check: test_code (ubuntu-latest, 3.10)
GitHub Check: test_code (macos-latest, 3.13)
GitHub Check: test_code (windows-latest, 3.13)
GitHub Check: test_code (windows-latest, 3.11)
GitHub Check: test_code (windows-latest, 3.10)

fix 533

f62e9d1

d-chambers added the bug Something isn't working label Sep 15, 2025

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

address rabbit review

69ec910

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

d-chambers merged commit 96c8e10 into master Sep 16, 2025
22 checks passed

d-chambers deleted the fix_double_chunk branch September 16, 2025 07:49

d-chambers mentioned this pull request Sep 16, 2025

double chunk fails #533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix 533 #536

fix 533 #536

Uh oh!

d-chambers commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix 533 #536

fix 533 #536

Uh oh!

Conversation

d-chambers commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-chambers commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

codecov bot commented Sep 15, 2025 •

edited

Loading