Skip to content

Conversation

@d-chambers
Copy link
Contributor

@d-chambers d-chambers commented Sep 15, 2025

Description

This PR fixes the double chunk issue mentioned in #533.
This is done by resetting the spool's dataframe to the source dataframe upon using Spool.chunk, and also ensuring the source dataframe is also filtered and time adjusted in Spool.select

Checklist

I have (if applicable):

  • referenced the GitHub issue this PR closes.
  • documented the new feature with docstrings or appropriate doc page.
  • included a test. See testing guidelines.
  • your name has been added to the contributors page (docs/contributors.md).
  • added the "ready_for_review" tag once the PR is ready to be reviewed.

Summary by CodeRabbit

  • Bug Fixes

    • Improved chunking and subsetting behavior for spools/patches so chained operations (split/merge and select paths) produce consistent patch counts and valid time ranges.
  • Documentation

    • Clarified utility behavior for returning a column or broadcasting a provided value when the column is missing.
  • Tests

    • Added tests covering chained chunk operations, validating patch types, integrity, and time_min/time_max consistency.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Chunking and selection in the spool now operate on the source dataframe view; a pandas util docstring was clarified; and tests were added to validate chained chunk operations and resulting patch/dataframe time integrity.

Changes

Cohort / File(s) Summary
Spool chunking and selection source alignment
dascore/core/spool.py
Chunking now operates on self._source_df (drop-columns applied to source) and new spools are constructed with source_df set accordingly. Selection (select) computes a source-aligned view via adjust_segments(self._source_df.loc[inst.index], ignore_bad_kwargs=True, **kwargs) and uses that source for the new spool; an inline comment marks dropped rows.
Docstring clarification in pandas utils
dascore/utils/pd.py
Updated _column_or_value docstring to state that when a column is missing it returns a NumPy array broadcasting the provided value to the dataframe length. No functional or API changes.
Tests for chained chunk operations
tests/test_core/test_patch_chunk.py
Added tests exercising chained chunk operations: chunk→merge→chunk, multiple chained chunks, and chunk split then merge. Tests assert resulting items are dc.Patch, that time_min/time_max are not NaN, and that dataframe time columns align with patch attributes.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "fix 533" directly references the issue addressed by this changeset and is related to the primary fix (preventing double-chunking in Spool.chunk and adjusting source handling in Spool.select), but it is very terse and does not describe the actual change for someone scanning PR history.
Description Check ✅ Passed The PR description states the problem and the high-level fix (resetting the spool dataframe on Spool.chunk and adjusting the source in Spool.select) and references issue #533, matching the template's structure; however the Checklist is left unchecked and the description could briefly list key changed files or the added tests to help reviewers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix_double_chunk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Sep 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.88%. Comparing base (4242dc7) to head (69ec910).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #536   +/-   ##
=======================================
  Coverage   99.88%   99.88%           
=======================================
  Files         122      122           
  Lines       10040    10045    +5     
=======================================
+ Hits        10028    10033    +5     
  Misses         12       12           
Flag Coverage Δ
unittests 99.88% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-chambers d-chambers added the bug Something isn't working label Sep 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
dascore/utils/pd.py (1)

453-457: Docstring over‑specifies dtype; align wording with implementation.

The fallback returns a numpy array broadcast of the provided value (not necessarily bool). Suggest clarifying the docstring.

-    """Return the values from a column, if they exist, else bool array of value."""
+    """Return column values if present; else a numpy array broadcast of `value` to len(df)."""
dascore/core/spool.py (2)

521-523: Correct: chunk on source_df to prevent “double chunk”. Remove stale comment.

Using self._source_df here fixes #533 semantics. Drop the commented legacy line.

-        df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore")
-        # df = self._df
+        df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore")

574-582: Select: align source view with instruction indices.

Adjusting source via adjust_segments(self._source_df.loc[inst.index], …) eliminates desync between current/instructions/source during chained ops. Good fix.

Tiny cleanup: the “Determine if the instructions are the same as the source dataframe.” comment above no longer reflects the logic; consider removing.

tests/test_core/test_patch_chunk.py (3)

502-536: Solid regression for #533; reduce order‑coupling to DataFrame rows.

Accessing result_contents.iloc[i] assumes DataFrame row order matches iteration order. Safer to reset the index to guarantee alignment.

-        result_contents = result_spool.get_contents()
+        result_contents = result_spool.get_contents().reset_index(drop=True)

537-548: Make the example explicit to avoid fixture drift.

Use get_example_spool("random_das") for determinism if defaults ever change.

-        spool = dc.get_example_spool()
+        spool = dc.get_example_spool("random_das")

549-579: Good chain test; mirror the index reset as above.

Same order assumption here; apply reset_index for robustness.

-        result_contents = result_spool.get_contents()
+        result_contents = result_spool.get_contents().reset_index(drop=True)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13b2bf2 and f62e9d1.

📒 Files selected for processing (3)
  • dascore/core/spool.py (3 hunks)
  • dascore/utils/pd.py (1 hunks)
  • tests/test_core/test_patch_chunk.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dascore/core/spool.py (1)
dascore/utils/pd.py (1)
  • adjust_segments (255-282)
tests/test_core/test_patch_chunk.py (5)
dascore/examples.py (2)
  • random_spool (529-552)
  • get_example_spool (686-732)
tests/conftest.py (3)
  • random_spool (447-451)
  • spool (569-571)
  • patch (375-377)
dascore/core/spool.py (5)
  • spool (675-704)
  • chunk (111-158)
  • chunk (511-542)
  • get_contents (186-195)
  • get_contents (633-635)
dascore/utils/chunk.py (1)
  • chunk (419-461)
dascore/utils/time.py (1)
  • to_datetime64 (22-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (windows-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (macos-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
🔇 Additional comments (1)
dascore/core/spool.py (1)

538-542: Preserving source_df across chunk is correct.

Passing the original source_df prevents compounding transforms across chained chunks.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/test_core/test_patch_chunk.py (1)

561-576: Repeat: avoid assuming dataframe row order matches iteration order.

Mirror the vectorized checks from the earlier test to reduce brittleness.

Apply this diff:

-        result_contents = result_spool.get_contents().reset_index(drop=True)
-        for i, patch in enumerate(result_spool):
-            # Assert no NaN values in patch attributes
-            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
-            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
-
-            # Verify dataframe contains reasonable time values
-            df_row = result_contents.iloc[i]
-            df_time_min = dc.to_datetime64(df_row["time_min"])
-            df_time_max = dc.to_datetime64(df_row["time_max"])
-
-            # Dataframe times should not be NaN or invalid
-            assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min"
-            assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max"
-            assert df_time_min <= df_time_max, f"DF row {i} has invalid time range"
+        result_contents = result_spool.get_contents()
+        for i, patch in enumerate(result_spool):
+            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
+            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
+        assert result_contents["time_min"].notna().all()
+        assert result_contents["time_max"].notna().all()
+        dt_min = dc.to_datetime64(result_contents["time_min"])
+        dt_max = dc.to_datetime64(result_contents["time_max"])
+        assert (dt_min <= dt_max).all()

Optional: extract a small helper to DRY these assertions across tests.

def _assert_df_times_valid(spool):
    df = spool.get_contents()
    assert df["time_min"].notna().all()
    assert df["time_max"].notna().all()
    dt_min = dc.to_datetime64(df["time_min"])
    dt_max = dc.to_datetime64(df["time_max"])
    assert (dt_min <= dt_max).all()
🧹 Nitpick comments (4)
tests/test_core/test_patch_chunk.py (4)

517-519: Relax assertion to avoid data-dependent flakiness.

Requiring more patches than the original spool is brittle if the merged duration is short. Prefer asserting non-emptiness and/or relative to the merged count.

Apply this diff:

-        # Should have more patches (chunking into smaller pieces)
-        assert len(result_spool) > len(spool)
+        # Result should be non-empty and not smaller than the merged count.
+        assert len(result_spool) >= 1
+        assert len(result_spool) >= len(chunk_1_spool)

521-536: Decouple patch iteration from dataframe row order to prevent order-coupling flakes.

Indexing result_contents.iloc[i] assumes iteration order matches dataframe row order; that isn’t guaranteed. Validate patch attrs and dataframe columns independently.

Apply this diff:

-        result_contents = result_spool.get_contents().reset_index(drop=True)
-        for i, patch in enumerate(result_spool):
-            # Assert no NaN values in patch attributes
-            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
-            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
-
-            # Verify dataframe contains reasonable time values
-            df_row = result_contents.iloc[i]
-            df_time_min = dc.to_datetime64(df_row["time_min"])
-            df_time_max = dc.to_datetime64(df_row["time_max"])
-
-            # Dataframe times should not be NaN or invalid
-            assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min"
-            assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max"
-            assert df_time_min <= df_time_max, f"DF row {i} has invalid time range"
+        result_contents = result_spool.get_contents()
+        # Assert patch attributes are valid
+        for i, patch in enumerate(result_spool):
+            assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min"
+            assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max"
+        # Assert dataframe consistency without assuming row order == iteration order
+        assert result_contents["time_min"].notna().all()
+        assert result_contents["time_max"].notna().all()
+        dt_min = dc.to_datetime64(result_contents["time_min"])
+        dt_max = dc.to_datetime64(result_contents["time_max"])
+        assert (dt_min <= dt_max).all()

540-546: Strengthen the chained-chunk test with a basic non-emptiness check.

Quick sanity guard before iterating.

Apply this diff:

-        # Should be able to access all patches
+        # Should be able to access all patches
+        assert len(result_spool) > 0
         for i in range(len(result_spool)):
             patch = result_spool[i]
             assert isinstance(patch, dc.Patch)

549-552: Assert equivalence to direct merge to validate end-to-end idempotence.

Merging after splitting should equal a direct merge on the same input (for consistent attrs). Adds a stronger check for #533.

Apply this diff:

         # First chunk into smaller pieces, then merge back
         result_spool = random_spool.chunk(time=1).chunk(time=...)
 
+        # Should match a direct merge from the original spool
+        expected_spool = random_spool.chunk(time=...)
+        assert result_spool == expected_spool
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f62e9d1 and 69ec910.

📒 Files selected for processing (3)
  • dascore/core/spool.py (3 hunks)
  • dascore/utils/pd.py (1 hunks)
  • tests/test_core/test_patch_chunk.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • dascore/utils/pd.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dascore/core/spool.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_core/test_patch_chunk.py (5)
dascore/examples.py (1)
  • random_spool (529-552)
tests/conftest.py (3)
  • random_spool (447-451)
  • spool (569-571)
  • patch (375-377)
dascore/core/spool.py (5)
  • spool (673-702)
  • chunk (111-158)
  • chunk (511-541)
  • get_contents (186-195)
  • get_contents (631-633)
dascore/utils/chunk.py (1)
  • chunk (419-461)
dascore/utils/time.py (1)
  • to_datetime64 (22-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: test_code_min_deps (windows-latest, 3.12)
  • GitHub Check: test_code_min_deps (windows-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
  • GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
  • GitHub Check: test_code_min_deps (macos-latest, 3.12)
  • GitHub Check: test_code (windows-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.11)
  • GitHub Check: test_code (macos-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.12)
  • GitHub Check: test_code (ubuntu-latest, 3.13)
  • GitHub Check: test_code (macos-latest, 3.11)
  • GitHub Check: test_code (ubuntu-latest, 3.10)
  • GitHub Check: test_code (macos-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.13)
  • GitHub Check: test_code (windows-latest, 3.11)
  • GitHub Check: test_code (windows-latest, 3.10)

@d-chambers d-chambers merged commit 96c8e10 into master Sep 16, 2025
22 checks passed
@d-chambers d-chambers deleted the fix_double_chunk branch September 16, 2025 07:49
@d-chambers d-chambers mentioned this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants