-
Notifications
You must be signed in to change notification settings - Fork 28
fix 533 #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix 533 #536
Conversation
WalkthroughChunking and selection in the spool now operate on the source dataframe view; a pandas util docstring was clarified; and tests were added to validate chained chunk operations and resulting patch/dataframe time integrity. Changes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #536 +/- ##
=======================================
Coverage 99.88% 99.88%
=======================================
Files 122 122
Lines 10040 10045 +5
=======================================
+ Hits 10028 10033 +5
Misses 12 12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
dascore/utils/pd.py (1)
453-457: Docstring over‑specifies dtype; align wording with implementation.The fallback returns a numpy array broadcast of the provided value (not necessarily bool). Suggest clarifying the docstring.
- """Return the values from a column, if they exist, else bool array of value.""" + """Return column values if present; else a numpy array broadcast of `value` to len(df)."""dascore/core/spool.py (2)
521-523: Correct: chunk on source_df to prevent “double chunk”. Remove stale comment.Using self._source_df here fixes #533 semantics. Drop the commented legacy line.
- df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore") - # df = self._df + df = self._source_df.drop(columns=list(self._drop_columns), errors="ignore")
574-582: Select: align source view with instruction indices.Adjusting source via adjust_segments(self._source_df.loc[inst.index], …) eliminates desync between current/instructions/source during chained ops. Good fix.
Tiny cleanup: the “Determine if the instructions are the same as the source dataframe.” comment above no longer reflects the logic; consider removing.
tests/test_core/test_patch_chunk.py (3)
502-536: Solid regression for #533; reduce order‑coupling to DataFrame rows.Accessing result_contents.iloc[i] assumes DataFrame row order matches iteration order. Safer to reset the index to guarantee alignment.
- result_contents = result_spool.get_contents() + result_contents = result_spool.get_contents().reset_index(drop=True)
537-548: Make the example explicit to avoid fixture drift.Use get_example_spool("random_das") for determinism if defaults ever change.
- spool = dc.get_example_spool() + spool = dc.get_example_spool("random_das")
549-579: Good chain test; mirror the index reset as above.Same order assumption here; apply reset_index for robustness.
- result_contents = result_spool.get_contents() + result_contents = result_spool.get_contents().reset_index(drop=True)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
dascore/core/spool.py(3 hunks)dascore/utils/pd.py(1 hunks)tests/test_core/test_patch_chunk.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dascore/core/spool.py (1)
dascore/utils/pd.py (1)
adjust_segments(255-282)
tests/test_core/test_patch_chunk.py (5)
dascore/examples.py (2)
random_spool(529-552)get_example_spool(686-732)tests/conftest.py (3)
random_spool(447-451)spool(569-571)patch(375-377)dascore/core/spool.py (5)
spool(675-704)chunk(111-158)chunk(511-542)get_contents(186-195)get_contents(633-635)dascore/utils/chunk.py (1)
chunk(419-461)dascore/utils/time.py (1)
to_datetime64(22-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
🔇 Additional comments (1)
dascore/core/spool.py (1)
538-542: Preserving source_df across chunk is correct.Passing the original source_df prevents compounding transforms across chained chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/test_core/test_patch_chunk.py (1)
561-576: Repeat: avoid assuming dataframe row order matches iteration order.Mirror the vectorized checks from the earlier test to reduce brittleness.
Apply this diff:
- result_contents = result_spool.get_contents().reset_index(drop=True) - for i, patch in enumerate(result_spool): - # Assert no NaN values in patch attributes - assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min" - assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max" - - # Verify dataframe contains reasonable time values - df_row = result_contents.iloc[i] - df_time_min = dc.to_datetime64(df_row["time_min"]) - df_time_max = dc.to_datetime64(df_row["time_max"]) - - # Dataframe times should not be NaN or invalid - assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min" - assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max" - assert df_time_min <= df_time_max, f"DF row {i} has invalid time range" + result_contents = result_spool.get_contents() + for i, patch in enumerate(result_spool): + assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min" + assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max" + assert result_contents["time_min"].notna().all() + assert result_contents["time_max"].notna().all() + dt_min = dc.to_datetime64(result_contents["time_min"]) + dt_max = dc.to_datetime64(result_contents["time_max"]) + assert (dt_min <= dt_max).all()Optional: extract a small helper to DRY these assertions across tests.
def _assert_df_times_valid(spool): df = spool.get_contents() assert df["time_min"].notna().all() assert df["time_max"].notna().all() dt_min = dc.to_datetime64(df["time_min"]) dt_max = dc.to_datetime64(df["time_max"]) assert (dt_min <= dt_max).all()
🧹 Nitpick comments (4)
tests/test_core/test_patch_chunk.py (4)
517-519: Relax assertion to avoid data-dependent flakiness.Requiring more patches than the original spool is brittle if the merged duration is short. Prefer asserting non-emptiness and/or relative to the merged count.
Apply this diff:
- # Should have more patches (chunking into smaller pieces) - assert len(result_spool) > len(spool) + # Result should be non-empty and not smaller than the merged count. + assert len(result_spool) >= 1 + assert len(result_spool) >= len(chunk_1_spool)
521-536: Decouple patch iteration from dataframe row order to prevent order-coupling flakes.Indexing
result_contents.iloc[i]assumes iteration order matches dataframe row order; that isn’t guaranteed. Validate patch attrs and dataframe columns independently.Apply this diff:
- result_contents = result_spool.get_contents().reset_index(drop=True) - for i, patch in enumerate(result_spool): - # Assert no NaN values in patch attributes - assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min" - assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max" - - # Verify dataframe contains reasonable time values - df_row = result_contents.iloc[i] - df_time_min = dc.to_datetime64(df_row["time_min"]) - df_time_max = dc.to_datetime64(df_row["time_max"]) - - # Dataframe times should not be NaN or invalid - assert not pd.isna(df_time_min), f"DF row {i} has NaN time_min" - assert not pd.isna(df_time_max), f"DF row {i} has NaN time_max" - assert df_time_min <= df_time_max, f"DF row {i} has invalid time range" + result_contents = result_spool.get_contents() + # Assert patch attributes are valid + for i, patch in enumerate(result_spool): + assert not pd.isna(patch.attrs["time_min"]), f"Patch {i} has NaN time_min" + assert not pd.isna(patch.attrs["time_max"]), f"Patch {i} has NaN time_max" + # Assert dataframe consistency without assuming row order == iteration order + assert result_contents["time_min"].notna().all() + assert result_contents["time_max"].notna().all() + dt_min = dc.to_datetime64(result_contents["time_min"]) + dt_max = dc.to_datetime64(result_contents["time_max"]) + assert (dt_min <= dt_max).all()
540-546: Strengthen the chained-chunk test with a basic non-emptiness check.Quick sanity guard before iterating.
Apply this diff:
- # Should be able to access all patches + # Should be able to access all patches + assert len(result_spool) > 0 for i in range(len(result_spool)): patch = result_spool[i] assert isinstance(patch, dc.Patch)
549-552: Assert equivalence to direct merge to validate end-to-end idempotence.Merging after splitting should equal a direct merge on the same input (for consistent attrs). Adds a stronger check for #533.
Apply this diff:
# First chunk into smaller pieces, then merge back result_spool = random_spool.chunk(time=1).chunk(time=...) + # Should match a direct merge from the original spool + expected_spool = random_spool.chunk(time=...) + assert result_spool == expected_spool
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
dascore/core/spool.py(3 hunks)dascore/utils/pd.py(1 hunks)tests/test_core/test_patch_chunk.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- dascore/utils/pd.py
🚧 Files skipped from review as they are similar to previous changes (1)
- dascore/core/spool.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_core/test_patch_chunk.py (5)
dascore/examples.py (1)
random_spool(529-552)tests/conftest.py (3)
random_spool(447-451)spool(569-571)patch(375-377)dascore/core/spool.py (5)
spool(673-702)chunk(111-158)chunk(511-541)get_contents(186-195)get_contents(631-633)dascore/utils/chunk.py (1)
chunk(419-461)dascore/utils/time.py (1)
to_datetime64(22-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.10)
Description
This PR fixes the double chunk issue mentioned in #533.
This is done by resetting the spool's dataframe to the source dataframe upon using
Spool.chunk, and also ensuring the source dataframe is also filtered and time adjusted inSpool.selectChecklist
I have (if applicable):
Summary by CodeRabbit
Bug Fixes
Documentation
Tests