-
Notifications
You must be signed in to change notification settings - Fork 28
Hampel speed #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hampel speed #557
Conversation
WalkthroughRenames many benchmark tests; refactors the Hampel filter to add an approximate (default) fast path and an exact path using new moving-window utilities; adds a moving-window module and re-exports; extends optional_import with on_missing behaviors; adds conditional spool deduplication and related tests and docs/pyproject changes. Changes
Possibly related PRs
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (3)
benchmarks/test_patch_benchmarks.py (3)
205-208: Silence Ruff ARG002 by using the fixture without a paramUse the fixture via a decorator and drop the unused function arg.
Apply this diff:
- @pytest.mark.benchmark - def test_waterfall(self, example_patch, cleanup_mpl): + @pytest.mark.usefixtures("cleanup_mpl") + @pytest.mark.benchmark + def test_waterfall(self, example_patch): """Timing for waterfall patch.""" example_patch.viz.waterfall()
220-224: Same fix for wiggle plot testAvoid the unused argument while still activating the fixture.
Apply this diff:
- @pytest.mark.benchmark - def test_wiggle(self, example_patch, cleanup_mpl): + @pytest.mark.usefixtures("cleanup_mpl") + @pytest.mark.benchmark + def test_wiggle(self, example_patch): """Time wiggle plot visualization.""" patch = example_patch.select(distance=(0, 100)) # Subset for performance patch.viz.wiggle()
129-140: Consider parametrizing separable for Hampel to reduce duplicationOne parametrized test can cover both separable=True/False.
Apply this diff:
- @pytest.mark.benchmark - def test_hampel_filter_non_separable(self, example_patch): - """Time the Hampel filter.""" - example_patch.hampel_filter( - threshold=3.0, distance=3, time=5, samples=True, separable=False - ) - - @pytest.mark.benchmark - def test_hampel_filter(self, example_patch): - """Time the Hampel filter.""" - example_patch.hampel_filter( - threshold=3.0, distance=3, time=5, samples=True, separable=True - ) + @pytest.mark.parametrize("separable", [False, True]) + @pytest.mark.benchmark + def test_hampel_filter(self, example_patch, separable): + """Time the Hampel filter (both non-separable and separable).""" + example_patch.hampel_filter( + threshold=3.0, distance=3, time=5, samples=True, separable=separable + )Also applies to: 135-140
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
benchmarks/test_io_benchmarks.py(1 hunks)benchmarks/test_patch_benchmarks.py(9 hunks)benchmarks/test_spool_benchmarks.py(4 hunks)dascore/proc/hampel.py(5 hunks)dascore/utils/__init__.py(1 hunks)dascore/utils/misc.py(5 hunks)dascore/utils/moving.py(1 hunks)tests/test_proc/test_hampel.py(3 hunks)tests/test_utils/test_misc.py(1 hunks)tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(100-103)move_max(215-217)move_mean(195-197)move_median(190-192)move_min(210-212)move_std(200-202)move_sum(205-207)moving_window(138-186)
dascore/utils/misc.py (1)
dascore/exceptions.py (1)
MissingOptionalDependencyError(98-99)
dascore/utils/__init__.py (1)
dascore/utils/moving.py (7)
move_max(215-217)move_mean(195-197)move_median(190-192)move_min(210-212)move_std(200-202)move_sum(205-207)moving_window(138-186)
dascore/proc/hampel.py (2)
dascore/utils/moving.py (1)
move_median(190-192)dascore/utils/patch.py (1)
get_patch_window_size(677-762)
dascore/utils/moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/misc.py (1)
optional_import(300-344)
benchmarks/test_io_benchmarks.py (2)
dascore/exceptions.py (1)
MissingOptionalDependencyError(98-99)dascore/io/core.py (1)
scan_to_df(654-702)
tests/test_utils/test_misc.py (1)
dascore/utils/misc.py (1)
optional_import(300-344)
tests/test_proc/test_hampel.py (2)
dascore/core/patch.py (2)
data(272-283)shape(286-299)dascore/proc/hampel.py (1)
hampel_filter(98-198)
benchmarks/test_patch_benchmarks.py (1)
dascore/proc/hampel.py (1)
hampel_filter(98-198)
🪛 Ruff (0.13.1)
tests/test_utils/test_moving.py
69-69: Unused method argument: operation
(ARG002)
192-192: Loop control variable description not used within loop body
Rename unused description to _description
(B007)
dascore/utils/moving.py
55-55: Unpacked variable module_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
103-103: Consider ['scipy', *bottle_list] instead of concatenation
Replace with ['scipy', *bottle_list]
(RUF005)
172-172: Avoid specifying long messages outside the exception class
(TRY003)
174-174: Avoid specifying long messages outside the exception class
(TRY003)
benchmarks/test_patch_benchmarks.py
205-205: Unused method argument: cleanup_mpl
(ARG002)
220-220: Unused method argument: cleanup_mpl
(ARG002)
🪛 GitHub Actions: LintCode
tests/test_utils/test_moving.py
[error] 95-95: E501 Line too long (95 > 88)
[error] 105-105: NPY002 Replace legacy np.random.rand call with np.random.Generator
[error] 118-118: E501 Line too long (98 > 88)
dascore/utils/moving.py
[error] 53-53: E501 Line too long (109 > 88)
[error] 75-75: E501 Line too long (114 > 88)
[error] 103-103: RUF005 Consider ["scipy", *bottle_list] instead of concatenation
[error] 190-190: E501 Line too long (108 > 88)
[error] 195-195: E501 Line too long (106 > 88)
[error] 200-200: E501 Line too long (105 > 88)
[error] 205-205: E501 Line too long (105 > 88)
[error] 210-210: E501 Line too long (105 > 88)
[error] 215-215: E501 Line too long (105 > 88)
🔇 Additional comments (4)
benchmarks/test_spool_benchmarks.py (1)
42-42: Benchmark test renames look goodNames are cleaner while preserving behavior and markers. No issues spotted.
Also applies to: 47-47, 54-54, 59-59, 64-64, 78-78, 87-87, 96-96, 103-103
dascore/utils/__init__.py (1)
4-12: Public API re-exports: verify doc discovery/allIf your docs or API checks rely on all, ensure these are added there; otherwise, re-exports are fine.
Would you like me to add/update an all list here to make the public surface explicit?
tests/test_utils/test_moving.py (2)
171-175: Engine “invalid_engine” behavior assumptionThis test expects a warning and fallback for an unknown engine. Confirm moving_window/_select_engine indeed warns (not raises) on invalid engine strings; otherwise adjust test or implementation accordingly.
120-121: Break long function signature to satisfy E501Wrap parameters across lines.
Apply this diff:
- def test_operation_engine_combinations(self, test_data, operation, engine, available_engines): + def test_operation_engine_combinations( + self, test_data, operation, engine, available_engines + ): """Test all operation/engine combinations."""Likely an incorrect or invalid review comment.
dascore/proc/hampel.py
Outdated
| is_int = data.dtype.kind == "i" | ||
| dataf = data.copy() if not is_int else data.astype(np.float32) | ||
|
|
||
| # Convert to float64 to avoid integer overflow/precision loss | ||
| dataf = data.astype(np.float64, copy=False) | ||
|
|
||
| # Local median and MAD via median filters. | ||
| # Apply hampel filtering using appropriate method | ||
| if separable and sum(s > 1 for s in size) >= 2: | ||
| # Use separable filtering for multi-dimensional windows | ||
| # This is faster but provides an approximation | ||
| med = _separable_median(dataf, size, mode) | ||
| abs_med_diff = np.abs(dataf - med) | ||
| mad = _separable_median(abs_med_diff, size, mode) | ||
| # Use separable filtering for multi-dimensional windows (faster approximation) | ||
| dataf = _hampel_separable(dataf, size, mode, threshold) | ||
| else: | ||
| # Use standard 2D median filter (more accurate but slower) | ||
| med, abs_med_diff, mad = _calculate_standard_median_and_mad(dataf, size, mode) | ||
| # Handle mad values of 0 so denominator doesn't blow up. | ||
| mad_safe = np.where(mad == 0.0, np.finfo(float).eps, mad) | ||
| # Hampel test and replacement. | ||
| thresholded = abs_med_diff / mad_safe | ||
| out = np.where(thresholded > threshold, med, dataf) | ||
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | ||
| # Cast back to original dtype (round if original was integer) | ||
| if np.issubdtype(data.dtype, np.integer): | ||
| out = np.rint(out) | ||
| out = out.astype(data.dtype, copy=False) | ||
| dataf = np.rint(dataf) | ||
| out = dataf.astype(data.dtype, copy=False) | ||
| return patch.update(data=out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle unsigned integer inputs without crashing
Right now only signed integers are promoted to float. Patches with unsigned types (e.g., uint16) keep their dtype, so _hampel_* hits np.finfo(dataf.dtype) and raises ValueError, and arithmetic wraps around. Please cast all integer dtypes before filtering.
- is_int = data.dtype.kind == "i"
- dataf = data.copy() if not is_int else data.astype(np.float32)
+ is_int = np.issubdtype(data.dtype, np.integer)
+ dataf = data.astype(np.float32) if is_int else data.copy()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| is_int = data.dtype.kind == "i" | |
| dataf = data.copy() if not is_int else data.astype(np.float32) | |
| # Convert to float64 to avoid integer overflow/precision loss | |
| dataf = data.astype(np.float64, copy=False) | |
| # Local median and MAD via median filters. | |
| # Apply hampel filtering using appropriate method | |
| if separable and sum(s > 1 for s in size) >= 2: | |
| # Use separable filtering for multi-dimensional windows | |
| # This is faster but provides an approximation | |
| med = _separable_median(dataf, size, mode) | |
| abs_med_diff = np.abs(dataf - med) | |
| mad = _separable_median(abs_med_diff, size, mode) | |
| # Use separable filtering for multi-dimensional windows (faster approximation) | |
| dataf = _hampel_separable(dataf, size, mode, threshold) | |
| else: | |
| # Use standard 2D median filter (more accurate but slower) | |
| med, abs_med_diff, mad = _calculate_standard_median_and_mad(dataf, size, mode) | |
| # Handle mad values of 0 so denominator doesn't blow up. | |
| mad_safe = np.where(mad == 0.0, np.finfo(float).eps, mad) | |
| # Hampel test and replacement. | |
| thresholded = abs_med_diff / mad_safe | |
| out = np.where(thresholded > threshold, med, dataf) | |
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | |
| # Cast back to original dtype (round if original was integer) | |
| if np.issubdtype(data.dtype, np.integer): | |
| out = np.rint(out) | |
| out = out.astype(data.dtype, copy=False) | |
| dataf = np.rint(dataf) | |
| out = dataf.astype(data.dtype, copy=False) | |
| return patch.update(data=out) | |
| is_int = np.issubdtype(data.dtype, np.integer) | |
| dataf = data.astype(np.float32) if is_int else data.copy() | |
| # Apply hampel filtering using appropriate method | |
| if separable and sum(s > 1 for s in size) >= 2: | |
| # Use separable filtering for multi-dimensional windows (faster approximation) | |
| dataf = _hampel_separable(dataf, size, mode, threshold) | |
| else: | |
| # Use standard 2D median filter (more accurate but slower) | |
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | |
| # Cast back to original dtype (round if original was integer) | |
| if np.issubdtype(data.dtype, np.integer): | |
| dataf = np.rint(dataf) | |
| out = dataf.astype(data.dtype, copy=False) | |
| return patch.update(data=out) |
🤖 Prompt for AI Agents
In dascore/proc/hampel.py around lines 184-198, the current check only treats
signed integer dtypes as needing promotion to float, so unsigned integer patches
(e.g., uint16) are left as-is and cause np.finfo errors or wraparound; change
the dtype check to detect any integer dtype (use np.issubdtype(data.dtype,
np.integer)) and cast those to a floating dtype (e.g., np.float32) before
calling the _hampel_* functions, then after filtering round if original was
integer and cast back to the original dtype; preserve the existing
separable/non-separable flow and ensure copy semantics remain correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
benchmarks/test_patch_benchmarks.py(9 hunks)dascore/proc/hampel.py(7 hunks)dascore/utils/misc.py(5 hunks)dascore/utils/moving.py(1 hunks)docs/recipes/despiking.qmd(1 hunks)tests/test_proc/test_hampel.py(4 hunks)tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
dascore/utils/misc.py (1)
dascore/exceptions.py (1)
MissingOptionalDependencyError(98-99)
dascore/proc/hampel.py (2)
dascore/utils/moving.py (1)
move_median(193-197)dascore/utils/patch.py (1)
get_patch_window_size(677-762)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(103-106)move_max(228-232)move_mean(200-204)move_median(193-197)move_min(221-225)move_std(207-211)move_sum(214-218)moving_window(141-189)
dascore/utils/moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/misc.py (1)
optional_import(300-344)
tests/test_proc/test_hampel.py (2)
dascore/core/patch.py (3)
data(272-283)ndim(229-231)shape(286-299)dascore/proc/hampel.py (1)
hampel_filter(98-211)
benchmarks/test_patch_benchmarks.py (3)
dascore/proc/coords.py (1)
snap_coords(20-53)dascore/proc/hampel.py (1)
hampel_filter(98-211)dascore/viz/waterfall.py (1)
waterfall(41-116)
🪛 Ruff (0.13.1)
dascore/utils/moving.py
57-57: Unpacked variable module_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
175-175: Avoid specifying long messages outside the exception class
(TRY003)
177-177: Avoid specifying long messages outside the exception class
(TRY003)
tests/test_proc/test_hampel.py
191-191: Loop control variable spike_val not used within loop body
Rename unused spike_val to _spike_val
(B007)
| # Need to convert ints to float for calculations to avoid roundoff error. | ||
| # There were issues using np.issubdtype not working so this uses kind. | ||
| is_int = data.dtype.kind in {"i", "u"} | ||
| dataf = data.copy() if not is_int else data.astype(np.float32) | ||
| # Apply hampel filtering using appropriate method | ||
| if approximate: | ||
| dataf = _hampel_separable(dataf, size, mode, threshold) | ||
| else: | ||
| # Use standard 2D median filter (more accurate but slower) | ||
| med, abs_med_diff, mad = _calculate_standard_median_and_mad(dataf, size, mode) | ||
| # Handle mad values of 0 so denominator doesn't blow up. | ||
| mad_safe = np.where(mad == 0.0, np.finfo(float).eps, mad) | ||
| # Hampel test and replacement. | ||
| thresholded = abs_med_diff / mad_safe | ||
| out = np.where(thresholded > threshold, med, dataf) | ||
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | ||
| # Cast back to original dtype (round if original was integer) | ||
| if np.issubdtype(data.dtype, np.integer): | ||
| out = np.rint(out) | ||
| out = out.astype(data.dtype, copy=False) | ||
| dataf = np.rint(dataf) | ||
| out = dataf.astype(data.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not downcast integer patches to float32
Casting every integer patch to np.float32 destroys precision for any value above 224. For example, an untouched sample with value 16_777_217 becomes 16_777_216 after the round‑trip through the filter, so we silently corrupt data even when no spike replacement occurs. We need to promote integers to a dtype that can exactly represent the original range (float64 is sufficient up to 253) before filtering.
- # Need to convert ints to float for calculations to avoid roundoff error.
- # There were issues using np.issubdtype not working so this uses kind.
- is_int = data.dtype.kind in {"i", "u"}
- dataf = data.copy() if not is_int else data.astype(np.float32)
+ # Promote integer inputs to a floating dtype that preserves their precision.
+ is_int = data.dtype.kind in {"i", "u"}
+ dataf = data.copy() if not is_int else data.astype(np.float64, copy=False)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Need to convert ints to float for calculations to avoid roundoff error. | |
| # There were issues using np.issubdtype not working so this uses kind. | |
| is_int = data.dtype.kind in {"i", "u"} | |
| dataf = data.copy() if not is_int else data.astype(np.float32) | |
| # Apply hampel filtering using appropriate method | |
| if approximate: | |
| dataf = _hampel_separable(dataf, size, mode, threshold) | |
| else: | |
| # Use standard 2D median filter (more accurate but slower) | |
| med, abs_med_diff, mad = _calculate_standard_median_and_mad(dataf, size, mode) | |
| # Handle mad values of 0 so denominator doesn't blow up. | |
| mad_safe = np.where(mad == 0.0, np.finfo(float).eps, mad) | |
| # Hampel test and replacement. | |
| thresholded = abs_med_diff / mad_safe | |
| out = np.where(thresholded > threshold, med, dataf) | |
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | |
| # Cast back to original dtype (round if original was integer) | |
| if np.issubdtype(data.dtype, np.integer): | |
| out = np.rint(out) | |
| out = out.astype(data.dtype, copy=False) | |
| dataf = np.rint(dataf) | |
| out = dataf.astype(data.dtype, copy=False) | |
| # Promote integer inputs to a floating dtype that preserves their precision. | |
| is_int = data.dtype.kind in {"i", "u"} | |
| dataf = data.copy() if not is_int else data.astype(np.float64, copy=False) | |
| # Apply hampel filtering using appropriate method | |
| if approximate: | |
| dataf = _hampel_separable(dataf, size, mode, threshold) | |
| else: | |
| # Use standard 2D median filter (more accurate but slower) | |
| dataf = _hampel_non_separable(dataf, size, mode, threshold) | |
| # Cast back to original dtype (round if original was integer) | |
| if np.issubdtype(data.dtype, np.integer): | |
| dataf = np.rint(dataf) | |
| out = dataf.astype(data.dtype, copy=False) |
🤖 Prompt for AI Agents
In dascore/proc/hampel.py around lines 197 to 210, integer patches are being
cast to np.float32 which loses precision for large integers; change the
promotion so integer dtypes are converted to np.float64 (not float32) before
filtering, keep non-integer arrays untouched, run the existing filtering
branches, then perform the same rounding and cast-back logic to the original
dtype; ensure you only promote when data.dtype is integer and preserve copy
semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
dascore/utils/moving.py (5)
57-57: Clean up naming and fix Ruff unused-variable warningRename the unpacked
module_nameto_and disambiguatefunc_nameto avoid confusion with_get_engine_function’s parameter name. No behavior change.- module_name, func_name = OPERATION_REGISTRY[operation]["scipy"] + _module_name, scipy_func_name = OPERATION_REGISTRY[operation]["scipy"] func = _get_engine_function("scipy", operation) - if func_name == "uniform_filter1d": + if scipy_func_name == "uniform_filter1d": ... - elif func_name == "median_filter": + elif scipy_func_name == "median_filter":Also applies to: 61-61, 70-70
60-69: Cross‑engine boundary/NaN semantics diverge (document or align)
- SciPy path uses padding (eg, reflect) via
uniform_filter1d; scaling bywindowyields a padded-window sum.- Bottleneck uses sliding windows with
min_countsemantics and NaN‑aware reductions.
Results can differ at the edges and in presence of NaNs. Either document this explicitly or add aboundary/min_countoption at the top API and route it per engine for predictable behavior.If you want, I can draft tests asserting expected cross‑engine behavior at boundaries and with NaNs.
Also applies to: 86-89
105-110: Return type annotation mismatch (tuple vs list)Function returns a tuple; update the annotation to match to keep type checkers quiet.
-@cache -def _get_available_engines() -> list[str]: +@cache +def _get_available_engines() -> tuple[str, ...]: """Get list of available engines (cached).""" bottle_list = [] if bn is None else ["bottleneck"] return tuple(["scipy", *bottle_list])
112-117: Cache engine function lookup and fix param namingName the param
operation(it is not a function name) and cache the resolution. This also removes shadowing with localfunc_nameelsewhere.-def _get_engine_function(engine: str, func_name: str) -> Callable | None: - """Get and cache engine function.""" - module_name, func_name = OPERATION_REGISTRY[func_name][engine] +@cache +def _get_engine_function(engine: str, operation: str) -> Callable: + """Get and cache engine function.""" + module_name, func_name = OPERATION_REGISTRY[operation][engine] mod = _get_module(module_name) return getattr(mod, func_name)
174-186: Normalize operation case and validate/normalize axis
- Accept case‑insensitive operations.
- Raise
ParameterError(notIndexError) on bad axis and normalize negatives.data = np.asarray(data) # Validate inputs if window <= 0: raise ParameterError("Window size must be positive") - if operation not in OPERATION_REGISTRY: + operation = operation.lower() + if operation not in OPERATION_REGISTRY: raise ParameterError(f"Unknown operation: {operation}") - if window > data.shape[axis]: + if not (-data.ndim <= axis < data.ndim): + raise ParameterError( + f"Axis {axis} is out of bounds for array with ndim={data.ndim}" + ) + axis = axis % data.ndim + if window > data.shape[axis]: msg = ( f"Window size ({window}) larger than data size ({data.shape[axis]}) " f"along axis {axis}" ) raise ParameterError(msg)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dascore/utils/moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/utils/moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/misc.py (1)
optional_import(300-344)
🪛 Ruff (0.13.1)
dascore/utils/moving.py
57-57: Unpacked variable module_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
177-177: Avoid specifying long messages outside the exception class
(TRY003)
179-179: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
dascore/utils/moving.py (3)
24-50: Good registry design and engine mappingClear split of SciPy vs Bottleneck, with explicit None for unsupported ops (std on SciPy). This keeps dispatch simple and discoverable.
63-65: Preserving complex inputs through SciPy path — niceOnly promoting integer/boolean types to float avoids stripping imaginary parts. Good catch.
119-141: Engine fallback UX is clearMessage and fallback behavior are sensible; raising on unsupported op/engine combos is appropriate.
Do we want a short note in the docstring that
stdrequires Bottleneck and will error if only SciPy is installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
dascore/proc/hampel.py (2)
45-74: Minor alloc cuts in separable path (optional)Preallocate a boolean mask and use np.copyto(where=…) to avoid temporary arrays/fancy indexing.
- outlier_mask = abs_med_diff > threshold - dataf[outlier_mask] = med[outlier_mask] + # Preallocate mask once + mask = np.empty(dataf.shape, dtype=bool) + np.greater(abs_med_diff, threshold, out=mask) + # Copy only where needed + np.copyto(dataf, med, where=mask)
76-95: Avoid extra arrays in exact path; compute in-placeSaves two temporaries in hot path without changing behavior.
- # Handle mad values of 0 so denominator doesn't blow up - mad_safe = np.where(mad == 0.0, np.finfo(dataf.dtype).eps, mad) - - # Hampel test and replacement - thresholded = abs_med_diff / mad_safe - - # Use in-place assignment for consistency - outlier_mask = thresholded > threshold + # Stabilize denominator in-place + eps = np.finfo(dataf.dtype).eps + mad[mad == 0.0] = eps + # Reuse abs_med_diff buffer + np.divide(abs_med_diff, mad, out=abs_med_diff) + outlier_mask = abs_med_diff > threshold dataf[outlier_mask] = med[outlier_mask]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dascore/core/spool.py(1 hunks)dascore/proc/hampel.py(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dascore/proc/hampel.py (2)
dascore/utils/moving.py (1)
move_median(202-206)dascore/utils/patch.py (1)
get_patch_window_size(677-762)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: Run benchmarks
🔇 Additional comments (4)
dascore/proc/hampel.py (4)
12-12: Import looks goodRe-export via utils.moving is clear; usage below is consistent.
204-206: Confirm rounding policy for integer castsnp.rint uses bankers’ rounding (ties to even). If you want “away from zero” semantics, switch to np.round(..., decimals=0) and specify a cast. Otherwise, please confirm current behavior is intended.
193-197: Integer round-trip corruption: promote to float64 (not float32) and use robust dtype checkfloat32 cannot exactly represent integers > 2**24; current round‑trip corrupts untouched samples. Use np.issubdtype and promote integers to float64 before filtering.
- # Need to convert ints to float for calculations to avoid roundoff error. - # There were issues using np.issubdtype not working so this uses kind. - is_int = data.dtype.kind in {"i", "u"} - dataf = data.copy() if not is_int else data.astype(np.float32) + # Promote integer inputs to float64 to avoid precision loss during filtering. + is_int = np.issubdtype(data.dtype, np.integer) + dataf = data.copy() if not is_int else data.astype(np.float64, copy=True)Note: If your data can exceed 2**53 (rare for amplitude arrays), we should avoid whole‑array round‑trips for integer inputs and only write back changed indices. If that’s a real use case here, I can propose a mask‑based write‑back variant.
16-35: Confirmmodehandling in move_median
Themove_mediansignature accepts amodekwarg and forwards it tomoving_window. Under the SciPy backendmodeis honored bymedian_filter, and under the Bottleneck backend it’s filtered out (no TypeError) to use Bottleneck’s default edge behavior.
CodSpeed Performance ReportMerging #557 will not alter performanceComparing Summary
Benchmarks breakdown
Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #557 +/- ##
========================================
Coverage 99.92% 99.92%
========================================
Files 126 127 +1
Lines 10481 10586 +105
========================================
+ Hits 10473 10578 +105
Misses 8 8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
62-69: Pin bottleneck>=1.4.1 in extras (NumPy 2.x / Python 3.13 support).Unversioned extras can pull incompatible wheels—update
pyproject.tomland note in CHANGELOG:- "bottleneck", + "bottleneck>=1.4.1",
🧹 Nitpick comments (7)
tests/test_utils/test_moving.py (4)
20-29: Stabilize test surface by avoiding internal registry import in parametrizationImporting OPERATION_REGISTRY couples tests to internal implementation and can shift test coverage unintentionally if new ops land (e.g., “std”). Prefer an explicit, curated list for this module’s scope.
-TEST_OPERATIONS = list(OPERATION_REGISTRY.keys()) +# Keep this list explicit to avoid accidental coverage drift when registry changes. +TEST_OPERATIONS = ["median", "mean", "sum", "min", "max", "std"]
175-182: Rename test or docstring: it raises, not warnsThe docstring says “warning,” but the test asserts a ParameterError. Align wording to avoid confusion.
- """Test warning for window larger than data.""" + """Test error for window larger than data."""
65-73: Guard interior slice for small windows and document behaviorLogic is fine; consider naming the half-window and add a brief comment for readability. No behavior change.
- interior_slice = slice(window // 2, -window // 2 if window > 2 else None) + h = window // 2 # half-window; avoid edges where padding differs by engine + interior_slice = slice(h, -h if window > 2 else None)
179-182: Remove unreachable assertion inside raises blockThe isinstance check never runs because moving_window raises. Drop it to keep the test tight.
- with pytest.raises(ParameterError, match="larger than data size"): - result = moving_window(data, large_window, "median") - assert isinstance(result, np.ndarray) + with pytest.raises(ParameterError, match="larger than data size"): + moving_window(data, large_window, "median")dascore/proc/hampel.py (3)
203-207: Prevent overflow on cast-back to narrow integer typesRound then clip to the target dtype’s bounds before astype to avoid wraparound.
- if np.issubdtype(data.dtype, np.integer): - dataf = np.rint(dataf) - out = dataf.astype(data.dtype, copy=False) + if np.issubdtype(data.dtype, np.integer): + dataf = np.rint(dataf, out=dataf) + info = np.iinfo(data.dtype) + np.clip(dataf, info.min, info.max, out=dataf) + out = dataf.astype(data.dtype, copy=False)
84-93: Reduce allocations in exact path by mutating MAD in placeYou can avoid creating mad_safe by fixing zeros in-place and dividing with out to reuse abs_med_diff.
- mad_safe = np.where(mad == 0.0, np.finfo(dataf.dtype).eps, mad) - # Hampel test and replacement - thresholded = abs_med_diff / mad_safe - # Use in-place assignment for consistency - outlier_mask = thresholded > threshold + eps = np.finfo(dataf.dtype).eps + mad[mad == 0.0] = eps + # Reuse abs_med_diff buffer for thresholding + np.divide(abs_med_diff, mad, out=abs_med_diff) + outlier_mask = abs_med_diff > threshold
147-151: Doc nit: quantify performance claims and qualify with data shape/engineIf you have benchmark numbers, consider citing ranges (e.g., “~3–4x on 2D patches of shape … with bottleneck 1.3.x”). This helps set expectations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dascore/core/spool.py(1 hunks)dascore/proc/hampel.py(7 hunks)pyproject.toml(1 hunks)tests/test_utils/test_moving.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dascore/core/spool.py
🧰 Additional context used
🧬 Code graph analysis (2)
dascore/proc/hampel.py (2)
dascore/utils/moving.py (1)
move_median(202-206)dascore/utils/patch.py (1)
get_patch_window_size(677-762)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (7)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_sum(223-227)moving_window(150-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: Run benchmarks
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
🔇 Additional comments (1)
dascore/proc/hampel.py (1)
28-33: No change needed formodekwarg
The Bottleneck wrapper (_apply_bottleneck_operation) explicitly drops unsupported keys (mode,origin,cval), so passingmodeintomove_median(..., engine="auto")won’t reach the Bottleneck function or raise a TypeError.Likely an incorrect or invalid review comment.
| # Need to convert ints to float for calculations to avoid roundoff error. | ||
| # There were issues using np.issubdtype not working so this uses kind. | ||
| is_int = data.dtype.kind in {"i", "u"} | ||
| dataf = data.copy() if not is_int else data.astype(np.float32) | ||
| # Apply hampel filtering using appropriate method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: promote integer inputs to float64 (not float32) and use np.issubdtype
Casting integers to float32 loses precision above 2**24 and silently corrupts untouched samples. Also include unsigned integers via np.issubdtype.
- # Need to convert ints to float for calculations to avoid roundoff error.
- # There were issues using np.issubdtype not working so this uses kind.
- is_int = data.dtype.kind in {"i", "u"}
- dataf = data.copy() if not is_int else data.astype(np.float32)
+ # Promote integer inputs to a float dtype that preserves integer precision.
+ is_int = np.issubdtype(data.dtype, np.integer)
+ dataf = data.astype(np.float64, copy=False) if is_int else data.copy()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Need to convert ints to float for calculations to avoid roundoff error. | |
| # There were issues using np.issubdtype not working so this uses kind. | |
| is_int = data.dtype.kind in {"i", "u"} | |
| dataf = data.copy() if not is_int else data.astype(np.float32) | |
| # Apply hampel filtering using appropriate method | |
| # Promote integer inputs to a float dtype that preserves integer precision. | |
| is_int = np.issubdtype(data.dtype, np.integer) | |
| dataf = data.astype(np.float64, copy=False) if is_int else data.copy() | |
| # Apply hampel filtering using appropriate method |
🤖 Prompt for AI Agents
In dascore/proc/hampel.py around lines 193 to 197, the current logic casts
integer inputs to float32 and detects integers by dtype.kind; change this to use
np.issubdtype(data.dtype, np.integer) (which covers signed and unsigned ints) to
detect integer types and promote integer arrays to float64 instead of float32 to
avoid precision loss; ensure you only cast when needed (leave non-integer dtypes
untouched or copy only when necessary) so downstream calculations use float64
precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (8)
tests/test_utils/test_moving.py (7)
9-18: Include move_std import to exercise std convenience path (gated by engine).You’re testing all convenience funcs except std; import it so we can add a conditional test without breaking when Bottleneck isn’t installed.
from dascore.utils.moving import ( OPERATION_REGISTRY, _get_available_engines, move_max, move_mean, move_median, move_min, + move_std, move_sum, moving_window, )
136-147: Test move_std convenience function when Bottleneck is available.Std is supported via the Bottleneck engine; add a gated check to cover this path and its equivalence to moving_window.
- def test_convenience_functions(self, test_data): + def test_convenience_functions(self, test_data, available_engines): """Test that convenience functions work correctly.""" data = test_data["1d"] window = 3 for operation, func in TEST_CONVENIENCE_FUNCS.items(): result = func(data, window) self._validate_basic_result(result, data) # Check equivalence with generic function expected = moving_window(data, window, operation) np.testing.assert_array_equal(result, expected) + + # Also validate std where supported (Bottleneck) + if "bottleneck" in available_engines: + result_std = move_std(data, window, engine="bottleneck") + self._validate_basic_result(result_std, data) + expected_std = moving_window(data, window, "std", engine="bottleneck") + np.testing.assert_array_equal(result_std, expected_std)
42-64: Strengthen property check: assert identity sum == mean * window.Inequality (sum ≥ mean) only holds for non-negative data; here we can make a stricter, window-aware identity to reduce false positives. Pass window into the helper and assert equality on finite positions.
- def _test_finite_properties(self, results, data): + def _test_finite_properties(self, results, data, window): """Test mathematical properties of operations.""" # Find where all results are finite all_finite = np.ones(len(data), dtype=bool) for result in results.values(): all_finite &= np.isfinite(result) if not np.any(all_finite): return # Skip if no finite values finite_indices = np.where(all_finite)[0] # Test properties only where values are finite mean_vals = results["mean"][finite_indices] sum_vals = results["sum"][finite_indices] min_vals = results["min"][finite_indices] max_vals = results["max"][finite_indices] - # Sum should be >= mean (for positive window size) - assert np.all(sum_vals >= mean_vals) + # Identity: sum equals mean * window + np.testing.assert_allclose(sum_vals, mean_vals * window, rtol=1e-12, atol=0.0) # Min should be <= max assert np.all(min_vals <= max_vals)
211-223: Propagate window to property helper.Pass the window so the strengthened identity can be checked.
def test_operation_properties(self, test_data): """Test mathematical properties of operations.""" data = test_data["1d"] window = 3 results = {} for operation in ["median", "mean", "sum", "min", "max"]: results[operation] = moving_window(data, window, operation) - # Test properties where both results are finite - self._test_finite_properties(results, data) + # Test properties where both results are finite + self._test_finite_properties(results, data, window)
175-181: Remove unreachable assertions inside pytest.raises context and align name.Code after the raising call won’t execute; drop it. Optionally rename the test to “raises” for clarity.
def test_large_window_warning(self, test_data): """Test warning for window larger than data.""" data = test_data["1d"] large_window = len(data) + 1 with pytest.raises(ParameterError, match="larger than data size"): - result = moving_window(data, large_window, "median") - assert isinstance(result, np.ndarray) + moving_window(data, large_window, "median")
170-174: Also assert the function returns a valid result when falling back after warn.Capture the output and check shape to ensure the fallback engine path works.
def test_unavailable_engine_warns(self, test_data): """Test unavailable engine warns.""" - with pytest.warns(UserWarning, match="not available"): - moving_window(test_data["1d"], 3, "median", engine="invalid_engine") + with pytest.warns(UserWarning, match="not available"): + out = moving_window(test_data["1d"], 3, "median", engine="invalid_engine") + assert isinstance(out, np.ndarray) + assert out.shape == test_data["1d"].shape
115-135: Decouple skip logic from exception message text.Relying on the substring “not available” is brittle; skip on any ParameterError here since inputs are otherwise valid.
- except ParameterError as e: - if "not available" in str(e): - pytest.skip(f"Operation {operation} not available in {engine}") - else: - raise + except ParameterError as e: + pytest.skip(f"{operation} not supported on engine={engine}: {e}")tests/test_core/test_spool.py (1)
178-194: Strengthen the test with a sanity check that duplicates exist on the dedup subset.Add a quick pre-check so the test fails loudly if the fixture stops producing duplicates on the expected columns.
def test_drop_duplicates_large_joined_dataframe(self, large_spool_with_duplicates): """Test that duplicates are removed when joined dataframe > 10 rows.""" spool = large_spool_with_duplicates + # Sanity: duplicates must exist on the dedup subset to exercise the path. + df = spool.get_contents() + subset = [c for c in ("network", "station", "data_type", "tag") if c in df.columns] + assert subset, "Expected group columns not present in contents" + assert len(df) > 10, "Joined DataFrame must exceed threshold to hit dedup path" + assert df.duplicated(subset=subset).any(), ( + "Fixture did not create duplicates on the dedup subset" + ) # Access a patch to trigger _get_patches_from_index # This should execute the drop_duplicates code path when len(joined) > 10 patch = spool[0] assert isinstance(patch, dc.Patch)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/test_core/test_spool.py(2 hunks)tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (7)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_sum(223-227)moving_window(150-198)
tests/test_core/test_spool.py (4)
dascore/examples.py (1)
get_example_patch(659-706)tests/conftest.py (2)
patch(375-377)spool(569-571)dascore/proc/basic.py (1)
update_attrs(96-133)dascore/core/spool.py (1)
spool(661-690)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: Run benchmarks
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
🔇 Additional comments (1)
tests/test_core/test_spool.py (1)
128-156: Ignore this suggestion: the fixture’s patches share the default dims attribute, so they already duplicate on (“network”, “station”, “dims”, “data_type”, “tag”) and exercise the dedup path correctly.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_clients/test_dirspool.py (2)
367-372: Strengthen assertions to validate dedup effectiveness, not just success.Check the chunk spans the full time extent of the (potentially duplicated) contents.
def test_chunk_redundant_index(self, directory_spool_redundant_index): """Ensure redundant indices are handled effectively with chunking""" spool = directory_spool_redundant_index.chunk(time=None) patch = spool[0] assert isinstance(patch, dc.Patch) + contents = directory_spool_redundant_index.get_contents() + assert patch.attrs.time_min == contents["time_min"].min() + assert patch.attrs.time_max == contents["time_max"].max()
71-84: Hardendirectory_spool_redundant_indexfixture to avoid flakiness
- Exclude the index file (
.dascore_index.h5) by globbing only*.hdf5data files.- Advance each file’s mtime explicitly with
os.utimeto guarantee duplicates on filesystems with coarse timestamp resolution.- (Optional) set
scope="module"to reduce per-test setup overhead.Apply this diff:
+import os @pytest.fixture(scope="module") def directory_spool_redundant_index(random_spool, tmp_path_factory): """Force a spool to be indexed many times with same files.""" path = Path(tmp_path_factory.mktemp("redundant_index_spool")) dascore.examples.spool_to_directory(random_spool, path, "dasdae") spool = dc.spool(path).update() - # Touch each file, re-index to saturate index with duplicates. - for _ in range(12): - for file_path in path.glob("*"): - file_path.touch() - spool = spool.update() + # Touch only data files, force monotonic mtime to create duplicates robustly. + for i in range(12): + for file_path in path.glob("*.hdf5"): + st = file_path.stat() + os.utime(file_path, (st.st_atime, st.st_mtime + i + 1)) + spool = spool.update() return spool
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dascore/core/spool.py(1 hunks)pyproject.toml(2 hunks)tests/test_clients/test_dirspool.py(2 hunks)tests/test_utils/test_moving.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- dascore/core/spool.py
- pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_clients/test_dirspool.py (7)
dascore/examples.py (2)
random_spool(553-576)spool_to_directory(635-655)tests/conftest.py (3)
random_spool(447-451)spool(569-571)patch(375-377)dascore/core/spool.py (4)
spool(663-692)update(237-249)chunk(94-141)chunk(501-531)dascore/io/indexer.py (2)
update(90-95)update(309-346)dascore/clients/filespool.py (1)
update(71-82)dascore/clients/dirspool.py (1)
update(105-112)dascore/utils/chunk.py (1)
chunk(429-471)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_std(216-220)move_sum(223-227)moving_window(150-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: Run benchmarks
🔇 Additional comments (15)
tests/test_utils/test_moving.py (15)
1-31: Well-structured test module with comprehensive imports and configuration.The module imports are clean and appropriate. The test configuration constants (
TEST_OPERATIONS,TEST_ENGINES,TEST_CONVENIENCE_FUNCS) provide good coverage for parametrized testing across different operations, engines, and convenience functions.
33-42: Robust basic validation helper method.The
_validate_basic_resulthelper properly validates fundamental properties: numpy array type, shape preservation, and appropriate numeric dtypes. The dtype check covers float, int, and complex types which is comprehensive.
43-65: Well-designed finite properties test with mathematical validation.This helper method intelligently handles cases where operations might produce non-finite values and validates mathematical properties (sum ≥ mean for positive windows, min ≤ max) only where all results are finite. The logic is sound and defensive.
66-96: Thorough engine comparison with reasonable tolerance for edge effects.The comparison method properly handles interior values to avoid edge effects, validates results within reasonable bounds, and allows for 50% tolerance in relative differences to accommodate different edge handling strategies between engines. This is a pragmatic approach to cross-engine validation.
99-108: Good test data fixtures with deterministic random generation.The fixtures provide comprehensive test scenarios including 1D/2D arrays, large arrays with deterministic randomness, and edge cases (single element, constant values). Using
np.random.default_rng(42)ensures reproducible test runs while satisfying modern NumPy practices.
116-136: Comprehensive parametrized testing with proper error handling.The test properly handles unavailable engines/operations by skipping rather than failing, which is appropriate for optional dependencies like bottleneck. The exception handling distinguishes between availability issues and actual errors.
137-149: Excellent equivalence testing for convenience functions.This test validates that convenience functions produce identical results to the generic
moving_windowinterface, which is crucial for API consistency. The approach ensures the convenience functions are proper aliases without behavioral differences.
150-158: Good multi-axis support validation.Testing operations on different axes ensures the moving window operations work correctly across array dimensions. The test appropriately validates shape preservation for 2D arrays.
160-170: Proper input validation testing.The tests correctly validate that invalid window sizes and unknown operations raise
ParameterErrorwith appropriate error messages. This ensures the API provides clear feedback for invalid inputs.
185-201: Solid edge case testing with meaningful validation.The edge cases cover important scenarios (single element, constant values) and include the important validation that window size 1 should return the input unchanged. This is mathematically correct behavior.
202-211: Good dtype compatibility testing.Testing with different numeric types (int32, float32, float64) ensures the moving window operations handle various input dtypes correctly. The test appropriately uses a smaller subset for performance.
213-224: Excellent mathematical property validation.This test validates fundamental mathematical relationships between different operations (median, mean, sum, min, max) by delegating to the well-designed
_test_finite_propertieshelper. This ensures operations behave consistently with mathematical expectations.
225-240: Strong numerical accuracy testing with known results.The test validates exact expected values for specific cases, which is crucial for ensuring correctness. Testing both median and mean with known inputs and expected outputs provides confidence in the implementations.
241-262: Robust cross-engine consistency testing.This test properly handles optional bottleneck dependency and validates that different engines produce consistent results for the same operations. The graceful handling of unavailable operations and use of the comparison helper method ensures reliable cross-engine validation.
263-270: Appropriate bottleneck-specific testing.The test specifically validates bottleneck's std operation (which scipy engine doesn't support) with proper import skipping. This ensures bottleneck-specific functionality is tested when available while gracefully handling its absence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tests/test_utils/test_moving.py (6)
69-73: Make interior slicing robust; skip tiny/even-window edge casesFor consistent “interior” comparisons across engines, bail early for very small windows and use a symmetric slice without the special-case
None. This avoids pulling edge-influenced values (e.g., forwindow == 2).- interior_slice = slice(window // 2, -window // 2 if window > 2 else None) + # Skip comparison for tiny windows where a stable interior is ill-defined. + if window < 3: + return + # Symmetric interior (works for odd/even windows and avoids edges) + interior_slice = slice(window // 2, -window // 2)
83-87: Avoid hard-coded value bounds; remove or derive from dataAsserting results lie in [-10, 10] is brittle and unrelated to input distribution or operation (e.g., “sum” scales with window). Either drop the guard or compute operation-aware bounds. Given you already compare relative means, the simplest fix is to remove this block.
- # Check that both give reasonable results (within data range) - data_min, data_max = -10, 10 # Reasonable range - assert np.all((vals1 >= data_min) & (vals1 <= data_max)) - assert np.all((vals2 >= data_min) & (vals2 <= data_max)) + # Value-range guard removed; relative agreement is checked below.
21-30: Include std convenience check when Bottleneck is availableYou already guard engine availability elsewhere; add
stdhere to increase wrapper coverage.TEST_CONVENIENCE_FUNCS = { "median": move_median, "mean": move_mean, "sum": move_sum, "min": move_min, "max": move_max, } + +# Add std wrapper when bottleneck is available. +if "bottleneck" in _get_available_engines(): + TEST_CONVENIENCE_FUNCS["std"] = move_stdAlso applies to: 137-149
43-65: Strengthen property test: relate sum and mean on interior
sum >= meanis weak and data-sign dependent. Prefersum ≈ mean * windowon interior indices to avoid edge effects. Passwindowinto the helper and assert allclose on the interior slice.- def _test_finite_properties(self, results, data): + def _test_finite_properties(self, results, data, window): """Test mathematical properties of operations.""" @@ - # Sum should be >= mean (for positive window size) - assert np.all(sum_vals >= mean_vals) + # On interior indices, sum should equal mean * window (within tolerance). + left = window // 2 + right = -window // 2 + interior = slice(left, right if window >= 3 else None) + np.testing.assert_allclose( + sum_vals[interior], mean_vals[interior] * window, rtol=1e-6, atol=1e-12 + ) # Min should be <= max assert np.all(min_vals <= max_vals)And update the call site:
- self._test_finite_properties(results, data) + self._test_finite_properties(results, data, window)Also applies to: 215-226
150-158: Broaden multi-axis coverage to more opsAxis handling bugs often differ by op; consider parametrizing over a small set like median/mean/sum/min/max.
- for axis in [0, 1]: - result = move_median(data, window, axis=axis) - assert result.shape == data.shape + for axis in [0, 1]: + for func in (move_median, move_mean, move_sum, move_min, move_max): + result = func(data, window, axis=axis) + assert result.shape == data.shape
208-213: Assert dtype promotion for integer inputs to meanMean over integer arrays should produce a floating dtype; assert it to catch regressions.
- result = move_mean(data, 3) - assert isinstance(result, np.ndarray) + result = move_mean(data, 3) + assert isinstance(result, np.ndarray) + # Mean should promote to floating dtype for integer inputs. + assert result.dtype.kind == "f"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_std(216-220)move_sum(223-227)moving_window(150-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: Run benchmarks
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (ubuntu-latest, 3.13)
🔇 Additional comments (1)
tests/test_utils/test_moving.py (1)
171-177: select_engine warns and falls back as expected
The function issues aUserWarningcontaining “not available” and then returns the first available engine (“scipy”), so the test’s warning and return-value assertions are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tests/test_utils/test_moving.py (6)
45-45: Include unsigned integer dtype in basic result checkSome ops (min/max/sum) can return unsigned kinds for unsigned inputs. Accept "u" as well.
- assert result.dtype.kind in ["f", "i", "c"] # float, int, or complex + assert result.dtype.kind in ("f", "i", "u", "c") # float, (u)int, or complex
147-154: Make skip condition robust to error-message wordingRelying only on "not available" is brittle. Broaden to common phrasings and lowercase once.
- except ParameterError as e: - if "not available" in str(e): - pytest.skip(f"Operation {operation} not available in {engine}") - else: - raise + except ParameterError as e: + msg = str(e).lower() + if any(s in msg for s in ("not available", "unsupported", "unknown operation")): + pytest.skip(f"{operation} not supported by engine={engine}") + raise
169-178: Strengthen multi‑axis checks by asserting equivalence with generic APICurrently only shape is asserted. Compare values to moving_window for each op/axis.
- for axis in [0, 1]: - for func in (move_median, move_mean, move_sum, move_min, move_max): - result = func(data, window, axis=axis) - assert result.shape == data.shape + for axis in [0, 1]: + for operation, func in TEST_CONVENIENCE_FUNCS.items(): + result = func(data, window, axis=axis) + expected = moving_window(data, window, operation, axis=axis) + assert result.shape == data.shape + np.testing.assert_array_equal(result, expected)
191-197: Also verify fallback equivalence on invalid engineAfter warning, ensure output matches the automatic engine selection.
def test_unavailable_engine_warns(self, test_data): """Test unavailable engine warns.""" with pytest.warns(UserWarning, match="not available"): - out = moving_window(test_data["1d"], 3, "median", engine="invalid_engine") + out = moving_window(test_data["1d"], 3, "median", engine="invalid_engine") # It should also return an array. assert isinstance(out, np.ndarray) assert out.shape == test_data["1d"].shape + # And it should be equivalent to the auto-selected engine. + expected = moving_window(test_data["1d"], 3, "median", engine="auto") + np.testing.assert_array_equal(out, expected)
288-295: Strengthen bottleneck std test: use constant data and check zerosShape-only assertion can miss regressions. Constant input should yield zeros.
def test_bottleneck_std(self, test_data): """ Ensure the bottleneck std works (current scipy engine doesnt have this). """ pytest.importorskip("bottleneck") - out = move_std(test_data["1d"], 10, axis=0, engine="bottleneck") - assert out.shape == test_data["1d"].shape + data = test_data["constant"] + out = move_std(data, 3, axis=0, engine="bottleneck") + assert out.shape == data.shape + assert np.allclose(out, 0.0)
23-23: Optional: build TEST_ENGINES dynamically to avoid skip noiseDerive engines from runtime availability and append "auto".
-TEST_ENGINES = ["scipy", "bottleneck", "auto"] +TEST_ENGINES = [*_get_available_engines(), "auto"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_std(216-220)move_sum(223-227)moving_window(150-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: Run benchmarks
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code_min_deps (macos-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (macos-latest, 3.13)
🔇 Additional comments (1)
tests/test_utils/test_moving.py (1)
37-39: Good addition: comprehensive, engine-aware test scaffoldSolid coverage across ops/engines with helpful helpers. Nice use of fixtures and interior checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
tests/test_utils/test_moving.py (7)
62-76: Unify “interior” indexing with the helper and widen valid regionUse the same symmetric interior definition as _compare_engine_results to avoid off‑by‑one inconsistencies and cover more valid samples.
- if window >= 3 and len(data) >= window: - # True interior: start from window-1 to end-(window-1) - interior_start = window - 1 - interior_end = len(data) - (window - 1) - if interior_end > interior_start: - interior_slice = slice(interior_start, interior_end) + if window >= 3 and len(data) >= window: + # Symmetric interior consistent with _compare_engine_results + interior_slice = slice(window // 2, len(data) - (window // 2)) + if interior_slice.stop - interior_slice.start > 0: interior_finite = all_finite[interior_slice] if np.any(interior_finite): mean_interior = mean_vals[interior_slice][interior_finite] sum_interior = sum_vals[interior_slice][interior_finite] np.testing.assert_allclose( sum_interior, mean_interior * window, rtol=1e-6, atol=1e-12 )
151-154: Make engine-unavailable skip more robust to message phrasingAvoid coupling to a single substring; handle common variants.
- except ParameterError as e: - if "not available" in str(e): - pytest.skip(f"Operation {operation} not available in {engine}") - else: - raise + except ParameterError as e: + msg = str(e).lower() + if any(p in msg for p in ("not available", "unsupported", "unavailable")): + pytest.skip(f"Operation {operation} not available in {engine}") + else: + raise
191-197: Also assert behavior matches the auto-engine fallbackStrengthen this by verifying output equals the auto-selected engine result.
def test_unavailable_engine_warns(self, test_data): """Test unavailable engine warns.""" with pytest.warns(UserWarning, match="not available"): out = moving_window(test_data["1d"], 3, "median", engine="invalid_engine") - # It should also return an array. - assert isinstance(out, np.ndarray) - assert out.shape == test_data["1d"].shape + # It should also return an array and match auto-engine output. + assert isinstance(out, np.ndarray) + assert out.shape == test_data["1d"].shape + expected = moving_window(test_data["1d"], 3, "median", engine="auto") + np.testing.assert_array_equal(out, expected)
174-178: Parametrize axis (including negative) for clearer test matrixParametrization improves reporting and adds axis=-1 coverage.
- def test_multi_axis_operations(self, test_data): + @pytest.mark.parametrize("axis", (0, 1, -1)) + def test_multi_axis_operations(self, test_data, axis): """Test operations on different axes.""" data = test_data["2d"] window = 2 - - for axis in [0, 1]: - for func in (move_median, move_mean, move_sum, move_min, move_max): - result = func(data, window, axis=axis) - assert result.shape == data.shape + for func in (move_median, move_mean, move_sum, move_min, move_max): + result = func(data, window, axis=axis) + assert result.shape == data.shape
276-283: Tighten cross‑engine agreement for additive ops on the interiorFor mean/sum, engines should agree element‑wise on interior. Keep the existing lenient check, but add an allclose assert.
for operation in operations_to_test: try: result_scipy = moving_window(data, window, operation, engine="scipy") result_bn = moving_window(data, window, operation, engine="bottleneck") # Compare interior values (avoiding edge effects) self._compare_engine_results(result_scipy, result_bn, window) + interior = slice(window // 2, -window // 2) + if operation in ("mean", "sum"): + np.testing.assert_allclose( + result_scipy[interior], result_bn[interior], rtol=1e-7, atol=1e-12 + ) except ParameterError: # Skip if operation not available in one engine continue
228-236: Broaden dtype coverage and simplify promotion checkInclude more integer types and assert promotion generically.
- dtypes = [np.int32, np.float32, np.float64] + dtypes = [np.int32, np.int64, np.uint8, np.float32, np.float64] for dtype in dtypes: data = base_data.astype(dtype) result = move_mean(data, 3) assert isinstance(result, np.ndarray) # Mean should promote to floating dtype for integer inputs. - if dtype in [np.int32]: - assert result.dtype.kind == "f" + if np.issubdtype(dtype, np.integer): + assert result.dtype.kind == "f"
107-114: Lax engine‑comparison tolerance risks false negativesMean-of-interior with 50% tolerance may miss mismatches. The added per-element allclose (suggested in test_engine_consistency) will catch real divergences while keeping this guard if you want a coarse sanity check.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_utils/test_moving.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_utils/test_moving.py (2)
dascore/exceptions.py (1)
ParameterError(26-27)dascore/utils/moving.py (8)
_get_available_engines(113-116)move_max(237-241)move_mean(209-213)move_median(202-206)move_min(230-234)move_std(216-220)move_sum(223-227)moving_window(150-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: test_code_min_deps (windows-latest, 3.12)
- GitHub Check: test_code_min_deps (macos-latest, 3.13)
- GitHub Check: test_code_min_deps (windows-latest, 3.13)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.12)
- GitHub Check: test_code_min_deps (ubuntu-latest, 3.13)
- GitHub Check: test_code (macos-latest, 3.13)
- GitHub Check: test_code (ubuntu-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.12)
- GitHub Check: test_code (ubuntu-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.13)
- GitHub Check: test_code (windows-latest, 3.11)
- GitHub Check: test_code (windows-latest, 3.10)
- GitHub Check: test_code (windows-latest, 3.12)
- GitHub Check: test_code (macos-latest, 3.11)
- GitHub Check: test_code (macos-latest, 3.10)
- GitHub Check: test_code (ubuntu-latest, 3.13)
- GitHub Check: Run benchmarks
🔇 Additional comments (1)
tests/test_utils/test_moving.py (1)
37-39: Solid, comprehensive test scaffoldGood coverage across ops, engines, axes, validation, and numerical properties. Nice use of importorskip for optional bn.
Description
This PR speeds up the hampel_filter by allocating fewer arrays and introducing bottleneck as an optional dependency. It also adds a moving module for simplifying window operations.
Checklist
I have (if applicable):
Summary by CodeRabbit