Releases: facebookresearch/balance
0.18.0 (2026-03-24)
New Features
- Implemented
r_indicator()with validated sample-variance formula- Added a public
r_indicator(sample_p, target_p)implementation in
weighted_comparisons_statsusing the documented Eq. 2.2.2 formulation
over concatenated propensity vectors and explicit input-size validation. - Added validation for non-finite and out-of-range propensity values,
and expanded unit coverage for formula correctness and edge cases. - Added
BalanceDFWeights.r_indicator()as a convenience wrapper, so
sample.weights().r_indicator()computes the r-indicator directly.
- Added a public
Deprecations
Sample.design_effect()is deprecated — usesample.weights().design_effect()instead.
The method already exists onBalanceDFWeights; theSamplemethod now emits a
DeprecationWarningand delegates. Will be removed in balance 0.19.0.Sample.design_effect_prop()is deprecated — usesample.weights().design_effect_prop()instead.
New method added toBalanceDFWeights. Will be removed in balance 0.19.0.Sample.plot_weight_density()is deprecated — usesample.weights().plot()instead.
Will be removed in balance 0.19.0.Sample.covar_means()is deprecated — usesample.covars().mean()instead
(with.rename(index={'self': 'adjusted'}).reindex([...]).Tfor the same format).
Will be removed in balance 0.19.0.Sample.outcome_sd_prop()is deprecated — usesample.outcomes().outcome_sd_prop()instead.
New method added toBalanceDFOutcomes. Will be removed in balance 0.19.0.Sample.outcome_variance_ratio()is deprecated — usesample.outcomes().outcome_variance_ratio()instead.
New method added toBalanceDFOutcomes. Will be removed in balance 0.19.0.
LLM/GenAI
- Added
CLAUDE.mdproject context files for Claude Code users, covering architecture,
build/test instructions (Meta and open-source), code conventions, and pre-submit checklist. - Updated
.github/copilot-instructions.mdreview checklist to reduce duplication with
CLAUDE.mdand add missing conventions (MIT license header,from __future__ import annotations,
factory pattern, seed fixing, deprecation style).
Bug Fixes
prepare_marginal_dist_for_raking/_realize_dicts_of_proportions: fixed memory explosion from LCM expansion- When proportions had high decimal precision or many covariates were passed,
the LCM of the individual per-variable array lengths could reach tens of
millions (or more), causing OOM crashes. - Both functions now accept a
max_lengthparameter (default10000). When
the natural LCM exceedsmax_length, the output is capped atmax_length
rows and counts are allocated via the Hare-Niemeyer (largest remainder)
method, which guarantees the total stays exactlymax_lengthwith minimal
rounding error per category. - A warning is logged whenever the cap is applied.
- A new internal helper
_hare_niemeyer_allocationimplements the allocation logic.
- When proportions had high decimal precision or many covariates were passed,
Contributors
Full Changelog: 0.17.0...0.18.0
0.17.0 (2026-03-17)
Breaking Changes
- CLI: unmentioned columns now go to
ignore_columnsinstead ofoutcome_columns- Previously, when
--outcome_columnswas not explicitly set, all columns that
were not the id, weight, or a covariate were automatically classified as
outcome columns. Now those columns are placed intoignore_columnsinstead. - Columns that are explicitly mentioned — the id column, weight column,
covariate columns, and outcome columns — are not ignored.
- Previously, when
New Features
- ASCII comparative histogram and plot improvements
- Added
ascii_comparative_histfor comparing multiple distributions against a
baseline using inline visual indicators (█,▒,▐,░). - Comparative ASCII plots now order datasets as population → adjusted → sample.
ascii_plot_distaccepts a newcomparativekeyword (defaultTrue) to
toggle between comparative and grouped-bar histograms for numeric variables.
- Added
Code Quality & Refactoring
- Moved dataset loading implementations out of
balance.datasets.__init__- Refactored
load_sim_data,load_cbps_data, andload_datainto
balance.datasets.loading_dataand re-exported them from
balance.datasetsto preserve the public API while keeping module
responsibilities focused.
- Refactored
Documentation
- ASCII plot documentation and tutorial examples
- Added rendered text-plot examples to ASCII plot docstrings and documented
library="balance"support. Updatedbalance_quickstart.ipynbwith
adjusted vs unadjusted ASCII plot examples.
- Added rendered text-plot examples to ASCII plot docstrings and documented
- Improved
keep_columnsdocumentation- Updated docstrings for
has_keep_columns(),keep_columns(), and the
--keep_columnsargument to clarify that keep columns control which columns
appear in the final output CSV. Keep columns that are not id, weight,
covariate, or outcome columns will be placed intoignore_columnsduring
processing but are still retained and available in the output.
- Updated docstrings for
- Clarified
_prepare_input_model_matrixargument docs- Updated docstrings in
balance.utils.model_matrixwith
explicit descriptions forsample,target,variables, andadd_na
behavior when preparing model-matrix inputs.
- Updated docstrings in
Bug Fixes
- Weight diagnostics now consistently accept DataFrame inputs
design_effect,nonparametric_skew,prop_above_and_below, and
weighted_median_breakdown_pointnow explicitly normalize DataFrame inputs
to their first column before computation, matching validation behavior and
returning scalar/Series outputs consistently.
- Model-matrix robustness improvements
_make_df_column_names_unique()now avoids suffix collisions when columns
likea,a_1, and repeatedanames appear together, renaming
duplicates deterministically to prevent downstream clashes._prepare_input_model_matrix()now raises a deterministicValueError
when the input sample has zero rows, instead of relying on an assertion.
- Stabilized
prop_above_and_below()return pathsprop_above_and_below()now builds concatenated outputs only from present
Series objects and returnsNonewhen bothbelowandaboveareNone,
avoiding ambiguous concat inputs while preserving existing behavior for valid
threshold sets.
- Validated and normalized comma-separated CLI column arguments
- CLI column-list arguments now trim surrounding whitespace and reject empty
entries (for example,"id,,weight") with clearValueErrormessages,
preventing malformed column specifications from silently propagating. - Applied to
--covariate_columns,--covariate_columns_for_diagnostics,
--batch_columns,--keep_columns, and--outcome_columnsparsing.
- CLI column-list arguments now trim surrounding whitespace and reject empty
Tests
- Added end-to-end adjustment test with ASCII plot output and expanded ASCII plot edge-case coverage
TestAsciiPlotsAdjustmentEndToEndruns the full adjustment pipeline and
asserts exact expected ASCII output. Added tests forascii_plot_distwith
comparative=Falseand mixed categorical+numeric routing.
- Expanded warning coverage for
Sample.from_frame()ID inference- Added assertions that validate all three expected warnings are emitted when inferring an
idcolumn and default weights, including ID guessing, ID string casting, and automatic weight creation.
- Added assertions that validate all three expected warnings are emitted when inferring an
- Expanded IPW helper and diagnostics test coverage
- Added tests for
link_transform()andcalc_dev()to validate behavior
for extreme probabilities and finite 10-fold deviance summaries. - Refactored diagnostics tests to use a shared IPW setup helper, added
edge-case assertions for solver/penalty values, NaN coercion of non-scalar
inputs, and now assert labels match fitted model parameters.
- Added tests for
- Expanded
prop_above_and_below()edge-case coverage- Added focused tests for empty threshold iterables, mixed
Nonethreshold groups in dict mode, and explicit all-Nonethreshold handling across return formats.
- Added focused tests for empty threshold iterables, mixed
- Added unit coverage for CLI I/O and empty-batch handling
- Added focused tests for
BalanceCLI.process_batch()empty-sample failure payloads,load_and_check_input()CSV loading paths, andwrite_outputs()delimiter-aware output writing for both adjusted and diagnostics files.
- Added focused tests for
Contributors
@sahil350 , @neuralsorcerer, @talgalili
Full Changelog
0.16.0 (2026-02-09)
New Features
- Outcome weight impact diagnostics
- Added paired outcome-weight impact tests (
y*w0vsy*w1) with confidence intervals. - Exposed in
BalanceDFOutcomes,Sample.diagnostics(), and the CLI via
--weights_impact_on_outcome_method.
- Added paired outcome-weight impact tests (
- Pandas 3 support
- Updated compatibility and tests for pandas 3.x
- Categorical distribution metrics without one-hot encoding
- KLD/EMD/CVMD/KS on
BalanceDF.covars()now operate on raw categorical variables
(with NA indicators) instead of one-hot encoded columns.
- KLD/EMD/CVMD/KS on
- Misc
- Raw-covariate adjustment for custom models
Sample.adjust()now supports fitting models on raw covariates (without a model matrix)
for IPW viause_model_matrix=False. String, object, and boolean columns are converted
to pandasCategoricaldtype, allowing sklearn estimators with native categorical
support (e.g.,HistGradientBoostingClassifierwithcategorical_features="from_dtype")
to handle them correctly. Requires scikit-learn >= 1.4 when categorical columns are
present.
- Validate weights include positive values
- Added a guard in weight diagnostics to error when all weights are zero.
- Support configurable ID column candidates
Sample.from_frame()andguess_id_column()now accept candidate ID column names
when auto-detecting the ID column.
- Formula support for BalanceDF model matrices
BalanceDF.model_matrix()now accepts aformulaargument to build
custom model matrices without precomputing them manually.
- Raw-covariate adjustment for custom models
Bug Fixes
- Removed deprecated setup build
- Replaced deprecated
setup.pywithpyproject.tomlbuild in CI to avoid build failure.
- Replaced deprecated
- Hardened ID column candidate validation
guess_id_column()now ignores duplicate candidate names and validates that candidates are non-empty strings.
- Hardened pandas 3 compatibility paths
- Updated string/NA handling and discrete checks for pandas 3 dtypes, and refreshed tests to accept string-backed dtypes.
Packaging & Tests
- Pandas 3.x compatibility
- Expanded the pandas dependency range to allow pandas 3.x releases.
- Direct util imports in tests
- Refactored util test modules to import helpers directly from their modules instead of via
balance_util.
- Refactored util test modules to import helpers directly from their modules instead of via
Breaking Changes
- Require positive weights for weight diagnostics that normalize or aggregate
design_effect,nonparametric_skew,prop_above_and_below, and
weighted_median_breakdown_pointnow raise aValueErrorwhen all weights
are zero.- Migration: ensure your weights include at least one positive value
before calling these diagnostics, or catch theValueErrorif all-zero
weights are possible in your workflow.
Contributors
@neuralsorcerer, @talgalili (with code/methodological review by @talsarig)
Full Changelog: 0.15.0...0.16.0
0.15.0 (2026-01-20)
New Features
- Added EMD/CVMD/KS distribution diagnostics
BalanceDFnow exposes Earth Mover's Distance (EMD), Cramér-von Mises distance (CVMD), and Kolmogorov-Smirnov (KS) statistics for comparing adjusted samples to targets.- These diagnostics support weighted or unweighted comparisons, apply discrete/continuous formulations, and respect
aggregate_by_main_covarfor one-hot categorical aggregation.
- Exposed outcome columns selection in the CLI
- Added
--outcome_columnsto choose which columns are treated as outcomes
instead of defaulting to all non-id/weight/covariate columns. Remaining columns are moved toignored_columns.
- Added
- Improved missing data handling in
poststratify()poststratify()now acceptsna_actionto either drop rows with missing
values or treat missing values as their own category during weighting.- Breaking change: the default behavior now fills missing values in
poststratification variables with"__NaN__"and treats this as a distinct
category during weighting. Previously, missing values were not handled
explicitly, and their treatment depended on pandasgroupbyandmerge
defaults. To approximate the legacy behavior where missing values do not
form their own category, passna_action="drop"explicitly.
- Added formula support for
descriptive_statsmodel matricesdescriptive_stats()now accepts aformulaargument that is always
applied to the data (including numeric-only frames), letting callers
control which terms and dummy variables are included in summary statistics.
Documentation
- Documented the balance CLI
- Added full API docstrings for
balance.cliand a new CLI tutorial notebook.
- Added full API docstrings for
- Created Balance CLI tutorial
- Added CLI command echoing, a
load_data()example, and richer diagnostics exploration with metric/variable listings and a browsable diagnostics table. https://import-balance.org/docs/tutorials/balance_cli_tutorial/
- Added CLI command echoing, a
- Synchronized docstring examples with test cases
- Updated user-facing docstrings so the documented examples mirror tested inputs
and outputs.
- Updated user-facing docstrings so the documented examples mirror tested inputs
Code Quality & Refactoring
- Added warning when the sample size of 'target' is much larger than 'sample' sample size
Sample.adjust()now warns when the target exceeds 100k rows and is at
least 10x larger than the sample, highlighting that uncertainty is
dominated by the sample (akin to a one-sample comparison).
- Split util helpers into focused modules
- Broke
balance.utilintobalance.utilssubmodules for easier navigation.
- Broke
Bug Fixes
- Updated
Sample.__str__()to format weight diagnostics likeSample.summary()- Weight diagnostics (design effect, effective sample size proportion, effective sample size)
are now displayed on separate lines instead of comma-separated on one line. - Replaced "eff." abbreviations with full "effective" word for better readability.
- Improves consistency with
Sample.summary()output format.
- Weight diagnostics (design effect, effective sample size proportion, effective sample size)
- Numerically stable CBPS probabilities
- The CBPS helper now uses a stable logistic transform to avoid exponential
overflow warnings during probability computation in constraint checks.
- The CBPS helper now uses a stable logistic transform to avoid exponential
- Silenced pandas observed default warning
- Explicitly sets
observed=Falsein weighted categorical KLD calculations
to retain current behavior and avoid future pandas default changes.
- Explicitly sets
- Fixed
plot_qq_categoricalto respect theweightedparameter for target data- Previously, the target weights were always applied regardless of the
weighted=Falsesetting, causing inconsistent behavior between sample
and target proportions in categorical QQ plots.
- Previously, the target weights were always applied regardless of the
- Restored CBPS tutorial plots
- Re-enabled scatter plots in the CBPS comparison tutorial notebook while
avoiding GitHub Pages rendering errors and pandas colormap warnings. https://import-balance.org/docs/tutorials/comparing_cbps_in_r_vs_python_using_sim_data/
- Re-enabled scatter plots in the CBPS comparison tutorial notebook while
- Clearer validation errors in adjustment helpers
trim_weights()now accepts list/tuple inputs and reports invalid types explicitly.apply_transformations()raises clearer errors for invalid inputs and empty transformations.
- Fixed
model_matrixto drop NA rows when requestedmodel_matrix(add_na=False)now actually drops rows containing NA values while preserving categorical levels, matching the documented behavior.- Previously,
add_na=Falseonly logged a warning without dropping rows; code relying on the old behavior may now see fewer rows and should either handle missingness explicitly or useadd_na=True.
Tests
- Aligned formatting toolchain between Meta internal and GitHub CI
- Added
["fbcode/core_stats/balance"]override to Meta's internaltools/lint/pyfmt/config.tomlto useformatter = "black"andsorter = "usort". - This ensures both internal (
pyfmt/arc lint) and external (GitHub Actions) environments use the same Black 25.1.0 formatter, eliminating formatting drift. - Updated CI workflow, pre-commit config, and
requirements-fmt.txtto useblack==25.1.0.
- Added
- Added Pyre type checking to GitHub Actions via
.pyre_configuration.externaland a newpyrejob in the workflow. Tests are excluded due to external typeshed stub differences; library code is fully type-checked. - Added test coverage workflow and badge to README via
.github/workflows/coverage.yml. The workflow collects coverage using pytest-cov, generates HTML and XML reports, uploads them as artifacts, and displays coverage metrics. A coverage badge is now shown in README.md alongside other workflow badges. - Improved test coverage for edge cases and error handling paths
- Added targeted tests for previously uncovered code paths across the library, addressing edge cases including empty inputs, verbose logging, error handling for invalid parameters, and boundary conditions in weighting methods (IPW, CBPS, rake).
- Tests exercise defensive code paths that handle empty DataFrames, NaN convergence values, invalid model types, and non-convergence warnings.
- Split test_util.py into focused test modules
- Split the large
test_util.pyfile (2325 lines) into 5 modular test files that mirror thebalance/utils/structure:test_util_data_transformation.py- Tests for data transformation utilitiestest_util_input_validation.py- Tests for input validation utilitiestest_util_model_matrix.py- Tests for model matrix utilitiestest_util_pandas_utils.py- Tests for pandas utilities (including high cardinality warnings)test_util_logging_utils.py- Tests for logging utilities
- This improves test organization and makes it easier to locate tests for specific utilities.
- Split the large
Contributors
Full Changelog: 0.14.0...0.15.0
0.14.0 (2025-12-14)
New Features
- Enhanced adjusted sample summary output
- Richer
Sample.summary()diagnostics- Adjusted sample summary now groups covariate diagnostics, reports design
effect alongside ESSP/ESS, and surfaces weighted outcome means when
available.
- Adjusted sample summary now groups covariate diagnostics, reports design
- Warning of high-cardinality categorical features in
.adjust() - Ignored column handling for Sample inputs
Sample.from_frameacceptsignore_columnsfor columns that should remain
on the dataframe but be excluded from covariates and outcome statistics.
Ignored columns appear inSample.dfand can be retrieved via
Sample.ignored_columns().
Code Quality & Refactoring
- Consolidated diagnostics helpers
- Added
_concat_metric_val_var()helper andbalance.util._coerce_scalar
for robust diagnostics row construction and scalar-to-float conversion. - Breaking change:
Sample.diagnostics()for IPW now always emits
iteration/intercept summaries plus hyperparameter settings.
- Added
Bug Fixes
- Early validation of null weight inputs
Sample.from_framenow raisesValueErrorwhen weights containNone,
NaN, orpd.NAvalues with count and preview of affected rows.
- Percentile weight trimming across platforms
trim_weights()now computes thresholds via percentile quantiles with
explicit clipping bounds for consistent behavior across Python/NumPy
versions.- Breaking change: percentile-based clipping may shift by roughly one
observation at typical limits.
- IPW diagnostics improvements
- Fixed
multi_classreporting, normalized scalar hyperparameters to floats,
removed deprecatedpenaltyargument warnings, and deduplicated metric
entries for stable counts across sklearn versions.
- Fixed
Tests
- Added Windows and macOS CI testing support
- Expanded GitHub Actions to run on
ubuntu-latest,macos-latest, and
windows-latestfor Python 3.9-3.14. - Added
tempfile_path()context manager for cross-platform temp file
handling and configured matplotlib Agg backend viaconftest.py.
- Expanded GitHub Actions to run on
Contributors
@neuralsorcerer, @talgalili, @wesleytlee
Full Changelog
0.13.0 (2025-12-02)
New Features
- Propensity modeling beyond static logistic regression
ipw()now accepts any sklearn classifier via themodelargument,
enabling the use of models like random forests and gradient boosting while
preserving all existing trimming and diagnostic features. Dense-only
estimators and models without linear coefficients are fully supported.
Propensity probabilities are stabilized to avoid numerical issues.- Allow customization of logistic regression by passing a configured
:class:~sklearn.linear_model.LogisticRegressioninstance through the
modelargument. Also, the CLI now accepts
--ipw_logistic_regression_kwargsJSON to build that estimator directly for
command-line workflows.
- Covariate diagnostics
- Added KL divergence calculations for covariate comparisons (numeric and
one-hot categorical), exposed viaBalanceDF.kld()alongside linked-sample
aggregation support.
- Added KL divergence calculations for covariate comparisons (numeric and
- Weighting Methods
rake()andpoststratify()now honourweight_trimming_mean_ratioand
weight_trimming_percentile, trimming and renormalising weights through the
enhancedtrim_weights(..., target_sum_weights=...)API so the documented
parameters work as expected
(#147).
Documentation
- Added comprehensive post-stratification tutorial notebook
(balance_quickstart_poststratify.ipynb)
(#141,
#142,
#143). - Expanded poststratify docstring with clear examples and improved statistical
methods documentation
(#141). - Added project badges to README for build status, Python version support, and
release tracking
(#145). - Added IPW quickstart tutorial showcasing default logistic regression and
custom sklearn classifier usage in (balance_quickstart.ipynb). - Shorten the welcome message (for when importing the package).
Code Quality & Refactoring
-
Raking algorithm refactor
- Removed
ipfndependency and replaced with a vectorized NumPy
implementation (_run_ipf_numpy) for iterative proportional fitting,
resulting in significant performance improvements and eliminating external
dependency (#135).
- Removed
-
IPW method refactoring
- Reduced Cyclomatic Complexity Number (CCN) by extracting repeated code
patterns into reusable helper functions:_compute_deviance(),
_compute_proportion_deviance(),_convert_to_dense_array(). - Removed manual ASMD improvement calculation and now uses existing
compute_asmd_improvement()fromweighted_comparisons_stats.py
- Reduced Cyclomatic Complexity Number (CCN) by extracting repeated code
-
Type safety improvements
- Migrated 32 Python files from
# pyre-unsafeto# pyre-strictmode,
covering core modules, statistics, weighting methods, datasets, and test
files - Modernized type hints to PEP 604 syntax (
X | Yinstead ofUnion[X, Y])
across 11 files for improved readability and Python 3.10+ alignment - Type alias definitions in
typing.pyretainUnionsyntax for Python 3.9
compatibility - Enhanced plotting function type safety with
TypedDictdefinitions and
proper type narrowing - Replaced assert-based type narrowing with
_verify_value_type()helper for
better error messages and pyre-strict compliance
- Migrated 32 Python files from
-
Renamed BalanceDF to BalanceDF****
- BalanceCovarsDF to BalanceDFCovars
- BalanceOutcomesDF to BalanceDFOutcomes
- BalanceWeightsDF to BalanceDFWeights
Bug Fixes
- Utility Functions
- Fixed
quantize()to preserve column ordering and use proper TypeError
exceptions (#133)
- Fixed
- Statistical Functions
- Fixed division by zero in
asmd_improvement()whenasmd_mean_beforeis
zero, now returns0.0for 0% improvement
- Fixed division by zero in
- CLI & Infrastructure
- Replaced deprecated argparse FileType with pathlib.Path
(#134)
- Replaced deprecated argparse FileType with pathlib.Path
- Weight Trimming
- Fixed
trim_weights()to consistently returnpd.Serieswith
dtype=np.float64and preserve original index across both trimming methods - Fixed percentile-based winsorization edge case:
_validate_limit()now
automatically adjusts limits to prevent floating-point precision issues
(#144) - Enhanced documentation for
trim_weights()and_validate_limit()with
clearer examples and explanations
- Fixed
Tests
- Enhanced test coverage for weight trimming with
test_trim_weights_return_type_consistencyand 11 comprehensive tests for
_validate_limit()covering edge cases, error conditions, and boundary
conditions
Contributors
@neuralsorcerer, @talgalili, @wesleytlee
Full Changelog: 0.12.1...0.13.0
0.12.1 (2025-11-03)
New Features
- Added a welcome message when importing the package.
Welcome to balance (Version 0.12.1)!
An open-source Python package for balancing biased data samples.📖 Documentation: https://import-balance.org/
🛠️ Get Help / Report Issues: https://github.com/facebookresearch/balance/issues/
📄 Citation:
Sarig, T., Galili, T., & Eilat, R. (2023).
balance - a Python package for balancing biased data samples.
https://arxiv.org/abs/2307.06024Tip: You can access this information at any time with balance.help()
Documentation
- Added 'CHANGELOG' to the docs website. https://import-balance.org/docs/docs/CHANGELOG/
Bug Fixes
- Fixed plotly figures in all the tutorials. https://import-balance.org/docs/tutorials/
Contributors
Full Changelog: 0.12.0...0.12.1
0.12.0 (2025-10-14)
New Features
- Support for Python 3.13 + 3.14
- Update setup.py and CI/CD integration to include Python 3.13 and 3.14.
- Remove upper version constraints from numpy, pandas, scipy, and scikit-learn dependencies for Python 3.12+.
Contributors
Full Changelog: 0.11.0...0.12.0
0.11.0 (2025-09-24)
New Features
- Python 3.12 support - Complete support for Python 3.12 alongside existing Python 3.9, 3.10, and 3.11 support (with CI/CD integration).
- Implemented Python version-specific dependency constraints - Added conditional version ranges for numpy, pandas, scipy, and scikit-learn that vary based on Python version (e.g., numpy>=1.21.0,<2.0 for Python <3.12, numpy>=1.24.0,<2.1 for Python >=3.12)
- Pandas compatibility improvements - Replaced
value_counts(dropna=False)withgroupby().size()in frequency table creation to avoid FutureWarning - Fixed various pandas deprecation warnings and improved DataFrame handling
- Improved raking algorithm - Completely refactored rake weighting from DataFrame-based to array-based ipfn algorithm using multi-dimensional arrays and itertools for better performance and compatibility with latest Python versions. Variables are now automatically alphabetized to ensure consistent results regardless of input order.
- poststratify method enhancement - New
strict_matchingparameter (default True) handles cases where sample cells are not present in target data. When False, issues warning and assigns weight 0 to uncovered samples
Bug Fixes
- Type annotations - Enhanced Pyre type hints throughout the codebase, particularly in utility functions
- Sample class improvements - Fixed weight type assignment (ensuring float64 type), improved DataFrame manipulation with
.infer_objects(copy=False)for pandas compatibility, and enhanced weight setting logic - Website dependencies - Updated various website dependencies including Docusaurus and related packages
Tests
Comprehensive test refactoring, including:
- Enhanced test validation - Added detailed explanations of test methodologies and expected behaviors in docstrings
- Improved test coverage - Tests now include edge cases like NaN handling, different data types, and error conditions
- Improved test organization (more granular) across all test modules (test_stats_and_plots.py, test_balancedf.py, test_ipw.py, test_rake.py, test_cli.py, test_weighted_comparisons_plots.py, test_cbps.py, test_testutil.py, test_adjustment.py, test_util.py, test_sample.py)
- Updated GitHub workflows to include Python 3.12 in build and test matrix
- Fix 261 "pandas deprecation" warnings!
- Added type annotations - Converted test_balancedf.py to pyre-strict with.
Documentation
- GitHub issue template for support questions - Added structured template to help users ask questions about using the balance package
Contributors
Full Changelog
0.10.0 (2025-01-06)
News
- This version we transitioned
ipwto use sklearn. This enables support for newer python versions as well as the Windows OS! - Updated Python and package compatibility. Balance is now compatible with Python 3.11, but no longer compatible with Python 3.8 due to typing errors. Balance is currently incompatible with Python 3.12 due to the removal of distutils.
- Update license from GPL-v2 to the MIT license.
New Features
- Dependency on glmnet has been removed, and the
ipwmethod now uses sklearn. ipwmethod uses logistic regression with L2-penalties instead of L1-penalties for computational reasons. The transition from glmnet to sklearn and use of L2-penalties will lead to slightly different generated weights compared to previous versions of Balance.- Unfortunately, the sklearn-based
ipwmethod is generally slower than the previous version by 2-5x. Consider using the new argumentslambda_min,lambda_max, andnum_lambdasfor a more efficient search over theipwpenalization space.
Bug Fixes
- Fix E721 flake8 issue (see: https://github.com/facebookresearch/balance/actions/runs/5704381365/job/15457952704)
Documentation
- Added links to presentation given at ISA 2023.
- Fixed misc typos.