Scikit Learn Docs PDF
Scikit Learn Docs PDF
Release 0.23.dev0
scikit-learn developers
1 Welcome to scikit-learn 1
1.1 Installing scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Related Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 About us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Who is using scikit-learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Release History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.8 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
1.9 Scikit-learn governance and decision-making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
i
5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
5.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
5.6 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
5.7 Data and sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
6 Examples 727
6.1 Miscellaneous examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
6.2 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
6.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
6.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
6.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
6.6 Covariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905
6.7 Cross decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
6.8 Dataset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
6.9 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
6.10 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
6.11 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
6.12 Examples based on real world datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
6.13 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
6.14 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124
6.15 Gaussian Process for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
6.16 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
6.17 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266
6.18 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1280
6.19 Missing Value Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
6.20 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317
6.21 Multioutput methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369
6.22 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
6.23 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412
6.24 Pipelines and composite estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
6.25 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
6.26 Release Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1473
6.27 Semi Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480
6.28 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1493
6.29 Tutorial exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1529
6.30 Working with text documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538
ii
7.17 sklearn.impute: Impute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1961
7.18 sklearn.inspection: inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1973
7.19 sklearn.isotonic: Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1980
7.20 sklearn.kernel_approximation Kernel Approximation . . . . . . . . . . . . . . . . . . . 1985
7.21 sklearn.kernel_ridge Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . 1995
7.22 sklearn.linear_model: Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1998
7.23 sklearn.manifold: Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2099
7.24 sklearn.metrics: Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2119
7.25 sklearn.mixture: Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 2205
7.26 sklearn.model_selection: Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 2217
7.27 sklearn.multiclass: Multiclass and multilabel classification . . . . . . . . . . . . . . . . . . 2272
7.28 sklearn.multioutput: Multioutput regression and classification . . . . . . . . . . . . . . . . 2281
7.29 sklearn.naive_bayes: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2291
7.30 sklearn.neighbors: Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2309
7.31 sklearn.neural_network: Neural network models . . . . . . . . . . . . . . . . . . . . . . . 2372
7.32 sklearn.pipeline: Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2385
7.33 sklearn.preprocessing: Preprocessing and Normalization . . . . . . . . . . . . . . . . . . . 2394
7.34 sklearn.random_projection: Random projection . . . . . . . . . . . . . . . . . . . . . . . 2451
7.35 sklearn.semi_supervised Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 2458
7.36 sklearn.svm: Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2464
7.37 sklearn.tree: Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2494
7.38 sklearn.utils: Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2526
7.39 Recently deprecated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2553
Bibliography 2603
Index 2613
iii
iv
CHAPTER
ONE
WELCOME TO SCIKIT-LEARN
Then run:
In order to check your installation you can use
Note that in order to avoid potential conflicts with other packages it is strongly recommended to use a virtual environ-
ment, e.g. python3 virtualenv (see python3 virtualenv documentation) or conda environments.
Using an isolated environment makes possible to install a specific version of scikit-learn and its dependencies indepen-
dently of any previously installed Python packages. In particular under Linux is it discouraged to install pip packages
alongside the packages managed by the package manager of the distribution (apt, dnf, pacman. . . ).
Note that you should always remember to activate the environment of your choice prior to running any Python com-
mand whenever you start a new terminal session.
If you have not installed NumPy or SciPy yet, you can also install these using conda or pip. When using pip, please
ensure that binary wheels are used, and NumPy and SciPy are not recompiled from source, which can happen when
using particular configurations of operating system and hardware (such as Linux on a Raspberry Pi).
If you must install scikit-learn and its dependencies with pip, you can install it as scikit-learn[alldeps].
Scikit-learn plotting capabilities (i.e., functions start with “plot_” and classes end with “Display”) require Matplotlib
(>= 1.5.1). For running the examples Matplotlib >= 1.5.1 is required. A few examples require scikit-image >= 0.12.3,
a few examples require pandas >= 0.18.0.
1
scikit-learn user guide, Release 0.23.dev0
Warning: Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. Scikit-learn now requires
Python 3.5 or newer.
Note: For installing on PyPy, PyPy3-v5.10+, Numpy 1.14.0+, and scipy 1.1.0+ are required.
Some third-party distributions provide versions of scikit-learn integrated with their package-management systems.
These can make installation and upgrading much easier for users since the integration includes the ability to automat-
ically install dependencies (numpy, scipy) that scikit-learn requires.
The following is an incomplete list of OS and python distributions that provide their own version of scikit-learn.
Arch Linux
Arch Linux’s package is provided through the official repositories as python-scikit-learn for Python. It can
be installed by typing the following command:
Debian/Ubuntu
The Debian/Ubuntu package is splitted in three different packages called python3-sklearn (python modules),
python3-sklearn-lib (low-level implementations and bindings), python3-sklearn-doc (documenta-
tion). Only the Python 3 version is available in the Debian Buster (the more recent Debian distribution). Packages can
be installed using apt-get:
Fedora
The Fedora package is called python3-scikit-learn for the python 3 version, the only one available in Fe-
dora30. It can be installed using dnf:
NetBSD
The MacPorts package is named py<XY>-scikits-learn, where XY denotes the Python version. It can be
installed by typing the following command:
Canopy and Anaconda both ship a recent version of scikit-learn, in addition to a large set of scientific python library
for Windows, Mac OSX and Linux.
Anaconda offers scikit-learn as part of its free distribution.
This version of scikit-learn comes with alternative solvers for some common estimators. Those solvers come from the
DAAL C++ library and are optimized for multi-core Intel CPUs.
Note that those solvers are not enabled by default, please refer to the daal4py documentation for more details.
Compatibility with the standard scikit-learn solvers is checked by running the full scikit-learn test suite via automated
continuous integration as reported on https://github.com/IntelPython/daal4py.
1.1.3 Troubleshooting
It can happen that pip fails to install packages when reaching the default path size limit of Windows if Python is
installed in a nested location such as the AppData folder structure under the user home directory, for instance:
C:\Users\username>C:\Users\username\AppData\Local\Microsoft\WindowsApps\python.exe -m
˓→pip install scikit-learn
Collecting scikit-learn
...
Installing collected packages: scikit-learn
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file
˓→or directory:
˓→'C:\\Users\\username\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.7_
˓→qbz5n2kfra8p0\\LocalCache\\local-packages\\Python37\\site-
˓→packages\\sklearn\\datasets\\tests\\data\\openml\\292\\api-v1-json-data-list-data_
˓→name-australian-limit-2-data_version-1-status-deactivated.json.gz'
In this case it is possible to lift that limit in the Windows registry by using the regedit tool:
1. Type “regedit” in the Windows start menu to launch regedit.
2. Go to the Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
key.
3. Edit the value of the LongPathsEnabled property of that key and set it to 1.
4. Reinstall scikit-learn (ignoring the previous broken installation):
Here we try to give some answers to questions that regularly pop up on the mailing list.
scikit-learn, but not scikit or SciKit nor sci-kit learn. Also not scikits.learn or scikits-learn, which were previously
used.
There are multiple scikits, which are scientific toolboxes built around SciPy. You can find a list at https://scikits.
appspot.com/scikits. Apart from scikit-learn, another popular one is scikit-image.
See Contributing. Before wanting to add a new algorithm, which is usually a major and lengthy undertaking, it is
recommended to start with known issues. Please do not contact the contributors of scikit-learn directly regarding
contributing to scikit-learn.
For general machine learning questions, please use Cross Validated with the [machine-learning] tag.
For scikit-learn usage questions, please use Stack Overflow with the [scikit-learn] and [python] tags. You
can alternatively use the mailing list.
Please make sure to include a minimal reproduction code snippet (ideally shorter than 10 lines) that highlights your
problem on a toy dataset (for instance from sklearn.datasets or randomly generated with functions of numpy.
random with a fixed random seed). Please remove any line of code that is not necessary to reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python shell with scikit-learn
installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
https://stackoverflow.com/help/mcve
If your problem raises an exception that you do not understand (even after googling it), please make sure to include
the full traceback that you obtain when running the reproduction script.
For bug reports or feature requests, please make use of the issue tracker on GitHub.
There is also a scikit-learn Gitter channel where some users and developers might be found.
Please do not email any authors directly to ask for assistance, report bugs, or for any other issue related to
scikit-learn.
Don’t make a bunch object! They are not part of the scikit-learn API. Bunch objects are just a way to package some
numpy arrays. As a scikit-learn user you only ever need numpy arrays to feed your model with data.
For instance to train a classifier, all you need is a 2D array X for the input variables and a 1D array y for the target
variables. The array X holds the features as columns and samples as rows . The array y contains integer values to
encode the class membership of each sample in X.
1.2.8 How can I load my own datasets into a format usable by scikit-learn?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that
are convertible to numeric arrays such as pandas DataFrame are also acceptable.
For more information on loading your data files into these usable data structures, please refer to loading external
datasets.
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+
citations and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data
structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
From the algorithms or techniques that meet the above criteria, only those which fit well within the current API of
scikit-learn, that is a fit, predict/transform interface and ordinarily having input/output that is a numpy array
or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with research papers and/or implementations
in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the
methods that are already implemented in scikit-learn at least in some areas.
Inclusion of a new algorithm speeding up an existing model is easier if:
• it does not introduce new hyper-parameters (as it makes the library more future-proof),
• it is easy to document clearly when the contribution improves the speed and when it does not, for instance “when
n_features >> n_samples”,
• benchmarks clearly show a speed up.
Also note that your implementation need not be in scikit-learn to be used together with scikit-learn tools. You can
implement your favorite algorithm in a scikit-learn compatible way, upload it to GitHub and let us know. We will be
happy to list it under Related Projects. If you already have a package on GitHub following the scikit-learn API, you
may also be interested to look at scikit-learn-contrib.
1.2.10 Why are you so selective on what algorithms you include in scikit-learn?
Code is maintenance cost, and we need to balance the amount of code we have with the size of the team (and add to
this the fact that complexity scales non linearly with the number of features). The package relies on core developers
using their free time to fix bugs, maintain code and review contributions. Any algorithm that is added needs future
attention by the developers, at which point the original author might long have lost interest. See also What are the
inclusion criteria for new algorithms ?. For a great read about long-term maintenance issues in open-source software,
look at the Executive Summary of Roads and Bridges
Not in the foreseeable future. scikit-learn tries to provide a unified API for the basic tasks in machine learning, with
pipelines and meta-algorithms like grid search to tie everything together. The required concepts, APIs, algorithms
and expertise required for structured learning are different from what scikit-learn has to offer. If we started doing
arbitrary structured learning, we’d need to redesign the whole package and the project would likely collapse under its
own weight.
There are two project with API similar to scikit-learn that do structured prediction:
• pystruct handles general structured learning (focuses on SSVMs on arbitrary graph structures with approximate
inference; defines the notion of sample as an instance of the graph structure)
• seqlearn handles sequences only (focuses on exact inference; has HMMs, but mostly for the sake of complete-
ness; treats a feature vector as a sample and uses an offset encoding for the dependencies between feature
vectors)
No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies
and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.
Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed
can often be achieved by a careful choice of algorithms.
In case you didn’t know, PyPy is an alternative Python implementation with a built-in just-in-time compiler. Experi-
mental support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+, and scipy 1.1.0+.
scikit-learn estimators assume you’ll feed them real-valued feature vectors. This assumption is hard-coded in pretty
much all of the library. However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see Text feature extraction for the built-in text
vectorizers. For more general feature extraction from any kind of data, see Loading features from dicts and Feature
hashing.
Another common case is when you have non-numerical data and a custom distance (or similarity) metric on these data.
Examples include strings with edit distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
encoded as numbers, but doing so is painful and error-prone. Working with distance metrics on arbitrary data can be
done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if the dataset is not too large, you can
compute distances for all pairs of inputs. If the dataset is large, you can use feature vectors with only one “feature”,
which is an index into a separate data structure, and supply a custom metric function that looks up the actual data in
this data structure. E.g., to use DBSCAN with Levenshtein distances:
1.2.16 Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Several scikit-learn tools such as GridSearchCV and cross_val_score rely internally on Python’s
multiprocessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as
argument.
The problem is that Python multiprocessing does a fork system call without following it with an exec system
call for performance reasons. Many libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and probably many others), manage their own internal thread
pool. Upon a call to fork, the thread pool state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to change the libraries to make them detect
when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).
But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead
of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX
standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /
vecLib as a bug.
In Python 3.4+ it is now possible to configure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods
(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you
can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that
using the ‘forkserver’ method prevents joblib.Parallel to call function interactively defined in a shell session.
If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the
‘forkserver’ mode globally for your program: Insert the following instructions in your main script:
import multiprocessing
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
You can find more default on the new start methods in the multiprocessing documentation.
1.2.17 Why does my job use more cores than specified with n_jobs?
This is because n_jobs only controls the number of jobs for routines that are parallelized with joblib, but parallel
code can come from other sources:
• some routines may be parallelized with OpenMP (for code written in C or Cython).
• scikit-learn relies a lot on numpy, which in turn may rely on numerical libraries like MKL, OpenBLAS or BLIS
which can provide parallel implementations.
For more details, please refer to our Parallelism notes.
1.2.18 Why is there no support for deep or reinforcement learning / Will there be
support for deep or reinforcement learning in scikit-learn?
Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning
additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of
scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks
to achieve.
You can find more information about addition of gpu support at Will you add GPU support?.
The scikit-learn review process takes a significant amount of time, and contributors should not be discouraged by a
lack of activity or review on their pull request. We care a lot about getting things right the first time, as maintenance
and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be
subject to high use immediately and should be of the highest quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working
on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are
busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely
because of this reason.
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-
random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it
relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an
execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)
However, a global random state is prone to modification by other code during execution. Thus, the only way to ensure
replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation
splitters have their random_state parameter set.
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices of a single numeric dtype. These do
not explicitly represent categorical variables at present. Thus, unlike R’s data.frames or pandas.DataFrame, we require
explicit conversion of categorical features to numeric values, as discussed in Encoding categorical features. See also
Column Transformer with Mixed Types for an example of working with heterogeneous (e.g. categorical and numeric)
data.
1.2.22 Why does Scikit-learn not directly work with, for example, pan-
das.DataFrame?
The homogeneous NumPy and SciPy data objects currently expected are most efficient to process for most operations.
Extensive work would also be needed to support Pandas categorical types. Restricting input to homogeneous types
therefore reduces maintenance cost and encourages usage of efficient data structures.
Currently transform only works for features X in a pipeline. There’s a long-standing discussion about not being
able to transform y in a pipeline. Follow on github issue #4143. Meanwhile check out sklearn.compose.
TransformedTargetRegressor, pipegraph, imbalanced-learn. Note that Scikit-learn solved for the case where
y has an invertible transformation applied before training and inverted after prediction. Scikit-learn intends to solve
for use cases where y should be transformed at training time and not at test time, for resampling and similar uses, like
at imbalanced learn. In general, these use cases can be solved with a custom meta estimator rather than a Pipeline
1.3 Support
1.3. Support 9
scikit-learn user guide, Release 0.23.dev0
• Some scikit-learn developers support users on StackOverflow using the [scikit-learn] tag.
• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-
able venue.
In both cases please use a descriptive question in the title field (e.g. no “Please help with scikit-learn!” as this is not a
question) and put details on what you tried to achieve, what were the expected results and what you observed instead
in the details field.
Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.
Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the
number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your
trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classification or
continuous variable regression.
If you think you’ve encountered a bug, please report it to the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues
Don’t forget to include:
• steps (or better script) to reproduce,
• expected outcome,
• observed outcome or python (or gdb) tracebacks
To help developers fix your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python
script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV
files using numpy.savetxt).
Note: gists are git cloneable repositories and thus you can use git to push datafiles to them.
1.3.4 IRC
This documentation is relative to 0.23.dev0. Documentation for other versions can be found here.
Printable pdf documentation for old versions can be found here.
Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which
facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also
accepts high-quality contributions of repositories conforming to this template.
Below is a list of sister-projects, extensions and domain specific packages.
These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s
estimators.
Data formats
• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.
• sklearn_xarray provides compatibility of scikit-learn estimators with xarray data structures.
Auto-ML
• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.
Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to
production environments.
• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-
chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in
replacement for a scikit-learn estimator.
• scikit-optimize A library to minimize (very) expensive and noisy black-box functions. It implements sev-
eral methods for sequential model-based optimization, and includes a replacement for GridSearchCV or
RandomizedSearchCV to do cross-validated parameter search using any of these strategies.
Experimentation frameworks
• REP Environment for conducting data-driven research in a consistent and reproducible way
• ML Frontend provides dataset management and SVM fitting/prediction through web-based and programmatic
interfaces.
• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked
ensembling. Provides a framework for keeping track of model-hyperparameter combinations.
Model inspection and visualisation
• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.
• mlxtend Includes model visualization utilities.
• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine
learning.
• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,
model selection, evaluation, and diagnostics.
Model export for production
• onnxmltools Serializes many Scikit-learn pipelines to ONNX for interchange and prediction.
• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the
help of JPMML-SkLearn library.
• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.
• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)
trained by sklearn. Useful for latency-sensitive production environments.
Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing
interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.
Structured learning
• sktime A scikit-learn compatible toolbox for machine learning with time series including time series classifica-
tion/regression and (supervised/panel) forecasting.
• Seqlearn Sequence classification using HMMs or structured perceptron.
• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
• PyStruct General conditional random fields and structured prediction.
• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.
• sklearn-crfsuite Linear-chain conditional random fields (CRFsuite wrapper with sklearn-like API).
Deep neural networks etc.
• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.
• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally
• nolearn A number of wrappers and abstractions around existing neural network libraries
• keras Deep Learning library capable of running on top of either TensorFlow or Theano.
• lasagne A lightweight library to build and train neural networks in Theano.
• skorch A scikit-learn compatible neural network library that wraps PyTorch.
Broad scope
• mlxtend Includes a number of additional estimators as well as model visualization utilities.
• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.
Other regression and classification
• xgboost Optimised gradient boosted decision tree library.
• ML-Ensemble Generalized ensemble learning (stacking, blending, subsemble, deep ensembles, etc.).
• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).
• py-earth Multivariate adaptive regression splines
• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection
• gplearn Genetic Programming for symbolic regression tasks.
• multiisotonic Isotonic regression on multidimensional features.
• scikit-multilearn Multi-label classification with focus on label space manipulation.
• seglearn Time series and sequence learning using sliding window segmentation.
Decomposition and clustering
• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling
to sample from the true posterior distribution. (scikit-learn’s sklearn.decomposition.
LatentDirichletAllocation implementation uses variational inference to sample from a tractable
approximation of a topic model’s posterior distribution.)
• Sparse Filtering Unsupervised feature learning based on sparse-filtering
• kmodes k-modes clustering algorithm for categorical data, and several of its variations.
• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.
• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-
sphere.
Pre-processing
• categorical-encoding A library of sklearn compatible categorical variable encoders.
• imbalanced-learn Various methods to under- and over-sample datasets.
• GraphLab Implementation of classical recommendation techniques (in C++, with Python bindings).
• implicit, Library for implicit feedback datasets.
• lightfm A Python/Cython implementation of a hybrid recommender system.
• OpenRec TensorFlow-based neural-network inspired recommendation algorithms.
• Spotlight Pytorch-based implementation of deep recommender models.
• Surprise Lib Library for explicit feedback datasets.
1.5 About us
1.5.1 History
This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu
Brucher started work on this project as part of his thesis.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the
project and made the first public release, February the 1st 2010. Since then, several releases have appeared following
a ~3 month cycle, and a thriving international community has been leading the development.
1.5.2 Governance
The decision making process and governance structure of scikit-learn is laid out in the governance document.
1.5.3 Authors
The following people are currently core contributors to scikit-learn’s development and maintenance:
Please do not email the authors directly to ask for assistance or report issues. Instead, please see What’s the best way
to ask questions about scikit-learn in the FAQ.
See also:
How you can contribute to the project
The following people have been active contributors in the past, but are no longer active in the project:
• Mathieu Blondel
• Matthieu Brucher
• Lars Buitinck
• David Cournapeau
• Noel Dawe
• Shiqiao Du
• Vincent Dubourg
• Edouard Duchesnay
• Alexander Fabisch
• Virgile Fritsch
• Satrajit Ghosh
• Angel Soler Gollonet
• Chris Gorgolewski
• Jaques Grobler
• Brian Holt
• Arnaud Joly
• Thouis (Ray) Jones
• Kyle Kastner
• manoj kumar
• Robert Layton
• Wei Li
• Paolo Losi
• Gilles Louppe
• Vincent Michel
• Jarrod Millman
• Alexandre Passos
• Fabian Pedregosa
• Peter Prettenhofer
• (Venkat) Raghav, Rajagopalan
• Jacob Schreiber
• Jake Vanderplas
• David Warde-Farley
• Ron Weiss
If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Bibtex entry:
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
If you want to cite scikit-learn for its API or design, you may also want to consider the following paper:
1.5. About us 15
scikit-learn user guide, Release 0.23.dev0
API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013.
Bibtex entry:
@inproceedings{sklearn_api,
author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
and Jaques Grobler and Robert Layton and Jake VanderPlas and
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
title = {{API} design for machine learning software: experiences from
˓→the scikit-learn
project},
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine
˓→Learning},
year = {2013},
pages = {108--122},
}
1.5.6 Artwork
High quality PNG and SVG logos are available in the doc/logos/ source directory.
1.5.7 Funding
Scikit-Learn is a community driven project, however institutional and private grants help to assure its sustainability.
The project would like to thank the following funders.
The Members of the Scikit-Learn Consortium at Inria Foundation fund Olivier Grisel, Guillaume Lemaitre, Jérémie
du Boisberranger and Chiara Marmo.
Andreas Müller received a grant to improve scikit-learn from the Alfred P. Sloan Foundation . This grant supports the
position of Nicolas Hug and Thomas J. Fan.
1.5. About us 17
scikit-learn user guide, Release 0.23.dev0
Past Sponsors
INRIA actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2017) to work on this project full-time. It also hosts coding sprints and other
events.
Paris-Saclay Center for Data Science funded one year for a developer to work on the project full-time (2014-2015),
50% of the time of Guillaume Lemaitre (2016-2017) and 50% of the time of Joris van den Bossche (2017-2018).
NYU Moore-Sloan Data Science Environment funded Andreas Mueller (2014-2016) to work on this project. The
Moore-Sloan Data Science Environment also funds several students to work on the project part-time.
Télécom Paristech funded Manoj Kumar (2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guille-
mot (2016-2017) and Albert Thomas (2017) to work on scikit-learn.
The Labex DigiCosme funded Nicolas Goix (2015-2016), Tom Dupré la Tour (2015-2016 and 2017-2018), Mathurin
Massias (2018-2019) to work part time on scikit-learn during their PhDs. It also funded a scikit-learn coding sprint in
2015.
The following students were sponsored by Google to work on scikit-learn through the Google Summer of Code
program.
• 2007 - David Cournapeau
• 2011 - Vlad Niculae
• 2012 - Vlad Niculae, Immanuel Bayer.
• 2013 - Kemal Eren, Nicolas Trésegnie
• 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar.
• 2015 - Raghav RV, Wei Xue
The NeuroDebian project providing Debian packaging and contributions is supported by Dr. James V. Haxby (Dart-
mouth College).
1.5.8 Sprints
The International 2019 Paris sprint was kindly hosted by AXA. Also some participants could attend thanks to the
support of the Alfred P. Sloan Foundation, the Python Software Foundation (PSF) and the DATAIA Institute.
The 2013 International Paris Sprint was made possible thanks to the support of Télécom Paristech, tinyclues, the
French Python Association and the Fonds de la Recherche Scientifique.
The 2011 International Granada sprint was made possible thanks to the support of the PSF and tinyclues.
If you are interested in donating to the project or to one of our code-sprints, you can use the Paypal button below or the
NumFOCUS Donations Page (if you use the latter, please indicate that you are donating for the scikit-learn project).
All donations will be handled by NumFOCUS, a non-profit-organization which is managed by a board of Scipy
community members. NumFOCUS’s mission is to foster scientific computing software, in particular in Python. As
a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available
while in compliance with tax regulations.
The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as
well as towards the organization budget of the project1 .
Notes
• We would like to thank Rackspace for providing us with a free Rackspace Cloud account to automatically build
the documentation and the example gallery from for the development version of scikit-learn using this tool.
• We would also like to thank Microsoft Azure, Travis Cl, CircleCl for free CPU time on their Continuous Inte-
gration servers.
1.6.1 J.P.Morgan
Scikit-learn is an indispensable part of the Python machine learning toolkit at JPMorgan. It is very widely used
across all parts of the bank for classification, predictive analytics, and very many other machine learning tasks. Its
1 Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS,
straightforward API, its breadth of algorithms, and the quality of its documentation combine to make scikit-learn
simultaneously very approachable and very powerful.
Stephen Simmons, VP, Athena Research, JPMorgan
1.6.2 Spotify
Scikit-learn provides a toolbox with solid implementations of a bunch of state-of-the-art models and makes it easy to
plug them into existing applications. We’ve been using it quite a lot for music recommendations at Spotify and I think
it’s the most well-designed ML package I’ve seen so far.
Erik Bernhardsson, Engineering Manager Music Discovery & Machine Learning, Spotify
1.6.3 Inria
At INRIA, we use scikit-learn to support leading-edge basic research in many teams: Parietal for neuroimaging, Lear
for computer vision, Visages for medical image analysis, Privatics for security. The project is a fantastic tool to
address difficult applications of machine learning in an academic environment as it is performant and versatile, but all
easy-to-use and well documented, which makes it well suited to grad students.
Gaël Varoquaux, research at Parietal
1.6.4 betaworks
Betaworks is a NYC-based startup studio that builds new products, grows companies, and invests in others. Over
the past 8 years we’ve launched a handful of social data analytics-driven services, such as Bitly, Chartbeat, digg and
Scale Model. Consistently the betaworks data science team uses Scikit-learn for a variety of tasks. From exploratory
analysis, to product development, it is an essential part of our toolkit. Recent uses are included in digg’s new video
recommender system, and Poncho’s dynamic heuristic subspace clustering.
Gilad Lotan, Chief Data Scientist
At Hugging Face we’re using NLP and probabilistic models to generate conversational Artificial intelligences that are
fun to chat with. Despite using deep neural nets for a few of our NLP tasks, scikit-learn is still the bread-and-butter of
our daily machine learning routine. The ease of use and predictability of the interface, as well as the straightforward
mathematical explanations that are here when you need them, is the killer feature. We use a variety of scikit-learn
models in production and they are also operationally very pleasant to work with.
Julien Chaumond, Chief Technology Officer
1.6.6 Evernote
Building a classifier is typically an iterative process of exploring the data, selecting the features (the attributes of the
data believed to be predictive in some way), training the models, and finally evaluating them. For many of these tasks,
we relied on the excellent scikit-learn package for Python.
Read more
Mark Ayzenshtat, VP, Augmented Intelligence
At Telecom ParisTech, scikit-learn is used for hands-on sessions and home assignments in introductory and advanced
machine learning courses. The classes are for undergrads and masters students. The great benefit of scikit-learn is its
fast learning curve that allows students to quickly start working on interesting and motivating problems.
Alexandre Gramfort, Assistant Professor
1.6.8 Booking.com
At Booking.com, we use machine learning algorithms for many different applications, such as recommending ho-
tels and destinations to our customers, detecting fraudulent reservations, or scheduling our customer service agents.
Scikit-learn is one of the tools we use when implementing standard algorithms for prediction tasks. Its API and doc-
umentations are excellent and make it easy to use. The scikit-learn developers do a great job of incorporating state of
the art implementations and new algorithms into the package. Thus, scikit-learn provides convenient access to a wide
spectrum of algorithms, and allows us to readily find the right tool for the right job.
Melanie Mueller, Data Scientist
1.6.9 AWeber
The scikit-learn toolkit is indispensable for the Data Analysis and Management team at AWeber. It allows us to do
AWesome stuff we would not otherwise have the time or resources to accomplish. The documentation is excellent,
allowing new engineers to quickly evaluate and apply many different algorithms to our data. The text feature extraction
utilities are useful when working with the large volume of email content we have at AWeber. The RandomizedPCA
implementation, along with Pipelining and FeatureUnions, allows us to develop complex machine learning algorithms
efficiently and reliably.
Anyone interested in learning more about how AWeber deploys scikit-learn in a production environment should check
out talks from PyData Boston by AWeber’s Michael Becker available at https://github.com/mdbecker/pydata_2013
Michael Becker, Software Engineer, Data Analysis and Management Ninjas
1.6.10 Yhat
The combination of consistent APIs, thorough documentation, and top notch implementation make scikit-learn our
favorite machine learning package in Python. scikit-learn makes doing advanced analysis in Python accessible to
anyone. At Yhat, we make it easy to integrate these models into your production applications. Thus eliminating the
unnecessary dev time encountered productionizing analytical work.
Greg Lamp, Co-founder Yhat
1.6.11 Rangespan
The Python scikit-learn toolkit is a core tool in the data science group at Rangespan. Its large collection of well
documented models and algorithms allow our team of data scientists to prototype fast and quickly iterate to find the
right solution to our learning problems. We find that scikit-learn is not only the right tool for prototyping, but its
careful and well tested implementation give us the confidence to run scikit-learn models in production.
Jurgen Van Gael, Data Science Director at Rangespan Ltd
1.6.12 Birchbox
At Birchbox, we face a range of machine learning problems typical to E-commerce: product recommendation, user
clustering, inventory prediction, trends detection, etc. Scikit-learn lets us experiment with many models, especially in
the exploration phase of a new project: the data can be passed around in a consistent way; models are easy to save and
reuse; updates keep us informed of new developments from the pattern discovery research community. Scikit-learn is
an important tool for our team, built the right way in the right language.
Thierry Bertin-Mahieux, Birchbox, Data Scientist
Scikit-learn is our #1 toolkit for all things machine learning at Bestofmedia. We use it for a variety of tasks (e.g. spam
fighting, ad click prediction, various ranking models) thanks to the varied, state-of-the-art algorithm implementations
packaged into it. In the lab it accelerates prototyping of complex pipelines. In production I can say it has proven to be
robust and efficient enough to be deployed for business critical components.
Eustache Diemert, Lead Scientist Bestofmedia Group
1.6.14 Change.org
At change.org we automate the use of scikit-learn’s RandomForestClassifier in our production systems to drive email
targeting that reaches millions of users across the world each week. In the lab, scikit-learn’s ease-of-use, performance,
and overall variety of algorithms implemented has proved invaluable in giving us a single reliable source to turn to for
our machine-learning needs.
Vijay Ramesh, Software Engineer in Data/science at Change.org
At PHIMECA Engineering, we use scikit-learn estimators as surrogates for expensive-to-evaluate numerical models
(mostly but not exclusively finite-element mechanical models) for speeding up the intensive post-processing operations
involved in our simulation-based decision making framework. Scikit-learn’s fit/predict API together with its efficient
cross-validation tools considerably eases the task of selecting the best-fit estimator. We are also using scikit-learn for
illustrating concepts in our training sessions. Trainees are always impressed by the ease-of-use of scikit-learn despite
the apparent theoretical complexity of machine learning.
Vincent Dubourg, PHIMECA Engineering, PhD Engineer
1.6.16 HowAboutWe
At HowAboutWe, scikit-learn lets us implement a wide array of machine learning techniques in analysis and in pro-
duction, despite having a small team. We use scikit-learn’s classification algorithms to predict user behavior, enabling
us to (for example) estimate the value of leads from a given traffic source early in the lead’s tenure on our site. Also, our
users’ profiles consist of primarily unstructured data (answers to open-ended questions), so we use scikit-learn’s fea-
ture extraction and dimensionality reduction tools to translate these unstructured data into inputs for our matchmaking
system.
Daniel Weitzenfeld, Senior Data Scientist at HowAboutWe
1.6.17 PeerIndex
At PeerIndex we use scientific methodology to build the Influence Graph - a unique dataset that allows us to identify
who’s really influential and in which context. To do this, we have to tackle a range of machine learning and predic-
tive modeling problems. Scikit-learn has emerged as our primary tool for developing prototypes and making quick
progress. From predicting missing data and classifying tweets to clustering communities of social media users, scikit-
learn proved useful in a variety of applications. Its very intuitive interface and excellent compatibility with other
python tools makes it and indispensable tool in our daily research efforts.
Ferenc Huszar - Senior Data Scientist at Peerindex
1.6.18 DataRobot
DataRobot is building next generation predictive analytics software to make data scientists more productive, and
scikit-learn is an integral part of our system. The variety of machine learning techniques in combination with the
solid implementations that scikit-learn offers makes it a one-stop-shopping library for machine learning in Python.
Moreover, its consistent API, well-tested code and permissive licensing allow us to use it in a production environment.
Scikit-learn has literally saved us years of work we would have had to do ourselves to bring our product to market.
Jeremy Achin, CEO & Co-founder DataRobot Inc.
1.6.19 OkCupid
We’re using scikit-learn at OkCupid to evaluate and improve our matchmaking system. The range of features it has,
especially preprocessing utilities, means we can use it for a wide variety of projects, and it’s performant enough to
handle the volume of data that we need to sort through. The documentation is really thorough, as well, which makes
the library quite easy to use.
David Koh - Senior Data Scientist at OkCupid
1.6.20 Lovely
At Lovely, we strive to deliver the best apartment marketplace, with respect to our users and our listings. From
understanding user behavior, improving data quality, and detecting fraud, scikit-learn is a regular tool for gathering
insights, predictive modeling and improving our product. The easy-to-read documentation and intuitive architecture of
the API makes machine learning both explorable and accessible to a wide range of python developers. I’m constantly
recommending that more developers and scientists try scikit-learn.
Simon Frid - Data Scientist, Lead at Lovely
Data Publica builds a new predictive sales tool for commercial and marketing teams called C-Radar. We extensively
use scikit-learn to build segmentations of customers through clustering, and to predict future customers based on past
partnerships success or failure. We also categorize companies using their website communication thanks to scikit-learn
and its machine learning algorithm implementations. Eventually, machine learning makes it possible to detect weak
signals that traditional tools cannot see. All these complex tasks are performed in an easy and straightforward way
thanks to the great quality of the scikit-learn framework.
Guillaume Lebourgeois & Samuel Charron - Data Scientists at Data Publica
1.6.22 Machinalis
Scikit-learn is the cornerstone of all the machine learning projects carried at Machinalis. It has a consistent API, a
wide selection of algorithms and lots of auxiliary tools to deal with the boilerplate. We have used it in production en-
vironments on a variety of projects including click-through rate prediction, information extraction, and even counting
sheep!
In fact, we use it so much that we’ve started to freeze our common use cases into Python packages, some of them
open-sourced, like FeatureForge . Scikit-learn in one word: Awesome.
Rafael Carrascosa, Lead developer
1.6.23 solido
Scikit-learn is helping to drive Moore’s Law, via Solido. Solido creates computer-aided design tools used by the
majority of top-20 semiconductor companies and fabs, to design the bleeding-edge chips inside smartphones, auto-
mobiles, and more. Scikit-learn helps to power Solido’s algorithms for rare-event estimation, worst-case verification,
optimization, and more. At Solido, we are particularly fond of scikit-learn’s libraries for Gaussian Process models,
large-scale regularized linear regression, and classification. Scikit-learn has increased our productivity, because for
many ML problems we no longer need to “roll our own” code. This PyData 2014 talk has details.
Trent McConaghy, founder, Solido Design Automation Inc.
1.6.24 INFONEA
We employ scikit-learn for rapid prototyping and custom-made Data Science solutions within our in-memory based
Business Intelligence Software INFONEA®. As a well-documented and comprehensive collection of state-of-the-art
algorithms and pipelining methods, scikit-learn enables us to provide flexible and scalable scientific analysis solutions.
Thus, scikit-learn is immensely valuable in realizing a powerful integration of Data Science technology within self-
service business analytics.
Thorsten Kranz, Data Scientist, Coma Soft AG.
1.6.25 Dataiku
Our software, Data Science Studio (DSS), enables users to create data services that combine ETL with Machine
Learning. Our Machine Learning module integrates many scikit-learn algorithms. The scikit-learn library is a perfect
integration with DSS because it offers algorithms for virtually all business cases. Our goal is to offer a transparent and
flexible tool that makes it easier to optimize time consuming aspects of building a data service, preparing data, and
training machine learning algorithms on all types of data.
Florian Douetteau, CEO, Dataiku
Here at Otto Group, one of global Big Five B2C online retailers, we are using scikit-learn in all aspects of our daily
work from data exploration to development of machine learning application to the productive deployment of those
services. It helps us to tackle machine learning problems ranging from e-commerce to logistics. It consistent APIs
enabled us to build the Palladium REST-API framework around it and continuously deliver scikit-learn based services.
Christian Rammig, Head of Data Science, Otto Group
1.6.27 Zopa
At Zopa, the first ever Peer-to-Peer lending platform, we extensively use scikit-learn to run the business and optimize
our users’ experience. It powers our Machine Learning models involved in credit risk, fraud risk, marketing, and
pricing, and has been used for originating at least 1 billion GBP worth of Zopa loans. It is very well documented,
powerful, and simple to use. We are grateful for the capabilities it has provided, and for allowing us to deliver on our
mission of making money simple and fair.
Vlasios Vasileiou, Head of Data Science, Zopa
1.6.28 MARS
Scikit-Learn is integral to the Machine Learning Ecosystem at Mars. Whether we’re designing better recipes for
petfood or closely analysing our cocoa supply chain, Scikit-Learn is used as a tool for rapidly prototyping ideas and
taking them to production. This allows us to better understand and meet the needs of our consumers worldwide.
Scikit-Learn’s feature-rich toolset is easy to use and equips our associates with the capabilities they need to solve the
business challenges they face every day.
Michael Fitzke Next Generation Technologies Sr Leader, Mars Inc.
Release notes for all scikit-learn releases are linked in this this page.
Tip: Subscribe to scikit-learn releases on libraries.io to be notified when new versions are released.
In Development
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• models come here
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.cluster
sklearn.datasets
sklearn.feature_extraction
sklearn.gaussian_process
sklearn.linear_model
• [F IX ] Fixed a bug where if a sample_weight parameter was passed to the fit method of linear_model.
RANSACRegressor, it would not be passed to the wrapped base_estimator during the fitting of the final
model. #15573 by Jeremy Alexandre.
• [E FFICIENCY ] linear_model.RidgeCV and linear_model.RidgeClassifierCV now does not al-
locate a potentially large array to store dual coefficients for all hyperparameters during its fit, nor an array to
store all error or LOO predictions unless store_cv_values is True. #15652 by Jérôme Dockès.
• [F IX ] add best_score_ attribute to linear_model.RidgeCV and linear_model.
RidgeClassifierCV . #15653 by Jérôme Dockès.
sklearn.model_selection
sklearn.preprocessing
sklearn.tree
• [F IX ] tree.plot_tree rotate parameter was unused and has been deprecated. #15806 by Chiara Marmo.
sklearn.utils
January 2 2020
This is a bug-fix release to primarily resolve some packaging issues in version 0.22.0. It also includes minor docu-
mentation improvements and some bug fixes.
Changelog
sklearn.cluster
• [F IX ] cluster.KMeans with algorithm="elkan" now uses the same stopping criterion as with the
default algorithm="full". #15930 by @inder128.
sklearn.inspection
sklearn.metrics
sklearn.model_selection
sklearn.naive_bayes
sklearn.preprocessing
sklearn.semi_supervised
sklearn.utils
• [F IX ] utils.check_array now correctly converts pandas DataFrame with boolean columns to floats.
#15797 by Thomas Fan.
• [F IX ] utils.check_is_fitted accepts back an explicit attributes argument to check for specific
attributes as explicit markers of a fitted estimator. When no explicit attributes are provided, only the
attributes that end with a underscore and do not start with double underscore are used as “fitted” markers.
The all_or_any argument is also no longer deprecated. This change is made to restore some backward
compatibility with the behavior of this utility in version 0.21. #15947 by Thomas Fan.
December 3 2019
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 0.22.
Website update
Our website was revamped and given a fresh new look. #14849 by Thomas Fan.
A function or object is public if it is documented in the API Reference and if it can be imported with an import path
without leading underscores. For example sklearn.pipeline.make_pipeline is public, while sklearn.
pipeline._name_estimators is private. sklearn.ensemble._gb.BaseEnsemble is private too be-
cause the whole _gb module is private.
Up to 0.22, some tools were de-facto public (no leading underscore), while they should have been private in the first
place. In version 0.22, these tools have been made properly private, and the public API space has been cleaned. In addi-
tion, importing from most sub-modules is now deprecated: you should for example use from sklearn.cluster
import Birch instead of from sklearn.cluster.birch import Birch (in practice, birch.py has
been moved to _birch.py).
Note: All the tools in the public API should be documented in the API Reference. If you find a public tool (without
leading underscore) that isn’t in the API reference, that means it should either be private or documented. Please let us
know by opening an issue!
When deprecating a feature, previous versions of scikit-learn used to raise a DeprecationWarning. Since the
DeprecationWarnings aren’t shown by default by Python, scikit-learn needed to resort to a custom warning
filter to always show the warnings. That filter would sometimes interfere with users custom warning filters.
Starting from version 0.22, scikit-learn will show FutureWarnings for deprecations, as recommended by the
Python documentation. FutureWarnings are always shown by default by Python, so the custom filter has been
removed and scikit-learn no longer hinders with user filters. #15080 by Nicolas Hug.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans when n_jobs=1. [F IX ]
• decomposition.SparseCoder, decomposition.DictionaryLearning, and
decomposition.MiniBatchDictionaryLearning [F IX ]
• decomposition.SparseCoder with algorithm='lasso_lars' [F IX ]
• decomposition.SparsePCA where normalize_components has no effect due to deprecation.
• ensemble.HistGradientBoostingClassifier and ensemble.
HistGradientBoostingRegressor [F IX ], [F EATURE ], [E NHANCEMENT ].
• impute.IterativeImputer when X has features with no missing values. [F EATURE ]
• linear_model.Ridge when X is sparse. [F IX ]
• model_selection.StratifiedKFold and any use of cv=int with a classifier. [F IX ]
• cross_decomposition.CCA when using scipy >= 1.3 [F IX ]
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.base
• [API C HANGE ] From version 0.24 base.BaseEstimator.get_params will raise an AttributeError rather
than return None for parameters that are in the estimator’s constructor but not stored as attributes on the instance.
#14464 by Joel Nothman.
sklearn.calibration
sklearn.cluster
sklearn.compose
sklearn.cross_decomposition
sklearn.datasets
sklearn.decomposition
sklearn.dummy
• [F IX ] dummy.DummyClassifier now handles checking the existence of the provided constant in multiouput
cases. #14908 by Martina G. Vilas.
• [API C HANGE ] The default value of the strategy parameter in dummy.DummyClassifier will change
from 'stratified' in version 0.22 to 'prior' in 0.24. A FutureWarning is raised when the default value
is used. #15382 by Thomas Fan.
• [API C HANGE ] The outputs_2d_ attribute is deprecated in dummy.DummyClassifier and dummy.
DummyRegressor. It is equivalent to n_outputs > 1. #14933 by Nicolas Hug
sklearn.ensemble
sklearn.feature_extraction
• [E NHANCEMENT ] A warning will now be raised if a parameter choice means that another parameter will
be unused on calling the fit() method for feature_extraction.text.HashingVectorizer,
feature_extraction.text.CountVectorizer and feature_extraction.text.
TfidfVectorizer. #14602 by Gaurav Chawla.
• [F IX ] Functions created by build_preprocessor and build_analyzer of feature_extraction.
text.VectorizerMixin can now be pickled. #14430 by Dillon Niederhut.
• [F IX ] feature_extraction.text.strip_accents_unicode now correctly removes accents from
strings that are in NFKD normalized form. #15100 by Daniel Grady.
• [F IX ] Fixed a bug that caused feature_extraction.DictVectorizer to raise an OverflowError
during the transform operation when producing a scipy.sparse matrix on large input data. #15463 by
Norvan Sahiner.
• [API C HANGE ] Deprecated unused copy param for feature_extraction.text.TfidfVectorizer.
transform it will be removed in v0.24. #14520 by Guillem G. Subies.
sklearn.feature_selection
sklearn.gaussian_process
sklearn.impute
• [M AJOR F EATURE ] Added impute.KNNImputer, to impute missing values using k-Nearest Neighbors.
#12852 by Ashim Bhattarai and Thomas Fan and #15010 by Guillaume Lemaitre.
• [F EATURE ] impute.IterativeImputer has new skip_compute flag that is False by default, which,
when True, will skip computation on features that have no missing values during the fit phase. #13773 by
Sergey Feldman.
• [E FFICIENCY ] impute.MissingIndicator.fit_transform avoid repeated computation of the
masked matrix. #14356 by Harsh Soni.
• [F IX ] impute.IterativeImputer now works when there is only one feature. By Sergey Feldman.
• [F IX ] Fixed a bug in impute.IterativeImputer where features where imputed in the reverse desired
order with imputation_order either "ascending" or "descending". #15393 by Venkatachalam N.
sklearn.inspection
sklearn.kernel_approximation
sklearn.linear_model
• [E FFICIENCY ] The ‘liblinear’ logistic regression solver is now faster and requires less memory. #14108, #14170,
#14296 by Alex Henrie.
• [E NHANCEMENT ] linear_model.BayesianRidge now accepts hyperparameters alpha_init and
lambda_init which can be used to set the initial value of the maximization procedure in fit. #13618 by
Yoshihiro Uchida.
• [F IX ] linear_model.Ridge now correctly fits an intercept when X is sparse, solver="auto" and
fit_intercept=True, because the default solver in this configuration has changed to sparse_cg, which
can fit an intercept with sparse data. #13995 by Jérôme Dockès.
• [F IX ] linear_model.Ridge with solver='sag' now accepts F-ordered and non-contiguous arrays and
makes a conversion instead of failing. #14458 by Guillaume Lemaitre.
• [F IX ] linear_model.LassoCV no longer forces precompute=False when fitting the final model.
#14591 by Andreas Müller.
• [F IX ] linear_model.RidgeCV and linear_model.RidgeClassifierCV now correctly scores
when cv=None. #14864 by Venkatachalam N.
• [F IX ] Fixed a bug in linear_model.LogisticRegressionCV where the scores_, n_iter_ and
coefs_paths_ attribute would have a wrong ordering with penalty='elastic-net'. #15044 by Nico-
las Hug
• [F IX ] linear_model.MultiTaskLassoCV and linear_model.MultiTaskElasticNetCV with
X of dtype int and fit_intercept=True. #15086 by Alex Gramfort.
• [F IX ] The liblinear solver now supports sample_weight. #15038 by Guillaume Lemaitre.
sklearn.manifold
sklearn.metrics
• [M AJOR F EATURE ] metrics.plot_roc_curve has been added to plot roc curves. This function introduces
the visualization API described in the User Guide. #14357 by Thomas Fan.
• [F EATURE ] Added a new parameter zero_division to multiple classifica-
tion metrics: precision_score, recall_score, f1_score, fbeta_score,
precision_recall_fscore_support, classification_report. This allows to set returned
value for ill-defined metrics. #14900 by Marc Torrellas Socastro.
• [F EATURE ] Added the metrics.pairwise.nan_euclidean_distances metric, which calculates eu-
clidean distances in the presence of missing values. #12852 by Ashim Bhattarai and Thomas Fan.
• [F EATURE ] New ranking metrics metrics.ndcg_score and metrics.dcg_score have been added to
compute Discounted Cumulative Gain and Normalized Discounted Cumulative Gain. #9951 by Jérôme Dockès.
• [F EATURE ] metrics.plot_precision_recall_curve has been added to plot precision recall curves.
#14936 by Thomas Fan.
• [F EATURE ] metrics.plot_confusion_matrix has been added to plot confusion matrices. #15083 by
Thomas Fan.
• [F EATURE ] Added multiclass support to metrics.roc_auc_score with correspond-
ing scorers 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', and
'roc_auc_ovo_weighted'. #12789 and #15274 by Kathy Chen, Mohamed Maskani, and Thomas
Fan.
• [F EATURE ] Add metrics.mean_tweedie_deviance measuring the Tweedie deviance for a given power
parameter. Also add mean Poisson deviance metrics.mean_poisson_deviance and mean Gamma de-
viance metrics.mean_gamma_deviance that are special cases of the Tweedie deviance for power=1
and power=2 respectively. #13938 by Christian Lorentzen and Roman Yurchak.
• [E FFICIENCY ] Improved performance of metrics.pairwise.manhattan_distances in the case of
sparse matrices. #15049 by Paolo Toccaceli <ptocca>.
• [E NHANCEMENT ] The parameter beta in metrics.fbeta_score is updated to accept the zero and
float('+inf') value. #13231 by Dong-hee Na.
• [E NHANCEMENT ] Added parameter squared in metrics.mean_squared_error to return root mean
squared error. #13467 by Urvang Patel.
• [E NHANCEMENT ] Allow computing averaged metrics in the case of no true positives. #14595 by Andreas Müller.
• [E NHANCEMENT ] Multilabel metrics now supports list of lists as input. #14865 Srivatsan Ramesh, Herilalaina
Rakotoarison, Léonard Binet.
• [E NHANCEMENT ] metrics.median_absolute_error now supports multioutput parameter. #14732
by Agamemnon Krasoulis.
• [E NHANCEMENT ] ‘roc_auc_ovr_weighted’ and ‘roc_auc_ovo_weighted’ can now be used as the scoring param-
eter of model-selection tools. #14417 by Thomas Fan.
• [E NHANCEMENT ] metrics.confusion_matrix accepts a parameters normalize allowing to normalize
the confusion matrix by column, rows, or overall. #15625 by Guillaume Lemaitre <glemaitre>.
• [F IX ] Raise a ValueError in metrics.silhouette_score when a precomputed distance matrix contains
non-zero diagonal entries. #12258 by Stephen Tierney.
• [API C HANGE ] scoring="neg_brier_score" should be used instead of
scoring="brier_score_loss" which is now deprecated. #14898 by Stefan Matcovici.
sklearn.model_selection
sklearn.multioutput
sklearn.naive_bayes
• [M AJOR F EATURE ] Added naive_bayes.CategoricalNB that implements the Categorical Naive Bayes
classifier. #12569 by Tim Bicker and Florian Wilhelm.
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [E NHANCEMENT ] pipeline.Pipeline now supports score_samples if the final estimator does. #13806 by
Anaël Beaugnon.
• [F IX ] The fit in FeatureUnion now accepts fit_params to pass to the underlying transformers. #15119
by Adrin Jalali.
• [API C HANGE ] None as a transformer is now deprecated in pipeline.FeatureUnion. Please use 'drop'
instead. #15053 by Thomas Fan.
sklearn.preprocessing
sklearn.model_selection
sklearn.svm
• [E NHANCEMENT ] svm.SVC and svm.NuSVC now accept a break_ties parameter. This param-
eter results in predict breaking the ties according to the confidence values of decision_function, if
decision_function_shape='ovr', and the number of target classes > 2. #12557 by Adrin Jalali.
• [E NHANCEMENT ] SVM estimators now throw a more specific error when kernel='precomputed' and fit
on non-square data. #14336 by Gregory Dexter.
• [F IX ] svm.SVC, svm.SVR, svm.NuSVR and svm.OneClassSVM when received values negative or zero
for parameter sample_weight in method fit(), generated an invalid model. This behavior occurred only in
some border scenarios. Now in these cases, fit() will fail with an Exception. #14286 by Alex Shacked.
• [F IX ] The n_support_ attribute of svm.SVR and svm.OneClassSVM was previously non-initialized, and
had size 2. It has now size 1 with the correct value. #15099 by Nicolas Hug.
• [F IX ] fixed a bug in BaseLibSVM._sparse_fit where n_SV=0 raised a ZeroDivisionError. #14894 by
Danna Naser.
• [F IX ] The liblinear solver now supports sample_weight. #15038 by Guillaume Lemaitre.
sklearn.tree
sklearn.utils
– mocking.CheckingClassifier
– optimize.newton_cg
– random.random_choice_csc
– utils.choose_check_classifiers_labels
– utils.enforce_estimator_tags_y
– utils.optimize.newton_cg
– utils.random.random_choice_csc
– utils.safe_indexing
– utils.mocking
– utils.fast_dict
– utils.seq_dataset
– utils.weight_vector
– utils.fixes.parallel_helper (removed)
– All of utils.testing except for all_estimators which is now in utils.
sklearn.isotonic
Miscellaneous
• [F IX ] Port lobpcg from SciPy which implement some bug fixes but only available in 1.3+. #13609 and #14971
by Guillaume Lemaitre.
• [API C HANGE ] Scikit-learn now converts any input data structure implementing a duck array to a numpy array
(using __array__) to ensure consistent behavior instead of relying on __array_function__ (see NEP
18). #14702 by Andreas Müller.
• [API C HANGE ] Replace manual checks with check_is_fitted. Errors thrown when using a non-fitted
estimators are now more uniform. #13013 by Agamemnon Krasoulis.
• Added check that pairwise estimators raise error on non-square data #14336 by Gregory Dexter.
• Added two common multioutput estimator tests check_classifier_multioutput and
check_regressor_multioutput. #13392 by Rok Mihevc.
• [F IX ] Added check_transformer_data_not_an_array to checks where missing
• [F IX ] The estimators tags resolution now follows the regular MRO. They used to be overridable only once.
#14884 by Andreas Müller.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, in-
cluding:
Aaron Alphonsus, Abbie Popa, Abdur-Rahmaan Janhangeer, abenbihi, Abhinav Sagar, Abhishek Jana, Abraham K.
Lagat, Adam J. Stewart, Aditya Vyas, Adrin Jalali, Agamemnon Krasoulis, Alec Peters, Alessandro Surace, Alexan-
dre de Siqueira, Alexandre Gramfort, alexgoryainov, Alex Henrie, Alex Itkes, alexshacked, Allen Akinkunle, Anaël
Beaugnon, Anders Kaseorg, Andrea Maldonado, Andrea Navarrete, Andreas Mueller, Andreas Schuderer, Andrew
Nystrom, Angela Ambroz, Anisha Keshavan, Ankit Jha, Antonio Gutierrez, Anuja Kelkar, Archana Alva, arnaud-
stiegler, arpanchowdhry, ashimb9, Ayomide Bamidele, Baran Buluttekin, barrycg, Bharat Raghunathan, Bill Mill,
Biswadip Mandal, blackd0t, Brian G. Barkley, Brian Wignall, Bryan Yang, c56pony, camilaagw, cartman_nabana,
catajara, Cat Chenal, Cathy, cgsavard, Charles Vesteghem, Chiara Marmo, Chris Gregory, Christian Lorentzen, Chris-
tos Aridas, Dakota Grusak, Daniel Grady, Daniel Perry, Danna Naser, DatenBergwerk, David Dormagen, deeplook,
Dillon Niederhut, Dong-hee Na, Dougal J. Sutherland, DrGFreeman, Dylan Cashman, edvardlindelof, Eric Larson,
Eric Ndirangu, Eunseop Jeong, Fanny, federicopisanu, Felix Divo, flaviomorelli, FranciDona, Franco M. Luque, Frank
Hoang, Frederic Haase, g0g0gadget, Gabriel Altay, Gabriel do Vale Rios, Gael Varoquaux, ganevgv, gdex1, getgau-
rav2, Gideon Sonoiya, Gordon Chen, gpapadok, Greg Mogavero, Grzegorz Szpak, Guillaume Lemaitre, Guillem Gar-
cía Subies, H4dr1en, hadshirt, Hailey Nguyen, Hanmin Qin, Hannah Bruce Macdonald, Harsh Mahajan, Harsh Soni,
Honglu Zhang, Hossein Pourbozorg, Ian Sanders, Ingrid Spielman, J-A16, jaehong park, Jaime Ferrando Huertas,
James Hill, James Myatt, Jay, jeremiedbb, Jérémie du Boisberranger, jeromedockes, Jesper Dramsch, Joan Massich,
Joanna Zhang, Joel Nothman, Johann Faouzi, Jonathan Rahn, Jon Cusick, Jose Ortiz, Kanika Sabharwal, Katarina
Slama, kellycarmody, Kennedy Kang’ethe, Kensuke Arai, Kesshi Jordan, Kevad, Kevin Loftis, Kevin Winata, Kevin
Yu-Sheng Li, Kirill Dolmatov, Kirthi Shankar Sivamani, krishna katyal, Lakshmi Krishnan, Lakshya KD, LalliAcqua,
lbfin, Leland McInnes, Léonard Binet, Loic Esteve, loopyme, lostcoaster, Louis Huynh, lrjball, Luca Ionescu, Lutz
Roeder, MaggieChege, Maithreyi Venkatesh, Maltimore, Maocx, Marc Torrellas, Marie Douriez, Markus, Markus
Frey, Martina G. Vilas, Martin Oywa, Martin Thoma, Masashi SHIBATA, Maxwell Aladago, mbillingr, m-clare,
Meghann Agarwal, m.fab, Micah Smith, miguelbarao, Miguel Cabrera, Mina Naghshhnejad, Ming Li, motmoti,
mschaffenroth, mthorrell, Natasha Borders, nezar-a, Nicolas Hug, Nidhin Pattaniyil, Nikita Titov, Nishan Singh Mann,
Nitya Mandyam, norvan, notmatthancock, novaya, nxorable, Oleg Stikhin, Oleksandr Pavlyk, Olivier Grisel, Omar
Saleem, Owen Flanagan, panpiort8, Paolo, Paolo Toccaceli, Paresh Mathur, Paula, Peng Yu, Peter Marko, pierre-
tallotte, poorna-kumar, pspachtholz, qdeffense, Rajat Garg, Raphaël Bournhonesque, Ray, Ray Bell, Rebekah Kim,
Reza Gharibi, Richard Payne, Richard W, rlms, Robert Juergens, Rok Mihevc, Roman Feldbauer, Roman Yurchak,
R Sanjabi, RuchitaGarde, Ruth Waithera, Sackey, Sam Dixon, Samesh Lakhotia, Samuel Taylor, Sarra Habchi, Scott
Gigante, Scott Sievert, Scott White, Sebastian Pölsterl, Sergey Feldman, SeWook Oh, she-dares, Shreya V, Shub-
ham Mehta, Shuzhe Xiao, SimonCW, smarie, smujjiga, Sönke Behrends, Soumirai, Sourav Singh, stefan-matcovici,
steinfurt, Stéphane Couvreur, Stephan Tulkens, Stephen Cowley, Stephen Tierney, SylvainLan, th0rwas, theoptips,
theotheo, Thierno Ibrahima DIOP, Thomas Edwards, Thomas J Fan, Thomas Moreau, Thomas Schmitt, Tilen Kusterle,
Tim Bicker, Timsaur, Tim Staley, Tirth Patel, Tola A, Tom Augspurger, Tom Dupré la Tour, topisan, Trevor Stephens,
ttang131, Urvang Patel, Vathsala Achar, veerlosar, Venkatachalam N, Victor Luzgin, Vincent Jeanselme, Vincent
Lostanlen, Vladimir Korolev, vnherdeiro, Wenbo Zhao, Wendy Hu, willdarnell, William de Vazelhes, wolframalpha,
xavier dupré, xcjason, x-martian, xsat, xun-tang, Yinglr, yokasre, Yu-Hang “Maxin” Tang, Yulia Zamriy, Zhao Feng
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• The v0.20.0 release notes failed to mention a backwards incompatibility in metrics.make_scorer
when needs_proba=True and y_true is binary. Now, the scorer function is supposed to accept a 1D
y_pred (i.e., probability of the positive class, shape (n_samples,)), instead of a 2D y_pred (i.e., shape
(n_samples, 2)).
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where computation with init='random' was single threaded for
n_jobs > 1 or n_jobs = -1. #12955 by Prabakaran Kumaresshan.
• [F IX ] Fixed a bug in cluster.OPTICS where users were unable to pass float min_samples and
min_cluster_size. #14496 by Fabian Klopfer and Hanmin Qin.
• [F IX ] Fixed a bug in cluster.KMeans where KMeans++ initialisation could rarely result in an IndexError.
#11756 by Joel Nothman.
sklearn.compose
sklearn.datasets
sklearn.ensemble
sklearn.impute
sklearn.inspection
sklearn.linear_model
sklearn.neighbors
sklearn.tree
• [F IX ] Fixed bug in tree.export_text when the tree has one feature and a single feature name is passed in.
#14053 by Thomas Fan.
• [F IX ] Fixed an issue with plot_tree where it displayed entropy calculations even for gini criterion in
DecisionTreeClassifiers. #13947 by Frank Hoang.
24 May 2019
Changelog
sklearn.decomposition
sklearn.metrics
sklearn.preprocessing
• [F IX ] Fixed a bug in preprocessing.OneHotEncoder where the new drop parameter was not reflected
in get_feature_names. #13894 by James Myatt.
sklearn.utils.sparsefuncs
• [F IX ] Fixed a bug where min_max_axis would fail on 32-bit systems for certain large inputs. This
affects preprocessing.MaxAbsScaler, preprocessing.normalize and preprocessing.
LabelBinarizer. #13741 by Roddy MacSween.
17 May 2019
This is a bug-fix release to primarily resolve some packaging issues in version 0.21.0. It also includes minor docu-
mentation improvements and some bug fixes.
Changelog
sklearn.inspection
• [F IX ] Fixed a bug in inspection.partial_dependence to only check classifier and not regressor for
the multiclass-multioutput case. #14309 by Guillaume Lemaitre.
sklearn.metrics
sklearn.neighbors
May 2019
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• discriminant_analysis.LinearDiscriminantAnalysis for multiclass classification. [F IX ]
• discriminant_analysis.LinearDiscriminantAnalysis with ‘eigen’ solver. [F IX ]
• linear_model.BayesianRidge [F IX ]
• Decision trees and derived ensembles when both max_depth and max_leaf_nodes are set. [F IX ]
• linear_model.LogisticRegression and linear_model.LogisticRegressionCV with
‘saga’ solver. [F IX ]
• ensemble.GradientBoostingClassifier [F IX ]
• sklearn.feature_extraction.text.HashingVectorizer, sklearn.
feature_extraction.text.TfidfVectorizer, and sklearn.feature_extraction.
text.CountVectorizer [F IX ]
• neural_network.MLPClassifier [F IX ]
• svm.SVC.decision_function and multiclass.OneVsOneClassifier.
decision_function. [F IX ]
• linear_model.SGDClassifier and any derived classifiers. [F IX ]
• Any model using the linear_model._sag.sag_solver function with a 0 seed, includ-
ing linear_model.LogisticRegression, linear_model.LogisticRegressionCV ,
linear_model.Ridge, and linear_model.RidgeCV with ‘sag’ solver. [F IX ]
• linear_model.RidgeCV when using generalized cross-validation with sparse inputs. [F IX ]
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
• The default max_iter for linear_model.LogisticRegression is too small for many solvers given
the default tol. In particular, we accidentally changed the default max_iter for the liblinear solver from
1000 to 100 iterations in #3591 released in version 0.16. In a future release we hope to choose better default
max_iter and tol heuristically depending on the solver (see #13317).
Changelog
Support for Python 3.4 and below has been officially dropped.
sklearn.base
• [API C HANGE ] The R2 score used when calling score on a regressor will use
multioutput='uniform_average' from version 0.23 to keep consistent with metrics.r2_score.
This will influence the score method of all the multioutput regressors (except for multioutput.
MultiOutputRegressor). #13157 by Hanmin Qin.
sklearn.calibration
• [E NHANCEMENT ] Added support to bin the data passed into calibration.calibration_curve by quan-
tiles instead of uniformly between 0 and 1. #13086 by Scott Cole.
• [E NHANCEMENT ] Allow n-dimensional arrays as input for calibration.CalibratedClassifierCV.
#13485 by William de Vazelhes.
sklearn.cluster
sklearn.compose
sklearn.datasets
• [F IX ] Added support for 64-bit group IDs and pointers in SVMLight files. #10727 by Bryan K Woods.
• [F IX ] datasets.load_sample_images returns images with a deterministic order. #13250 by Thomas
Fan.
sklearn.decomposition
sklearn.discriminant_analysis
sklearn.dummy
• [F IX ] Fixed a bug in dummy.DummyClassifier where the predict_proba method was returning int32
array instead of float64 for the stratified strategy. #13266 by Christos Aridas.
• [F IX ] Fixed a bug in dummy.DummyClassifier where it was throwing a dimension mismatch error in
prediction time if a column vector y with shape=(n, 1) was given at fit time. #13545 by Nick Sorros and
Adrin Jalali.
sklearn.ensemble
• [M AJOR F EATURE ] Add two new implementations of gradient boosting trees: ensemble.
HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor.
The implementation of these estimators is inspired by LightGBM and can be orders of magnitude faster than
ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
when the number of samples is larger than tens of thousands of samples. The API of these new estimators
is slightly different, and some of the features from ensemble.GradientBoostingClassifier and
ensemble.GradientBoostingRegressor are not yet supported.
These new estimators are experimental, which means that their results or their API might change without any
deprecation cycle. To use them, you need to explicitly import enable_hist_gradient_boosting:
sklearn.externals
• [API C HANGE ] Deprecated externals.six since we have dropped support for Python 2.7. #12916 by Han-
min Qin.
sklearn.feature_extraction
sklearn.impute
• [M AJOR F EATURE ] Added impute.IterativeImputer, which is a strategy for imputing missing values
by modeling each feature with missing values as a function of other features in a round-robin fashion. #8478
and #12177 by Sergey Feldman and Ben Lawson.
The API of IterativeImputer is experimental and subject to change without any deprecation cycle. To use them,
you need to explicitly import enable_iterative_imputer:
the imputer’s transform. That allows a predictive estimator to account for missingness. #12583, #13601 by
Danylo Baibak.
• [F IX ] In impute.MissingIndicator avoid implicit densification by raising an exception if input is sparse
add missing_values property is set to 0. #13240 by Bartosz Telenczuk.
• [F IX ] Fixed two bugs in impute.MissingIndicator. First, when X is sparse, all the non-zero non missing
values used to become explicit False in the transformed data. Then, when features='missing-only',
all features used to be kept if there were no missing values at all. #13562 by Jérémie du Boisberranger.
sklearn.inspection
(new subpackage)
• [F EATURE ] Partial dependence plots (inspection.plot_partial_dependence) are now supported for
any regressor or classifier (provided that they have a predict_proba method). #12599 by Trevor Stephens
and Nicolas Hug.
sklearn.isotonic
sklearn.linear_model
• [E NHANCEMENT ] linear_model.Ridge now preserves float32 and float64 dtypes. #8769 and
#11000 by Guillaume Lemaitre, and Joan Massich
• [F EATURE ] linear_model.LogisticRegression and linear_model.
LogisticRegressionCV now support Elastic-Net penalty, with the ‘saga’ solver. #11646 by Nicolas
Hug.
• [F EATURE ] Added linear_model.lars_path_gram, which is linear_model.lars_path in the
sufficient stats mode, allowing users to compute linear_model.lars_path without providing X and y.
#11699 by Kuai Yu.
• [E FFICIENCY ] linear_model.make_dataset now preserves float32 and float64 dtypes, reducing
memory consumption in stochastic gradient, SAG and SAGA solvers. #8769 and #11000 by Nelle Varoquaux,
Arthur Imbert, Guillaume Lemaitre, and Joan Massich
• [E NHANCEMENT ] linear_model.LogisticRegression now supports an unregularized objective when
penalty='none' is passed. This is equivalent to setting C=np.inf with l2 regularization. Not supported
by the liblinear solver. #12860 by Nicolas Hug.
• [E NHANCEMENT ] sparse_cg solver in linear_model.Ridge now supports fitting the intercept (i.e.
fit_intercept=True) when inputs are sparse. #13336 by Bartosz Telenczuk.
• [E NHANCEMENT ] The coordinate descent solver used in Lasso, ElasticNet, etc. now issues a
ConvergenceWarning when it completes without meeting the desired toleranbce. #11754 and #13397
by Brent Fagan and Adrin Jalali.
• [F IX ] Fixed a bug in linear_model.LogisticRegression and linear_model.
LogisticRegressionCV with ‘saga’ solver, where the weights would not be correctly updated in
some cases. #11646 by Tom Dupre la Tour.
• [F IX ] Fixed the posterior mean, posterior covariance and returned regularization parameters in
linear_model.BayesianRidge. The posterior mean and the posterior covariance were not the ones
computed with the last update of the regularization parameters and the returned regularization parameters were
not the final ones. Also fixed the formula of the log marginal likelihood used to compute the score when
compute_score=True. #12174 by Albert Thomas.
• [F IX ] Fixed a bug in linear_model.LassoLarsIC, where user input copy_X=False at instance cre-
ation would be overridden by default parameter value copy_X=True in fit. #12972 by Lucio Fernandez-
Arjona
• [F IX ] Fixed a bug in linear_model.LinearRegression that was not returning the same coeffecients
and intercepts with fit_intercept=True in sparse and dense case. #13279 by Alexandre Gramfort
• [F IX ] Fixed a bug in linear_model.HuberRegressor that was broken when X was of dtype bool. #13328
by Alexandre Gramfort.
• [F IX ] Fixed a performance issue of saga and sag solvers when called in a joblib.Parallel setting with
n_jobs > 1 and backend="threading", causing them to perform worse than in the sequential case.
#13389 by Pierre Glaser.
• [F IX ] Fixed a bug in linear_model.stochastic_gradient.BaseSGDClassifier that was not
deterministic when trained in a multi-class setting on several threads. #13422 by Clément Doumouro.
• [F IX ] Fixed bug in linear_model.ridge_regression, linear_model.Ridge
and linear_model.RidgeClassifier that caused unhandled exception for arguments
return_intercept=True and solver=auto (default) or any other solver different from sag.
#13363 by Bartosz Telenczuk
• [F IX ] linear_model.ridge_regression will now raise an exception if return_intercept=True
and solver is different from sag. Previously, only warning was issued. #13363 by Bartosz Telenczuk
• [F IX ] linear_model.ridge_regression will choose sparse_cg solver for sparse inputs when
solver=auto and sample_weight is provided (previously cholesky solver was selected). #13363 by
Bartosz Telenczuk
• [API C HANGE ] The use of linear_model.lars_path with X=None while passing Gram is deprecated in
version 0.21 and will be removed in version 0.23. Use linear_model.lars_path_gram instead. #11699
by Kuai Yu.
• [API C HANGE ] linear_model.logistic_regression_path is deprecated in version 0.21 and will be
removed in version 0.23. #12821 by Nicolas Hug.
• [F IX ] linear_model.RidgeCV with generalized cross-validation now correctly fits an intercept when
fit_intercept=True and the design matrix is sparse. #13350 by Jérôme Dockès
sklearn.manifold
sklearn.metrics
• [F EATURE ] Added the metrics.max_error metric and a corresponding 'max_error' scorer for single
output regression. #12232 by Krishna Sangeeth.
sklearn.mixture
• [F IX ] Fixed a bug in mixture.BaseMixture and therefore on estimators based on it, i.e. mixture.
GaussianMixture and mixture.BayesianGaussianMixture, where fit_predict and fit.
predict were not equivalent. #13142 by Jérémie du Boisberranger.
sklearn.model_selection
• [F EATURE ] Classes GridSearchCV and RandomizedSearchCV now allow for refit=callable to add flex-
ibility in identifying the best estimator. See Balance model complexity and cross-validated score. #11354 by
sklearn.multiclass
sklearn.multioutput
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [F EATURE ] pipeline.Pipeline can now use indexing notation (e.g. my_pipeline[0:-1]) to extract a
subsequence of steps as another Pipeline instance. A Pipeline can also be indexed directly to extract a particular
step (e.g. my_pipeline['svc']), rather than accessing named_steps. #2568 by Joel Nothman.
• [F EATURE ] Added optional parameter verbose in pipeline.Pipeline, compose.
ColumnTransformer and pipeline.FeatureUnion and corresponding make_ helpers for showing
progress and timing of each step. #11364 by Baze Petrushev, Karan Desai, Joel Nothman, and Thomas Fan.
• [E NHANCEMENT ] pipeline.Pipeline now supports using 'passthrough' as a transformer, with the
same effect as None. #11144 by Thomas Fan.
• [E NHANCEMENT ] pipeline.Pipeline implements __len__ and therefore len(pipeline) returns the
number of steps in the pipeline. #13439 by Lakshya KD.
sklearn.preprocessing
• [F EATURE ] preprocessing.OneHotEncoder now supports dropping one feature per category with a new
drop parameter. #12908 by Drew Johnston.
• [E FFICIENCY ] preprocessing.OneHotEncoder and preprocessing.OrdinalEncoder now han-
dle pandas DataFrames more efficiently. #13253 by @maikia.
• [E FFICIENCY ] Make preprocessing.MultiLabelBinarizer cache class mappings instead of calculat-
ing it every time on the fly. #12116 by Ekaterina Krivich and Joel Nothman.
• [E FFICIENCY ] preprocessing.PolynomialFeatures now supports compressed sparse row (CSR) ma-
trices as input for degrees 2 and 3. This is typically much faster than the dense case as it scales with matrix
density and expansion degree (on the order of density^degree), and is much, much faster than the compressed
sparse column (CSC) case. #12197 by Andrew Nystrom.
• [E FFICIENCY ] Speed improvement in preprocessing.PolynomialFeatures, in the dense case. Also
added a new parameter order which controls output order for further speed performances. #12251 by Tom
Dupre la Tour.
• [F IX ] Fixed the calculation overflow when using a float16 dtype with preprocessing.StandardScaler.
#13007 by Raffaello Baluyot
• [F IX ] Fixed a bug in preprocessing.QuantileTransformer and preprocessing.
quantile_transform to force n_quantiles to be at most equal to n_samples. Values of n_quantiles
larger than n_samples were either useless or resulting in a wrong approximation of the cumulative distribution
function estimator. #13333 by Albert Thomas.
• [API C HANGE ] The default value of copy in preprocessing.quantile_transform will change from
False to True in 0.23 in order to make it more consistent with the default copy values of other functions in
preprocessing and prevent unexpected side effects by modifying the value of X inplace. #13459 by Hunter
McGushion.
sklearn.svm
sklearn.tree
• [F EATURE ] Decision Trees can now be plotted with matplotlib using tree.plot_tree without relying on the
dot library, removing a hard-to-install dependency. #8508 by Andreas Müller.
• [F EATURE ] Decision Trees can now be exported in a human readable textual format using tree.
export_text. #6261 by Giuseppe Vettigli <JustGlowing>.
• [F EATURE ] get_n_leaves() and get_depth() have been added to tree.BaseDecisionTree
and consequently all estimators based on it, including tree.DecisionTreeClassifier, tree.
DecisionTreeRegressor, tree.ExtraTreeClassifier, and tree.ExtraTreeRegressor.
#12300 by Adrin Jalali.
• [F IX ] Trees and forests did not previously predict multi-output classification targets with string labels, despite
accepting them in fit. #11458 by Mitar Milutinovic.
• [F IX ] Fixed an issue with tree.BaseDecisionTree and consequently all estimators based
on it, including tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.
ExtraTreeClassifier, and tree.ExtraTreeRegressor, where they used to exceed the given
max_depth by 1 while expanding the tree if max_leaf_nodes and max_depth were both specified by
the user. Please note that this also affects all ensemble methods using decision trees. #12344 by Adrin Jalali.
sklearn.utils
• [F EATURE ] utils.resample now accepts a stratify parameter for sampling according to class distribu-
tions. #13549 by Nicolas Hug.
• [API C HANGE ] Deprecated warn_on_dtype parameter from utils.check_array and utils.
check_X_y. Added explicit warning for dtype conversion in check_pairwise_arrays if the metric
being passed is a pairwise boolean metric. #13382 by Prathmesh Savale.
Multiple modules
• [M AJOR F EATURE ] The __repr__() method of all estimators (used when calling print(estimator))
has been entirely re-written, building on Python’s pretty printing standard library. All parameters are printed
by default, but this can be altered with the print_changed_only option in sklearn.set_config.
#11705 by Nicolas Hug.
• [M AJOR F EATURE ] Add estimators tags: these are annotations of estimators that allow programmatic inspection
of their capabilities, such as sparse matrix support, supported output types and supported methods. Estimator
tags also determine the tests that are run on an estimator when check_estimator is called. Read more in
the User Guide. #8022 by Andreas Müller.
• [E FFICIENCY ] Memory copies are avoided when casting arrays to a different dtype in multiple estimators. #11973
by Roman Yurchak.
• [F IX ] Fixed a bug in the implementation of the our_rand_r helper function that was not behaving consistently
across platforms. #13422 by Madhura Parikh and Clément Doumouro.
Miscellaneous
• [E NHANCEMENT ] Joblib is no longer vendored in scikit-learn, and becomes a dependency. Minimal supported
version is joblib 0.11, however using version >= 0.13 is strongly recommended. #13531 by Roman Yurchak.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, in-
cluding:
adanhawth, Aditya Vyas, Adrin Jalali, Agamemnon Krasoulis, Albert Thomas, Alberto Torres, Alexandre Gramfort,
amourav, Andrea Navarrete, Andreas Mueller, Andrew Nystrom, assiaben, Aurélien Bellet, Bartosz Michałowski,
Bartosz Telenczuk, bauks, BenjaStudio, bertrandhaut, Bharat Raghunathan, brentfagan, Bryan Woods, Cat Chenal,
Cheuk Ting Ho, Chris Choe, Christos Aridas, Clément Doumouro, Cole Smith, Connossor, Corey Levinson, Dan
Ellis, Dan Stine, Danylo Baibak, daten-kieker, Denis Kataev, Didi Bar-Zev, Dillon Gardner, Dmitry Mottl, Dmitry
Vukolov, Dougal J. Sutherland, Dowon, drewmjohnston, Dror Atariah, Edward J Brown, Ekaterina Krivich, Eliza-
beth Sander, Emmanuel Arias, Eric Chang, Eric Larson, Erich Schubert, esvhd, Falak, Feda Curic, Federico Caselli,
Frank Hoang, Fibinse Xavier‘, Finn O’Shea, Gabriel Marzinotto, Gabriel Vacaliuc, Gabriele Calvo, Gael Varoquaux,
GauravAhlawat, Giuseppe Vettigli, Greg Gandenberger, Guillaume Fournier, Guillaume Lemaitre, Gustavo De Mari
Pereira, Hanmin Qin, haroldfox, hhu-luqi, Hunter McGushion, Ian Sanders, JackLangerman, Jacopo Notarstefano,
jakirkham, James Bourbeau, Jan Koch, Jan S, janvanrijn, Jarrod Millman, jdethurens, jeremiedbb, JF, joaak, Joan
Massich, Joel Nothman, Jonathan Ohayon, Joris Van den Bossche, josephsalmon, Jérémie Méhault, Katrin Leinwe-
ber, ken, kms15, Koen, Kossori Aruku, Krishna Sangeeth, Kuai Yu, Kulbear, Kushal Chauhan, Kyle Jackson, Lakshya
KD, Leandro Hermida, Lee Yi Jie Joel, Lily Xiong, Lisa Sarah Thomas, Loic Esteve, louib, luk-f-a, maikia, mail-liam,
Manimaran, Manuel López-Ibáñez, Marc Torrellas, Marco Gaido, Marco Gorelli, MarcoGorelli, marineLM, Mark
Hannel, Martin Gubri, Masstran, mathurinm, Matthew Roeschke, Max Copeland, melsyt, mferrari3, Mickaël Schoent-
gen, Ming Li, Mitar, Mohammad Aftab, Mohammed AbdelAal, Mohammed Ibraheem, Muhammad Hassaan Rafique,
mwestt, Naoya Iijima, Nicholas Smith, Nicolas Goix, Nicolas Hug, Nikolay Shebanov, Oleksandr Pavlyk, Oliver
Rausch, Olivier Grisel, Orestis, Osman, Owen Flanagan, Paul Paczuski, Pavel Soriano, pavlos kallis, Pawel Sendyk,
peay, Peter, Peter Cock, Peter Hausamann, Peter Marko, Pierre Glaser, pierretallotte, Pim de Haan, Piotr Szymański,
Prabakaran Kumaresshan, Pradeep Reddy Raamana, Prathmesh Savale, Pulkit Maloo, Quentin Batista, Radostin Stoy-
anov, Raf Baluyot, Rajdeep Dua, Ramil Nugmanov, Raúl García Calvo, Rebekah Kim, Reshama Shaikh, Rohan
Lekhwani, Rohan Singh, Rohan Varma, Rohit Kapoor, Roman Feldbauer, Roman Yurchak, Romuald M, Roopam
Sharma, Ryan, Rüdiger Busche, Sam Waterbury, Samuel O. Ronsin, SandroCasagrande, Scott Cole, Scott Lowe, Se-
bastian Raschka, Shangwu Yao, Shivam Kotwalia, Shiyu Duan, smarie, Sriharsha Hatwar, Stephen Hoover, Stephen
Tierney, Stéphane Couvreur, surgan12, SylvainLan, TakingItCasual, Tashay Green, thibsej, Thomas Fan, Thomas
J Fan, Thomas Moreau, Tom Dupré la Tour, Tommy, Tulio Casagrande, Umar Farouk Umar, Utkarsh Upadhyay,
Vinayak Mehta, Vishaal Kapoor, Vivek Kumar, Vlad Niculae, vqean3, Wenhao Zhang, William de Vazelhes, xhan,
Xing Han Lu, xinyuliu12, Yaroslav Halchenko, Zach Griffith, Zach Miller, Zayd Hammoudeh, Zhuyi Xue, Zijie (ZJ)
Poh, ^__^
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where KMeans++ initialisation could rarely result in an IndexError.
#11756 by Joel Nothman.
sklearn.compose
sklearn.decomposition
sklearn.model_selection
• [F IX ] Fixed a bug where model_selection.StratifiedKFold shuffles each class’s samples with the
same random_state, making shuffle=True ineffective. #13124 by Hanmin Qin.
sklearn.neighbors
March 1, 2019
This is a bug-fix release with some minor documentation improvements and enhancements to features released in
0.20.0.
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where computation was single threaded when n_jobs > 1 or
n_jobs = -1. #12949 by Prabakaran Kumaresshan.
sklearn.compose
• [F IX ] Fixed a bug in compose.ColumnTransformer to handle negative indexes in the columns list of the
transformers. #12946 by Pierre Tallotte.
sklearn.covariance
sklearn.decomposition
sklearn.datasets
sklearn.feature_extraction
sklearn.impute
• [F IX ] add support for non-numeric data in sklearn.impute.MissingIndicator which was not sup-
ported while sklearn.impute.SimpleImputer was supporting this for some imputation strategies.
#13046 by Guillaume Lemaitre.
sklearn.linear_model
sklearn.preprocessing
sklearn.svm
• [F IX ] Fixed a bug in svm.SVC, svm.NuSVC, svm.SVR, svm.NuSVR and svm.OneClassSVM where the
scale option of parameter gamma is erroneously defined as 1 / (n_features * X.std()). It’s now
defined as 1 / (n_features * X.var()). #13221 by Hanmin Qin.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• sklearn.neighbors when metric=='jaccard' (bug fix)
• use of 'seuclidean' or 'mahalanobis' metrics in some cases (bug fix)
Changelog
sklearn.compose
sklearn.metrics
sklearn.neighbors
sklearn.utils
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• decomposition.IncrementalPCA (bug fix)
Changelog
sklearn.cluster
• [E FFICIENCY ] make cluster.MeanShift no longer try to do nested parallelism as the overhead would hurt
performance significantly when n_jobs > 1. #12159 by Olivier Grisel.
• [F IX ] Fixed a bug in cluster.DBSCAN with precomputed sparse neighbors graph, which would add explicitly
zeros on the diagonal even when already present. #12105 by Tom Dupre la Tour.
sklearn.compose
• [F IX ] Fixed an issue in compose.ColumnTransformer when stacking columns with types not convertible
to a numeric. #11912 by Adrin Jalali.
• [API C HANGE ] compose.ColumnTransformer now applies the sparse_threshold even if all trans-
formation results are sparse. #12304 by Andreas Müller.
sklearn.datasets
• [F IX ] datasets.fetch_openml to correctly use the local cache. #12246 by Jan N. van Rijn.
• [F IX ] datasets.fetch_openml to correctly handle ignore attributes and row id attributes. #12330 by Jan
N. van Rijn.
• [F IX ] Fixed integer overflow in datasets.make_classification for values of n_informative pa-
rameter larger than 64. #10811 by Roman Feldbauer.
• [F IX ] Fixed olivetti faces dataset DESCR attribute to point to the right location in datasets.
fetch_olivetti_faces. #12441 by Jérémie du Boisberranger
• [F IX ] datasets.fetch_openml to retry downloading when reading from local cache fails. #12517 by
Thomas Fan.
sklearn.decomposition
sklearn.ensemble
sklearn.feature_extraction
sklearn.linear_model
sklearn.metrics
sklearn.mixture
sklearn.neighbors
sklearn.preprocessing
sklearn.utils
• [F IX ] Use float64 for mean accumulator to avoid floating point precision issues in preprocessing.
StandardScaler and decomposition.IncrementalPCA when using float32 datasets. #12338 by
bauks.
• [F IX ] Calling utils.check_array on pandas.Series, which raised an error in 0.20.0, now returns the
expected output again. #12625 by Andreas Müller
Miscellaneous
• [F IX ] When using site joblib by setting the environment variable SKLEARN_SITE_JOBLIB, added compati-
bility with joblib 0.11 in addition to 0.12+. #12350 by Joel Nothman and Roman Yurchak.
• [F IX ] Make sure to avoid raising FutureWarning when calling np.vstack with numpy 1.16 and later (use
list comprehensions instead of generator expressions in many locations of the scikit-learn code base). #12467
by Olivier Grisel.
• [API C HANGE ] Removed all mentions of sklearn.externals.joblib, and deprecated joblib
methods exposed in sklearn.utils, except for utils.parallel_backend and utils.
register_parallel_backend, which allow users to configure parallel computation in scikit-learn. Other
functionalities are part of joblib. package and should be used directly, by installing it. The goal of this change
is to prepare for unvendoring joblib in future version of scikit-learn. #12345 by Thomas Moreau
Warning: Version 0.20 is the last version of scikit-learn to support Python 2.7 and Python 3.4. Scikit-learn 0.21
will require Python 3.5 or higher.
Highlights
We have tried to improve our support for common data-science use-cases including missing values, categorical vari-
ables, heterogeneous data, and features/targets with unusual distributions. Missing values in features, represented by
NaNs, are now accepted in column-wise preprocessing such as scalers. Each feature is fitted disregarding NaNs, and
data containing NaNs can be transformed. The new impute module provides estimators for learning despite missing
data.
ColumnTransformer handles the case where different features or columns of a pandas.DataFrame need dif-
ferent preprocessing. String or pandas Categorical columns can now be encoded with OneHotEncoder or
OrdinalEncoder.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.MeanShift (bug fix)
• decomposition.IncrementalPCA in Python 2 (bug fix)
• decomposition.SparsePCA (bug fix)
• ensemble.GradientBoostingClassifier (bug fix affecting feature importances)
• isotonic.IsotonicRegression (bug fix)
• linear_model.ARDRegression (bug fix)
• linear_model.LogisticRegressionCV (bug fix)
• linear_model.OrthogonalMatchingPursuit (bug fix)
• linear_model.PassiveAggressiveClassifier (bug fix)
• linear_model.PassiveAggressiveRegressor (bug fix)
• linear_model.Perceptron (bug fix)
• linear_model.SGDClassifier (bug fix)
• linear_model.SGDRegressor (bug fix)
• metrics.roc_auc_score (bug fix)
• metrics.roc_curve (bug fix)
• neural_network.BaseMultilayerPerceptron (bug fix)
• neural_network.MLPClassifier (bug fix)
• neural_network.MLPRegressor (bug fix)
• The v0.19.0 release notes failed to mention a backwards incompatibility with model_selection.
StratifiedKFold when shuffle=True due to #7823.
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.cluster
sklearn.compose
• New module.
sklearn.covariance
sklearn.datasets
• [M AJOR F EATURE ] Added datasets.fetch_openml to fetch datasets from OpenML. OpenML is a free,
open data sharing platform and will be used instead of mldata as it provides better service availability. #9908 by
Andreas Müller and Jan N. van Rijn.
• [F EATURE ] In datasets.make_blobs, one can now pass a list to the n_samples parameter to indicate
the number of samples to generate per cluster. #8617 by Maskani Filali Mohamed and Konstantinos Katrioplas.
• [F EATURE ] Add filename attribute to datasets that have a CSV file. #9101 by alex-33 and Maskani Filali
Mohamed.
• [F EATURE ] return_X_y parameter has been added to several dataset loaders. #10774 by Chris Catalfo.
• [F IX ] Fixed a bug in datasets.load_boston which had a wrong data point. #10795 by Takeshi
Yoshizawa.
• [F IX ] Fixed a bug in datasets.load_iris which had two wrong data points. #11082 by Sadhana Srini-
vasan and Hanmin Qin.
• [F IX ] Fixed a bug in datasets.fetch_kddcup99, where data were not properly shuffled. #9731 by Nico-
las Goix.
• [F IX ] Fixed a bug in datasets.make_circles, where no odd number of data points could be generated.
#10045 by Christian Braune.
• [API C HANGE ] Deprecated sklearn.datasets.fetch_mldata to be removed in version 0.22. ml-
data.org is no longer operational. Until removal it will remain possible to load cached datasets. #11466 by
Joel Nothman.
sklearn.decomposition
sklearn.discriminant_analysis
sklearn.dummy
• [F EATURE ] dummy.DummyRegressor now has a return_std option in its predict method. The re-
turned standard deviations will be zeros.
• [F EATURE ] dummy.DummyClassifier and dummy.DummyRegressor now only require X to be an ob-
ject with finite length or shape. #9832 by Vrishank Bhardwaj.
• [F EATURE ] dummy.DummyClassifier and dummy.DummyRegressor can now be scored without sup-
plying test samples. #11951 by Rüdiger Busche.
sklearn.ensemble
sklearn.feature_extraction
sklearn.feature_selection
sklearn.gaussian_process
when using return_std=True in particular more when called several times in a row. #9234 by andrewww
and Minghui Liu.
sklearn.impute
sklearn.isotonic
sklearn.linear_model
sklearn.manifold
• [E FFICIENCY ] Speed improvements for both ‘exact’ and ‘barnes_hut’ methods in manifold.TSNE. #10593
and #10610 by Tom Dupre la Tour.
• [F EATURE ] Support sparse input in manifold.Isomap.fit. #8554 by Leland McInnes.
• [F EATURE ] manifold.t_sne.trustworthiness accepts metrics other than Euclidean. #9775 by
William de Vazelhes.
• [F IX ] Fixed a bug in manifold.spectral_embedding where the normalization of the spectrum was
using a division instead of a multiplication. #8129 by Jan Margeta, Guillaume Lemaitre, and Devansh D..
• [API C HANGE ] [F EATURE ] Deprecate precomputed parameter in function manifold.t_sne.
trustworthiness. Instead, the new parameter metric should be used with any compatible metric in-
cluding ‘precomputed’, in which case the input matrix X should be a matrix of pairwise distances or squared
distances. #9775 by William de Vazelhes.
sklearn.metrics
• [M AJOR F EATURE ] Added the metrics.davies_bouldin_score metric for evaluation of clustering mod-
els without a ground truth. #10827 by Luis Osa.
• [M AJOR F EATURE ] Added the metrics.balanced_accuracy_score metric and a corresponding
'balanced_accuracy' scorer for binary and multiclass classification. #8066 by @xyguo and Aman
Dalmia, and #10587 by Joel Nothman.
• [F EATURE ] Partial AUC is available via max_fpr parameter in metrics.roc_auc_score. #3840 by
Alexander Niederbühl.
• [F EATURE ] A scorer based on metrics.brier_score_loss is also available. #9521 by Hanmin Qin.
• [F EATURE ] Added control over the normalization in metrics.normalized_mutual_info_score and
metrics.adjusted_mutual_info_score via the average_method parameter. In version 0.22, the
default normalizer for each will become the arithmetic mean of the entropies of each clustering. #11124 by
Arya McCarthy.
• [F EATURE ] Added output_dict parameter in metrics.classification_report to return classifi-
cation statistics as dictionary. #11160 by Dan Barkhorn.
• [F EATURE ] metrics.classification_report now reports all applicable averages on the given data, in-
cluding micro, macro and weighted average as well as samples average for multilabel data. #11679 by Alexander
Pacha.
• [F EATURE ] metrics.average_precision_score now supports binary y_true other than {0, 1} or
{-1, 1} through pos_label parameter. #9980 by Hanmin Qin.
• [F EATURE ] metrics.label_ranking_average_precision_score now supports
sample_weight. #10845 by Jose Perez-Parras Toledano.
• [F EATURE ] Add dense_output parameter to metrics.pairwise.linear_kernel. When False and
both inputs are sparse, will return a sparse matrix. #10999 by Taylor G Smith.
• [E FFICIENCY ] metrics.silhouette_score and metrics.silhouette_samples are more mem-
ory efficient and run faster. This avoids some reported freezes and MemoryErrors. #11135 by Joel Nothman.
• [F IX ] Fixed a bug in metrics.precision_recall_fscore_support when truncated
range(n_labels) is passed as value for labels. #10377 by Gaurav Dhingra.
• [F IX ] Fixed a bug due to floating point error in metrics.roc_auc_score with non-integer sample weights.
#9786 by Hanmin Qin.
• [F IX ] Fixed a bug where metrics.roc_curve sometimes starts on y-axis instead of (0, 0), which is in-
consistent with the document and other implementations. Note that this will not influence the result from
metrics.roc_auc_score #10093 by alexryndin and Hanmin Qin.
• [F IX ] Fixed a bug to avoid integer overflow. Casted product to 64 bits integer in metrics.
mutual_info_score. #9772 by Kumar Ashutosh.
• [F IX ] Fixed a bug where metrics.average_precision_score will sometimes return nan when
sample_weight contains 0. #9980 by Hanmin Qin.
sklearn.mixture
sklearn.model_selection
sklearn.multioutput
sklearn.naive_bayes
• [M AJOR F EATURE ] Added naive_bayes.ComplementNB, which implements the Complement Naive Bayes
classifier described in Rennie et al. (2003). #8190 by Michael A. Alcorn.
• [F EATURE ] Add var_smoothing parameter in naive_bayes.GaussianNB to give a precise control over
variances calculation. #9681 by Dmitry Mottl.
• [F IX ] Fixed a bug in naive_bayes.GaussianNB which incorrectly raised error for prior list which summed
to 1. #10005 by Gaurav Dhingra.
• [F IX ] Fixed a bug in naive_bayes.MultinomialNB which did not accept vector valued pseudocounts
(alpha). #10346 by Tobias Madsen
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [F EATURE ] The predict method of pipeline.Pipeline now passes keyword arguments on to the
pipeline’s last estimator, enabling the use of parameters such as return_std in a pipeline with caution.
#9304 by Breno Freitas.
• [API C HANGE ] pipeline.FeatureUnion now supports 'drop' as a transformer to drop features. #11144
by Thomas Fan.
sklearn.preprocessing
• [API C HANGE ] The NaN marker for the missing values has been changed between the preprocessing.
Imputer and the impute.SimpleImputer. missing_values='NaN' should now be
missing_values=np.nan. #11211 by Jeremie du Boisberranger.
• [API C HANGE ] In preprocessing.FunctionTransformer, the default of validate will be from
True to False in 0.22. #10655 by Guillaume Lemaitre.
sklearn.svm
• [F IX ] Fixed a bug in svm.SVC where when the argument kernel is unicode in Python2, the
predict_proba method was raising an unexpected TypeError given dense inputs. #10412 by Jiongyan
Zhang.
• [API C HANGE ] Deprecate random_state parameter in svm.OneClassSVM as the underlying implemen-
tation is not random. #9497 by Albert Thomas.
• [API C HANGE ] The default value of gamma parameter of svm.SVC, NuSVC, SVR, NuSVR, OneClassSVM
will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. #8361 by
Gaurav Dhingra and Ting Neo.
sklearn.tree
• [E NHANCEMENT ] Although private (and hence not assured API stability), tree._criterion.
ClassificationCriterion and tree._criterion.RegressionCriterion may now be cim-
ported and extended. #10325 by Camil Staps.
• [F IX ] Fixed a bug in tree.BaseDecisionTree with splitter="best" where split threshold could
become infinite when values in X were near infinite. #10536 by Jonathan Ohayon.
• [F IX ] Fixed a bug in tree.MAE to ensure sample weights are being used during the calculation of tree MAE im-
purity. Previous behaviour could cause suboptimal splits to be chosen since the impurity calculation considered
all samples to be of equal weight importance. #11464 by John Stott.
sklearn.utils
Multiple modules
• [F EATURE ] [API C HANGE ] More consistent outlier detection API: Add a score_samples method
in svm.OneClassSVM , ensemble.IsolationForest, neighbors.LocalOutlierFactor,
covariance.EllipticEnvelope. It allows to access raw score functions from original pa-
pers. A new offset_ parameter allows to link score_samples and decision_function
methods. The contamination parameter of ensemble.IsolationForest and neighbors.
LocalOutlierFactor decision_function methods is used to define this offset_ such that outliers
(resp. inliers) have negative (resp. positive) decision_function values. By default, contamination
is kept unchanged to 0.1 for a deprecation period. In 0.22, it will be set to “auto”, thus using method-specific
score offsets. In covariance.EllipticEnvelope decision_function method, the raw_values
parameter is deprecated as the shifted Mahalanobis distance will be always returned in 0.22. #9015 by Nicolas
Goix.
• [F EATURE ] [API C HANGE ] A behaviour parameter has been introduced in ensemble.
IsolationForest to ensure backward compatibility. In the old behaviour, the decision_function is
independent of the contamination parameter. A threshold attribute depending on the contamination
parameter is thus used. In the new behaviour the decision_function is dependent on the
contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers. Set-
ting behaviour to “old” is deprecated and will not be possible in version 0.22. Beside, the behaviour parameter
will be removed in 0.24. #11553 by Nicolas Goix.
• [API C HANGE ] Added convergence warning to svm.LinearSVC and linear_model.
LogisticRegression when verbose is set to 0. #10881 by Alexandre Sevin.
• [API C HANGE ] Changed warning type from UserWarning to exceptions.ConvergenceWarning
for failing convergence in linear_model.logistic_regression_path, linear_model.
RANSACRegressor, linear_model.ridge_regression, gaussian_process.
GaussianProcessRegressor, gaussian_process.GaussianProcessClassifier,
decomposition.fastica, cross_decomposition.PLSCanonical, cluster.
AffinityPropagation, and cluster.Birch. #10306 by Jonathan Siebert.
Miscellaneous
• [M AJOR F EATURE ] A new configuration parameter, working_memory was added to control memory con-
sumption limits in chunked operations, such as the new metrics.pairwise_distances_chunked. See
Limiting Working Memory. #10280 by Joel Nothman and Aman Dalmia.
• [F EATURE ] The version of joblib bundled with Scikit-learn is now 0.12. This uses a new default multiprocess-
ing implementation, named loky. While this may incur some memory and communication overhead, it should
provide greater cross-platform stability than relying on Python standard library multiprocessing. #11741 by the
Joblib developers, especially Thomas Moreau and Olivier Grisel.
• [F EATURE ] An environment variable to use the site joblib instead of the vendored one was added (Environment
variables). The main API of joblib is now exposed in sklearn.utils. #11166 by Gael Varoquaux.
• [F EATURE ] Add almost complete PyPy 3 support. Known unsupported functionalities are datasets.
load_svmlight_file, feature_extraction.FeatureHasher and feature_extraction.
text.HashingVectorizer. For running on PyPy, PyPy3-v5.10+, Numpy 1.14.0+, and scipy 1.1.0+ are
required. #11010 by Ronan Lamy and Roman Yurchak.
• [F EATURE ] A utility method sklearn.show_versions was added to print out information relevant for
debugging. It includes the user system, the Python executable, the version of the main libraries and BLAS
binding information. #11596 by Alexandre Boucaud
• [F IX ] Fixed a bug when setting parameters on meta-estimator, involving both a wrapped estimator and its pa-
rameter. #9999 by Marcus Voss and Joel Nothman.
• [F IX ] Fixed a bug where calling sklearn.base.clone was not thread safe and could result in a “pop from
empty list” error. #9569 by Andreas Müller.
• [API C HANGE ] The default value of n_jobs is changed from 1 to None in all related functions and classes.
n_jobs=None means unset. It will generally be interpreted as n_jobs=1, unless the current joblib.
Parallel backend context specifies otherwise (See Glossary for additional information). Note that this
change happens immediately (i.e., without a deprecation cycle). #11741 by Olivier Grisel.
• [F IX ] Fixed a bug in validation helpers where passing a Dask DataFrame results in an error. #12462 by Zachariah
Miller
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.19, in-
cluding:
211217613, Aarshay Jain, absolutelyNoWarranty, Adam Greenhall, Adam Kleczewski, Adam Richie-Halford, adelr,
AdityaDaflapurkar, Adrin Jalali, Aidan Fitzgerald, aishgrt1, Akash Shivram, Alan Liddell, Alan Yee, Albert Thomas,
Alexander Lenail, Alexander-N, Alexandre Boucaud, Alexandre Gramfort, Alexandre Sevin, Alex Egg, Alvaro Perez-
Diaz, Amanda, Aman Dalmia, Andreas Bjerre-Nielsen, Andreas Mueller, Andrew Peng, Angus Williams, Aniruddha
Dave, annaayzenshtat, Anthony Gitter, Antonio Quinonez, Anubhav Marwaha, Arik Pamnani, Arthur Ozga, Artiem
K, Arunava, Arya McCarthy, Attractadore, Aurélien Bellet, Aurélien Geron, Ayush Gupta, Balakumaran Manoha-
ran, Bangda Sun, Barry Hart, Bastian Venthur, Ben Lawson, Benn Roth, Breno Freitas, Brent Yi, brett koonce,
Caio Oliveira, Camil Staps, cclauss, Chady Kamar, Charlie Brummitt, Charlie Newey, chris, Chris, Chris Catalfo,
Chris Foster, Chris Holdgraf, Christian Braune, Christian Hirsch, Christian Hogan, Christopher Jenness, Clement
Joudet, cnx, cwitte, Dallas Card, Dan Barkhorn, Daniel, Daniel Ferreira, Daniel Gomez, Daniel Klevebring, Danielle
Shwed, Daniel Mohns, Danil Baibak, Darius Morawiec, David Beach, David Burns, David Kirkby, David Nichol-
son, David Pickup, Derek, Didi Bar-Zev, diegodlh, Dillon Gardner, Dillon Niederhut, dilutedsauce, dlovell, Dmitry
Mottl, Dmitry Petrov, Dor Cohen, Douglas Duhaime, Ekaterina Tuzova, Eric Chang, Eric Dean Sanchez, Erich Schu-
bert, Eunji, Fang-Chieh Chou, FarahSaeed, felix, Félix Raimundo, fenx, filipj8, FrankHui, Franz Wompner, Freija
Descamps, frsi, Gabriele Calvo, Gael Varoquaux, Gaurav Dhingra, Georgi Peev, Gil Forsyth, Giovanni Giuseppe
Costa, gkevinyen5418, goncalo-rodrigues, Gryllos Prokopis, Guillaume Lemaitre, Guillaume “Vermeille” Sanchez,
Gustavo De Mari Pereira, hakaa1, Hanmin Qin, Henry Lin, Hong, Honghe, Hossein Pourbozorg, Hristo, Hunan Ros-
tomyan, iampat, Ivan PANICO, Jaewon Chung, Jake VanderPlas, jakirkham, James Bourbeau, James Malcolm, Jamie
Cox, Jan Koch, Jan Margeta, Jan Schlüter, janvanrijn, Jason Wolosonovich, JC Liu, Jeb Bearer, jeremiedbb, Jimmy
Wan, Jinkun Wang, Jiongyan Zhang, jjabl, jkleint, Joan Massich, Joël Billaud, Joel Nothman, Johannes Hansen,
JohnStott, Jonatan Samoocha, Jonathan Ohayon, Jörg Döpfert, Joris Van den Bossche, Jose Perez-Parras Toledano,
josephsalmon, jotasi, jschendel, Julian Kuhlmann, Julien Chaumond, julietcl, Justin Shenk, Karl F, Kasper Primdal
Lauritzen, Katrin Leinweber, Kirill, ksemb, Kuai Yu, Kumar Ashutosh, Kyeongpil Kang, Kye Taylor, kyledrogo,
Leland McInnes, Léo DS, Liam Geron, Liutong Zhou, Lizao Li, lkjcalc, Loic Esteve, louib, Luciano Viola, Lucija
Gregov, Luis Osa, Luis Pedro Coelho, Luke M Craig, Luke Persola, Mabel, Mabel Villalba, Maniteja Nandana, MarkI-
wanchyshyn, Mark Roth, Markus Müller, MarsGuy, Martin Gubri, martin-hahn, martin-kokos, mathurinm, Matthias
Feurer, Max Copeland, Mayur Kulkarni, Meghann Agarwal, Melanie Goetz, Michael A. Alcorn, Minghui Liu, Ming
Li, Minh Le, Mohamed Ali Jamaoui, Mohamed Maskani, Mohammad Shahebaz, Muayyad Alsadi, Nabarun Pal, Na-
garjuna Kumar, Naoya Kanai, Narendran Santhanam, NarineK, Nathaniel Saul, Nathan Suh, Nicholas Nadeau, P.Eng.,
AVS, Nick Hoh, Nicolas Goix, Nicolas Hug, Nicolau Werneck, nielsenmarkus11, Nihar Sheth, Nikita Titov, Nilesh
Kevlani, Nirvan Anjirbag, notmatthancock, nzw, Oleksandr Pavlyk, oliblum90, Oliver Rausch, Olivier Grisel, Oren
Milman, Osaid Rehman Nasir, pasbi, Patrick Fernandes, Patrick Olden, Paul Paczuski, Pedro Morales, Peter, Peter St.
John, pierreablin, pietruh, Pinaki Nath Chowdhury, Piotr Szymański, Pradeep Reddy Raamana, Pravar D Mahajan,
pravarmahajan, QingYing Chen, Raghav RV, Rajendra arora, RAKOTOARISON Herilalaina, Rameshwar Bhaskaran,
RankyLau, Rasul Kerimov, Reiichiro Nakano, Rob, Roman Kosobrodov, Roman Yurchak, Ronan Lamy, rragundez,
Rüdiger Busche, Ryan, Sachin Kelkar, Sagnik Bhattacharya, Sailesh Choyal, Sam Radhakrishnan, Sam Steingold,
Samuel Bell, Samuel O. Ronsin, Saqib Nizam Shamsi, SATISH J, Saurabh Gupta, Scott Gigante, Sebastian Flen-
nerhag, Sebastian Raschka, Sebastien Dubois, Sébastien Lerique, Sebastin Santy, Sergey Feldman, Sergey Melderis,
Sergul Aydore, Shahebaz, Shalil Awaley, Shangwu Yao, Sharad Vijalapuram, Sharan Yalburgi, shenhanc78, Shivam
Rastogi, Shu Haoran, siftikha, Sinclert Pérez, SolutusImmensus, Somya Anand, srajan paliwal, Sriharsha Hatwar, Sri
Krishna, Stefan van der Walt, Stephen McDowell, Steven Brown, syonekura, Taehoon Lee, Takanori Hayashi, tarcusx,
Taylor G Smith, theriley106, Thomas, Thomas Fan, Thomas Heavey, Tobias Madsen, tobycheese, Tom Augspurger,
Tom Dupré la Tour, Tommy, Trevor Stephens, Trishnendu Ghorai, Tulio Casagrande, twosigmajab, Umar Farouk
Umar, Urvang Patel, Utkarsh Upadhyay, Vadim Markovtsev, Varun Agrawal, Vathsala Achar, Vilhelm von Ehren-
heim, Vinayak Mehta, Vinit, Vinod Kumar L, Viraj Mavani, Viraj Navkal, Vivek Kumar, Vlad Niculae, vqean3, Vris-
hank Bhardwaj, vufg, wallygauze, Warut Vijitbenjaronk, wdevazelhes, Wenhao Zhang, Wes Barnett, Will, William de
Vazelhes, Will Rosenfeld, Xin Xiong, Yiming (Paul) Li, ymazari, Yufeng, Zach Griffith, Zé Vinícius, Zhenqing Hu,
Zhiqing Xiao, Zijie (ZJ) Poh
July, 2018
This release is exclusively in order to support Python 3.7.
Related changes
Note there may be minor differences in TSNE output in this release (due to #9623), in the case where multiple samples
have equal distance to some sample.
Changelog
API changes
• Reverted the addition of metrics.ndcg_score and metrics.dcg_score which had been merged into
version 0.19.0 by error. The implementations were broken and undocumented.
• return_train_score which was added to model_selection.GridSearchCV ,
model_selection.RandomizedSearchCV and model_selection.cross_validate in
version 0.19.0 will be changing its default value from True to False in version 0.21. We found that calculating
training score could have a great effect on cross validation runtime in some cases. Users should explicitly
set return_train_score to False if prediction or scoring functions are slow, resulting in a deleterious
effect on CV runtime, or to True if they wish to use the calculated scores. #9677 by Kumar Ashutosh and Joel
Nothman.
• correlation_models and regression_models from the legacy gaussian processes implementation
have been belatedly deprecated. #9717 by Kumar Ashutosh.
Bug fixes
• Dataset fetchers make sure temporary files are closed before removing them, which caused errors on Windows.
#9847 by Joan Massich.
• Fixed a regression in manifold.TSNE where it no longer supported metrics other than ‘euclidean’ and ‘pre-
computed’. #9623 by Oli Blum.
Enhancements
• Our test suite and utils.estimator_checks.check_estimators can now be run without Nose in-
stalled. #9697 by Joan Massich.
• To improve usability of version 0.19’s pipeline.Pipeline caching, memory now allows joblib.
Memory instances. This make use of the new utils.validation.check_memory helper. issue:9584
by Kumar Ashutosh
• Some fixes to examples: #9750, #9788, #9815
• Made a FutureWarning in SGD-based estimators less verbose. #9802 by Vrishank Bhardwaj.
Highlights
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans with sparse X and initial centroids given (bug fix)
• cross_decomposition.PLSRegression with scale=True (bug fix)
• ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor
where min_impurity_split is used (bug fix)
• gradient boosting loss='quantile' (bug fix)
• ensemble.IsolationForest (bug fix)
• feature_selection.SelectFdr (bug fix)
• linear_model.RANSACRegressor (bug fix)
• linear_model.LassoLars (bug fix)
• linear_model.LassoLarsIC (bug fix)
• manifold.TSNE (bug fix)
• neighbors.NearestCentroid (bug fix)
• semi_supervised.LabelSpreading (bug fix)
• semi_supervised.LabelPropagation (bug fix)
• tree based models where min_weight_fraction_leaf is used (enhancement)
• model_selection.StratifiedKFold with shuffle=True (this change, due to #7823 was not men-
tioned in the release notes at the time)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
New features
• The new solver 'mu' implements a Multiplicate Update in decomposition.NMF, allowing the optimization
of all beta-divergences, including the Frobenius norm, the generalized Kullback-Leibler divergence and the
Itakura-Saito divergence. #5295 by Tom Dupre la Tour.
Model selection and evaluation
• model_selection.GridSearchCV and model_selection.RandomizedSearchCV now support
simultaneous evaluation of multiple metrics. Refer to the Specifying multiple metrics for evaluation section of
the user guide for more information. #7388 by Raghav RV
• Added the model_selection.cross_validate which allows evaluation of multiple metrics. This func-
tion returns a dict with more useful information from cross-validation such as the train scores, fit times and score
times. Refer to The cross_validate function and multiple metric evaluation section of the userguide for more
information. #7388 by Raghav RV
• Added metrics.mean_squared_log_error, which computes the mean square error of the logarithmic
transformation of targets, particularly useful for targets with an exponential trend. #7655 by Karan Desai.
• Added metrics.dcg_score and metrics.ndcg_score, which compute Discounted cumulative gain
(DCG) and Normalized discounted cumulative gain (NDCG). #7739 by David Gasquez.
• Added the model_selection.RepeatedKFold and model_selection.
RepeatedStratifiedKFold. #8120 by Neeraj Gangwar.
Miscellaneous
• Validation that input data contains no NaN or inf can now be suppressed using config_context, at your
own risk. This will save on runtime, and may be particularly useful for prediction time. #7548 by Joel Nothman.
• Added a test to ensure parameter listing in docstrings match the function/class signature. #9206 by Alexandre
Gramfort and Raghav RV.
Enhancements
Bug fixes
• Fixed a bug in svm.OneClassSVM where it returned floats instead of integer classes. #8676 by Vathsala
Achar.
• Fix AIC/BIC criterion computation in linear_model.LassoLarsIC. #9022 by Alexandre Gramfort and
Mehmet Basbug.
• Fixed a memory leak in our LibLinear implementation. #9024 by Sergei Lebedev
• Fix bug where stratified CV splitters did not work with linear_model.LassoCV . #8973 by Paulo Haddad.
• Fixed a bug in gaussian_process.GaussianProcessRegressor when the standard deviation and
covariance predicted without fit would fail with a unmeaningful error by default. #6573 by Quazi Marufur
Rahman and Manoj Kumar.
Other predictors
• Fix semi_supervised.BaseLabelPropagation to correctly implement LabelPropagation and
LabelSpreading as done in the referenced papers. #9239 by Andre Ambrosio Boechat, Utkarsh Upadhyay,
and Joel Nothman.
Decomposition, manifold learning and clustering
• Fixed the implementation of manifold.TSNE:
• early_exageration parameter had no effect and is now used for the first 250 optimization iterations.
• Fixed the AssertionError: Tree consistency failed exception reported in #8992.
• Improve the learning schedule to match the one from the reference implementation lvdmaaten/bhtsne. by
Thomas Moreau and Olivier Grisel.
• Fix a bug in decomposition.LatentDirichletAllocation where the perplexity method was
returning incorrect results because the transform method returns normalized document topic distributions as
of version 0.18. #7954 by Gary Foreman.
• Fix output shape and bugs with n_jobs > 1 in decomposition.SparseCoder transform and
decomposition.sparse_encode for one-dimensional data and one component. This also impacts the
output shape of decomposition.DictionaryLearning. #8086 by Andreas Müller.
• Fixed the implementation of explained_variance_ in decomposition.PCA, decomposition.
RandomizedPCA and decomposition.IncrementalPCA. #9105 by Hanmin Qin.
• Fixed the implementation of noise_variance_ in decomposition.PCA. #9108 by Hanmin Qin.
• Fixed a bug where cluster.DBSCAN gives incorrect result when input is a precomputed sparse matrix with
initial rows all zero. #8306 by Akshay Gupta
• Fix a bug regarding fitting cluster.KMeans with a sparse array X and initial centroids, where X’s means
were unnecessarily being subtracted from the centroids. #7872 by Josh Karnofsky.
• Fixes to the input validation in covariance.EllipticEnvelope. #8086 by Andreas Müller.
• Fixed a bug in covariance.MinCovDet where inputting data that produced a singular covariance matrix
would cause the helper method _c_step to throw an exception. #3367 by Jeremy Steward
• Fixed a bug in manifold.TSNE affecting convergence of the gradient descent. #8768 by David DeTomaso.
• Fixed a bug in manifold.TSNE where it stored the incorrect kl_divergence_. #6507 by Sebastian
Saeger.
• Fixed improper scaling in cross_decomposition.PLSRegression with scale=True. #7819 by
jayzed82.
• Deprecate the y parameter in transform and inverse_transform. The method should not accept y
parameter, as it’s used at the prediction time. #8174 by Tahar Zanouda, Alexandre Gramfort and Raghav RV.
• SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following
backported functions in utils have been removed or deprecated accordingly. #8854 and #8874 by Naoya
Kanai
• The store_covariances and covariances_ parameters of discriminant_analysis.
QuadraticDiscriminantAnalysis has been renamed to store_covariance and covariance_
to be consistent with the corresponding parameter names of the discriminant_analysis.
LinearDiscriminantAnalysis. They will be removed in version 0.21. #7998 by Jiacheng
Removed in 0.19:
– utils.fixes.argpartition
– utils.fixes.array_equal
– utils.fixes.astype
– utils.fixes.bincount
– utils.fixes.expit
– utils.fixes.frombuffer_empty
– utils.fixes.in1d
– utils.fixes.norm
– utils.fixes.rankdata
– utils.fixes.safe_copy
Deprecated in 0.19, to be removed in 0.21:
– utils.arpack.eigs
– utils.arpack.eigsh
– utils.arpack.svds
– utils.extmath.fast_dot
– utils.extmath.logsumexp
– utils.extmath.norm
– utils.extmath.pinvh
– utils.graph.graph_laplacian
– utils.random.choice
– utils.sparsetools.connected_components
– utils.stats.rankdata
• Estimators with both methods decision_function and predict_proba are now required to have a
monotonic relation between them. The method check_decision_proba_consistency has been added
in utils.estimator_checks to check their consistency. #7578 by Shubham Bhardwaj
• All checks in utils.estimator_checks, in particular utils.estimator_checks.
check_estimator now accept estimator instances. Most other checks do not accept estimator classes any
more. #9019 by Andreas Müller.
• Ensure that estimators’ attributes ending with _ are not set in the constructor but only in the fit method.
Most notably, ensemble estimators (deriving from ensemble.BaseEnsemble) now only have self.
estimators_ available after fit. #7464 by Lars Buitinck and Loic Esteve.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.18, in-
cluding:
Joel Nothman, Loic Esteve, Andreas Mueller, Guillaume Lemaitre, Olivier Grisel, Hanmin Qin, Raghav RV, Alexandre
Gramfort, themrmax, Aman Dalmia, Gael Varoquaux, Naoya Kanai, Tom Dupré la Tour, Rishikesh, Nelson Liu, Tae-
hoon Lee, Nelle Varoquaux, Aashil, Mikhail Korobov, Sebastin Santy, Joan Massich, Roman Yurchak, RAKOTOARI-
SON Herilalaina, Thierry Guillemot, Alexandre Abadie, Carol Willing, Balakumaran Manoharan, Josh Karnofsky,
Vlad Niculae, Utkarsh Upadhyay, Dmitry Petrov, Minghui Liu, Srivatsan, Vincent Pham, Albert Thomas, Jake Van-
derPlas, Attractadore, JC Liu, alexandercbooth, chkoar, Óscar Nájera, Aarshay Jain, Kyle Gilliam, Ramana Subra-
manyam, CJ Carey, Clement Joudet, David Robles, He Chen, Joris Van den Bossche, Karan Desai, Katie Luangkote,
Leland McInnes, Maniteja Nandana, Michele Lacchia, Sergei Lebedev, Shubham Bhardwaj, akshay0724, omtcyfz,
rickiepark, waterponey, Vathsala Achar, jbDelafosse, Ralf Gommers, Ekaterina Krivich, Vivek Kumar, Ishank Gulati,
Dave Elliott, ldirer, Reiichiro Nakano, Levi John Wolf, Mathieu Blondel, Sid Kapur, Dougal J. Sutherland, midinas,
mikebenfield, Sourav Singh, Aseem Bansal, Ibraim Ganiev, Stephen Hoover, AishwaryaRK, Steven C. Howell, Gary
Foreman, Neeraj Gangwar, Tahar, Jon Crall, dokato, Kathy Chen, ferria, Thomas Moreau, Charlie Brummitt, Nicolas
Goix, Adam Kleczewski, Sam Shleifer, Nikita Singh, Basil Beirouti, Giorgio Patrini, Manoj Kumar, Rafael Possas,
James Bourbeau, James A. Bednar, Janine Harper, Jaye, Jean Helie, Jeremy Steward, Artsiom, John Wei, Jonathan
LIgo, Jonathan Rahn, seanpwilliams, Arthur Mensch, Josh Levy, Julian Kuhlmann, Julien Aubert, Jörn Hees, Kai,
shivamgargsya, Kat Hempstalk, Kaushik Lakshmikanth, Kennedy, Kenneth Lyons, Kenneth Myers, Kevin Yap, Kir-
ill Bobyrev, Konstantin Podshumok, Arthur Imbert, Lee Murray, toastedcornflakes, Lera, Li Li, Arthur Douillard,
Mainak Jas, tobycheese, Manraj Singh, Manvendra Singh, Marc Meketon, MarcoFalke, Matthew Brett, Matthias
Gilch, Mehul Ahuja, Melanie Goetz, Meng, Peng, Michael Dezube, Michal Baumgartner, vibrantabhi19, Artem Golu-
bin, Milen Paskov, Antonin Carette, Morikko, MrMjauh, NALEPA Emmanuel, Namiya, Antoine Wendlinger, Narine
Kokhlikyan, NarineK, Nate Guerin, Angus Williams, Ang Lu, Nicole Vavrova, Nitish Pandey, Okhlopkov Daniil
Olegovich, Andy Craze, Om Prakash, Parminder Singh, Patrick Carlson, Patrick Pei, Paul Ganssle, Paulo Haddad,
Paweł Lorek, Peng Yu, Pete Bachant, Peter Bull, Peter Csizsek, Peter Wang, Pieter Arthur de Jong, Ping-Yao, Chang,
Preston Parry, Puneet Mathur, Quentin Hibon, Andrew Smith, Andrew Jackson, 1kastner, Rameshwar Bhaskaran, Re-
becca Bilbro, Remi Rampin, Andrea Esuli, Rob Hall, Robert Bradshaw, Romain Brault, Aman Pratik, Ruifeng Zheng,
Russell Smith, Sachin Agarwal, Sailesh Choyal, Samson Tan, Samuël Weber, Sarah Brown, Sebastian Pölsterl, Se-
bastian Raschka, Sebastian Saeger, Alyssa Batula, Abhyuday Pratap Singh, Sergey Feldman, Sergul Aydore, Sharan
Yalburgi, willduan, Siddharth Gupta, Sri Krishna, Almer, Stijn Tonk, Allen Riddell, Theofilos Papapanagiotou, Alison,
Alexis Mignon, Tommy Boucher, Tommy Löfstedt, Toshihiro Kamishima, Tyler Folkman, Tyler Lanigan, Alexander
Junge, Varun Shenoy, Victor Poughon, Vilhelm von Ehrenheim, Aleksandr Sandrovskii, Alan Yee, Vlasios Vasileiou,
Warut Vijitbenjaronk, Yang Zhang, Yaroslav Halchenko, Yichuan Liu, Yuichi Fujikawa, affanv14, aivision2020, xor,
andreh7, brady salz, campustrampus, Agamemnon Krasoulis, ditenberg, elena-sharova, filipj8, fukatani, gedeck, guin-
iol, guoci, hakaa1, hongkahjun, i-am-xhy, jakirkham, jaroslaw-weber, jayzed82, jeroko, jmontoyam, jonathan.striebel,
josephsalmon, jschendel, leereeves, martin-hahn, mathurinm, mehak-sachdeva, mlewis1729, mlliou112, mthorrell,
ndingwall, nuffe, yangarbiter, plagree, pldtc325, Breno Freitas, Brett Olsen, Brian A. Alfano, Brian Burns, polmauri,
Brandon Carter, Charlton Austin, Chayant T15h, Chinmaya Pancholi, Christian Danielsen, Chung Yen, Chyi-Kwei
Yau, pravarmahajan, DOHMATOB Elvis, Daniel LeJeune, Daniel Hnyk, Darius Morawiec, David DeTomaso, David
Gasquez, David Haberthür, David Heryanto, David Kirkby, David Nicholson, rashchedrin, Deborah Gertrude Digges,
Denis Engemann, Devansh D, Dickson, Bob Baxley, Don86, E. Lynch-Klarup, Ed Rogers, Elizabeth Ferriss, Ellen-
Co2, Fabian Egli, Fang-Chieh Chou, Bing Tian Dai, Greg Stupp, Grzegorz Szpak, Bertrand Thirion, Hadrien Bertrand,
Harizo Rajaona, zxcvbnius, Henry Lin, Holger Peters, Icyblade Dai, Igor Andriushchenko, Ilya, Isaac Laughlin, Iván
Vallés, Aurélien Bellet, JPFrancoia, Jacob Schreiber, Asish Mahapatra
Scikit-learn 0.18 is the last major release of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
Changelog
• Fixes for compatibility with NumPy 1.13.0: #7946 #8355 by Loic Esteve.
• Minor compatibility changes in the examples #9010 #8040 #9149.
Code Contributors
Changelog
Enhancements
Bug fixes
• Fix issue where min_grad_norm and n_iter_without_progress parameters were not being utilised
by manifold.TSNE. #6497 by Sebastian Säger
• Fix bug for svm’s decision values when decision_function_shape is ovr in svm.SVC. svm.SVC’s
decision_function was incorrect from versions 0.17.0 through 0.18.0. #7724 by Bing Tian Dai
• Attribute explained_variance_ratio of discriminant_analysis.
LinearDiscriminantAnalysis calculated with SVD and Eigen solver are now of the same length.
#7632 by JPFrancoia
• Fixes issue in Univariate feature selection where score functions were not accepting multi-label targets. #7676
by Mohammed Affan
• Fixed setting parameters when calling fit multiple times on feature_selection.SelectFromModel.
#7756 by Andreas Müller
• Fixes issue in partial_fit method of multiclass.OneVsRestClassifier when number of classes
used in partial_fit was less than the total number of classes in the data. #7786 by Srivatsan Ramesh
• Fixes issue in calibration.CalibratedClassifierCV where the sum of probabilities of each class
for a data was not 1, and CalibratedClassifierCV now handles the case where the training set has less
number of classes than the total data. #7799 by Srivatsan Ramesh
• Fix a bug where sklearn.feature_selection.SelectFdr did not exactly implement Benjamini-
Hochberg procedure. It formerly may have selected fewer features than it should. #7490 by Peng Meng.
• sklearn.manifold.LocallyLinearEmbedding now correctly handles integer inputs. #6282 by Jake
Vanderplas.
• The min_weight_fraction_leaf parameter of tree-based classifiers and regressors now assumes uniform
sample weights by default if the sample_weight argument is not passed to the fit function. Previously, the
parameter was silently ignored. #7301 by Nelson Liu.
• Numerical issue with linear_model.RidgeCV on centered data when n_features > n_samples.
#6178 by Bertrand Thirion
• Tree splitting criterion classes’ cloning/pickling is now memory safe #7680 by Ibraim Ganiev.
• Fixed a bug where decomposition.NMF sets its n_iters_ attribute in transform(). #7553 by Ekate-
rina Krivich.
• sklearn.linear_model.LogisticRegressionCV now correctly handles string labels. #5874 by
Raghav RV.
• Fixed a bug where sklearn.model_selection.train_test_split raised an error when
stratify is a list of string labels. #7593 by Raghav RV.
• Fixed a bug where sklearn.model_selection.GridSearchCV and sklearn.
model_selection.RandomizedSearchCV were not pickleable because of a pickling bug in np.
ma.MaskedArray. #7594 by Raghav RV.
• All cross-validation utilities in sklearn.model_selection now permit one time cross-validation splitters
for the cv parameter. Also non-deterministic cross-validation splitters (where multiple calls to split produce
dissimilar splits) can be used as cv parameter. The sklearn.model_selection.GridSearchCV will
cross-validate each parameter setting on the split produced by the first split call to the cross-validation splitter.
#7660 by Raghav RV.
• Fix bug where preprocessing.MultiLabelBinarizer.fit_transform returned an invalid CSR
matrix. #7750 by CJ Carey.
• Fixed a bug where metrics.pairwise.cosine_distances could return a small negative distance.
#7732 by Artsion.
Scikit-learn 0.18 will be the last version of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
Changelog
New features
Enhancements
Bug fixes
new class solves the computational problems of the old class and computes the Variational Bayesian Gaussian
mixture faster than before. #6651 by Wei Xue and Thierry Guillemot.
• The old mixture.GMM is deprecated in favor of the new mixture.GaussianMixture. The new class
computes the Gaussian mixture faster than before and some of computational problems have been solved. #6666
by Wei Xue and Thierry Guillemot.
Model evaluation and meta-estimators
• The sklearn.cross_validation, sklearn.grid_search and sklearn.learning_curve
have been deprecated and the classes and functions have been reorganized into the sklearn.
model_selection module. Ref Model Selection Enhancements and API Changes for more information.
#4294 by Raghav RV.
• The grid_scores_ attribute of model_selection.GridSearchCV and model_selection.
RandomizedSearchCV is deprecated in favor of the attribute cv_results_. Ref Model Selection En-
hancements and API Changes for more information. #6697 by Raghav RV.
• The parameters n_iter or n_folds in old CV splitters are replaced by the new parameter n_splits since
it can provide a consistent and unambiguous interface to represent the number of train-test splits. #7187 by
YenChen Lin.
• classes parameter was renamed to labels in metrics.hamming_loss. #7260 by Sebastián Vanrell.
• The splitter classes LabelKFold, LabelShuffleSplit, LeaveOneLabelOut and
LeavePLabelsOut are renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively. Also the parameter labels in the split method of the newly renamed
splitters model_selection.LeaveOneGroupOut and model_selection.LeavePGroupsOut
is renamed to groups. Additionally in model_selection.LeavePGroupsOut, the parameter
n_labels is renamed to n_groups. #6660 by Raghav RV.
• Error and loss names for scoring parameters are now prefixed by 'neg_', such as
neg_mean_squared_error. The unprefixed versions are deprecated and will be removed in version 0.20.
#7261 by Tim Head.
Code Contributors
Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander Minyushkin, Alexander Rudy, Alexan-
dre Abadie, Alexandre Abraham, Alexandre Gramfort, Alexandre Saint, alexfields, Alvaro Ulloa, alyssaq, Amlan
Kar, Andreas Mueller, andrew giessel, Andrew Jackson, Andrew McCulloh, Andrew Murray, Anish Shah, Arafat,
Archit Sharma, Ariel Rokem, Arnaud Joly, Arnaud Rachez, Arthur Mensch, Ash Hoover, asnt, b0noI, Behzad Tabib-
ian, Bernardo, Bernhard Kratzwald, Bhargav Mangipudi, blakeflei, Boyuan Deng, Brandon Carter, Brett Naul, Brian
McFee, Caio Oliveira, Camilo Lamus, Carol Willing, Cass, CeShine Lee, Charles Truong, Chyi-Kwei Yau, CJ Carey,
codevig, Colin Ni, Dan Shiebler, Daniel, Daniel Hnyk, David Ellis, David Nicholson, David Staub, David Thaler,
David Warshaw, Davide Lasagna, Deborah, definitelyuncertain, Didi Bar-Zev, djipey, dsquareindia, edwinENSAE,
Elias Kuthe, Elvis DOHMATOB, Ethan White, Fabian Pedregosa, Fabio Ticconi, fisache, Florian Wilhelm, Francis,
Francis O’Donovan, Gael Varoquaux, Ganiev Ibraim, ghg, Gilles Louppe, Giorgio Patrini, Giovanni Cherubin, Gio-
vanni Lanzani, Glenn Qian, Gordon Mohr, govin-vatsan, Graham Clenaghan, Greg Reda, Greg Stupp, Guillaume
Lemaitre, Gustav Mörtberg, halwai, Harizo Rajaona, Harry Mavroforakis, hashcode55, hdmetor, Henry Lin, Hob-
son Lane, Hugo Bowne-Anderson, Igor Andriushchenko, Imaculate, Inki Hwang, Isaac Sijaranamual, Ishank Gulati,
Issam Laradji, Iver Jordal, jackmartin, Jacob Schreiber, Jake Vanderplas, James Fiedler, James Routley, Jan Zikes,
Janna Brettingen, jarfa, Jason Laska, jblackburne, jeff levesque, Jeffrey Blackburne, Jeffrey04, Jeremy Hintz, jere-
mynixon, Jeroen, Jessica Yung, Jill-Jênn Vie, Jimmy Jia, Jiyuan Qian, Joel Nothman, johannah, John, John Boersma,
John Kirkham, John Moeller, jonathan.striebel, joncrall, Jordi, Joseph Munoz, Joshua Cook, JPFrancoia, jrfiedler,
JulianKahnert, juliathebrave, kaichogami, KamalakerDadi, Kenneth Lyons, Kevin Wang, kingjr, kjell, Konstantin
Podshumok, Kornel Kielczewski, Krishna Kalyan, krishnakalyan3, Kvle Putnam, Kyle Jackson, Lars Buitinck, ldavid,
LeiG, LeightonZhang, Leland McInnes, Liang-Chi Hsieh, Lilian Besson, lizsz, Loic Esteve, Louis Tiao, Léonie Borne,
Mads Jensen, Maniteja Nandana, Manoj Kumar, Manvendra Singh, Marco, Mario Krell, Mark Bao, Mark Szepieniec,
Martin Madsen, MartinBpr, MaryanMorel, Massil, Matheus, Mathieu Blondel, Mathieu Dubois, Matteo, Matthias Ek-
man, Max Moroz, Michael Scherer, michiaki ariga, Mikhail Korobov, Moussa Taifi, mrandrewandrade, Mridul Seth,
nadya-p, Naoya Kanai, Nate George, Nelle Varoquaux, Nelson Liu, Nick James, NickleDave, Nico, Nicolas Goix,
Nikolay Mayorov, ningchi, nlathia, okbalefthanded, Okhlopkov, Olivier Grisel, Panos Louridas, Paul Strickland, Per-
rine Letellier, pestrickland, Peter Fischer, Pieter, Ping-Yao, Chang, practicalswift, Preston Parry, Qimu Zheng, Rachit
Kansal, Raghav RV, Ralf Gommers, Ramana.S, Rammig, Randy Olson, Rob Alexander, Robert Lutz, Robin Schucker,
Rohan Jain, Ruifeng Zheng, Ryan Yu, Rémy Léone, saihttam, Saiwing Yeung, Sam Shleifer, Samuel St-Jean, Sar-
taj Singh, Sasank Chilamkurthy, saurabh.bansod, Scott Andrews, Scott Lowe, seales, Sebastian Raschka, Sebastian
Saeger, Sebastián Vanrell, Sergei Lebedev, shagun Sodhani, shanmuga cv, Shashank Shekhar, shawpan, shengxid-
uan, Shota, shuckle16, Skipper Seabold, sklearn-ci, SmedbergM, srvanrell, Sébastien Lerique, Taranjeet, themrmax,
Thierry, Thierry Guillemot, Thomas, Thomas Hallock, Thomas Moreau, Tim Head, tKammy, toastedcornflakes, Tom,
TomDLT, Toshihiro Kamishima, tracer0tong, Trent Hauck, trevorstephens, Tue Vo, Varun, Varun Jewalikar, Viach-
eslav, Vighnesh Birodkar, Vikram, Villu Ruusmann, Vinayak Mehta, walter, waterponey, Wenhua Yang, Wenjian
Huang, Will Welch, wyseguy7, xyguo, yanlend, Yaroslav Halchenko, yelite, Yen, YenChenLin, Yichuan Liu, Yoav
Ram, Yoshiki, Zheng RuiFeng, zivori, Óscar Nájera
Changelog
Bug fixes
• Upgrade vendored joblib to version 0.9.4 that fixes an important bug in joblib.Parallel that can silently
yield to wrong results when working on datasets larger than 1MB: https://github.com/joblib/joblib/blob/0.9.4/
CHANGES.rst
• Fixed reading of Bunch pickles generated with scikit-learn version <= 0.16. This can affect users who have
already downloaded a dataset with scikit-learn 0.16 and are loading it with scikit-learn 0.17. See #6196 for how
this affected datasets.fetch_20newsgroups. By Loic Esteve.
• Fixed a bug that prevented using ROC AUC score to perform grid search on several CPU / cores on large arrays.
See #6147 By Olivier Grisel.
• Fixed a bug that prevented to properly set the presort parameter in ensemble.
GradientBoostingRegressor. See #5857 By Andrew McCulloh.
• Fixed a joblib error when evaluating the perplexity of a decomposition.
LatentDirichletAllocation model. See #6258 By Chyi-Kwei Yau.
November 5, 2015
Changelog
New features
• All the Scaler classes but preprocessing.RobustScaler can be fitted online by calling partial_fit.
By Giorgio Patrini.
• The new class ensemble.VotingClassifier implements a “majority rule” / “soft voting” ensemble
classifier to combine estimators for classification. By Sebastian Raschka.
• The new class preprocessing.RobustScaler provides an alternative to preprocessing.
StandardScaler for feature-wise centering and range normalization that is robust to outliers. By Thomas
Unterthiner.
• The new class preprocessing.MaxAbsScaler provides an alternative to preprocessing.
MinMaxScaler for feature-wise range normalization when the data is already centered or sparse. By Thomas
Unterthiner.
• The new class preprocessing.FunctionTransformer turns a Python function into a Pipeline-
compatible transformer object. By Joe Jevnik.
• The new classes cross_validation.LabelKFold and cross_validation.
LabelShuffleSplit generate train-test folds, respectively similar to cross_validation.KFold and
cross_validation.ShuffleSplit, except that the folds are conditioned on a label array. By Brian
McFee, Jean Kossaifi and Gilles Louppe.
• decomposition.LatentDirichletAllocation implements the Latent Dirichlet Allocation topic
model with online variational inference. By Chyi-Kwei Yau, with code based on an implementation by Matt
Hoffman. (#3659)
• The new solver sag implements a Stochastic Average Gradient descent and is available in both
linear_model.LogisticRegression and linear_model.Ridge. This solver is very efficient for
large datasets. By Danny Sullivan and Tom Dupre la Tour. (#4738)
• The new solver cd implements a Coordinate Descent in decomposition.NMF. Previous solver based on
Projected Gradient is still available setting new parameter solver to pg, but is deprecated and will be removed
in 0.19, along with decomposition.ProjectedGradientNMF and parameters sparseness, eta,
beta and nls_max_iter. New parameters alpha and l1_ratio control L1 and L2 regularization, and
shuffle adds a shuffling step in the cd solver. By Tom Dupre la Tour and Mathieu Blondel.
Enhancements
• manifold.TSNE now supports approximate optimization via the Barnes-Hut method, leading to much faster
fitting. By Christopher Erick Moody. (#4025)
• cluster.mean_shift_.MeanShift now supports parallel execution, as implemented in the
mean_shift function. By Martino Sorbaro.
• naive_bayes.GaussianNB now supports fitting with sample_weight. By Jan Hendrik Metzen.
• dummy.DummyClassifier now supports a prior fitting strategy. By Arnaud Joly.
• Added a fit_predict method for mixture.GMM and subclasses. By Cory Lorenz.
• Added the metrics.label_ranking_loss metric. By Arnaud Joly.
• Added the metrics.cohen_kappa_score metric.
• Added a warm_start constructor parameter to the bagging ensemble models to increase the size of the en-
semble. By Tim Head.
• Added option to use multi-output regression metrics without averaging. By Konstantin Shmelkov and Michael
Eickenberg.
• Speed up tree based methods by reducing the number of computations needed when computing the impurity
measure taking into account linear relationship of the computed statistics. The effect is particularly visible with
extra trees and on datasets with categorical or sparse features. By Arnaud Joly.
• ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
now expose an apply method for retrieving the leaf indices each sample ends up in under each try. By Ja-
cob Schreiber.
• Add sample_weight support to linear_model.LinearRegression. By Sonny Hu. (##4881)
• Add n_iter_without_progress to manifold.TSNE to control the stopping criterion. By Santi Vil-
lalba. (#5186)
• Added optional parameter random_state in linear_model.Ridge , to set the seed of the pseudo random
generator used in sag solver. By Tom Dupre la Tour.
• Added optional parameter warm_start in linear_model.LogisticRegression. If set to True, the
solvers lbfgs, newton-cg and sag will be initialized with the coefficients computed in the previous fit. By
Tom Dupre la Tour.
• Added sample_weight support to linear_model.LogisticRegression for the lbfgs,
newton-cg, and sag solvers. By Valentin Stolbunov. Support added to the liblinear solver. By Manoj
Kumar.
• Added optional parameter presort to ensemble.GradientBoostingRegressor and ensemble.
GradientBoostingClassifier, keeping default behavior the same. This allows gradient boosters to
turn off presorting when building deep trees or using sparse data. By Jacob Schreiber.
• Altered metrics.roc_curve to drop unnecessary thresholds by default. By Graham Clenaghan.
• Added feature_selection.SelectFromModel meta-transformer which can be used along with esti-
mators that have coef_ or feature_importances_ attribute to select important features of the input data.
By Maheshakya Wijewardena, Joel Nothman and Manoj Kumar.
• Added metrics.pairwise.laplacian_kernel. By Clyde Fare.
• covariance.GraphLasso allows separate control of the convergence criterion for the Elastic-Net subprob-
lem via the enet_tol parameter.
• Improved verbosity in decomposition.DictionaryLearning.
• ensemble.RandomForestClassifier and ensemble.RandomForestRegressor no longer ex-
plicitly store the samples used in bagging, resulting in a much reduced memory footprint for storing random
forest models.
• Added positive option to linear_model.Lars and linear_model.lars_path to force coeffi-
cients to be positive. (#5131)
• Added the X_norm_squared parameter to metrics.pairwise.euclidean_distances to provide
precomputed squared norms for X.
• Added the fit_predict method to pipeline.Pipeline.
• Added the preprocessing.min_max_scale function.
Bug fixes
• Fixed bug where grid_search.RandomizedSearchCV could consume a lot of memory for large discrete
grids. By Joel Nothman.
• Fixed bug in linear_model.LogisticRegressionCV where penalty was ignored in the final fit. By
Manoj Kumar.
• Fixed bug in ensemble.forest.ForestClassifier while computing oob_score and X is a
sparse.csc_matrix. By Ankur Ankan.
• All regressors now consistently handle and warn when given y that is of shape (n_samples, 1). By Andreas
Müller and Henry Lin. (#5431)
• Fix in cluster.KMeans cluster reassignment for sparse input by Lars Buitinck.
• Fixed a bug in lda.LDA that could cause asymmetric covariance matrices when using shrinkage. By Martin
Billinger.
• Fixed cross_validation.cross_val_predict for estimators with sparse predictions. By Buddha
Prakash.
• Fixed the predict_proba method of linear_model.LogisticRegression to use soft-max instead
of one-vs-rest normalization. By Manoj Kumar. (#5182)
• Fixed the partial_fit method of linear_model.SGDClassifier when called with
average=True. By Andrew Lamb. (#5282)
• Dataset fetchers use different filenames under Python 2 and Python 3 to avoid pickling compatibility issues. By
Olivier Grisel. (#5355)
• Fixed a bug in naive_bayes.GaussianNB which caused classification results to depend on scale. By Jake
Vanderplas.
• Fixed temporarily linear_model.Ridge, which was incorrect when fitting the intercept in the case of
sparse data. The fix automatically changes the solver to ‘sag’ in this case. #5360 by Tom Dupre la Tour.
• Fixed a performance bug in decomposition.RandomizedPCA on data with a large number of features
and fewer samples. (#4478) By Andreas Müller, Loic Esteve and Giorgio Patrini.
• Fixed bug in cross_decomposition.PLS that yielded unstable and platform dependent output, and failed
on fit_transform. By Arthur Mensch.
• Fixes to the Bunch class used to store datasets.
• Fixed ensemble.plot_partial_dependence ignoring the percentiles parameter.
• Providing a set as vocabulary in CountVectorizer no longer leads to inconsistent results when pickling.
• Fixed the conditions on when a precomputed Gram matrix needs to be recomputed in linear_model.
LinearRegression, linear_model.OrthogonalMatchingPursuit, linear_model.Lasso
and linear_model.ElasticNet.
• Fixed inconsistent memory layout in the coordinate descent solver that affected linear_model.
DictionaryLearning and covariance.GraphLasso. (#5337) By Olivier Grisel.
• manifold.LocallyLinearEmbedding no longer ignores the reg parameter.
• Nearest Neighbor estimators with custom distance metrics can now be pickled. (#4362)
• Fixed a bug in pipeline.FeatureUnion where transformer_weights were not properly handled
when performing grid-searches.
• Fixed a bug in linear_model.LogisticRegression and linear_model.
LogisticRegressionCV when using class_weight='balanced' or class_weight='auto'.
By Tom Dupre la Tour.
Code Contributors
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando
Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly,
Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee,
Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat,
Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard,
Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry
Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric
Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles
Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Im-
manuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen,
Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph,
Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler
Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Ku-
mar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada,
Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson,
Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli
Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert
Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebas-
tian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De
Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timo-
thy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta,
Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro,
Zichen Wang
Changelog
Bug fixes
Highlights
• Speed improvements (notably in cluster.DBSCAN ), reduced memory requirements, bug-fixes and better
default settings.
• Multinomial Logistic regression and a path algorithm in linear_model.LogisticRegressionCV .
• Out-of core learning of PCA via decomposition.IncrementalPCA.
• Probability callibration of classifiers using calibration.CalibratedClassifierCV .
• cluster.Birch clustering method for large-scale datasets.
• Scalable approximate nearest neighbors search with Locality-sensitive hashing forests in neighbors.
LSHForest.
• Improved error messages and better validation when using malformed input data.
• More robust integration with pandas dataframes.
Changelog
New features
• The new neighbors.LSHForest implements locality-sensitive hashing for approximate nearest neighbors
search. By Maheshakya Wijewardena.
• Added svm.LinearSVR. This class uses the liblinear implementation of Support Vector Regression which is
much faster for large sample sizes than svm.SVR with linear kernel. By Fabian Pedregosa and Qiang Luo.
• Incremental fit for GaussianNB.
• Added sample_weight support to dummy.DummyClassifier and dummy.DummyRegressor. By
Arnaud Joly.
• Added the metrics.label_ranking_average_precision_score metrics. By Arnaud Joly.
• Add the metrics.coverage_error metrics. By Arnaud Joly.
• Added linear_model.LogisticRegressionCV . By Manoj Kumar, Fabian Pedregosa, Gael Varoquaux
and Alexandre Gramfort.
• Added warm_start constructor parameter to make it possible for any trained forest model to grow additional
trees incrementally. By Laurent Direr.
Enhancements
Documentation improvements
Bug fixes
• RBFSampler with gamma=g formerly approximated rbf_kernel with gamma=g/2.; the definition of
gamma is now consistent, which may substantially change your results if you use a fixed value. (If you cross-
validated over gamma, it probably doesn’t matter too much.) By Dougal Sutherland.
• Pipeline object delegate the classes_ attribute to the underlying estimator. It allows, for instance, to make
bagging of a pipeline object. By Arnaud Joly
• neighbors.NearestCentroid now uses the median as the centroid when metric is set to manhattan.
It was using the mean before. By Manoj Kumar
• Fix numerical stability issues in linear_model.SGDClassifier and linear_model.
SGDRegressor by clipping large gradients and ensuring that weight decay rescaling is always positive (for
large l2 regularization and large learning rate values). By Olivier Grisel
• When compute_full_tree is set to “auto”, the full tree is built when n_clusters is high and
is early stopped when n_clusters is low, while the behavior should be vice-versa in cluster.
AgglomerativeClustering (and friends). This has been fixed By Manoj Kumar
• Fix lazy centering of data in linear_model.enet_path and linear_model.lasso_path. It was
centered around one. It has been changed to be centered around the origin. By Manoj Kumar
• Fix handling of precomputed affinity matrices in cluster.AgglomerativeClustering when using
connectivity constraints. By Cathy Deng
• Correct partial_fit handling of class_prior for sklearn.naive_bayes.MultinomialNB and
sklearn.naive_bayes.BernoulliNB. By Trevor Stephens.
• Fixed a crash in metrics.precision_recall_fscore_support when using unsorted labels in the
multi-label setting. By Andreas Müller.
• Avoid skipping the first nearest neighbor in the methods radius_neighbors, kneighbors,
kneighbors_graph and radius_neighbors_graph in sklearn.neighbors.
NearestNeighbors and family, when the query data is not the same as fit data. By Manoj Kumar.
• Fix log-density calculation in the mixture.GMM with tied covariance. By Will Dawson
• Fixed a scaling error in feature_selection.SelectFdr where a factor n_features was missing. By
Andrew Tulloch
• Fix zero division in neighbors.KNeighborsRegressor and related classes when using distance weight-
ing and having identical data points. By Garret-R.
• Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon.
• Fixed a error in the computation of conditional probabilities in naive_bayes.BernoulliNB. By Hanna
Wallach.
• Make the method radius_neighbors of neighbors.NearestNeighbors return the samples lying
on the boundary for algorithm='brute'. By Yan Yi.
• Flip sign of dual_coef_ of svm.SVC to make it consistent with the documentation and
decision_function. By Artem Sobolev.
• Fixed handling of ties in isotonic.IsotonicRegression. We now use the weighted average of targets
(secondary method). By Andreas Müller and Michael Bommarito.
• GridSearchCV and cross_val_score and other meta-estimators don’t convert pandas DataFrames into
arrays any more, allowing DataFrame specific operations in custom estimators.
Code Contributors
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander
Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, An-
drew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu,
Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng,
Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brun-
ner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Suther-
land, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R,
Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil
Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque,
isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff
Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle
Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michel-
bacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz
Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael
Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas,
Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap
Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert
Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan
Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Un-
terthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta,
Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan
Meng, Yan Yi, Yu-Chin
September 4, 2014
Bug fixes
• Fixed handling of the p parameter of the Minkowski distance that was previously ignored in nearest neighbors
models. By Nikolay Mayorov.
• Fixed duplicated alphas in linear_model.LassoLars with early stopping on 32 bit Python. By Olivier
Grisel and Fabian Pedregosa.
• Fixed the build under Windows when scikit-learn is built with MSVC while NumPy is built with MinGW. By
Olivier Grisel and Federico Vaggi.
• Fixed an array index overflow bug in the coordinate descent solver. By Gael Varoquaux.
• Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux.
• Removed unnecessary data copy in cluster.KMeans. By Gael Varoquaux.
• Explicitly close open files to avoid ResourceWarnings under Python 3. By Calvin Giles.
• The transform of discriminant_analysis.LinearDiscriminantAnalysis now projects the
input on the most discriminant directions. By Martin Billinger.
• Fixed potential overflow in _tree.safe_realloc by Lars Buitinck.
August 1, 2014
Bug fixes
Highlights
Changelog
New features
Enhancements
• Reduce memory usage and overhead when fitting and predicting with forests of randomized trees in parallel
with n_jobs != 1 by leveraging new threading backend of joblib 0.8 and releasing the GIL in the tree fitting
Cython code. By Olivier Grisel and Gilles Louppe.
• Speed improvement of the sklearn.ensemble.gradient_boosting module. By Gilles Louppe and
Peter Prettenhofer.
• Various enhancements to the sklearn.ensemble.gradient_boosting module: a warm_start ar-
gument to fit additional trees, a max_leaf_nodes argument to fit GBM style trees, a monitor fit argument
to inspect the estimator during training, and refactoring of the verbose code. By Peter Prettenhofer.
• Faster sklearn.ensemble.ExtraTrees by caching feature values. By Arnaud Joly.
• Faster depth-based tree building algorithm such as decision tree, random forest, extra trees or gradient tree
boosting (with depth based growing strategy) by avoiding trying to split on found constant features in the sample
subset. By Arnaud Joly.
• Add min_weight_fraction_leaf pre-pruning parameter to tree-based methods: the minimum weighted
fraction of the input samples required to be at a leaf node. By Noel Dawe.
• Added metrics.pairwise_distances_argmin_min, by Philippe Gervais.
• Added predict method to cluster.AffinityPropagation and cluster.MeanShift, by Mathieu
Blondel.
• Vector and matrix multiplications have been optimised throughout the library by Denis Engemann, and Alexan-
dre Gramfort. In particular, they should take less memory with older NumPy versions (prior to 1.7.2).
• Precision-recall and ROC examples now use train_test_split, and have more explanation of why these metrics
are useful. By Kyle Kastner
• The training algorithm for decomposition.NMF is faster for sparse matrices and has much lower memory
complexity, meaning it will scale up gracefully to large datasets. By Lars Buitinck.
• Added svd_method option with default value to “randomized” to decomposition.FactorAnalysis to
save memory and significantly speedup computation by Denis Engemann, and Alexandre Gramfort.
• Changed cross_validation.StratifiedKFold to try and preserve as much of the original ordering of
samples as possible so as not to hide overfitting on datasets with a non-negligible level of samples dependency.
By Daniel Nouri and Olivier Grisel.
• Add multi-output support to gaussian_process.GaussianProcess by John Novak.
• Support for precomputed distance matrices in nearest neighbor estimators by Robert Layton and Joel Nothman.
• Norm computations optimized for NumPy 1.6 and later versions by Lars Buitinck. In particular, the k-means
algorithm no longer needs a temporary data structure the size of its input.
• dummy.DummyClassifier can now be used to predict a constant output value. By Manoj Kumar.
• dummy.DummyRegressor has now a strategy parameter which allows to predict the mean, the median of the
training set or a constant output value. By Maheshakya Wijewardena.
• Multi-label classification output in multilabel indicator format is now supported by metrics.
roc_auc_score and metrics.average_precision_score by Arnaud Joly.
• Significant performance improvements (more than 100x speedup for large problems) in isotonic.
IsotonicRegression by Andrew Tulloch.
• Speed and memory usage improvements to the SGD algorithm for linear models: it now uses threads, not
separate processes, when n_jobs>1. By Lars Buitinck.
• Grid search and cross validation allow NaNs in the input arrays so that preprocessors such as
preprocessing.Imputer can be trained within the cross validation loop, avoiding potentially skewed
results.
• Ridge regression can now deal with sample weights in feature space (only sample space until then). By Michael
Eickenberg. Both solutions are provided by the Cholesky solver.
• Several classification and regression metrics now support weighted samples with the new
sample_weight argument: metrics.accuracy_score, metrics.zero_one_loss,
metrics.precision_score, metrics.average_precision_score, metrics.
f1_score, metrics.fbeta_score, metrics.recall_score, metrics.roc_auc_score,
metrics.explained_variance_score, metrics.mean_squared_error, metrics.
mean_absolute_error, metrics.r2_score. By Noel Dawe.
• Speed up of the sample generator datasets.make_multilabel_classification. By Joel Nothman.
Documentation improvements
• The Working With Text Data tutorial has now been worked in to the main documentation’s tutorial section.
Includes exercises and skeletons for tutorial presentation. Original tutorial created by several authors including
Olivier Grisel, Lars Buitinck and many others. Tutorial integration into the scikit-learn documentation by Jaques
Grobler
• Added Computational Performance documentation. Discussion and examples of prediction latency / throughput
and different factors that have influence over speed. Additional tips for building faster models and choosing a
relevant compromise between speed and predictive power. By Eustache Diemert.
Bug fixes
People
• 6 Daniel Nouri
• 6 Chen Liu
• 6 Michael Eickenberg
• 6 ugurthemaster
• 5 Aaron Schumacher
• 5 Baptiste Lagarde
• 5 Rajat Khanduja
• 5 Robert McGibbon
• 5 Sergio Pascual
• 4 Alexis Metaireau
• 4 Ignacio Rossi
• 4 Virgile Fritsch
• 4 Sebastian Säger
• 4 Ilambharathi Kanniah
• 4 sdenton4
• 4 Robert Layton
• 4 Alyssa
• 4 Amos Waterland
• 3 Andrew Tulloch
• 3 murad
• 3 Steven Maude
• 3 Karol Pysniak
• 3 Jacques Kvam
• 3 cgohlke
• 3 cjlin
• 3 Michael Becker
• 3 hamzeh
• 3 Eric Jacobsen
• 3 john collins
• 3 kaushik94
• 3 Erwin Marsi
• 2 csytracy
• 2 LK
• 2 Vlad Niculae
• 2 Laurent Direr
• 2 Erik Shilts
• 2 Raul Garreta
• 2 Yoshiki Vázquez Baeza
• 2 Yung Siang Liau
• 2 abhishek thakur
• 2 James Yu
• 2 Rohit Sivaprasad
• 2 Roland Szabo
• 2 amormachine
• 2 Alexis Mignon
• 2 Oscar Carlsson
• 2 Nantas Nardelli
• 2 jess010
• 2 kowalski87
• 2 Andrew Clegg
• 2 Federico Vaggi
• 2 Simon Frid
• 2 Félix-Antoine Fortin
• 1 Ralf Gommers
• 1 t-aft
• 1 Ronan Amicel
• 1 Rupesh Kumar Srivastava
• 1 Ryan Wang
• 1 Samuel Charron
• 1 Samuel St-Jean
• 1 Fabian Pedregosa
• 1 Skipper Seabold
• 1 Stefan Walk
• 1 Stefan van der Walt
• 1 Stephan Hoyer
• 1 Allen Riddell
• 1 Valentin Haenel
• 1 Vijay Ramesh
• 1 Will Myers
• 1 Yaroslav Halchenko
• 1 Yoni Ben-Meshulam
• 1 Yury V. Zaytsev
• 1 adrinjalali
• 1 ai8rahim
• 1 alemagnani
• 1 alex
• 1 benjamin wilson
• 1 chalmerlowe
• 1 dzikie drożdże
• 1 jamestwebber
• 1 matrixorz
• 1 popo
• 1 samuela
• 1 François Boulogne
• 1 Alexander Measure
• 1 Ethan White
• 1 Guilherme Trein
• 1 Hendrik Heuer
• 1 IvicaJovic
• 1 Jan Hendrik Metzen
• 1 Jean Michel Rouly
• 1 Eduardo Ariño de la Rubia
• 1 Jelle Zijlstra
• 1 Eddy L O Jansson
• 1 Denis
• 1 John
• 1 John Schmidt
• 1 Jorge Cañardo Alastuey
• 1 Joseph Perla
• 1 Joshua Vredevoogd
• 1 José Ricardo
• 1 Julien Miotte
• 1 Kemal Eren
• 1 Kenta Sato
• 1 David Cournapeau
• 1 Kyle Kelley
• 1 Daniele Medri
• 1 Laurent Luce
• 1 Laurent Pierron
• 1 Luis Pedro Coelho
• 1 DanielWeitzenfeld
• 1 Craig Thompson
• 1 Chyi-Kwei Yau
• 1 Matthew Brett
• 1 Matthias Feurer
• 1 Max Linke
• 1 Chris Filo Gorgolewski
• 1 Charles Earl
• 1 Michael Hanke
• 1 Michele Orrù
• 1 Bryan Lunt
• 1 Brian Kearns
• 1 Paul Butler
• 1 Paweł Mandera
• 1 Peter
• 1 Andrew Ash
• 1 Pietro Zambelli
• 1 staubda
August 7, 2013
Changelog
• Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.
Imputer by Nicolas Trésegnie.
• The core implementation of decisions trees has been rewritten from scratch, allowing for faster tree induction
and lower memory consumption in all tree-based estimators. By Gilles Louppe.
• Added ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor, by Noel Dawe and
Gilles Louppe. See the AdaBoost section of the user guide for details and examples.
• Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for random-
ized hyperparameter optimization. By Andreas Müller.
• Added biclustering algorithms (sklearn.cluster.bicluster.SpectralCoclustering and
sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.
datasets.make_biclusters and sklearn.datasets.make_checkerboard), and scoring met-
rics (sklearn.metrics.consensus_score). By Kemal Eren.
• Added Restricted Boltzmann Machines (neural_network.BernoulliRBM ). By Yann Dauphin.
• Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass
under Python 3.3.
• Ability to pass one penalty (alpha value) per target in linear_model.Ridge, by @eickenberg and Mathieu
Blondel.
• Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical
significance). By Norbert Crombach and Mathieu Blondel .
• Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the docu-
mentation. See Choosing the right estimator. By Jaques Grobler.
• grid_search.GridSearchCV and cross_validation.cross_val_score now support the use
of advanced scoring function such as area under the ROC curve and f-beta scores. See The scoring parameter:
defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function from
sklearn.metrics as score_func is deprecated.
• Multi-label classification output is now supported by metrics.accuracy_score,
metrics.zero_one_loss, metrics.f1_score, metrics.fbeta_score, metrics.
classification_report, metrics.precision_score and metrics.recall_score by
Arnaud Joly.
• Two new metrics metrics.hamming_loss and metrics.jaccard_similarity_score are added
with multi-label support by Arnaud Joly.
• Speed and memory usage improvements in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.
• The min_df parameter in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to
avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A
value of at least 2 is still recommended for practical use.
• svm.LinearSVC, linear_model.SGDClassifier and linear_model.SGDRegressor now
have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained
using these estimators can be made much more compact.
• linear_model.SGDClassifier now produces multiclass probability estimates when trained under log
loss or modified Huber loss.
• Hyperlinks to documentation in example code on the website by Martin Luessi.
• Fixed bug in preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default
feature_range settings. By Andreas Müller.
• max_features in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all
derived ensemble estimators now supports percentage values. By Gilles Louppe.
• Performance improvements in isotonic.IsotonicRegression by Nelle Varoquaux.
• metrics.accuracy_score has an option normalize to return the fraction or the number of correctly clas-
sified sample by Arnaud Joly.
• Added metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars
Buitinck.
• A bug that caused ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.
• Feature selectors now share a mixin providing consistent transform, inverse_transform and
get_support methods. By Joel Nothman.
• A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally
be pickled. By Joel Nothman.
People
• 7 Hrishikesh Huilgolkar
• 6 Kyle Kastner
• 6 Martin Luessi
• 6 Rob Speer
• 5 Federico Vaggi
• 5 Raul Garreta
• 5 Rob Zinkov
• 4 Ken Geis
• 3 A. Flaxman
• 3 Denton Cockburn
• 3 Dougal Sutherland
• 3 Ian Ozsvald
• 3 Johannes Schönberger
• 3 Robert McGibbon
• 3 Roman Sinayev
• 3 Szabo Roland
• 2 Diego Molla
• 2 Imran Haque
• 2 Jochen Wersdörfer
• 2 Sergey Karayev
• 2 Yannick Schwartz
• 2 jamestwebber
• 1 Abhijeet Kolhe
• 1 Alexander Fabisch
• 1 Bastiaan van den Berg
• 1 Benjamin Peterson
• 1 Daniel Velkov
• 1 Fazlul Shahriar
• 1 Felix Brockherde
• 1 Félix-Antoine Fortin
• 1 Harikrishnan S
• 1 Jack Hale
• 1 JakeMick
• 1 James McDermott
• 1 John Benediktsson
• 1 John Zwinck
• 1 Joshua Vredevoogd
• 1 Justin Pati
• 1 Kevin Hughes
• 1 Kyle Kelley
• 1 Matthias Ekman
• 1 Miroslav Shubernetskiy
• 1 Naoki Orii
• 1 Norbert Crombach
• 1 Rafael Cunha de Almeida
• 1 Rolando Espinoza La fuente
• 1 Seamus Abshere
• 1 Sergey Feldman
• 1 Sergio Medina
• 1 Stefano Lattarini
• 1 Steve Koch
• 1 Sturla Molden
• 1 Thomas Jarosch
• 1 Yaroslav Halchenko
Changelog
People
Changelog
• metrics.zero_one_loss (formerly metrics.zero_one) now has option for normalized output that
reports the fraction of misclassifications, rather than the raw number of misclassifications. By Kyle Beauchamp.
• tree.DecisionTreeClassifier and all derived ensemble models now support sample weighting, by
Noel Dawe and Gilles Louppe.
• Speedup improvement when using bootstrap samples in forests of randomized trees, by Peter Prettenhofer and
Gilles Louppe.
• Partial dependence plots for Gradient Tree Boosting in ensemble.partial_dependence.
partial_dependence by Peter Prettenhofer. See Partial Dependence Plots for an example.
• The table of contents on the website has now been made expandable by Jaques Grobler.
• feature_selection.SelectPercentile now breaks ties deterministically instead of returning all
equally ranked features.
• feature_selection.SelectKBest and feature_selection.SelectPercentile are more
numerically stable since they use scores, rather than p-values, to rank results. This means that they might
sometimes select different features than they did previously.
• Ridge regression and ridge classification fitting with sparse_cg solver no longer has quadratic memory com-
plexity, by Lars Buitinck and Fabian Pedregosa.
• Ridge regression and ridge classification now support a new fast solver called lsqr, by Mathieu Blondel.
• Speed up of metrics.precision_recall_curve by Conrad Lee.
• Added support for reading/writing svmlight files with pairwise preference attribute (qid in svmlight file format)
in datasets.dump_svmlight_file and datasets.load_svmlight_file by Fabian Pedregosa.
• Faster and more robust metrics.confusion_matrix and Clustering performance evaluation by Wei Li.
• cross_validation.cross_val_score now works with precomputed kernels and affinity matrices, by
Andreas Müller.
• LARS algorithm made more numerically stable with heuristics to drop regressors too correlated as well as to
stop the path when numerical noise becomes predominant, by Gael Varoquaux.
• Faster implementation of metrics.precision_recall_curve by Conrad Lee.
• New kernel metrics.chi2_kernel by Andreas Müller, often used in computer vision applications.
• Fix of longstanding bug in naive_bayes.BernoulliNB fixed by Shaun Jackman.
People
• 106 Wei Li
• 101 Olivier Grisel
• 65 Vlad Niculae
• 54 Gilles Louppe
• 40 Jaques Grobler
• 38 Alexandre Gramfort
• 30 Rob Zinkov
• 19 Aymeric Masurelle
• 18 Andrew Winterman
• 17 Fabian Pedregosa
• 17 Nelle Varoquaux
• 16 Christian Osendorfer
• 14 Daniel Nouri
• 13 Virgile Fritsch
• 13 syhw
• 12 Satrajit Ghosh
• 10 Corey Lynch
• 10 Kyle Beauchamp
• 9 Brian Cheung
• 9 Immanuel Bayer
• 9 mr.Shu
• 8 Conrad Lee
• 8 James Bergstra
• 7 Tadej Janež
• 6 Brian Cajes
• 6 Jake Vanderplas
• 6 Michael
• 6 Noel Dawe
• 6 Tiago Nunes
• 6 cow
• 5 Anze
• 5 Shiqiao Du
• 4 Christian Jauvin
• 4 Jacques Kvam
• 4 Richard T. Guy
• 4 Robert Layton
• 3 Alexandre Abraham
• 3 Doug Coleman
• 3 Scott Dickerson
• 2 ApproximateIdentity
• 2 John Benediktsson
• 2 Mark Veronda
• 2 Matti Lyra
• 2 Mikhail Korobov
• 2 Xinfan Meng
• 1 Alejandro Weinstein
• 1 Alexandre Passos
• 1 Christoph Deil
• 1 Eugene Nizhibitsky
• 1 Kenneth C. Arnold
• 1 Luis Pedro Coelho
• 1 Miroslav Batchkarov
• 1 Pavel
• 1 Sebastian Berg
• 1 Shaun Jackman
• 1 Subhodeep Moitra
• 1 bob
• 1 dengemann
• 1 emanuele
• 1 x006
October 8, 2012
The 0.12.1 release is a bug-fix release with no additional features, but is instead a set of bug fixes
Changelog
People
• 14 Peter Prettenhofer
• 12 Gael Varoquaux
• 10 Andreas Müller
• 5 Lars Buitinck
• 3 Virgile Fritsch
• 1 Alexandre Gramfort
• 1 Gilles Louppe
• 1 Mathieu Blondel
September 4, 2012
Changelog
• Add MultiTaskLasso and MultiTaskElasticNet for joint feature selection, by Alexandre Gramfort.
• Added metrics.auc_score and metrics.average_precision_score convenience functions by
Andreas Müller.
• Improved sparse matrix support in the Feature selection module by Andreas Müller.
• New word boundaries-aware character n-gram analyzer for the Text feature extraction module by @kernc.
• Fixed bug in spectral clustering that led to single point clusters by Andreas Müller.
• In feature_extraction.text.CountVectorizer, added an option to ignore infrequent words,
min_df by Andreas Müller.
• Add support for multiple targets in some linear models (ElasticNet, Lasso and OrthogonalMatchingPursuit) by
Vlad Niculae and Alexandre Gramfort.
• Fixes in decomposition.ProbabilisticPCA score function by Wei Li.
• Fixed feature importance computation in Gradient Tree Boosting.
• The old scikits.learn package has disappeared; all code should import from sklearn instead, which
was introduced in 0.9.
• In metrics.roc_curve, the thresholds array is now returned with it’s order reversed, in order to keep
it consistent with the order of the returned fpr and tpr.
• In hmm objects, like hmm.GaussianHMM, hmm.MultinomialHMM, etc., all parameters must be passed to
the object when initialising it and not through fit. Now fit will only accept the data as an input parameter.
• For all SVM classes, a faulty behavior of gamma was fixed. Previously, the default gamma value was only
computed the first time fit was called and then stored. It is now recalculated on every call to fit.
• All Base classes are now abstract meta classes so that they can not be instantiated.
• cluster.ward_tree now also returns the parent array. This is necessary for early-stopping in which case
the tree is not completely built.
• In feature_extraction.text.CountVectorizer the parameters min_n and max_n were joined to
the parameter n_gram_range to enable grid-searching both at once.
• In feature_extraction.text.CountVectorizer, words that appear only in one document are now
ignored by default. To reproduce the previous behavior, set min_df=1.
• Fixed API inconsistency: linear_model.SGDClassifier.predict_proba now returns 2d array
when fit on two classes.
• Fixed API inconsistency: discriminant_analysis.QuadraticDiscriminantAnalysis.
decision_function and discriminant_analysis.LinearDiscriminantAnalysis.
decision_function now return 1d arrays when fit on two classes.
• Grid of alphas used for fitting linear_model.LassoCV and linear_model.ElasticNetCV is now
stored in the attribute alphas_ rather than overriding the init parameter alphas.
• Linear models when alpha is estimated by cross-validation store the estimated value in the alpha_ attribute
rather than just alpha or best_alpha.
• ensemble.GradientBoostingClassifier now supports ensemble.
GradientBoostingClassifier.staged_predict_proba, and ensemble.
GradientBoostingClassifier.staged_predict.
• svm.sparse.SVC and other sparse SVM classes are now deprecated. The all classes in the Support Vector
Machines module now automatically select the sparse or dense representation base on the input.
• All clustering algorithms now interpret the array X given to fit as input data, in particular cluster.
SpectralClustering and cluster.AffinityPropagation which previously expected affinity ma-
trices.
• For clustering algorithms that take the desired number of clusters as a parameter, this parameter is now called
n_clusters.
People
• 3 flyingimmidev
• 2 Francois Savard
• 2 Hannes Schulz
• 2 Peter Welinder
• 2 Yaroslav Halchenko
• 2 Wei Li
• 1 Alex Companioni
• 1 Brandyn A. White
• 1 Bussonnier Matthias
• 1 Charles-Pierre Astolfi
• 1 Dan O’Huiginn
• 1 David Cournapeau
• 1 Keith Goodman
• 1 Ludwig Schwardt
• 1 Olivier Hervieu
• 1 Sergio Medina
• 1 Shiqiao Du
• 1 Tim Sheerman-Chase
• 1 buguen
May 7, 2012
Changelog
Highlights
• Gradient boosted regression trees (Gradient Tree Boosting) for classification and regression by Peter Pretten-
hofer and Scott White .
• Simple dict-based feature loader with support for categorical variables (feature_extraction.
DictVectorizer) by Lars Buitinck.
• Added Matthews correlation coefficient (metrics.matthews_corrcoef) and added macro and micro av-
erage options to metrics.precision_score, metrics.recall_score and metrics.f1_score
by Satrajit Ghosh.
• Out of Bag Estimates of generalization error for Ensemble methods by Andreas Müller.
• Randomized sparse linear models for feature selection, by Alexandre Gramfort and Gael Varoquaux
• Label Propagation for semi-supervised learning, by Clay Woolam. Note the semi-supervised API is still work
in progress, and may change.
• Added BIC/AIC model selection to classical Gaussian mixture models and unified the API with the remainder
of scikit-learn, by Bertrand Thirion
• Added sklearn.cross_validation.StratifiedShuffleSplit, which is a sklearn.
cross_validation.ShuffleSplit with balanced splits, by Yannick Schwartz.
• sklearn.neighbors.NearestCentroid classifier added, along with a shrink_threshold param-
eter, which implements shrunken centroid classification, by Robert Layton.
Other changes
• Merged dense and sparse implementations of Stochastic Gradient Descent module and exposed utility extension
types for sequential datasets seq_dataset and weight vectors weight_vector by Peter Prettenhofer.
• Added partial_fit (support for online/minibatch learning) and warm_start to the Stochastic Gradient De-
scent module by Mathieu Blondel.
• Dense and sparse implementations of Support Vector Machines classes and linear_model.
LogisticRegression merged by Lars Buitinck.
• Regressors can now be used as base estimator in the Multiclass and multilabel algorithms module by Mathieu
Blondel.
• Added n_jobs option to metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels for parallel computation, by Mathieu Blondel.
• K-means can now be run in parallel, using the n_jobs argument to either K-means or KMeans, by Robert
Layton.
• Improved Cross-validation: evaluating estimator performance and Tuning the hyper-parameters of an estima-
tor documentation and introduced the new cross_validation.train_test_split helper function by
Olivier Grisel
• svm.SVC members coef_ and intercept_ changed sign for consistency with decision_function;
for kernel==linear, coef_ was fixed in the one-vs-one case, by Andreas Müller.
• Performance improvements to efficient leave-one-out cross-validated Ridge regression, esp. for the
n_samples > n_features case, in linear_model.RidgeCV , by Reuben Fletcher-Costin.
• Refactoring and simplification of the Text feature extraction API and fixed a bug that caused possible negative
IDF, by Olivier Grisel.
• Beam pruning option in _BaseHMM module has been removed since it is difficult to Cythonize. If you are
interested in contributing a Cython version, you can use the python version in the git history as a reference.
• Classes in Nearest Neighbors now support arbitrary Minkowski metric for nearest neighbors searches. The
metric can be specified by argument p.
People
• 1 Claire Revillet
• 1 Conrad Lee
• 1 Edouard Duchesnay
• 1 Jan Hendrik Metzen
• 1 Meng Xinfan
• 1 Rob Zinkov
• 1 Shiqiao
• 1 Udi Weinsberg
• 1 Virgile Fritsch
• 1 Xinfan Meng
• 1 Yaroslav Halchenko
• 1 jansoe
• 1 Leon Palafox
Changelog
• Python 2.5 compatibility was dropped; the minimum Python version needed to use scikit-learn is now 2.6.
• Sparse inverse covariance estimation using the graph Lasso, with associated cross-validated estimator, by Gael
Varoquaux
• New Tree module by Brian Holt, Peter Prettenhofer, Satrajit Ghosh and Gilles Louppe. The module comes with
complete documentation and examples.
• Fixed a bug in the RFE module by Gilles Louppe (issue #378).
• Fixed a memory leak in Support Vector Machines module by Brian Holt (issue #367).
• Faster tests by Fabian Pedregosa and others.
• Silhouette Coefficient cluster analysis evaluation metric added as sklearn.metrics.
silhouette_score by Robert Layton.
• Fixed a bug in K-means in the handling of the n_init parameter: the clustering algorithm used to be run
n_init times but the last solution was retained instead of the best solution by Olivier Grisel.
• Minor refactoring in Stochastic Gradient Descent module; consolidated dense and sparse predict methods; En-
hanced test time performance by converting model parameters to fortran-style arrays after fitting (only multi-
class).
• Adjusted Mutual Information metric added as sklearn.metrics.adjusted_mutual_info_score
by Robert Layton.
• Models like SVC/SVR/LinearSVC/LogisticRegression from libsvm/liblinear now support scaling of C regular-
ization parameter by the number of samples by Alexandre Gramfort.
• New Ensemble Methods module by Gilles Louppe and Brian Holt. The module comes with the random forest
algorithm and the extra-trees method, along with documentation and examples.
• Novelty and Outlier Detection: outlier and novelty detection, by Virgile Fritsch.
• Kernel Approximation: a transform implementing kernel approximation for fast SGD on non-linear kernels by
Andreas Müller.
• Fixed a bug due to atom swapping in Orthogonal Matching Pursuit (OMP) by Vlad Niculae.
• Sparse coding with a precomputed dictionary by Vlad Niculae.
• Mini Batch K-Means performance improvements by Olivier Grisel.
• K-means support for sparse matrices by Mathieu Blondel.
• Improved documentation for developers and for the sklearn.utils module, by Jake Vanderplas.
• Vectorized 20newsgroups dataset loader (sklearn.datasets.fetch_20newsgroups_vectorized)
by Mathieu Blondel.
• Multiclass and multilabel algorithms by Lars Buitinck.
• Utilities for fast computation of mean and variance for sparse matrices by Mathieu Blondel.
• Make sklearn.preprocessing.scale and sklearn.preprocessing.Scaler work on sparse
matrices by Olivier Grisel
• Feature importances using decision trees and/or forest of trees, by Gilles Louppe.
• Parallel implementation of forests of randomized trees by Gilles Louppe.
• sklearn.cross_validation.ShuffleSplit can subsample the train sets as well as the test sets by
Olivier Grisel.
• Errors in the build of the documentation fixed by Andreas Müller.
Here are the code migration instructions when upgrading from scikit-learn version 0.9:
• Some estimators that may overwrite their inputs to save memory previously had overwrite_ parameters;
these have been replaced with copy_ parameters with exactly the opposite meaning.
This particularly affects some of the estimators in linear_model. The default behavior is still to copy
everything passed in.
• The SVMlight dataset loader sklearn.datasets.load_svmlight_file no longer supports loading
two files at once; use load_svmlight_files instead. Also, the (unused) buffer_mb parameter is gone.
• Sparse estimators in the Stochastic Gradient Descent module use dense parameter vector coef_ instead of
sparse_coef_. This significantly improves test time performance.
• The Covariance estimation module now has a robust estimator of covariance, the Minimum Covariance Deter-
minant estimator.
• Cluster evaluation metrics in metrics.cluster have been refactored but the changes are backwards compat-
ible. They have been moved to the metrics.cluster.supervised, along with metrics.cluster.
unsupervised which contains the Silhouette Coefficient.
• The permutation_test_score function now behaves the same way as cross_val_score (i.e. uses
the mean score across the folds.)
• Cross Validation generators now use integer indices (indices=True) by default instead of boolean masks.
This make it more intuitive to use with sparse matrix data.
• The functions used for sparse coding, sparse_encode and sparse_encode_parallel have been com-
bined into sklearn.decomposition.sparse_encode, and the shapes of the arrays have been trans-
posed for consistency with the matrix factorization setting, as opposed to the regression setting.
• Fixed an off-by-one error in the SVMlight/LibSVM file format handling; files generated using sklearn.
datasets.dump_svmlight_file should be re-generated. (They should continue to work, but acciden-
tally had one extra column of zeros prepended.)
• BaseDictionaryLearning class replaced by SparseCodingMixin.
• sklearn.utils.extmath.fast_svd has been renamed sklearn.utils.extmath.
randomized_svd and the default oversampling is now fixed to 10 additional random vectors instead
of doubling the number of components to extract. The new behavior follows the reference paper.
People
• 1 Félix-Antoine Fortin
• 1 Juan Manuel Caicedo Carvajal
• 1 Nelle Varoquaux
• 1 Nicolas Pinto
• 1 Tiziano Zito
• 1 Xinfan Meng
Changelog
• Nearest Neighbors module refactoring by Jake Vanderplas : general refactoring, support for sparse matrices in
input, speed and documentation improvements. See the next section for a full list of API changes.
• Improvements on the Feature selection module by Gilles Louppe : refactoring of the RFE classes, documenta-
tion rewrite, increased efficiency and minor API changes.
• Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA) by Vlad Niculae, Gael Varo-
quaux and Alexandre Gramfort
• Printing an estimator now behaves independently of architectures and Python version thanks to Jean Kossaifi.
• Loader for libsvm/svmlight format by Mathieu Blondel and Lars Buitinck
• Documentation improvements: thumbnails in example gallery by Fabian Pedregosa.
• Important bugfixes in Support Vector Machines module (segfaults, bad performance) by Fabian Pedregosa.
• Added Multinomial Naive Bayes and Bernoulli Naive Bayes by Lars Buitinck
• Text feature extraction optimizations by Lars Buitinck
• Chi-Square feature selection (feature_selection.univariate_selection.chi2) by Lars Buit-
inck.
• Generated datasets module refactoring by Gilles Louppe
• Multiclass and multilabel algorithms by Mathieu Blondel
• Ball tree rewrite by Jake Vanderplas
• Implementation of DBSCAN algorithm by Robert Layton
• Kmeans predict and transform by Robert Layton
• Preprocessing module refactoring by Olivier Grisel
• Faster mean shift by Conrad Lee
• New Bootstrap, Random permutations cross-validation a.k.a. Shuffle & Split and various other improve-
ments in cross validation schemes by Olivier Grisel and Gael Varoquaux
• Adjusted Rand index and V-Measure clustering evaluation metrics by Olivier Grisel
• Added Orthogonal Matching Pursuit by Vlad Niculae
• Added 2D-patch extractor utilities in the Feature extraction module by Vlad Niculae
• Implementation of linear_model.LassoLarsCV (cross-validated Lasso solver using the Lars algorithm)
and linear_model.LassoLarsIC (BIC/AIC model selection in Lars) by Gael Varoquaux and Alexandre
Gramfort
• Scalability improvements to metrics.roc_curve by Olivier Hervieu
• Distance helper functions metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels by Robert Layton
• Mini-Batch K-Means by Nelle Varoquaux and Peter Prettenhofer.
• mldata utilities by Pietro Berkes.
• The Olivetti faces dataset by David Warde-Farley.
Here are the code migration instructions when upgrading from scikit-learn version 0.8:
• The scikits.learn package was renamed sklearn. There is still a scikits.learn package alias for
backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
• Estimators no longer accept model parameters as fit arguments: instead all parameters must be only
be passed as constructor arguments or using the now public set_params method inherited from base.
BaseEstimator.
Some estimators can still accept keyword arguments on the fit but this is restricted to data-dependent values
(e.g. a Gram matrix or an affinity matrix that are precomputed from the X data matrix.
• The cross_val package has been renamed to cross_validation although there is also a cross_val
package alias in place for backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
People
Changelog
People
• 96 Gael Varoquaux
• 96 Vlad Niculae
• 94 Fabian Pedregosa
• 36 Alexandre Gramfort
• 32 Paolo Losi
• 31 Edouard Duchesnay
• 30 Mathieu Blondel
• 25 Peter Prettenhofer
• 22 Nicolas Pinto
• 11 Virgile Fritsch
– 7 Lars Buitinck
– 6 Vincent Michel
– 5 Bertrand Thirion
– 4 Thouis (Ray) Jones
– 4 Vincent Schut
– 3 Jan Schlüter
– 2 Julien Miotte
– 2 Matthieu Perrot
– 2 Yann Malet
– 2 Yaroslav Halchenko
– 1 Amit Aides
– 1 Andreas Müller
– 1 Feth Arezki
– 1 Meng Xinfan
March 2, 2011
scikit-learn 0.7 was released in March 2011, roughly three months after the 0.6 release. This release is marked by the
speed improvements in existing algorithms like k-Nearest Neighbors and K-Means algorithm and by the inclusion of
an efficient algorithm for computing the Ridge Generalized Cross Validation solution. Unlike the preceding release,
no new modules where added to this release.
Changelog
• Better handling of collinearity and early stopping in linear_model.lars_path [Alexandre Gramfort and
Fabian Pedregosa].
• Fixes for liblinear ordering of labels and sign of coefficients [Dan Yamins, Paolo Losi, Mathieu Blondel and
Fabian Pedregosa].
• Performance improvements for Nearest Neighbors algorithm in high-dimensional spaces [Fabian Pedregosa].
• Performance improvements for cluster.KMeans [Gael Varoquaux and James Bergstra].
• Sanity checks for SVM-based classes [Mathieu Blondel].
• Refactoring of neighbors.NeighborsClassifier and neighbors.kneighbors_graph: added
different algorithms for the k-Nearest Neighbor Search and implemented a more stable algorithm for finding
barycenter weights. Also added some developer documentation for this module, see notes_neighbors for more
information [Fabian Pedregosa].
• Documentation improvements: Added pca.RandomizedPCA and linear_model.
LogisticRegression to the class reference. Also added references of matrices used for clustering
and other fixes [Gael Varoquaux, Fabian Pedregosa, Mathieu Blondel, Olivier Grisel, Virgile Fritsch ,
Emmanuelle Gouillart]
• Binded decision_function in classes that make use of liblinear, dense and sparse variants, like svm.
LinearSVC or linear_model.LogisticRegression [Fabian Pedregosa].
• Performance and API improvements to metrics.euclidean_distances and to pca.
RandomizedPCA [James Bergstra].
• Fix compilation issues under NetBSD [Kamel Ibn Hassen Derouiche]
• Allow input sequences of different lengths in hmm.GaussianHMM [Ron Weiss].
• Fix bug in affinity propagation caused by incorrect indexing [Xinfan Meng]
People
• 1 VirgileFritsch
• 1 Yaroslav Halchenko
• 1 Xinfan Meng
Changelog
• New stochastic gradient descent module by Peter Prettenhofer. The module comes with complete documentation
and examples.
• Improved svm module: memory consumption has been reduced by 50%, heuristic to automatically set class
weights, possibility to assign weights to samples (see SVM: Weighted samples for an example).
• New Gaussian Processes module by Vincent Dubourg. This module also has great documenta-
tion and some very neat examples. See example_gaussian_process_plot_gp_regression.py or exam-
ple_gaussian_process_plot_gp_probabilistic_classification_after_regression.py for a taste of what can be done.
• It is now possible to use liblinear’s Multi-class SVC (option multi_class in svm.LinearSVC)
• New features and performance improvements of text feature extraction.
• Improved sparse matrix support, both in main classes (grid_search.GridSearchCV) as in modules
sklearn.svm.sparse and sklearn.linear_model.sparse.
• Lots of cool new examples and a new section that uses real-world datasets was created. These include: Faces
recognition example using eigenfaces and SVMs, Species distribution modeling, Libsvm GUI, Wikipedia princi-
pal eigenvector and others.
• Faster Least Angle Regression algorithm. It is now 2x faster than the R version on worst case and up to 10x
times faster on some cases.
• Faster coordinate descent algorithm. In particular, the full path version of lasso (linear_model.
lasso_path) is more than 200x times faster than before.
• It is now possible to get probability estimates from a linear_model.LogisticRegression model.
• module renaming: the glm module has been renamed to linear_m