Scikit Learn Docs PDF
Scikit Learn Docs PDF
Release 0.20.dev0
scikit-learn developers
1 Welcome to scikit-learn 1
1.1 Installing scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Related Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 About us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Who is using scikit-learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Release History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.8 Version 0.20 (under development) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9 Version 0.19.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.10 Version 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.11 Previous Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Examples 609
5.1 General examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
i
5.2 Examples based on real world datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
5.3 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
5.4 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
5.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
5.6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
5.7 Covariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
5.8 Cross decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
5.9 Dataset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
5.10 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
5.11 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
5.12 Tutorial exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
5.13 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
5.14 Gaussian Process for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936
5.15 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
5.16 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
5.17 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049
5.18 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
5.19 Multioutput methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1102
5.20 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105
5.21 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1120
5.22 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133
5.23 Semi Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154
5.24 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165
5.25 Working with text documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
5.26 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1211
ii
6.28 sklearn.neighbors: Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1835
6.29 sklearn.neural_network: Neural network models . . . . . . . . . . . . . . . . . . . . . . . 1880
6.30 sklearn.pipeline: Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1893
6.31 sklearn.preprocessing: Preprocessing and Normalization . . . . . . . . . . . . . . . . . . . 1901
6.32 sklearn.random_projection: Random projection . . . . . . . . . . . . . . . . . . . . . . . 1952
6.33 sklearn.semi_supervised Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 1958
6.34 sklearn.svm: Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1963
6.35 sklearn.tree: Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1993
6.36 sklearn.utils: Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2017
6.37 Recently deprecated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2042
Bibliography 2151
Index 2159
iii
iv
CHAPTER
ONE
WELCOME TO SCIKIT-LEARN
Note: If you wish to contribute to the project, it’s recommended you install the latest development version.
Scikit-learn requires:
• Python (>= 2.7 or >= 3.4),
• NumPy (>= 1.8.2),
• SciPy (>= 0.13.3).
If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip
pip install -U scikit-learn
or conda:
conda install scikit-learn
If you have not installed NumPy or SciPy yet, you can also install these using conda or pip. When using pip, please
ensure that binary wheels are used, and NumPy and SciPy are not recompiled from source, which can happen when
using particular configurations of operating system and hardware (such as Linux on a Raspberry Pi). Building numpy
and scipy from source can be complex (especially on Windows) and requires careful configuration to ensure that they
link against an optimized implementation of linear algebra routines. Instead, use a third-party distribution as described
below.
If you must install scikit-learn and its dependencies with pip, you can install it as scikit-learn[alldeps]. The
most common use case for this is in a requirements.txt file used as part of an automated build process for a
PaaS application or a Docker image. This option is not intended for manual installation from the command line.
If you don’t already have a python installation with numpy and scipy, we recommend to install either via your package
manager or via a python bundle. These come with numpy, scipy, scikit-learn, matplotlib and many other helpful
1
scikit-learn user guide, Release 0.20.dev0
Canopy and Anaconda both ship a recent version of scikit-learn, in addition to a large set of scientific python library
for Windows, Mac OSX and Linux.
Anaconda offers scikit-learn as part of its free distribution.
Warning: To upgrade or uninstall scikit-learn installed with Anaconda or conda you should not use the pip
command. Instead:
To upgrade scikit-learn:
conda update scikit-learn
To uninstall scikit-learn:
conda remove scikit-learn
Here we try to give some answers to questions that regularly pop up on the mailing list.
scikit-learn, but not scikit or SciKit nor sci-kit learn. Also not scikits.learn or scikits-learn, which were previously
used.
There are multiple scikits, which are scientific toolboxes built around SciPy. You can find a list at https://scikits.
appspot.com/scikits. Apart from scikit-learn, another popular one is scikit-image.
See Contributing. Before wanting to add a new algorithm, which is usually a major and lengthy undertaking, it is
recommended to start with known issues. Please do not contact the contributors of scikit-learn directly regarding
contributing to scikit-learn.
For general machine learning questions, please use Cross Validated with the [machine-learning] tag.
For scikit-learn usage questions, please use Stack Overflow with the [scikit-learn] and [python] tags. You
can alternatively use the mailing list.
Please make sure to include a minimal reproduction code snippet (ideally shorter than 10 lines) that highlights your
problem on a toy dataset (for instance from sklearn.datasets or randomly generated with functions of numpy.
random with a fixed random seed). Please remove any line of code that is not necessary to reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python shell with scikit-learn
installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
http://stackoverflow.com/help/mcve
If your problem raises an exception that you do not understand (even after googling it), please make sure to include
the full traceback that you obtain when running the reproduction script.
For bug reports or feature requests, please make use of the issue tracker on GitHub.
There is also a scikit-learn Gitter channel where some users and developers might be found.
Please do not email any authors directly to ask for assistance, report bugs, or for any other issue related to
scikit-learn.
Don’t make a bunch object! They are not part of the scikit-learn API. Bunch objects are just a way to package some
numpy arrays. As a scikit-learn user you only ever need numpy arrays to feed your model with data.
For instance to train a classifier, all you need is a 2D array X for the input variables and a 1D array y for the target
variables. The array X holds the features as columns and samples as rows . The array y contains integer values to
encode the class membership of each sample in X.
1.2.7 How can I load my own datasets into a format usable by scikit-learn?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that
are convertible to numeric arrays such as pandas DataFrame are also acceptable.
For more information on loading your data files into these usable data structures, please refer to loading external
datasets.
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+
citations and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data
structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
From the algorithms or techniques that meet the above criteria, only those which fit well within the current API of
scikit-learn, that is a fit, predict/transform interface and ordinarily having input/output that is a numpy array
or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with research papers and/or implementations
in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the
methods that are already implemented in scikit-learn at least in some areas.
Also note that your implementation need not be in scikit-learn to be used together with scikit-learn tools. You can
implement your favorite algorithm in a scikit-learn compatible way, upload it to GitHub and let us know. We will be
happy to list it under Related Projects. If you already have a package on GitHub following the scikit-learn API, you
may also be interested to look at scikit-learn-contrib.
1.2.9 Why are you so selective on what algorithms you include in scikit-learn?
Code is maintenance cost, and we need to balance the amount of code we have with the size of the team (and add to
this the fact that complexity scales non linearly with the number of features). The package relies on core developers
using their free time to fix bugs, maintain code and review contributions. Any algorithm that is added needs future
attention by the developers, at which point the original author might long have lost interest. See also What are the
inclusion criteria for new algorithms ?. For a great read about long-term maintenance issues in open-source software,
look at the Executive Summary of Roads and Bridges
Not in the foreseeable future. scikit-learn tries to provide a unified API for the basic tasks in machine learning, with
pipelines and meta-algorithms like grid search to tie everything together. The required concepts, APIs, algorithms
and expertise required for structured learning are different from what scikit-learn has to offer. If we started doing
arbitrary structured learning, we’d need to redesign the whole package and the project would likely collapse under its
own weight.
There are two project with API similar to scikit-learn that do structured prediction:
• pystruct handles general structured learning (focuses on SSVMs on arbitrary graph structures with approximate
inference; defines the notion of sample as an instance of the graph structure)
• seqlearn handles sequences only (focuses on exact inference; has HMMs, but mostly for the sake of complete-
ness; treats a feature vector as a sample and uses an offset encoding for the dependencies between feature
vectors)
No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies
and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.
Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed
can often be achieved by a careful choice of algorithms.
In case you didn’t know, PyPy is the new, fast, just-in-time compiling Python implementation. We don’t support it.
When the NumPy support in PyPy is complete or near-complete, and SciPy is ported over as well, we can start thinking
of a port. We use too much of NumPy to work with a partial implementation.
scikit-learn estimators assume you’ll feed them real-valued feature vectors. This assumption is hard-coded in pretty
much all of the library. However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see Text feature extraction for the built-in text
vectorizers. For more general feature extraction from any kind of data, see Loading features from dicts and Feature
hashing.
Another common case is when you have non-numerical data and a custom distance (or similarity) metric on these data.
Examples include strings with edit distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
encoded as numbers, but doing so is painful and error-prone. Working with distance metrics on arbitrary data can be
done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if the dataset is not too large, you can
compute distances for all pairs of inputs. If the dataset is large, you can use feature vectors with only one “feature”,
which is an index into a separate data structure, and supply a custom metric function that looks up the actual data in
this data structure. E.g., to use DBSCAN with Levenshtein distances:
1.2.15 Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Several scikit-learn tools such as GridSearchCV and cross_val_score rely internally on Python’s multipro-
cessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as argument.
The problem is that Python multiprocessing does a fork system call without following it with an exec system
call for performance reasons. Many libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and probably many others), manage their own internal thread
pool. Upon a call to fork, the thread pool state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to change the libraries to make them detect
when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).
But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead
of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX
standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /
vecLib as a bug.
In Python 3.4+ it is now possible to configure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods
(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you
can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that
using the ‘forkserver’ method prevents joblib.Parallel to call function interactively defined in a shell session.
If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the
‘forkserver’ mode globally for your program: Insert the following instructions in your main script:
import multiprocessing
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
You can find more default on the new start methods in the multiprocessing documentation.
1.2.16 Why does my job use more cores than specified with n_jobs under OSX or
Linux?
This happens when vectorized numpy operations are handled by libraries such as MKL or OpenBLAS.
While scikit-learn adheres to the limit set by n_jobs, numpy operations vectorized using MKL (or OpenBLAS) will
make use of multiple threads within each scikit-learn job (thread or process).
The number of threads used by the BLAS library can be set via an environment variable. For example, to set the
maximum number of threads to some integer value N, the following environment variables should be set:
• For MKL: export MKL_NUM_THREADS=N
• For OpenBLAS: export OPENBLAS_NUM_THREADS=N
1.2.17 Why is there no support for deep or reinforcement learning / Will there be
support for deep or reinforcement learning in scikit-learn?
Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning
additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of
scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks
to achieve.
You can find more information about addition of gpu support at Will you add GPU support?.
The scikit-learn review process takes a significant amount of time, and contributors should not be discouraged by a
lack of activity or review on their pull request. We care a lot about getting things right the first time, as maintenance
and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be
subject to high use immediately and should be of the highest quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working
on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are
busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely
because of this reason.
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-
random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it
relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an
execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)
However, a global random state is prone to modification by other code during execution. Thus, the only way to ensure
replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation
splitters have their random_state parameter set.
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices of a single numeric dtype. These do
not explicitly represent categorical variables at present. Thus, unlike R’s data.frames or pandas.DataFrame, we require
explicit conversion of categorical features to numeric values, as discussed in Encoding categorical features. See also
Feature Union with Heterogeneous Data Sources for an example of working with heterogeneous (e.g. categorical and
numeric) data.
Why does Scikit-learn not directly work with, for example, pandas.DataFrame?
The homogeneous NumPy and SciPy data objects currently expected are most efficient to process for most operations.
Extensive work would also be needed to support Pandas categorical types. Restricting input to homogeneous types
therefore reduces maintenance cost and encourages usage of efficient data structures.
1.3 Support
1.3. Support 7
scikit-learn user guide, Release 0.20.dev0
• Some scikit-learn developers support users on StackOverflow using the [scikit-learn] tag.
• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-
able venue.
In both cases please use a descriptive question in the title field (e.g. no “Please help with scikit-learn!” as this is not a
question) and put details on what you tried to achieve, what were the expected results and what you observed instead
in the details field.
Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.
Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the
number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your
trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classification or
continuous variable regression.
If you think you’ve encountered a bug, please report it to the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues
Don’t forget to include:
• steps (or better script) to reproduce,
• expected outcome,
• observed outcome or python (or gdb) tracebacks
To help developers fix your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python
script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV
files using numpy.savetxt).
Note: gists are git cloneable repositories and thus you can use git to push datafiles to them.
1.3.4 IRC
This documentation is relative to 0.20.dev0. Documentation for other versions can be found here.
Printable pdf documentation for old versions can be found here.
Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which
facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also
accepts high-quality contributions of repositories conforming to this template.
Below is a list of sister-projects, extensions and domain specific packages.
These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s
estimators.
Data formats
• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.
• sklearn_xarray provides compatibility of scikit-learn estimators with xarray data structures.
Auto-ML
• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.
Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to
production environments.
• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-
chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in
replacement for a scikit-learn estimator.
• scikit-optimize A library to minimize (very) expensive and noisy black-box functions. It implements several
methods for sequential model-based optimization, and includes a replacement for GridSearchCV or
RandomizedSearchCV to do cross-validated parameter search using any of these strategies.
Experimentation frameworks
• REP Environment for conducting data-driven research in a consistent and reproducible way
• ML Frontend provides dataset management and SVM fitting/prediction through web-based and programmatic
interfaces.
• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked
ensembling. Provides a framework for keeping track of model-hyperparameter combinations.
Model inspection and visualisation
• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.
• mlxtend Includes model visualization utilities.
• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine
learning.
• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,
model selection, evaluation, and diagnostics.
Model export for production
• sklearn-pmml Serialization of (some) scikit-learn estimators into PMML.
• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the
help of JPMML-SkLearn library.
• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.
• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)
trained by sklearn. Useful for latency-sensitive production environments.
Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing
interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.
Structured learning
• Seqlearn Sequence classification using HMMs or structured perceptron.
• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
• PyStruct General conditional random fields and structured prediction.
• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.
• sklearn-crfsuite Linear-chain conditional random fields (CRFsuite wrapper with sklearn-like API).
Deep neural networks etc.
• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.
• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally
• nolearn A number of wrappers and abstractions around existing neural network libraries
• keras Deep Learning library capable of running on top of either TensorFlow or Theano.
• lasagne A lightweight library to build and train neural networks in Theano.
Broad scope
• mlxtend Includes a number of additional estimators as well as model visualization utilities.
• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.
Other regression and classification
• xgboost Optimised gradient boosted decision tree library.
• ML-Ensemble Generalized ensemble learning (stacking, blending, subsemble, deep ensembles, etc.).
• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).
• py-earth Multivariate adaptive regression splines
• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection
• gplearn Genetic Programming for symbolic regression tasks.
• multiisotonic Isotonic regression on multidimensional features.
Decomposition and clustering
• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling
to sample from the true posterior distribution. (scikit-learn’s sklearn.decomposition.
LatentDirichletAllocation implementation uses variational inference to sample from a tractable
approximation of a topic model’s posterior distribution.)
• Sparse Filtering Unsupervised feature learning based on sparse-filtering
• kmodes k-modes clustering algorithm for categorical data, and several of its variations.
• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.
• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-
sphere.
Pre-processing
1.5 About us
This is a community effort, and as such many people have contributed to it over the years.
1.5.1 History
This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu
Brucher started work on this project as part of his thesis.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the
project and made the first public release, February the 1st 2010. Since then, several releases have appeared following
a ~3 month cycle, and a thriving international community has been leading the development.
1.5. About us 11
scikit-learn user guide, Release 0.20.dev0
1.5.2 People
The following people have been core contributors to scikit-learn’s development and maintenance:
• Mathieu Blondel
• Matthieu Brucher
• Lars Buitinck
• David Cournapeau
• Noel Dawe
• Vincent Dubourg
• Edouard Duchesnay
• Tom Dupré la Tour
• Alexander Fabisch
• Virgile Fritsch
• Satra Ghosh
• Angel Soler Gollonet
• Chris Filo Gorgolewski
• Alexandre Gramfort
• Olivier Grisel
• Jaques Grobler
• Yaroslav Halchenko
• Brian Holt
• Arnaud Joly
• Thouis (Ray) Jones
• Kyle Kastner
• Manoj Kumar
• Robert Layton
• Wei Li
• Paolo Losi
• Gilles Louppe
• Jan Hendrik Metzen
• Vincent Michel
• Jarrod Millman
• Andreas Müller (release manager)
• Vlad Niculae
• Joel Nothman
• Alexandre Passos
• Fabian Pedregosa
• Peter Prettenhofer
• Bertrand Thirion
• Jake VanderPlas
• Nelle Varoquaux
• Gael Varoquaux
• Ron Weiss
Please do not email the authors directly to ask for assistance or report issues. Instead, please see What’s the best way
to ask questions about scikit-learn in the FAQ.
See also:
How you can contribute to the project
If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Bibtex entry:
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
If you want to cite scikit-learn for its API or design, you may also want to consider the following paper:
API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013.
Bibtex entry:
@inproceedings{sklearn_api,
author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
and Jaques Grobler and Robert Layton and Jake VanderPlas and
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
title = {{API} design for machine learning software: experiences from
˓→the scikit-learn
project},
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine
˓→Learning},
year = {2013},
pages = {108--122},
}
1.5.4 Artwork
High quality PNG and SVG logos are available in the doc/logos/ source directory.
1.5.5 Funding
INRIA actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2017) to work on this project full-time. It also hosts coding sprints and other
1.5. About us 13
scikit-learn user guide, Release 0.20.dev0
events. Paris-Saclay Center for Data Science funded one year for a
developer to work on the project full-time (2014-2015) and 50% of the time of Guillaume Lemaitre (2016-2017).
Thomas (2017) to work on scikit-learn. Columbia University funds Andreas Müller since
2016. Andreas Müller also received a grant to improve scikit-learn from the Alfred P. Sloan
If you are interested in donating to the project or to one of our code-sprints, you can use the Paypal button below or the
NumFOCUS Donations Page (if you use the latter, please indicate that you are donating for the scikit-learn project).
All donations will be handled by NumFOCUS, a non-profit-organization which is managed by a board of Scipy
community members. NumFOCUS’s mission is to foster scientific computing software, in particular in Python. As
a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available
while in compliance with tax regulations.
The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as
well as towards the organization budget of the project1 .
Notes
1 Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS,
1.5. About us 15
scikit-learn user guide, Release 0.20.dev0
• We would like to thank Rackspace for providing us with a free Rackspace Cloud account to automatically build
the documentation and the example gallery from for the development version of scikit-learn using this tool.
• We would also like to thank Shining Panda for free CPU time on their Continuous Integration server.
1.6.1 Spotify
Scikit-learn provides a toolbox with solid implementations of a bunch of state-of-the-art models and makes it easy to
plug them into existing applications. We’ve been using it quite a lot for music recommendations at Spotify and I think
it’s the most well-designed ML package I’ve seen so far.
Erik Bernhardsson, Engineering Manager Music Discovery & Machine Learning, Spotify
1.6.2 Inria
At INRIA, we use scikit-learn to support leading-edge basic research in many teams: Parietal for neuroimaging, Lear
for computer vision, Visages for medical image analysis, Privatics for security. The project is a fantastic tool to
address difficult applications of machine learning in an academic environment as it is performant and versatile, but all
easy-to-use and well documented, which makes it well suited to grad students.
Gaël Varoquaux, research at Parietal
1.6.3 betaworks
Betaworks is a NYC-based startup studio that builds new products, grows companies, and invests in others. Over
the past 8 years we’ve launched a handful of social data analytics-driven services, such as Bitly, Chartbeat, digg and
Scale Model. Consistently the betaworks data science team uses Scikit-learn for a variety of tasks. From exploratory
analysis, to product development, it is an essential part of our toolkit. Recent uses are included in digg’s new video
recommender system, and Poncho’s dynamic heuristic subspace clustering.
Gilad Lotan, Chief Data Scientist
At Hugging Face we’re using NLP and probabilistic models to generate conversational Artificial intelligences that are
fun to chat with. Despite using deep neural nets for a few of our NLP tasks, scikit-learn is still the bread-and-butter of
our daily machine learning routine. The ease of use and predictability of the interface, as well as the straightforward
mathematical explanations that are here when you need them, is the killer feature. We use a variety of scikit-learn
models in production and they are also operationally very pleasant to work with.
Julien Chaumond, Chief Technology Officer
1.6.5 Evernote
Building a classifier is typically an iterative process of exploring the data, selecting the features (the attributes of the
data believed to be predictive in some way), training the models, and finally evaluating them. For many of these tasks,
we relied on the excellent scikit-learn package for Python.
Read more
Mark Ayzenshtat, VP, Augmented Intelligence
At Telecom ParisTech, scikit-learn is used for hands-on sessions and home assignments in introductory and advanced
machine learning courses. The classes are for undergrads and masters students. The great benefit of scikit-learn is its
fast learning curve that allows students to quickly start working on interesting and motivating problems.
Alexandre Gramfort, Assistant Professor
1.6.7 Booking.com
At Booking.com, we use machine learning algorithms for many different applications, such as recommending ho-
tels and destinations to our customers, detecting fraudulent reservations, or scheduling our customer service agents.
Scikit-learn is one of the tools we use when implementing standard algorithms for prediction tasks. Its API and doc-
umentations are excellent and make it easy to use. The scikit-learn developers do a great job of incorporating state of
the art implementations and new algorithms into the package. Thus, scikit-learn provides convenient access to a wide
spectrum of algorithms, and allows us to readily find the right tool for the right job.
Melanie Mueller, Data Scientist
1.6.8 AWeber
The scikit-learn toolkit is indispensable for the Data Analysis and Management team at AWeber. It allows us to do
AWesome stuff we would not otherwise have the time or resources to accomplish. The documentation is excellent,
allowing new engineers to quickly evaluate and apply many different algorithms to our data. The text feature extraction
utilities are useful when working with the large volume of email content we have at AWeber. The RandomizedPCA
implementation, along with Pipelining and FeatureUnions, allows us to develop complex machine learning algorithms
efficiently and reliably.
Anyone interested in learning more about how AWeber deploys scikit-learn in a production environment should check
out talks from PyData Boston by AWeber’s Michael Becker available at https://github.com/mdbecker/pydata_2013
Michael Becker, Software Engineer, Data Analysis and Management Ninjas
1.6.9 Yhat
The combination of consistent APIs, thorough documentation, and top notch implementation make scikit-learn our
favorite machine learning package in Python. scikit-learn makes doing advanced analysis in Python accessible to
anyone. At Yhat, we make it easy to integrate these models into your production applications. Thus eliminating the
unnecessary dev time encountered productionizing analytical work.
Greg Lamp, Co-founder Yhat
1.6.10 Rangespan
The Python scikit-learn toolkit is a core tool in the data science group at Rangespan. Its large collection of well
documented models and algorithms allow our team of data scientists to prototype fast and quickly iterate to find the
right solution to our learning problems. We find that scikit-learn is not only the right tool for prototyping, but its
careful and well tested implementation give us the confidence to run scikit-learn models in production.
Jurgen Van Gael, Data Science Director at Rangespan Ltd
1.6.11 Birchbox
At Birchbox, we face a range of machine learning problems typical to E-commerce: product recommendation, user
clustering, inventory prediction, trends detection, etc. Scikit-learn lets us experiment with many models, especially in
the exploration phase of a new project: the data can be passed around in a consistent way; models are easy to save and
reuse; updates keep us informed of new developments from the pattern discovery research community. Scikit-learn is
an important tool for our team, built the right way in the right language.
Thierry Bertin-Mahieux, Birchbox, Data Scientist
Scikit-learn is our #1 toolkit for all things machine learning at Bestofmedia. We use it for a variety of tasks (e.g. spam
fighting, ad click prediction, various ranking models) thanks to the varied, state-of-the-art algorithm implementations
packaged into it. In the lab it accelerates prototyping of complex pipelines. In production I can say it has proven to be
robust and efficient enough to be deployed for business critical components.
Eustache Diemert, Lead Scientist Bestofmedia Group
1.6.13 Change.org
At change.org we automate the use of scikit-learn’s RandomForestClassifier in our production systems to drive email
targeting that reaches millions of users across the world each week. In the lab, scikit-learn’s ease-of-use, performance,
and overall variety of algorithms implemented has proved invaluable in giving us a single reliable source to turn to for
our machine-learning needs.
Vijay Ramesh, Software Engineer in Data/science at Change.org
At PHIMECA Engineering, we use scikit-learn estimators as surrogates for expensive-to-evaluate numerical models
(mostly but not exclusively finite-element mechanical models) for speeding up the intensive post-processing operations
involved in our simulation-based decision making framework. Scikit-learn’s fit/predict API together with its efficient
cross-validation tools considerably eases the task of selecting the best-fit estimator. We are also using scikit-learn for
illustrating concepts in our training sessions. Trainees are always impressed by the ease-of-use of scikit-learn despite
the apparent theoretical complexity of machine learning.
Vincent Dubourg, PHIMECA Engineering, PhD Engineer
1.6.15 HowAboutWe
At HowAboutWe, scikit-learn lets us implement a wide array of machine learning techniques in analysis and in pro-
duction, despite having a small team. We use scikit-learn’s classification algorithms to predict user behavior, enabling
us to (for example) estimate the value of leads from a given traffic source early in the lead’s tenure on our site. Also, our
users’ profiles consist of primarily unstructured data (answers to open-ended questions), so we use scikit-learn’s fea-
ture extraction and dimensionality reduction tools to translate these unstructured data into inputs for our matchmaking
system.
Daniel Weitzenfeld, Senior Data Scientist at HowAboutWe
1.6.16 PeerIndex
At PeerIndex we use scientific methodology to build the Influence Graph - a unique dataset that allows us to identify
who’s really influential and in which context. To do this, we have to tackle a range of machine learning and predic-
tive modeling problems. Scikit-learn has emerged as our primary tool for developing prototypes and making quick
progress. From predicting missing data and classifying tweets to clustering communities of social media users, scikit-
learn proved useful in a variety of applications. Its very intuitive interface and excellent compatibility with other
python tools makes it and indispensable tool in our daily research efforts.
Ferenc Huszar - Senior Data Scientist at Peerindex
1.6.17 DataRobot
DataRobot is building next generation predictive analytics software to make data scientists more productive, and
scikit-learn is an integral part of our system. The variety of machine learning techniques in combination with the
solid implementations that scikit-learn offers makes it a one-stop-shopping library for machine learning in Python.
Moreover, its consistent API, well-tested code and permissive licensing allow us to use it in a production environment.
Scikit-learn has literally saved us years of work we would have had to do ourselves to bring our product to market.
Jeremy Achin, CEO & Co-founder DataRobot Inc.
1.6.18 OkCupid
We’re using scikit-learn at OkCupid to evaluate and improve our matchmaking system. The range of features it has,
especially preprocessing utilities, means we can use it for a wide variety of projects, and it’s performant enough to
handle the volume of data that we need to sort through. The documentation is really thorough, as well, which makes
the library quite easy to use.
David Koh - Senior Data Scientist at OkCupid
1.6.19 Lovely
At Lovely, we strive to deliver the best apartment marketplace, with respect to our users and our listings. From
understanding user behavior, improving data quality, and detecting fraud, scikit-learn is a regular tool for gathering
insights, predictive modeling and improving our product. The easy-to-read documentation and intuitive architecture of
the API makes machine learning both explorable and accessible to a wide range of python developers. I’m constantly
recommending that more developers and scientists try scikit-learn.
Simon Frid - Data Scientist, Lead at Lovely
Data Publica builds a new predictive sales tool for commercial and marketing teams called C-Radar. We extensively
use scikit-learn to build segmentations of customers through clustering, and to predict future customers based on past
partnerships success or failure. We also categorize companies using their website communication thanks to scikit-learn
and its machine learning algorithm implementations. Eventually, machine learning makes it possible to detect weak
signals that traditional tools cannot see. All these complex tasks are performed in an easy and straightforward way
thanks to the great quality of the scikit-learn framework.
Guillaume Lebourgeois & Samuel Charron - Data Scientists at Data Publica
1.6.21 Machinalis
Scikit-learn is the cornerstone of all the machine learning projects carried at Machinalis. It has a consistent API, a
wide selection of algorithms and lots of auxiliary tools to deal with the boilerplate. We have used it in production en-
vironments on a variety of projects including click-through rate prediction, information extraction, and even counting
sheep!
In fact, we use it so much that we’ve started to freeze our common use cases into Python packages, some of them
open-sourced, like FeatureForge . Scikit-learn in one word: Awesome.
Rafael Carrascosa, Lead developer
1.6.22 solido
Scikit-learn is helping to drive Moore’s Law, via Solido. Solido creates computer-aided design tools used by the
majority of top-20 semiconductor companies and fabs, to design the bleeding-edge chips inside smartphones, auto-
mobiles, and more. Scikit-learn helps to power Solido’s algorithms for rare-event estimation, worst-case verification,
optimization, and more. At Solido, we are particularly fond of scikit-learn’s libraries for Gaussian Process models,
large-scale regularized linear regression, and classification. Scikit-learn has increased our productivity, because for
many ML problems we no longer need to “roll our own” code. This PyData 2014 talk has details.
Trent McConaghy, founder, Solido Design Automation Inc.
1.6.23 INFONEA
We employ scikit-learn for rapid prototyping and custom-made Data Science solutions within our in-memory based
Business Intelligence Software INFONEA®. As a well-documented and comprehensive collection of state-of-the-art
algorithms and pipelining methods, scikit-learn enables us to provide flexible and scalable scientific analysis solutions.
Thus, scikit-learn is immensely valuable in realizing a powerful integration of Data Science technology within self-
service business analytics.
Thorsten Kranz, Data Scientist, Coma Soft AG.
1.6.24 Dataiku
Our software, Data Science Studio (DSS), enables users to create data services that combine ETL with Machine
Learning. Our Machine Learning module integrates many scikit-learn algorithms. The scikit-learn library is a perfect
integration with DSS because it offers algorithms for virtually all business cases. Our goal is to offer a transparent and
flexible tool that makes it easier to optimize time consuming aspects of building a data service, preparing data, and
training machine learning algorithms on all types of data.
Florian Douetteau, CEO, Dataiku
Here at Otto Group, one of global Big Five B2C online retailers, we are using scikit-learn in all aspects of our daily
work from data exploration to development of machine learning application to the productive deployment of those
services. It helps us to tackle machine learning problems ranging from e-commerce to logistics. It consistent APIs
enabled us to build the Palladium REST-API framework around it and continuously deliver scikit-learn based services.
Christian Rammig, Head of Data Science, Otto Group
1.6.26 Zopa
At Zopa, the first ever Peer-to-Peer lending platform, we extensively use scikit-learn to run the business and optimize
our users’ experience. It powers our Machine Learning models involved in credit risk, fraud risk, marketing, and
pricing, and has been used for originating at least 1 billion GBP worth of Zopa loans. It is very well documented,
powerful, and simple to use. We are grateful for the capabilities it has provided, and for allowing us to deliver on our
mission of making money simple and fair.
Release notes for current and recent releases are detailed on this page, with previous releases linked below.
As well as a plethora of new features and enhancements, this release is the first to be accompanied by a Glossary of
Common Terms and API Elements developed by Joel Nothman. The glossary is a reference resource to help users and
contributors become familiar with the terminology and conventions used in Scikit-learn.
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• decomposition.IncrementalPCA in Python 2 (bug fix)
• isotonic.IsotonicRegression (bug fix)
• linear_model.ARDRegression (bug fix)
• linear_model.OrthogonalMatchingPursuit (bug fix)
• metrics.roc_auc_score (bug fix)
• metrics.roc_curve (bug fix)
• neural_network.BaseMultilayerPerceptron (bug fix)
• neural_network.MLPRegressor (bug fix)
• neural_network.MLPClassifier (bug fix)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
1.8.2 Changelog
New features
• Added naive_bayes.ComplementNB, which implements the Complement Naive Bayes classifier de-
scribed in Rennie et al. (2003). #8190 by Michael A. Alcorn.
• Added multioutput.RegressorChain for multi-target regression. #9257 by Kumar Ashutosh.
Preprocessing
• Added preprocessing.CategoricalEncoder, which allows to encode categorical features as a nu-
meric array, either using a one-hot (or dummy) encoding scheme or by converting to ordinal integers. Com-
pared to the existing OneHotEncoder, this new class handles encoding of all feature types (also handles
string-valued features) and derives the categories based on the unique values in the features instead of the max-
imum value in the features. #9151 by Vighnesh Birodkar and Joris Van den Bossche.
• Added preprocessing.PowerTransformer, which implements the Box-Cox power transformation, al-
lowing users to map data from any distribution to a Gaussian distribution. This is useful as a variance-stabilizing
transformation in situations where normality and homoscedasticity are desirable. #10210 by Eric Chang and
Maniteja Nandana.
• Added the preprocessing.TransformedTargetRegressor which transforms the target y before fit-
ting a regression model. The predictions are mapped back to the original space via an inverse transform. #9041
by Andreas Müller and Guillaume Lemaitre.
Model evaluation
• Added the metrics.balanced_accuracy_score metric and a corresponding
'balanced_accuracy' scorer for binary classification. #8066 by @xyguo and Aman Dalmia.
Decomposition, manifold learning and clustering
• cluster.AgglomerativeClustering now supports Single Linkage clustering via
linkage='single'. #9372 by Leland McInnes and Steve Astels.
Metrics
• Partial AUC is available via max_fpr parameter in metrics.roc_auc_score. #3273 by Alexander
Niederbühl.
Enhancements
Bug fixes
• Fixed a bug in naive_bayes.GaussianNB which incorrectly raised error for prior list which summed to 1.
#10005 by Gaurav Dhingra.
• Fixed a bug in linear_model.LogisticRegression where when using the parameter
multi_class='multinomial', the predict_proba method was returning incorrect probabili-
ties in the case of binary outcomes. #9939 by Roger Westover.
• Fixed a bug in linear_model.OrthogonalMatchingPursuit that was broken when setting
normalize=False. #10071 by Alexandre Gramfort.
• Fixed a bug in linear_model.ARDRegression which caused incorrectly updated estimates for the stan-
dard deviation and the coefficients. #10153 by Jörg Döpfert.
• Fixed a bug when fitting ensemble.GradientBoostingClassifier or ensemble.
GradientBoostingRegressor with warm_start=True which previously raised a segmentation
fault due to a non-conversion of CSC matrix into CSR format expected by decision_function. Similarly,
Fortran-ordered arrays are converted to C-ordered arrays in the dense case. #9991 by Guillaume Lemaitre.
• Fixed a bug in neighbors.NearestNeighbors where fitting a NearestNeighbors model fails when a) the
distance metric used is a callable and b) the input to the NearestNeighbors model is sparse. #9579 by Thomas
Kober.
• Fixed a bug in naive_bayes.MultinomialNB which did not accept vector valued pseudocounts (alpha).
#10346 by Tobias Madsen
• Fixed a bug in svm.SVC where when the argument kernel is unicode in Python2, the predict_proba
method was raising an unexpected TypeError given dense inputs. #10412 by Jiongyan Zhang.
• Fixed a bug in tree.BaseDecisionTree with splitter=”best” where split threshold could become infinite
when values in X were near infinite. #10536 by Jonathan Ohayon.
• Fixed a bug in linear_model.ElasticNet which caused the input to be overridden when using parameter
copy_X=True and check_input=False. #10581 by Yacine Mazari.
• Fixed a bug in sklearn.linear_model.Lasso where the coefficient had wrong shape when
fit_intercept=False. #10687 by Martin Hahn.
Decomposition, manifold learning and clustering
• Fix for uninformative error in decomposition.IncrementalPCA: now an error is raised if the number of
components is larger than the chosen batch size. The n_components=None case was adapted accordingly.
#6452. By Wally Gauze.
• Fixed a bug where the partial_fit method of decomposition.IncrementalPCA used integer divi-
sion instead of float division on Python 2 versions. #9492 by James Bourbeau.
• Fixed a bug where the fit method of cluster.AffinityPropagation stored cluster centers as 3d array
instead of 2d array in case of non-convergence. For the same class, fixed undefined and arbitrary behavior in
case of training data where all samples had equal similarity. #9612. By Jonatan Samoocha.
• In decomposition.PCA selecting a n_components parameter greater than the number of samples now raises
an error. Similarly, the n_components=None case now selects the minimum of n_samples and n_features.
#8484. By Wally Gauze.
• Fixed a bug in datasets.fetch_kddcup99, where data were not properly shuffled. #9731 by Nicolas
Goix.
• Fixed a bug in decomposition.PCA where users will get unexpected error with large datasets when
n_components='mle' on Python 3 versions. #9886 by Hanmin Qin.
• Fixed a bug when setting parameters on meta-estimator, involving both a wrapped estimator and its parameter.
#9999 by Marcus Voss and Joel Nothman.
• k_means now gives a warning, if the number of distinct clusters found is smaller than n_clusters. This
may occur when the number of distinct points in the data set is actually smaller than the number of cluster one
is looking for. #10059 by Christian Braune.
• Fixed a bug in datasets.make_circles, where no odd number of data points could be generated. #10037
by :user:‘Christian Braune <christianbraune79>‘_.
• Fixed a bug in cluster.spectral_clustering where the normalization of the spectrum was using a
division instead of a multiplication. #8129 by Jan Margeta, Guillaume Lemaitre, and Devansh D..
Metrics
• Fixed a bug in metrics.precision_precision_recall_fscore_support when truncated
range(n_labels) is passed as value for labels. #10377 by Gaurav Dhingra.
• Fixed a bug due to floating point error in metrics.roc_auc_score with non-integer sample weights.
#9786 by Hanmin Qin.
• Fixed a bug where metrics.roc_curve sometimes starts on y-axis instead of (0, 0), which is inconsistent
with the document and other implementations. Note that this will not influence the result from metrics.
roc_auc_score #10093 by alexryndin and Hanmin Qin.
• Fixed a bug to avoid integer overflow. Casted product to 64 bits integer in mutual_info_score. #9772 by
Kumar Ashutosh.
Neighbors
• Fixed a bug so predict in neighbors.RadiusNeighborsRegressor can handle empty neighbor set
when using non uniform weights. Also raises a new warning when no neighbors are found for samples. #9655
by Andreas Bjerre-Nielsen.
Feature Extraction
• Fixed a bug in feature_extraction.image.extract_patches_2d which would throw an excep-
tion if max_patches was greater than or equal to the number of all possible patches rather than simply
returning the number of possible patches. #10100 by Varun Agrawal
• Fixed a bug in feature_extraction.text.CountVectorizer, feature_extraction.text.
TfidfVectorizer, feature_extraction.text.HashingVectorizer to support 64 bit sparse
array indexing necessary to process large datasets with more than 2·109 tokens (words or n-grams). #9147 by
Claes-Fredrik Mannby and Roman Yurchak.
Utils
• utils.validation.check_array yield a FutureWarning indicating that arrays of bytes/strings will
be interpreted as decimal numbers beginning in version 0.22. #10229 by Ryan Lee
Preprocessing
• Fixed bugs in preprocessing.LabelEncoder which would sometimes throw errors when transform
or inverse_transform was called with empty arrays. #10458 by Mayur Kulkarni.
• Fix ValueError in preprocessing.LabelEncoder when using inverse_transform on unseen la-
bels. #9816 by Charlie Newey.
• Allow tests in estimator_checks.check_estimator to test functions that accept pairwise data. #9701
by Kyle Johnson
• Allow estimator_checks.check_estimator to check that there is no private settings apart from pa-
rameters during estimator initialization. #9378 by Herilalaina Rakotoarison
• Add test estimator_checks.check_methods_subset_invariance to check that estimators meth-
ods are invariant if applied to a data subset. #10420 by Jonathan Ohayon
1.9.1 Changelog
API changes
• Reverted the addition of metrics.ndcg_score and metrics.dcg_score which had been merged into
version 0.19.0 by error. The implementations were broken and undocumented.
• return_train_score which was added to model_selection.GridSearchCV ,
model_selection.RandomizedSearchCV and model_selection.cross_validate in
version 0.19.0 will be changing its default value from True to False in version 0.21. We found that calculating
training score could have a great effect on cross validation runtime in some cases. Users should explicitly
set return_train_score to False if prediction or scoring functions are slow, resulting in a deleterious
effect on CV runtime, or to True if they wish to use the calculated scores. #9677 by Kumar Ashutosh and Joel
Nothman.
• correlation_models and regression_models from the legacy gaussian processes implementation
have been belatedly deprecated. #9717 by Kumar Ashutosh.
Bug fixes
• Fix regression in pipeline.Pipeline where it no longer accepted steps as a tuple. #9604 by Joris Van
den Bossche.
• Fix bug where n_iter was not properly deprecated, leaving n_iter unavailable for interim use
in linear_model.SGDClassifier, linear_model.SGDRegressor, linear_model.
PassiveAggressiveClassifier, linear_model.PassiveAggressiveRegressor and
linear_model.Perceptron. #9558 by Andreas Müller.
• Dataset fetchers make sure temporary files are closed before removing them, which caused errors on Windows.
#9847 by Joan Massich.
• Fixed a regression in manifold.TSNE where it no longer supported metrics other than ‘euclidean’ and ‘pre-
computed’. #9623 by Oli Blum.
Enhancements
• Our test suite and utils.estimator_checks.check_estimators can now be run without Nose in-
stalled. #9697 by Joan Massich.
• To improve usability of version 0.19’s pipeline.Pipeline caching, memory now allows joblib.
Memory instances. This make use of the new utils.validation.check_memory helper. issue:9584
by Kumar Ashutosh
• Some fixes to examples: #9750, #9788, #9815
• Made a FutureWarning in SGD-based estimators less verbose. #9802 by Vrishank Bhardwaj.
1.10.1 Highlights
You can also learn faster. For instance, the new option to cache transformations in pipeline.Pipeline makes
grid search over pipelines including slow transformations much more efficient. And you can predict faster: if you’re
sure you know what you’re doing, you can turn off validating that the input is finite using config_context.
We’ve made some important fixes too. We’ve fixed a longstanding implementation error in metrics.
average_precision_score, so please be cautious with prior results reported from that function. A number
of errors in the manifold.TSNE implementation have been fixed, particularly in the default Barnes-Hut approx-
imation. semi_supervised.LabelSpreading and semi_supervised.LabelPropagation have had
substantial fixes. LabelPropagation was previously broken. LabelSpreading should now correctly respect its alpha
parameter.
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans with sparse X and initial centroids given (bug fix)
• cross_decomposition.PLSRegression with scale=True (bug fix)
• ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor
where min_impurity_split is used (bug fix)
• gradient boosting loss='quantile' (bug fix)
• ensemble.IsolationForest (bug fix)
• feature_selection.SelectFdr (bug fix)
• linear_model.RANSACRegressor (bug fix)
• linear_model.LassoLars (bug fix)
• linear_model.LassoLarsIC (bug fix)
• manifold.TSNE (bug fix)
• neighbors.NearestCentroid (bug fix)
• semi_supervised.LabelSpreading (bug fix)
• semi_supervised.LabelPropagation (bug fix)
• tree based models where min_weight_fraction_leaf is used (enhancement)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
1.10.3 Changelog
New features
Other estimators
• Added the neighbors.LocalOutlierFactor class for anomaly detection based on nearest neighbors.
#5279 by Nicolas Goix and Alexandre Gramfort.
• Added preprocessing.QuantileTransformer class and preprocessing.
quantile_transform function for features normalization based on quantiles. #8363 by Denis Engemann,
Guillaume Lemaitre, Olivier Grisel, Raghav RV, Thierry Guillemot, and Gael Varoquaux.
• The new solver 'mu' implements a Multiplicate Update in decomposition.NMF, allowing the optimization
of all beta-divergences, including the Frobenius norm, the generalized Kullback-Leibler divergence and the
Itakura-Saito divergence. #5295 by Tom Dupre la Tour.
Model selection and evaluation
• model_selection.GridSearchCV and model_selection.RandomizedSearchCV now support
simultaneous evaluation of multiple metrics. Refer to the Specifying multiple metrics for evaluation section of
the user guide for more information. #7388 by Raghav RV
• Added the model_selection.cross_validate which allows evaluation of multiple metrics. This func-
tion returns a dict with more useful information from cross-validation such as the train scores, fit times and score
times. Refer to The cross_validate function and multiple metric evaluation section of the userguide for more
information. #7388 by Raghav RV
• Added metrics.mean_squared_log_error, which computes the mean square error of the logarithmic
transformation of targets, particularly useful for targets with an exponential trend. #7655 by Karan Desai.
• Added metrics.dcg_score and metrics.ndcg_score, which compute Discounted cumulative gain
(DCG) and Normalized discounted cumulative gain (NDCG). #7739 by David Gasquez.
• Added the model_selection.RepeatedKFold and model_selection.
RepeatedStratifiedKFold. #8120 by Neeraj Gangwar.
Miscellaneous
• Validation that input data contains no NaN or inf can now be suppressed using config_context, at your
own risk. This will save on runtime, and may be particularly useful for prediction time. #7548 by Joel Nothman.
• Added a test to ensure parameter listing in docstrings match the function/class signature. #9206 by Alexandre
Gramfort and Raghav RV.
Enhancements
Bug fixes
• Deprecate the y parameter in transform and inverse_transform. The method should not accept y
parameter, as it’s used at the prediction time. #8174 by Tahar Zanouda, Alexandre Gramfort and Raghav RV.
• SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following
backported functions in utils have been removed or deprecated accordingly. #8854 and #8874 by Naoya
Kanai
• The store_covariances and covariances_ parameters of discriminant_analysis.
QuadraticDiscriminantAnalysis has been renamed to store_covariance and covariance_
to be consistent with the corresponding parameter names of the discriminant_analysis.
LinearDiscriminantAnalysis. They will be removed in version 0.21. #7998 by Jiacheng
Removed in 0.19:
– utils.fixes.argpartition
– utils.fixes.array_equal
– utils.fixes.astype
– utils.fixes.bincount
– utils.fixes.expit
– utils.fixes.frombuffer_empty
– utils.fixes.in1d
– utils.fixes.norm
– utils.fixes.rankdata
– utils.fixes.safe_copy
Deprecated in 0.19, to be removed in 0.21:
– utils.arpack.eigs
– utils.arpack.eigsh
– utils.arpack.svds
– utils.extmath.fast_dot
– utils.extmath.logsumexp
– utils.extmath.norm
– utils.extmath.pinvh
– utils.graph.graph_laplacian
– utils.random.choice
– utils.sparsetools.connected_components
– utils.stats.rankdata
• Estimators with both methods decision_function and predict_proba are now required to have a
monotonic relation between them. The method check_decision_proba_consistency has been added
in utils.estimator_checks to check their consistency. #7578 by Shubham Bhardwaj
• All checks in utils.estimator_checks, in particular utils.estimator_checks.
check_estimator now accept estimator instances. Most other checks do not accept estimator classes any
more. #9019 by Andreas Müller.
• Ensure that estimators’ attributes ending with _ are not set in the constructor but only in the fit method.
Most notably, ensemble estimators (deriving from ensemble.BaseEnsemble) now only have self.
estimators_ available after fit. #7464 by Lars Buitinck and Loic Esteve.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.18, in-
cluding:
Joel Nothman, Loic Esteve, Andreas Mueller, Guillaume Lemaitre, Olivier Grisel, Hanmin Qin, Raghav RV, Alexandre
Gramfort, themrmax, Aman Dalmia, Gael Varoquaux, Naoya Kanai, Tom Dupré la Tour, Rishikesh, Nelson Liu, Tae-
hoon Lee, Nelle Varoquaux, Aashil, Mikhail Korobov, Sebastin Santy, Joan Massich, Roman Yurchak, RAKOTOARI-
SON Herilalaina, Thierry Guillemot, Alexandre Abadie, Carol Willing, Balakumaran Manoharan, Josh Karnofsky,
Vlad Niculae, Utkarsh Upadhyay, Dmitry Petrov, Minghui Liu, Srivatsan, Vincent Pham, Albert Thomas, Jake Van-
derPlas, Attractadore, JC Liu, alexandercbooth, chkoar, Óscar Nájera, Aarshay Jain, Kyle Gilliam, Ramana Subra-
manyam, CJ Carey, Clement Joudet, David Robles, He Chen, Joris Van den Bossche, Karan Desai, Katie Luangkote,
Leland McInnes, Maniteja Nandana, Michele Lacchia, Sergei Lebedev, Shubham Bhardwaj, akshay0724, omtcyfz,
rickiepark, waterponey, Vathsala Achar, jbDelafosse, Ralf Gommers, Ekaterina Krivich, Vivek Kumar, Ishank Gulati,
Dave Elliott, ldirer, Reiichiro Nakano, Levi John Wolf, Mathieu Blondel, Sid Kapur, Dougal J. Sutherland, midinas,
mikebenfield, Sourav Singh, Aseem Bansal, Ibraim Ganiev, Stephen Hoover, AishwaryaRK, Steven C. Howell, Gary
Foreman, Neeraj Gangwar, Tahar, Jon Crall, dokato, Kathy Chen, ferria, Thomas Moreau, Charlie Brummitt, Nicolas
Goix, Adam Kleczewski, Sam Shleifer, Nikita Singh, Basil Beirouti, Giorgio Patrini, Manoj Kumar, Rafael Possas,
James Bourbeau, James A. Bednar, Janine Harper, Jaye, Jean Helie, Jeremy Steward, Artsiom, John Wei, Jonathan
LIgo, Jonathan Rahn, seanpwilliams, Arthur Mensch, Josh Levy, Julian Kuhlmann, Julien Aubert, Jörn Hees, Kai,
shivamgargsya, Kat Hempstalk, Kaushik Lakshmikanth, Kennedy, Kenneth Lyons, Kenneth Myers, Kevin Yap, Kir-
ill Bobyrev, Konstantin Podshumok, Arthur Imbert, Lee Murray, toastedcornflakes, Lera, Li Li, Arthur Douillard,
Mainak Jas, tobycheese, Manraj Singh, Manvendra Singh, Marc Meketon, MarcoFalke, Matthew Brett, Matthias
Gilch, Mehul Ahuja, Melanie Goetz, Meng, Peng, Michael Dezube, Michal Baumgartner, vibrantabhi19, Artem Golu-
bin, Milen Paskov, Antonin Carette, Morikko, MrMjauh, NALEPA Emmanuel, Namiya, Antoine Wendlinger, Narine
Kokhlikyan, NarineK, Nate Guerin, Angus Williams, Ang Lu, Nicole Vavrova, Nitish Pandey, Okhlopkov Daniil
Olegovich, Andy Craze, Om Prakash, Parminder Singh, Patrick Carlson, Patrick Pei, Paul Ganssle, Paulo Haddad,
Paweł Lorek, Peng Yu, Pete Bachant, Peter Bull, Peter Csizsek, Peter Wang, Pieter Arthur de Jong, Ping-Yao, Chang,
Preston Parry, Puneet Mathur, Quentin Hibon, Andrew Smith, Andrew Jackson, 1kastner, Rameshwar Bhaskaran, Re-
becca Bilbro, Remi Rampin, Andrea Esuli, Rob Hall, Robert Bradshaw, Romain Brault, Aman Pratik, Ruifeng Zheng,
Russell Smith, Sachin Agarwal, Sailesh Choyal, Samson Tan, Samuël Weber, Sarah Brown, Sebastian Pölsterl, Se-
bastian Raschka, Sebastian Saeger, Alyssa Batula, Abhyuday Pratap Singh, Sergey Feldman, Sergul Aydore, Sharan
Yalburgi, willduan, Siddharth Gupta, Sri Krishna, Almer, Stijn Tonk, Allen Riddell, Theofilos Papapanagiotou, Alison,
Alexis Mignon, Tommy Boucher, Tommy Löfstedt, Toshihiro Kamishima, Tyler Folkman, Tyler Lanigan, Alexander
Junge, Varun Shenoy, Victor Poughon, Vilhelm von Ehrenheim, Aleksandr Sandrovskii, Alan Yee, Vlasios Vasileiou,
Warut Vijitbenjaronk, Yang Zhang, Yaroslav Halchenko, Yichuan Liu, Yuichi Fujikawa, affanv14, aivision2020, xor,
andreh7, brady salz, campustrampus, Agamemnon Krasoulis, ditenberg, elena-sharova, filipj8, fukatani, gedeck, guin-
iol, guoci, hakaa1, hongkahjun, i-am-xhy, jakirkham, jaroslaw-weber, jayzed82, jeroko, jmontoyam, jonathan.striebel,
josephsalmon, jschendel, leereeves, martin-hahn, mathurinm, mehak-sachdeva, mlewis1729, mlliou112, mthorrell,
ndingwall, nuffe, yangarbiter, plagree, pldtc325, Breno Freitas, Brett Olsen, Brian A. Alfano, Brian Burns, polmauri,
Brandon Carter, Charlton Austin, Chayant T15h, Chinmaya Pancholi, Christian Danielsen, Chung Yen, Chyi-Kwei
Yau, pravarmahajan, DOHMATOB Elvis, Daniel LeJeune, Daniel Hnyk, Darius Morawiec, David DeTomaso, David
Gasquez, David Haberthür, David Heryanto, David Kirkby, David Nicholson, rashchedrin, Deborah Gertrude Digges,
Denis Engemann, Devansh D, Dickson, Bob Baxley, Don86, E. Lynch-Klarup, Ed Rogers, Elizabeth Ferriss, Ellen-
Co2, Fabian Egli, Fang-Chieh Chou, Bing Tian Dai, Greg Stupp, Grzegorz Szpak, Bertrand Thirion, Hadrien Bertrand,
Harizo Rajaona, zxcvbnius, Henry Lin, Holger Peters, Icyblade Dai, Igor Andriushchenko, Ilya, Isaac Laughlin, Iván
Vallés, Aurélien Bellet, JPFrancoia, Jacob Schreiber, Asish Mahapatra
Scikit-learn 0.18 is the last major release of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
Changelog
• Fixes for compatibility with NumPy 1.13.0: #7946 #8355 by Loic Esteve.
• Minor compatibility changes in the examples #9010 #8040 #9149.
Code Contributors
Changelog
Enhancements
Bug fixes
• Fix issue where min_grad_norm and n_iter_without_progress parameters were not being utilised
by manifold.TSNE. #6497 by Sebastian Säger
• Fix bug for svm’s decision values when decision_function_shape is ovr in svm.SVC. svm.SVC’s
decision_function was incorrect from versions 0.17.0 through 0.18.0. #7724 by Bing Tian Dai
• The min_weight_fraction_leaf parameter of tree-based classifiers and regressors now assumes uniform
sample weights by default if the sample_weight argument is not passed to the fit function. Previously, the
parameter was silently ignored. #7301 by Nelson Liu.
• Tree splitting criterion classes’ cloning/pickling is now memory safe. #7680 by Ibraim Ganiev.
Linear, kernelized and related models
• Length of explained_variance_ratio of discriminant_analysis.
LinearDiscriminantAnalysis changed for both Eigen and SVD solvers. The attribute has now
a length of min(n_components, n_classes - 1). #7632 by JPFrancoia
• Numerical issue with linear_model.RidgeCV on centered data when n_features > n_samples.
#6178 by Bertrand Thirion
Scikit-learn 0.18 will be the last version of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
The parameter values for each parameter is stored separately as numpy masked object arrays. The value, for
that search candidate, is masked if the corresponding parameter is not applicable. Additionally a list of all the
parameter dicts are stored at cv_results_['params'].
• Parameters n_folds and n_iter renamed to n_splits
Some parameter names have changed: The n_folds parameter in new model_selection.KFold,
model_selection.GroupKFold (see below for the name change), and model_selection.
StratifiedKFold is now renamed to n_splits. The n_iter parameter in model_selection.
ShuffleSplit, the new class model_selection.GroupShuffleSplit and model_selection.
StratifiedShuffleSplit is now renamed to n_splits.
• Rename of splitter classes which accepts group labels along with data
The cross-validation splitters LabelKFold, LabelShuffleSplit, LeaveOneLabelOut and
LeavePLabelOut have been renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively.
Note the change from singular to plural form in model_selection.LeavePGroupsOut.
• Fit parameter labels renamed to groups
The labels parameter in the split method of the newly renamed splitters model_selection.
GroupKFold, model_selection.LeaveOneGroupOut, model_selection.
LeavePGroupsOut, model_selection.GroupShuffleSplit is renamed to groups following the
new nomenclature of their class names.
• Parameter n_labels renamed to n_groups
The parameter n_labels in the newly renamed model_selection.LeavePGroupsOut is changed to
n_groups.
• Training scores and Timing information
cv_results_ also includes the training scores for each cross-validation split (with keys such
as 'split0_train_score'), as well as their mean ('mean_train_score') and stan-
dard deviation ('std_train_score'). To avoid the cost of evaluating training score, set
return_train_score=False.
Additionally the mean and standard deviation of the times taken to split, train and score the model across all the
cross-validation splits is available at the key 'mean_time' and 'std_time' respectively.
Changelog
New features
Other estimators
• New mixture.GaussianMixture and mixture.BayesianGaussianMixture replace former mix-
ture models, employing faster inference for sounder results. #7295 by Wei Xue and Thierry Guillemot.
• Class decomposition.RandomizedPCA is now factored into decomposition.PCA and it is avail-
able calling with parameter svd_solver='randomized'. The default number of n_iter for
'randomized' has changed to 4. The old behavior of PCA is recovered by svd_solver='full'. An
additional solver calls arpack and performs truncated (non-randomized) SVD. By default, the best solver is
selected depending on the size of the input and the number of components requested. #5299 by Giorgio Patrini.
• Added two functions for mutual information estimation: feature_selection.
mutual_info_classif and feature_selection.mutual_info_regression. These
functions can be used in feature_selection.SelectKBest and feature_selection.
SelectPercentile as score functions. By Andrea Bravi and Nikolay Mayorov.
• Added the ensemble.IsolationForest class for anomaly detection based on random forests. By Nicolas
Goix.
• Added algorithm="elkan" to cluster.KMeans implementing Elkan’s fast K-Means algorithm. By
Andreas Müller.
Model selection and evaluation
• Added metrics.cluster.fowlkes_mallows_score, the Fowlkes Mallows Index which measures the
similarity of two clusterings of a set of points By Arnaud Fouchet and Thierry Guillemot.
• Added metrics.calinski_harabaz_score, which computes the Calinski and Harabaz score to evalu-
ate the resulting clustering of a set of points. By Arnaud Fouchet and Thierry Guillemot.
• Added new cross-validation splitter model_selection.TimeSeriesSplit to handle time series data.
#6586 by YenChen Lin
• The cross-validation iterators are replaced by cross-validation splitters available from sklearn.
model_selection, allowing for nested cross-validation. See Model Selection Enhancements and API
Changes for more information. #4294 by Raghav RV.
Enhancements
Bug fixes
new class solves the computational problems of the old class and computes the Variational Bayesian Gaussian
mixture faster than before. #6651 by Wei Xue and Thierry Guillemot.
• The old mixture.GMM is deprecated in favor of the new mixture.GaussianMixture. The new class
computes the Gaussian mixture faster than before and some of computational problems have been solved. #6666
by Wei Xue and Thierry Guillemot.
Model evaluation and meta-estimators
• The sklearn.cross_validation, sklearn.grid_search and sklearn.learning_curve
have been deprecated and the classes and functions have been reorganized into the sklearn.
model_selection module. Ref Model Selection Enhancements and API Changes for more information.
#4294 by Raghav RV.
• The grid_scores_ attribute of model_selection.GridSearchCV and model_selection.
RandomizedSearchCV is deprecated in favor of the attribute cv_results_. Ref Model Selection En-
hancements and API Changes for more information. #6697 by Raghav RV.
• The parameters n_iter or n_folds in old CV splitters are replaced by the new parameter n_splits since
it can provide a consistent and unambiguous interface to represent the number of train-test splits. #7187 by
YenChen Lin.
• classes parameter was renamed to labels in metrics.hamming_loss. #7260 by Sebastián Vanrell.
• The splitter classes LabelKFold, LabelShuffleSplit, LeaveOneLabelOut and
LeavePLabelsOut are renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively. Also the parameter labels in the split method of the newly renamed
splitters model_selection.LeaveOneGroupOut and model_selection.LeavePGroupsOut
is renamed to groups. Additionally in model_selection.LeavePGroupsOut, the parameter
n_labels is renamed to n_groups. #6660 by Raghav RV.
• Error and loss names for scoring parameters are now prefixed by 'neg_', such as
neg_mean_squared_error. The unprefixed versions are deprecated and will be removed in version 0.20.
#7261 by Tim Head.
Code Contributors
Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander Minyushkin, Alexander Rudy, Alexan-
dre Abadie, Alexandre Abraham, Alexandre Gramfort, Alexandre Saint, alexfields, Alvaro Ulloa, alyssaq, Amlan
Kar, Andreas Mueller, andrew giessel, Andrew Jackson, Andrew McCulloh, Andrew Murray, Anish Shah, Arafat,
Archit Sharma, Ariel Rokem, Arnaud Joly, Arnaud Rachez, Arthur Mensch, Ash Hoover, asnt, b0noI, Behzad Tabib-
ian, Bernardo, Bernhard Kratzwald, Bhargav Mangipudi, blakeflei, Boyuan Deng, Brandon Carter, Brett Naul, Brian
McFee, Caio Oliveira, Camilo Lamus, Carol Willing, Cass, CeShine Lee, Charles Truong, Chyi-Kwei Yau, CJ Carey,
codevig, Colin Ni, Dan Shiebler, Daniel, Daniel Hnyk, David Ellis, David Nicholson, David Staub, David Thaler,
David Warshaw, Davide Lasagna, Deborah, definitelyuncertain, Didi Bar-Zev, djipey, dsquareindia, edwinENSAE,
Elias Kuthe, Elvis DOHMATOB, Ethan White, Fabian Pedregosa, Fabio Ticconi, fisache, Florian Wilhelm, Francis,
Francis O’Donovan, Gael Varoquaux, Ganiev Ibraim, ghg, Gilles Louppe, Giorgio Patrini, Giovanni Cherubin, Gio-
vanni Lanzani, Glenn Qian, Gordon Mohr, govin-vatsan, Graham Clenaghan, Greg Reda, Greg Stupp, Guillaume
Lemaitre, Gustav Mörtberg, halwai, Harizo Rajaona, Harry Mavroforakis, hashcode55, hdmetor, Henry Lin, Hob-
son Lane, Hugo Bowne-Anderson, Igor Andriushchenko, Imaculate, Inki Hwang, Isaac Sijaranamual, Ishank Gulati,
Issam Laradji, Iver Jordal, jackmartin, Jacob Schreiber, Jake Vanderplas, James Fiedler, James Routley, Jan Zikes,
Janna Brettingen, jarfa, Jason Laska, jblackburne, jeff levesque, Jeffrey Blackburne, Jeffrey04, Jeremy Hintz, jere-
mynixon, Jeroen, Jessica Yung, Jill-Jênn Vie, Jimmy Jia, Jiyuan Qian, Joel Nothman, johannah, John, John Boersma,
John Kirkham, John Moeller, jonathan.striebel, joncrall, Jordi, Joseph Munoz, Joshua Cook, JPFrancoia, jrfiedler,
JulianKahnert, juliathebrave, kaichogami, KamalakerDadi, Kenneth Lyons, Kevin Wang, kingjr, kjell, Konstantin
Podshumok, Kornel Kielczewski, Krishna Kalyan, krishnakalyan3, Kvle Putnam, Kyle Jackson, Lars Buitinck, ldavid,
LeiG, LeightonZhang, Leland McInnes, Liang-Chi Hsieh, Lilian Besson, lizsz, Loic Esteve, Louis Tiao, Léonie Borne,
Mads Jensen, Maniteja Nandana, Manoj Kumar, Manvendra Singh, Marco, Mario Krell, Mark Bao, Mark Szepieniec,
Martin Madsen, MartinBpr, MaryanMorel, Massil, Matheus, Mathieu Blondel, Mathieu Dubois, Matteo, Matthias Ek-
man, Max Moroz, Michael Scherer, michiaki ariga, Mikhail Korobov, Moussa Taifi, mrandrewandrade, Mridul Seth,
nadya-p, Naoya Kanai, Nate George, Nelle Varoquaux, Nelson Liu, Nick James, NickleDave, Nico, Nicolas Goix,
Nikolay Mayorov, ningchi, nlathia, okbalefthanded, Okhlopkov, Olivier Grisel, Panos Louridas, Paul Strickland, Per-
rine Letellier, pestrickland, Peter Fischer, Pieter, Ping-Yao, Chang, practicalswift, Preston Parry, Qimu Zheng, Rachit
Kansal, Raghav RV, Ralf Gommers, Ramana.S, Rammig, Randy Olson, Rob Alexander, Robert Lutz, Robin Schucker,
Rohan Jain, Ruifeng Zheng, Ryan Yu, Rémy Léone, saihttam, Saiwing Yeung, Sam Shleifer, Samuel St-Jean, Sar-
taj Singh, Sasank Chilamkurthy, saurabh.bansod, Scott Andrews, Scott Lowe, seales, Sebastian Raschka, Sebastian
Saeger, Sebastián Vanrell, Sergei Lebedev, shagun Sodhani, shanmuga cv, Shashank Shekhar, shawpan, shengxid-
uan, Shota, shuckle16, Skipper Seabold, sklearn-ci, SmedbergM, srvanrell, Sébastien Lerique, Taranjeet, themrmax,
Thierry, Thierry Guillemot, Thomas, Thomas Hallock, Thomas Moreau, Tim Head, tKammy, toastedcornflakes, Tom,
TomDLT, Toshihiro Kamishima, tracer0tong, Trent Hauck, trevorstephens, Tue Vo, Varun, Varun Jewalikar, Viach-
eslav, Vighnesh Birodkar, Vikram, Villu Ruusmann, Vinayak Mehta, walter, waterponey, Wenhua Yang, Wenjian
Huang, Will Welch, wyseguy7, xyguo, yanlend, Yaroslav Halchenko, yelite, Yen, YenChenLin, Yichuan Liu, Yoav
Ram, Yoshiki, Zheng RuiFeng, zivori, Óscar Nájera
Changelog
Bug fixes
• Upgrade vendored joblib to version 0.9.4 that fixes an important bug in joblib.Parallel that can silently
yield to wrong results when working on datasets larger than 1MB: https://github.com/joblib/joblib/blob/0.9.4/
CHANGES.rst
• Fixed reading of Bunch pickles generated with scikit-learn version <= 0.16. This can affect users who have
already downloaded a dataset with scikit-learn 0.16 and are loading it with scikit-learn 0.17. See #6196 for how
this affected datasets.fetch_20newsgroups. By Loic Esteve.
• Fixed a bug that prevented using ROC AUC score to perform grid search on several CPU / cores on large arrays.
See #6147 By Olivier Grisel.
• Fixed a bug that prevented to properly set the presort parameter in ensemble.
GradientBoostingRegressor. See #5857 By Andrew McCulloh.
• Fixed a joblib error when evaluating the perplexity of a decomposition.
LatentDirichletAllocation model. See #6258 By Chyi-Kwei Yau.
November 5, 2015
Changelog
New features
• All the Scaler classes but preprocessing.RobustScaler can be fitted online by calling partial_fit. By
Giorgio Patrini.
• The new class ensemble.VotingClassifier implements a “majority rule” / “soft voting” ensemble
classifier to combine estimators for classification. By Sebastian Raschka.
• The new class preprocessing.RobustScaler provides an alternative to preprocessing.
StandardScaler for feature-wise centering and range normalization that is robust to outliers. By Thomas
Unterthiner.
• The new class preprocessing.MaxAbsScaler provides an alternative to preprocessing.
MinMaxScaler for feature-wise range normalization when the data is already centered or sparse. By Thomas
Unterthiner.
• The new class preprocessing.FunctionTransformer turns a Python function into a Pipeline-
compatible transformer object. By Joe Jevnik.
• The new classes cross_validation.LabelKFold and cross_validation.
LabelShuffleSplit generate train-test folds, respectively similar to cross_validation.KFold and
cross_validation.ShuffleSplit, except that the folds are conditioned on a label array. By Brian
McFee, Jean Kossaifi and Gilles Louppe.
• decomposition.LatentDirichletAllocation implements the Latent Dirichlet Allocation topic
model with online variational inference. By Chyi-Kwei Yau, with code based on an implementation by Matt
Hoffman. (#3659)
• The new solver sag implements a Stochastic Average Gradient descent and is available in both
linear_model.LogisticRegression and linear_model.Ridge. This solver is very efficient for
large datasets. By Danny Sullivan and Tom Dupre la Tour. (#4738)
• The new solver cd implements a Coordinate Descent in decomposition.NMF. Previous solver based on
Projected Gradient is still available setting new parameter solver to pg, but is deprecated and will be removed
in 0.19, along with decomposition.ProjectedGradientNMF and parameters sparseness, eta,
beta and nls_max_iter. New parameters alpha and l1_ratio control L1 and L2 regularization, and
shuffle adds a shuffling step in the cd solver. By Tom Dupre la Tour and Mathieu Blondel.
Enhancements
• manifold.TSNE now supports approximate optimization via the Barnes-Hut method, leading to much faster
fitting. By Christopher Erick Moody. (#4025)
• cluster.mean_shift_.MeanShift now supports parallel execution, as implemented in the
mean_shift function. By Martino Sorbaro.
• naive_bayes.GaussianNB now supports fitting with sample_weight. By Jan Hendrik Metzen.
• dummy.DummyClassifier now supports a prior fitting strategy. By Arnaud Joly.
• Added a fit_predict method for mixture.GMM and subclasses. By Cory Lorenz.
• Added the metrics.label_ranking_loss metric. By Arnaud Joly.
• Added the metrics.cohen_kappa_score metric.
• Added a warm_start constructor parameter to the bagging ensemble models to increase the size of the en-
semble. By Tim Head.
• Added option to use multi-output regression metrics without averaging. By Konstantin Shmelkov and Michael
Eickenberg.
• Added stratify option to cross_validation.train_test_split for stratified splitting. By
Miroslav Batchkarov.
• The tree.export_graphviz function now supports aesthetic improvements for tree.
DecisionTreeClassifier and tree.DecisionTreeRegressor, including options for coloring
nodes by their majority class or impurity, showing variable names, and using node proportions instead of raw
sample counts. By Trevor Stephens.
• Improved speed of newton-cg solver in linear_model.LogisticRegression, by avoiding loss com-
putation. By Mathieu Blondel and Tom Dupre la Tour.
• The class_weight="auto" heuristic in classifiers supporting class_weight was deprecated and re-
placed by the class_weight="balanced" option, which has a simpler formula and interpretation. By
Hanna Wallach and Andreas Müller.
• Add class_weight parameter to automatically weight samples by class frequency for linear_model.
PassiveAgressiveClassifier. By Trevor Stephens.
• Added backlinks from the API reference pages to the user guide. By Andreas Müller.
• The labels parameter to sklearn.metrics.f1_score, sklearn.metrics.fbeta_score,
sklearn.metrics.recall_score and sklearn.metrics.precision_score has been ex-
tended. It is now possible to ignore one or more labels, such as where a multiclass problem has a majority
class to ignore. By Joel Nothman.
• Add sample_weight support to linear_model.RidgeClassifier. By Trevor Stephens.
• Provide an option for sparse output from sklearn.metrics.pairwise.cosine_similarity. By
Jaidev Deshpande.
• Add minmax_scale to provide a function interface for MinMaxScaler. By Thomas Unterthiner.
• dump_svmlight_file now handles multi-label datasets. By Chih-Wei Chang.
• RCV1 dataset loader (sklearn.datasets.fetch_rcv1). By Tom Dupre la Tour.
• The “Wisconsin Breast Cancer” classical two-class classification dataset is now included in scikit-learn, avail-
able with sklearn.dataset.load_breast_cancer.
• Upgraded to joblib 0.9.3 to benefit from the new automatic batching of short tasks. This makes it possible for
scikit-learn to benefit from parallelism when many very short tasks are executed in parallel, for instance by the
grid_search.GridSearchCV meta-estimator with n_jobs > 1 used with a large grid of parameters
on a small dataset. By Vlad Niculae, Olivier Grisel and Loic Esteve.
• For more details about changes in joblib 0.9.3 see the release notes: https://github.com/joblib/joblib/blob/master/
CHANGES.rst#release-093
• Improved speed (3 times per iteration) of decomposition.DictLearning with coordinate descent
method from linear_model.Lasso. By Arthur Mensch.
• Parallel processing (threaded) for queries of nearest neighbors (using the ball-tree) by Nikolay Mayorov.
• Allow datasets.make_multilabel_classification to output a sparse y. By Kashif Rasul.
• cluster.DBSCAN now accepts a sparse matrix of precomputed distances, allowing memory-efficient distance
precomputation. By Joel Nothman.
• tree.DecisionTreeClassifier now exposes an apply method for retrieving the leaf indices samples
are predicted as. By Daniel Galvez and Gilles Louppe.
• Speed up decision tree regressors, random forest regressors, extra trees regressors and gradient boosting estima-
tors by computing a proxy of the impurity improvement during the tree growth. The proxy quantity is such that
the split that maximizes this value also maximizes the impurity improvement. By Arnaud Joly, Jacob Schreiber
and Gilles Louppe.
• Speed up tree based methods by reducing the number of computations needed when computing the impurity
measure taking into account linear relationship of the computed statistics. The effect is particularly visible with
extra trees and on datasets with categorical or sparse features. By Arnaud Joly.
• ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
now expose an apply method for retrieving the leaf indices each sample ends up in under each try. By Ja-
cob Schreiber.
• Add sample_weight support to linear_model.LinearRegression. By Sonny Hu. (##4881)
• Add n_iter_without_progress to manifold.TSNE to control the stopping criterion. By Santi Vil-
lalba. (#5186)
• Added optional parameter random_state in linear_model.Ridge , to set the seed of the pseudo random
generator used in sag solver. By Tom Dupre la Tour.
• Added optional parameter warm_start in linear_model.LogisticRegression. If set to True, the
solvers lbfgs, newton-cg and sag will be initialized with the coefficients computed in the previous fit. By
Tom Dupre la Tour.
• Added sample_weight support to linear_model.LogisticRegression for the lbfgs,
newton-cg, and sag solvers. By Valentin Stolbunov. Support added to the liblinear solver. By Manoj
Kumar.
• Added optional parameter presort to ensemble.GradientBoostingRegressor and ensemble.
GradientBoostingClassifier, keeping default behavior the same. This allows gradient boosters to
turn off presorting when building deep trees or using sparse data. By Jacob Schreiber.
• Altered metrics.roc_curve to drop unnecessary thresholds by default. By Graham Clenaghan.
• Added feature_selection.SelectFromModel meta-transformer which can be used along with es-
timators that have coef_ or feature_importances_ attribute to select important features of the input data. By
Maheshakya Wijewardena, Joel Nothman and Manoj Kumar.
• Added metrics.pairwise.laplacian_kernel. By Clyde Fare.
• covariance.GraphLasso allows separate control of the convergence criterion for the Elastic-Net subprob-
lem via the enet_tol parameter.
• Improved verbosity in decomposition.DictionaryLearning.
• ensemble.RandomForestClassifier and ensemble.RandomForestRegressor no longer ex-
plicitly store the samples used in bagging, resulting in a much reduced memory footprint for storing random
forest models.
• Added positive option to linear_model.Lars and linear_model.lars_path to force coeffi-
cients to be positive. (#5131)
• Added the X_norm_squared parameter to metrics.pairwise.euclidean_distances to provide
precomputed squared norms for X.
• Added the fit_predict method to pipeline.Pipeline.
• Added the preprocessing.min_max_scale function.
Bug fixes
Code Contributors
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando
Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly,
Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee,
Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat,
Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard,
Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry
Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric
Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles
Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Im-
manuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen,
Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph,
Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler
Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Ku-
mar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada,
Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson,
Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli
Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert
Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebas-
tian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De
Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timo-
thy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta,
Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro,
Zichen Wang
Changelog
Bug fixes
Highlights
• Speed improvements (notably in cluster.DBSCAN ), reduced memory requirements, bug-fixes and better
default settings.
• Multinomial Logistic regression and a path algorithm in linear_model.LogisticRegressionCV .
• Out-of core learning of PCA via decomposition.IncrementalPCA.
• Probability callibration of classifiers using calibration.CalibratedClassifierCV .
• cluster.Birch clustering method for large-scale datasets.
• Scalable approximate nearest neighbors search with Locality-sensitive hashing forests in neighbors.
LSHForest.
• Improved error messages and better validation when using malformed input data.
• More robust integration with pandas dataframes.
Changelog
New features
• The new neighbors.LSHForest implements locality-sensitive hashing for approximate nearest neighbors
search. By Maheshakya Wijewardena.
• Added svm.LinearSVR. This class uses the liblinear implementation of Support Vector Regression which is
much faster for large sample sizes than svm.SVR with linear kernel. By Fabian Pedregosa and Qiang Luo.
• Incremental fit for GaussianNB.
• Added sample_weight support to dummy.DummyClassifier and dummy.DummyRegressor. By
Arnaud Joly.
Enhancements
• Make the stopping criterion for mixture.GMM , mixture.DPGMM and mixture.VBGMM less dependent
on the number of samples by thresholding the average log-likelihood change instead of its sum over all samples.
By Hervé Bredin.
• The outcome of manifold.spectral_embedding was made deterministic by flipping the sign of eigen-
vectors. By Hasil Sharma.
• Significant performance and memory usage improvements in preprocessing.PolynomialFeatures.
By Eric Martin.
• Numerical stability improvements for preprocessing.StandardScaler and preprocessing.
scale. By Nicolas Goix
• svm.SVC fitted on sparse input now implements decision_function. By Rob Zinkov and Andreas
Müller.
• cross_validation.train_test_split now preserves the input type, instead of converting to numpy
arrays.
Documentation improvements
Bug fixes
• Various fixes to the Gaussian processes subpackage by Vincent Dubourg and Jan Hendrik Metzen.
• Calling partial_fit with class_weight=='auto' throws an appropriate error message and suggests
a work around. By Danny Sullivan.
• RBFSampler with gamma=g formerly approximated rbf_kernel with gamma=g/2.; the definition of
gamma is now consistent, which may substantially change your results if you use a fixed value. (If you cross-
validated over gamma, it probably doesn’t matter too much.) By Dougal Sutherland.
• Pipeline object delegate the classes_ attribute to the underlying estimator. It allows, for instance, to make
bagging of a pipeline object. By Arnaud Joly
• neighbors.NearestCentroid now uses the median as the centroid when metric is set to manhattan.
It was using the mean before. By Manoj Kumar
• Fix numerical stability issues in linear_model.SGDClassifier and linear_model.
SGDRegressor by clipping large gradients and ensuring that weight decay rescaling is always positive (for
large l2 regularization and large learning rate values). By Olivier Grisel
• When compute_full_tree is set to “auto”, the full tree is built when n_clusters is high and is early stopped when
n_clusters is low, while the behavior should be vice-versa in cluster.AgglomerativeClustering (and
friends). This has been fixed By Manoj Kumar
• Fix lazy centering of data in linear_model.enet_path and linear_model.lasso_path. It was
centered around one. It has been changed to be centered around the origin. By Manoj Kumar
• Fix handling of precomputed affinity matrices in cluster.AgglomerativeClustering when using
connectivity constraints. By Cathy Deng
• Correct partial_fit handling of class_prior for sklearn.naive_bayes.MultinomialNB and
sklearn.naive_bayes.BernoulliNB. By Trevor Stephens.
• Fixed a crash in metrics.precision_recall_fscore_support when using unsorted labels in the
multi-label setting. By Andreas Müller.
• Avoid skipping the first nearest neighbor in the methods radius_neighbors, kneighbors,
kneighbors_graph and radius_neighbors_graph in sklearn.neighbors.
NearestNeighbors and family, when the query data is not the same as fit data. By Manoj Kumar.
• Fix log-density calculation in the mixture.GMM with tied covariance. By Will Dawson
• Fixed a scaling error in feature_selection.SelectFdr where a factor n_features was missing. By
Andrew Tulloch
• Fix zero division in neighbors.KNeighborsRegressor and related classes when using distance weight-
ing and having identical data points. By Garret-R.
• Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon.
• Fixed a error in the computation of conditional probabilities in naive_bayes.BernoulliNB. By Hanna
Wallach.
• Make the method radius_neighbors of neighbors.NearestNeighbors return the samples lying
on the boundary for algorithm='brute'. By Yan Yi.
• Flip sign of dual_coef_ of svm.SVC to make it consistent with the documentation and
decision_function. By Artem Sobolev.
• Fixed handling of ties in isotonic.IsotonicRegression. We now use the weighted average of targets
(secondary method). By Andreas Müller and Michael Bommarito.
• GridSearchCV and cross_val_score and other meta-estimators don’t convert pandas DataFrames into
arrays any more, allowing DataFrame specific operations in custom estimators.
• multiclass.fit_ovr, multiclass.predict_ovr, predict_proba_ovr, multiclass.
fit_ovo, multiclass.predict_ovo, multiclass.fit_ecoc and multiclass.
predict_ecoc are deprecated. Use the underlying estimators instead.
• Nearest neighbors estimators used to take arbitrary keyword arguments and pass these to their distance metric.
This will no longer be supported in scikit-learn 0.18; use the metric_params argument instead.
• n_jobs parameter of the fit method shifted to the constructor of the LinearRegression class.
• The predict_proba method of multiclass.OneVsRestClassifier now returns two probabilities
per sample in the multiclass case; this is consistent with other estimators and with the method’s documenta-
tion, but previous versions accidentally returned only the positive probability. Fixed by Will Lamond and Lars
Buitinck.
• Change default value of precompute in ElasticNet and Lasso to False. Setting precompute to “auto” was
found to be slower when n_samples > n_features since the computation of the Gram matrix is computationally
expensive and outweighs the benefit of fitting the Gram for just one alpha. precompute="auto" is now
deprecated and will be removed in 0.18 By Manoj Kumar.
• Expose positive option in linear_model.enet_path and linear_model.enet_path which
constrains coefficients to be positive. By Manoj Kumar.
• Users should now supply an explicit average parameter to sklearn.metrics.f1_score, sklearn.
metrics.fbeta_score, sklearn.metrics.recall_score and sklearn.metrics.
precision_score when performing multiclass or multilabel (i.e. not binary) classification. By Joel
Nothman.
• scoring parameter for cross validation now accepts ‘f1_micro’, ‘f1_macro’ or ‘f1_weighted’. ‘f1’ is now for
binary classification only. Similar changes apply to ‘precision’ and ‘recall’. By Joel Nothman.
• The fit_intercept, normalize and return_models parameters in linear_model.enet_path
and linear_model.lasso_path have been removed. They were deprecated since 0.14
• From now onwards, all estimators will uniformly raise NotFittedError (utils.validation.
NotFittedError), when any of the predict like methods are called before the model is fit. By Raghav
RV.
• Input data validation was refactored for more consistent input validation. The check_arrays function was
replaced by check_array and check_X_y. By Andreas Müller.
• Allow X=None in the methods radius_neighbors, kneighbors, kneighbors_graph and
radius_neighbors_graph in sklearn.neighbors.NearestNeighbors and family. If set to
None, then for every sample this avoids setting the sample itself as the first nearest neighbor. By Manoj Kumar.
• Add parameter include_self in neighbors.kneighbors_graph and neighbors.
radius_neighbors_graph which has to be explicitly set by the user. If set to True, then the
sample itself is considered as the first nearest neighbor.
• thresh parameter is deprecated in favor of new tol parameter in GMM, DPGMM and VBGMM. See Enhancements
section for details. By Hervé Bredin.
• Estimators will treat input with dtype object as numeric when possible. By Andreas Müller
• Estimators now raise ValueError consistently when fitted on empty data (less than 1 sample or less than 1 feature
for 2D input). By Olivier Grisel.
Code Contributors
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander
Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, An-
drew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu,
Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng,
Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brun-
ner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Suther-
land, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R,
Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil
Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque,
isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff
Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle
Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michel-
bacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz
Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael
Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas,
Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap
Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert
Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan
Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Un-
terthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta,
Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan
Meng, Yan Yi, Yu-Chin
September 4, 2014
Bug fixes
• Fixed handling of the p parameter of the Minkowski distance that was previously ignored in nearest neighbors
models. By Nikolay Mayorov.
• Fixed duplicated alphas in linear_model.LassoLars with early stopping on 32 bit Python. By Olivier
Grisel and Fabian Pedregosa.
• Fixed the build under Windows when scikit-learn is built with MSVC while NumPy is built with MinGW. By
Olivier Grisel and Federico Vaggi.
• Fixed an array index overflow bug in the coordinate descent solver. By Gael Varoquaux.
• Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux.
• Removed unnecessary data copy in cluster.KMeans. By Gael Varoquaux.
• Explicitly close open files to avoid ResourceWarnings under Python 3. By Calvin Giles.
August 1, 2014
Bug fixes
Highlights
• Added dimensionality reduction with manifold.TSNE which can be used to visualize high-dimensional data.
Changelog
New features
Enhancements
• Changed the internal storage of decision trees to use a struct array. This fixed some small bugs, while improving
code and providing a small speed gain. By Joel Nothman.
• Reduce memory usage and overhead when fitting and predicting with forests of randomized trees in parallel
with n_jobs != 1 by leveraging new threading backend of joblib 0.8 and releasing the GIL in the tree fitting
Cython code. By Olivier Grisel and Gilles Louppe.
• Speed improvement of the sklearn.ensemble.gradient_boosting module. By Gilles Louppe and
Peter Prettenhofer.
• Various enhancements to the sklearn.ensemble.gradient_boosting module: a warm_start ar-
gument to fit additional trees, a max_leaf_nodes argument to fit GBM style trees, a monitor fit argument
to inspect the estimator during training, and refactoring of the verbose code. By Peter Prettenhofer.
• Faster sklearn.ensemble.ExtraTrees by caching feature values. By Arnaud Joly.
• Faster depth-based tree building algorithm such as decision tree, random forest, extra trees or gradient tree
boosting (with depth based growing strategy) by avoiding trying to split on found constant features in the sample
subset. By Arnaud Joly.
• Add min_weight_fraction_leaf pre-pruning parameter to tree-based methods: the minimum weighted
fraction of the input samples required to be at a leaf node. By Noel Dawe.
• Added metrics.pairwise_distances_argmin_min, by Philippe Gervais.
• Added predict method to cluster.AffinityPropagation and cluster.MeanShift, by Mathieu
Blondel.
• Vector and matrix multiplications have been optimised throughout the library by Denis Engemann, and Alexan-
dre Gramfort. In particular, they should take less memory with older NumPy versions (prior to 1.7.2).
• Precision-recall and ROC examples now use train_test_split, and have more explanation of why these metrics
are useful. By Kyle Kastner
• The training algorithm for decomposition.NMF is faster for sparse matrices and has much lower memory
complexity, meaning it will scale up gracefully to large datasets. By Lars Buitinck.
• Added svd_method option with default value to “randomized” to decomposition.FactorAnalysis to
save memory and significantly speedup computation by Denis Engemann, and Alexandre Gramfort.
• Changed cross_validation.StratifiedKFold to try and preserve as much of the original ordering of
samples as possible so as not to hide overfitting on datasets with a non-negligible level of samples dependency.
By Daniel Nouri and Olivier Grisel.
• Add multi-output support to gaussian_process.GaussianProcess by John Novak.
• Support for precomputed distance matrices in nearest neighbor estimators by Robert Layton and Joel Nothman.
• Norm computations optimized for NumPy 1.6 and later versions by Lars Buitinck. In particular, the k-means
algorithm no longer needs a temporary data structure the size of its input.
• dummy.DummyClassifier can now be used to predict a constant output value. By Manoj Kumar.
• dummy.DummyRegressor has now a strategy parameter which allows to predict the mean, the median of the
training set or a constant output value. By Maheshakya Wijewardena.
• Multi-label classification output in multilabel indicator format is now supported by metrics.
roc_auc_score and metrics.average_precision_score by Arnaud Joly.
• Significant performance improvements (more than 100x speedup for large problems) in isotonic.
IsotonicRegression by Andrew Tulloch.
• Speed and memory usage improvements to the SGD algorithm for linear models: it now uses threads, not
separate processes, when n_jobs>1. By Lars Buitinck.
• Grid search and cross validation allow NaNs in the input arrays so that preprocessors such as
preprocessing.Imputer can be trained within the cross validation loop, avoiding potentially skewed
results.
• Ridge regression can now deal with sample weights in feature space (only sample space until then). By Michael
Eickenberg. Both solutions are provided by the Cholesky solver.
• Several classification and regression metrics now support weighted samples with the new
sample_weight argument: metrics.accuracy_score, metrics.zero_one_loss,
metrics.precision_score, metrics.average_precision_score, metrics.
f1_score, metrics.fbeta_score, metrics.recall_score, metrics.roc_auc_score,
metrics.explained_variance_score, metrics.mean_squared_error, metrics.
mean_absolute_error, metrics.r2_score. By Noel Dawe.
• Speed up of the sample generator datasets.make_multilabel_classification. By Joel Nothman.
Documentation improvements
• The Working With Text Data tutorial has now been worked in to the main documentation’s tutorial section.
Includes exercises and skeletons for tutorial presentation. Original tutorial created by several authors including
Olivier Grisel, Lars Buitinck and many others. Tutorial integration into the scikit-learn documentation by Jaques
Grobler
• Added Computational Performance documentation. Discussion and examples of prediction latency / throughput
and different factors that have influence over speed. Additional tips for building faster models and choosing a
relevant compromise between speed and predictive power. By Eustache Diemert.
Bug fixes
People
• 6 Daniel Nouri
• 6 Chen Liu
• 6 Michael Eickenberg
• 6 ugurthemaster
• 5 Aaron Schumacher
• 5 Baptiste Lagarde
• 5 Rajat Khanduja
• 5 Robert McGibbon
• 5 Sergio Pascual
• 4 Alexis Metaireau
• 4 Ignacio Rossi
• 4 Virgile Fritsch
• 4 Sebastian Säger
• 4 Ilambharathi Kanniah
• 4 sdenton4
• 4 Robert Layton
• 4 Alyssa
• 4 Amos Waterland
• 3 Andrew Tulloch
• 3 murad
• 3 Steven Maude
• 3 Karol Pysniak
• 3 Jacques Kvam
• 3 cgohlke
• 3 cjlin
• 3 Michael Becker
• 3 hamzeh
• 3 Eric Jacobsen
• 3 john collins
• 3 kaushik94
• 3 Erwin Marsi
• 2 csytracy
• 2 LK
• 2 Vlad Niculae
• 2 Laurent Direr
• 2 Erik Shilts
• 2 Raul Garreta
• 2 Yoshiki Vázquez Baeza
• 2 Yung Siang Liau
• 2 abhishek thakur
• 2 James Yu
• 2 Rohit Sivaprasad
• 2 Roland Szabo
• 2 amormachine
• 2 Alexis Mignon
• 2 Oscar Carlsson
• 2 Nantas Nardelli
• 2 jess010
• 2 kowalski87
• 2 Andrew Clegg
• 2 Federico Vaggi
• 2 Simon Frid
• 2 Félix-Antoine Fortin
• 1 Ralf Gommers
• 1 t-aft
• 1 Ronan Amicel
• 1 Rupesh Kumar Srivastava
• 1 Ryan Wang
• 1 Samuel Charron
• 1 Samuel St-Jean
• 1 Fabian Pedregosa
• 1 Skipper Seabold
• 1 Stefan Walk
• 1 Stefan van der Walt
• 1 Stephan Hoyer
• 1 Allen Riddell
• 1 Valentin Haenel
• 1 Vijay Ramesh
• 1 Will Myers
• 1 Yaroslav Halchenko
• 1 Yoni Ben-Meshulam
• 1 Yury V. Zaytsev
• 1 adrinjalali
• 1 ai8rahim
• 1 alemagnani
• 1 alex
• 1 benjamin wilson
• 1 chalmerlowe
• 1 dzikie drożdże
• 1 jamestwebber
• 1 matrixorz
• 1 popo
• 1 samuela
• 1 François Boulogne
• 1 Alexander Measure
• 1 Ethan White
• 1 Guilherme Trein
• 1 Hendrik Heuer
• 1 IvicaJovic
• 1 Jan Hendrik Metzen
• 1 Jean Michel Rouly
• 1 Eduardo Ariño de la Rubia
• 1 Jelle Zijlstra
• 1 Eddy L O Jansson
• 1 Denis
• 1 John
• 1 John Schmidt
• 1 Jorge Cañardo Alastuey
• 1 Joseph Perla
• 1 Joshua Vredevoogd
• 1 José Ricardo
• 1 Julien Miotte
• 1 Kemal Eren
• 1 Kenta Sato
• 1 David Cournapeau
• 1 Kyle Kelley
• 1 Daniele Medri
• 1 Laurent Luce
• 1 Laurent Pierron
• 1 Luis Pedro Coelho
• 1 DanielWeitzenfeld
• 1 Craig Thompson
• 1 Chyi-Kwei Yau
• 1 Matthew Brett
• 1 Matthias Feurer
• 1 Max Linke
• 1 Chris Filo Gorgolewski
• 1 Charles Earl
• 1 Michael Hanke
• 1 Michele Orrù
• 1 Bryan Lunt
• 1 Brian Kearns
• 1 Paul Butler
• 1 Paweł Mandera
• 1 Peter
• 1 Andrew Ash
• 1 Pietro Zambelli
• 1 staubda
August 7, 2013
Changelog
• Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.
Imputer by Nicolas Trésegnie.
• The core implementation of decisions trees has been rewritten from scratch, allowing for faster tree induction
and lower memory consumption in all tree-based estimators. By Gilles Louppe.
• Added ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor, by Noel Dawe and
Gilles Louppe. See the AdaBoost section of the user guide for details and examples.
• Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for ran-
domized hyperparameter optimization. By Andreas Müller.
• Added biclustering algorithms (sklearn.cluster.bicluster.SpectralCoclustering and
sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.
datasets.make_biclusters and sklearn.datasets.make_checkerboard), and scoring met-
rics (sklearn.metrics.consensus_score). By Kemal Eren.
• Added Restricted Boltzmann Machines (neural_network.BernoulliRBM ). By Yann Dauphin.
• Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass
under Python 3.3.
• Ability to pass one penalty (alpha value) per target in linear_model.Ridge, by @eickenberg and Mathieu
Blondel.
• Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical
significance). By Norbert Crombach and Mathieu Blondel .
• Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the docu-
mentation. See Choosing the right estimator. By Jaques Grobler.
• grid_search.GridSearchCV and cross_validation.cross_val_score now support the use
of advanced scoring function such as area under the ROC curve and f-beta scores. See The scoring parameter:
defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function from
sklearn.metrics as score_func is deprecated.
• Multi-label classification output is now supported by metrics.accuracy_score,
metrics.zero_one_loss, metrics.f1_score, metrics.fbeta_score, metrics.
classification_report, metrics.precision_score and metrics.recall_score by
Arnaud Joly.
• Two new metrics metrics.hamming_loss and metrics.jaccard_similarity_score are added
with multi-label support by Arnaud Joly.
• Speed and memory usage improvements in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.
• The min_df parameter in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to
avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A
value of at least 2 is still recommended for practical use.
• svm.LinearSVC, linear_model.SGDClassifier and linear_model.SGDRegressor now
have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained
using these estimators can be made much more compact.
• linear_model.SGDClassifier now produces multiclass probability estimates when trained under log
loss or modified Huber loss.
• Hyperlinks to documentation in example code on the website by Martin Luessi.
• Fixed bug in preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default
feature_range settings. By Andreas Müller.
• max_features in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all
derived ensemble estimators now supports percentage values. By Gilles Louppe.
• Performance improvements in isotonic.IsotonicRegression by Nelle Varoquaux.
• metrics.accuracy_score has an option normalize to return the fraction or the number of correctly clas-
sified sample by Arnaud Joly.
• Added metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars
Buitinck.
• A bug that caused ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.
• Feature selectors now share a mixin providing consistent transform, inverse_transform and
get_support methods. By Joel Nothman.
• A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally
be pickled. By Joel Nothman.
People
• 7 Hrishikesh Huilgolkar
• 6 Kyle Kastner
• 6 Martin Luessi
• 6 Rob Speer
• 5 Federico Vaggi
• 5 Raul Garreta
• 5 Rob Zinkov
• 4 Ken Geis
• 3 A. Flaxman
• 3 Denton Cockburn
• 3 Dougal Sutherland
• 3 Ian Ozsvald
• 3 Johannes Schönberger
• 3 Robert McGibbon
• 3 Roman Sinayev
• 3 Szabo Roland
• 2 Diego Molla
• 2 Imran Haque
• 2 Jochen Wersdörfer
• 2 Sergey Karayev
• 2 Yannick Schwartz
• 2 jamestwebber
• 1 Abhijeet Kolhe
• 1 Alexander Fabisch
• 1 Bastiaan van den Berg
• 1 Benjamin Peterson
• 1 Daniel Velkov
• 1 Fazlul Shahriar
• 1 Felix Brockherde
• 1 Félix-Antoine Fortin
• 1 Harikrishnan S
• 1 Jack Hale
• 1 JakeMick
• 1 James McDermott
• 1 John Benediktsson
• 1 John Zwinck
• 1 Joshua Vredevoogd
• 1 Justin Pati
• 1 Kevin Hughes
• 1 Kyle Kelley
• 1 Matthias Ekman
• 1 Miroslav Shubernetskiy
• 1 Naoki Orii
• 1 Norbert Crombach
• 1 Rafael Cunha de Almeida
• 1 Rolando Espinoza La fuente
• 1 Seamus Abshere
• 1 Sergey Feldman
• 1 Sergio Medina
• 1 Stefano Lattarini
• 1 Steve Koch
• 1 Sturla Molden
• 1 Thomas Jarosch
• 1 Yaroslav Halchenko
Changelog
People
Changelog
• metrics.zero_one_loss (formerly metrics.zero_one) now has option for normalized output that
reports the fraction of misclassifications, rather than the raw number of misclassifications. By Kyle Beauchamp.
• tree.DecisionTreeClassifier and all derived ensemble models now support sample weighting, by
Noel Dawe and Gilles Louppe.
• Speedup improvement when using bootstrap samples in forests of randomized trees, by Peter Prettenhofer and
Gilles Louppe.
• Partial dependence plots for Gradient Tree Boosting in ensemble.partial_dependence.
partial_dependence by Peter Prettenhofer. See Partial Dependence Plots for an example.
• The table of contents on the website has now been made expandable by Jaques Grobler.
• feature_selection.SelectPercentile now breaks ties deterministically instead of returning all
equally ranked features.
• feature_selection.SelectKBest and feature_selection.SelectPercentile are more
numerically stable since they use scores, rather than p-values, to rank results. This means that they might
sometimes select different features than they did previously.
• Ridge regression and ridge classification fitting with sparse_cg solver no longer has quadratic memory com-
plexity, by Lars Buitinck and Fabian Pedregosa.
• Ridge regression and ridge classification now support a new fast solver called lsqr, by Mathieu Blondel.
• Speed up of metrics.precision_recall_curve by Conrad Lee.
• Added support for reading/writing svmlight files with pairwise preference attribute (qid in svmlight file format)
in datasets.dump_svmlight_file and datasets.load_svmlight_file by Fabian Pedregosa.
• Faster and more robust metrics.confusion_matrix and Clustering performance evaluation by Wei Li.
• cross_validation.cross_val_score now works with precomputed kernels and affinity matrices, by
Andreas Müller.
• LARS algorithm made more numerically stable with heuristics to drop regressors too correlated as well as to
stop the path when numerical noise becomes predominant, by Gael Varoquaux.
• Faster implementation of metrics.precision_recall_curve by Conrad Lee.
• New kernel metrics.chi2_kernel by Andreas Müller, often used in computer vision applications.
• Fix of longstanding bug in naive_bayes.BernoulliNB fixed by Shaun Jackman.