Skip to content

Commit fc56da5

Browse files
jnothmanrth
authored andcommitted
Deprecate fetch_mldata (#11466)
* API Deprecate fetch_mldata and update examples * Use pytest's filterwarnings * Rm unused import * Remove broken doctest * Refer user to openml URL * DOC whatsnew tweak
1 parent 4752ea7 commit fc56da5

File tree

11 files changed

+100
-144
lines changed

11 files changed

+100
-144
lines changed

doc/datasets/index.rst

Lines changed: 0 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -351,89 +351,6 @@ features::
351351

352352
_`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader
353353

354-
..
355-
For doctests:
356-
357-
>>> import numpy as np
358-
>>> import os
359-
>>> import tempfile
360-
>>> # Create a temporary folder for the data fetcher
361-
>>> custom_data_home = tempfile.mkdtemp()
362-
>>> os.makedirs(os.path.join(custom_data_home, 'mldata'))
363-
364-
365-
.. _mldata:
366-
367-
Downloading datasets from the mldata.org repository
368-
---------------------------------------------------
369-
370-
`mldata.org <http://mldata.org>`_ is a public repository for machine learning
371-
data, supported by the `PASCAL network <http://www.pascal-network.org>`_ .
372-
373-
The ``sklearn.datasets`` package is able to directly download data
374-
sets from the repository using the function
375-
:func:`sklearn.datasets.fetch_mldata`.
376-
377-
For example, to download the MNIST digit recognition database::
378-
379-
>>> from sklearn.datasets import fetch_mldata
380-
>>> mnist = fetch_mldata('MNIST original', data_home=custom_data_home)
381-
382-
The MNIST database contains a total of 70000 examples of handwritten digits
383-
of size 28x28 pixels, labeled from 0 to 9::
384-
385-
>>> mnist.data.shape
386-
(70000, 784)
387-
>>> mnist.target.shape
388-
(70000,)
389-
>>> np.unique(mnist.target)
390-
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
391-
392-
After the first download, the dataset is cached locally in the path
393-
specified by the ``data_home`` keyword argument, which defaults to
394-
``~/scikit_learn_data/``::
395-
396-
>>> os.listdir(os.path.join(custom_data_home, 'mldata'))
397-
['mnist-original.mat']
398-
399-
Data sets in `mldata.org <http://mldata.org>`_ do not adhere to a strict
400-
naming or formatting convention. :func:`sklearn.datasets.fetch_mldata` is
401-
able to make sense of the most common cases, but allows to tailor the
402-
defaults to individual datasets:
403-
404-
* The data arrays in `mldata.org <http://mldata.org>`_ are most often
405-
shaped as ``(n_features, n_samples)``. This is the opposite of the
406-
``scikit-learn`` convention, so :func:`sklearn.datasets.fetch_mldata`
407-
transposes the matrix by default. The ``transpose_data`` keyword controls
408-
this behavior::
409-
410-
>>> iris = fetch_mldata('iris', data_home=custom_data_home)
411-
>>> iris.data.shape
412-
(150, 4)
413-
>>> iris = fetch_mldata('iris', transpose_data=False,
414-
... data_home=custom_data_home)
415-
>>> iris.data.shape
416-
(4, 150)
417-
418-
* For datasets with multiple columns, :func:`sklearn.datasets.fetch_mldata`
419-
tries to identify the target and data columns and rename them to ``target``
420-
and ``data``. This is done by looking for arrays named ``label`` and
421-
``data`` in the dataset, and failing that by choosing the first array to be
422-
``target`` and the second to be ``data``. This behavior can be changed with
423-
the ``target_name`` and ``data_name`` keywords, setting them to a specific
424-
name or index number (the name and order of the columns in the datasets
425-
can be found at its `mldata.org <http://mldata.org>`_ under the tab "Data"::
426-
427-
>>> iris2 = fetch_mldata('datasets-UCI iris', target_name=1, data_name=0,
428-
... data_home=custom_data_home)
429-
>>> iris3 = fetch_mldata('datasets-UCI iris', target_name='class',
430-
... data_name='double0', data_home=custom_data_home)
431-
432-
433-
..
434-
>>> import shutil
435-
>>> shutil.rmtree(custom_data_home)
436-
437354
.. _external_datasets:
438355

439356
Loading from external datasets

doc/modules/classes.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,6 @@ Loaders
257257
datasets.fetch_kddcup99
258258
datasets.fetch_lfw_pairs
259259
datasets.fetch_lfw_people
260-
datasets.fetch_mldata
261260
datasets.fetch_olivetti_faces
262261
datasets.fetch_openml
263262
datasets.fetch_rcv1
@@ -1513,6 +1512,7 @@ To be removed in 0.22
15131512
:template: deprecated_function.rst
15141513

15151514
covariance.graph_lasso
1515+
datasets.fetch_mldata
15161516

15171517

15181518
To be removed in 0.21

doc/whats_new/v0.20.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,10 @@ Support for Python 3.3 has been officially dropped.
209209
data points could be generated. :issue:`10045` by :user:`Christian Braune
210210
<christianbraune79>`.
211211

212+
- |API| Deprecated :func:`sklearn.datasets.fetch_mldata` to be removed in
213+
version 0.22. mldata.org is no longer operational. Until removal it will
214+
remain possible to load cached datasets. :issue:`11466` by `Joel Nothman`_.
215+
212216
:mod:`sklearn.decomposition`
213217
............................
214218

examples/gaussian_process/plot_gpr_co2.py

Lines changed: 42 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
hyperparameter optimization using gradient ascent on the
99
log-marginal-likelihood. The data consists of the monthly average atmospheric
1010
CO2 concentrations (in parts per million by volume (ppmv)) collected at the
11-
Mauna Loa Observatory in Hawaii, between 1958 and 1997. The objective is to
11+
Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to
1212
model the CO2 concentration as a function of the time t.
1313
1414
The kernel is composed of several terms that are responsible for explaining
@@ -57,24 +57,59 @@
5757
explained by the model. The figure shows also that the model makes very
5858
confident predictions until around 2015.
5959
"""
60-
print(__doc__)
61-
6260
# Authors: Jan Hendrik Metzen <[email protected]>
6361
#
6462
# License: BSD 3 clause
6563

64+
from __future__ import division, print_function
65+
6666
import numpy as np
6767

6868
from matplotlib import pyplot as plt
6969

7070
from sklearn.gaussian_process import GaussianProcessRegressor
7171
from sklearn.gaussian_process.kernels \
7272
import RBF, WhiteKernel, RationalQuadratic, ExpSineSquared
73-
from sklearn.datasets import fetch_mldata
73+
try:
74+
from urllib.request import urlopen
75+
except ImportError:
76+
# Python 2
77+
from urllib2 import urlopen
78+
79+
print(__doc__)
80+
7481

75-
data = fetch_mldata('mauna-loa-atmospheric-co2').data
76-
X = data[:, [1]]
77-
y = data[:, 0]
82+
def load_mauna_loa_atmospheric_c02():
83+
url = ('http://cdiac.ess-dive.lbl.gov/'
84+
'ftp/trends/co2/sio-keel-flask/maunaloa_c.dat')
85+
months = []
86+
ppmv_sums = []
87+
counts = []
88+
for line in urlopen(url):
89+
line = line.decode('utf8')
90+
if not line.startswith('MLO'):
91+
# ignore headers
92+
continue
93+
station, date, weight, flag, ppmv = line.split()
94+
y = date[:2]
95+
m = date[2:4]
96+
month_float = (int(('20' if y < '20' else '19') + y) +
97+
(int(m) - 1) / 12)
98+
if not months or month_float != months[-1]:
99+
months.append(month_float)
100+
ppmv_sums.append(float(ppmv))
101+
counts.append(1)
102+
else:
103+
# aggregate monthly sum to produce average
104+
ppmv_sums[-1] += float(ppmv)
105+
counts[-1] += 1
106+
107+
months = np.asarray(months).reshape(-1, 1)
108+
avg_ppmvs = np.asarray(ppmv_sums) / counts
109+
return months, avg_ppmvs
110+
111+
112+
X, y = load_mauna_loa_atmospheric_c02()
78113

79114
# Kernel with parameters given in GPML book
80115
k1 = 66.0**2 * RBF(length_scale=67.0) # long term smooth rising trend

examples/linear_model/plot_sgd_early_stopping.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
import matplotlib.pyplot as plt
4848

4949
from sklearn import linear_model
50-
from sklearn.datasets import fetch_mldata
50+
from sklearn.datasets import fetch_openml
5151
from sklearn.model_selection import train_test_split
5252
from sklearn.utils.testing import ignore_warnings
5353
from sklearn.exceptions import ConvergenceWarning
@@ -56,9 +56,10 @@
5656
print(__doc__)
5757

5858

59-
def load_mnist(n_samples=None, class_0=0, class_1=8):
59+
def load_mnist(n_samples=None, class_0='0', class_1='8'):
6060
"""Load MNIST, select two classes, shuffle and return only n_samples."""
61-
mnist = fetch_mldata('MNIST original')
61+
# Load data from http://openml.org/d/554
62+
mnist = fetch_openml('mnist_784', version=1)
6263

6364
# take only two classes for binary classification
6465
mask = np.logical_or(mnist.target == class_0, mnist.target == class_1)

examples/linear_model/plot_sparse_logistic_regression_mnist.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
import matplotlib.pyplot as plt
2121
import numpy as np
2222

23-
from sklearn.datasets import fetch_mldata
23+
from sklearn.datasets import fetch_openml
2424
from sklearn.linear_model import LogisticRegression
2525
from sklearn.model_selection import train_test_split
2626
from sklearn.preprocessing import StandardScaler
@@ -35,9 +35,11 @@
3535
t0 = time.time()
3636
train_samples = 5000
3737

38-
mnist = fetch_mldata('MNIST original')
39-
X = mnist.data.astype('float64')
38+
# Load data from https://www.openml.org/d/554
39+
mnist = fetch_openml('mnist_784', version=1)
40+
X = mnist.data
4041
y = mnist.target
42+
4143
random_state = check_random_state(0)
4244
permutation = random_state.permutation(X.shape[0])
4345
X = X[permutation]

examples/multioutput/plot_classifier_chain_yeast.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,24 +32,24 @@
3232
with randomly ordered chains).
3333
"""
3434

35-
print(__doc__)
36-
3735
# Author: Adam Kleczewski
3836
# License: BSD 3 clause
3937

4038
import numpy as np
4139
import matplotlib.pyplot as plt
40+
from sklearn.datasets import fetch_openml
4241
from sklearn.multioutput import ClassifierChain
4342
from sklearn.model_selection import train_test_split
4443
from sklearn.multiclass import OneVsRestClassifier
4544
from sklearn.metrics import jaccard_similarity_score
4645
from sklearn.linear_model import LogisticRegression
47-
from sklearn.datasets import fetch_mldata
4846

49-
# Load a multi-label dataset
50-
yeast = fetch_mldata('yeast')
51-
X = yeast['data']
52-
Y = yeast['target'].transpose().toarray()
47+
print(__doc__)
48+
49+
# Load a multi-label dataset from https://www.openml.org/d/40597
50+
yeast = fetch_openml('yeast', version=4)
51+
X = yeast.data
52+
Y = yeast.target == 'TRUE'
5353
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2,
5454
random_state=0)
5555

examples/neural_networks/plot_mnist_filters.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,18 @@
2020
for a very short time. Training longer would result in weights with a much
2121
smoother spatial appearance.
2222
"""
23-
print(__doc__)
24-
2523
import matplotlib.pyplot as plt
26-
from sklearn.datasets import fetch_mldata
24+
from sklearn.datasets import fetch_openml
2725
from sklearn.neural_network import MLPClassifier
2826

29-
mnist = fetch_mldata("MNIST original")
27+
print(__doc__)
28+
29+
# Load data from https://www.openml.org/d/554
30+
mnist = fetch_openml('mnist_784', version=1)
31+
X = mnist.data
32+
y = mnist.target
33+
3034
# rescale the data, use the traditional train/test split
31-
X, y = mnist.data / 255., mnist.target
3235
X_train, X_test = X[:60000], X[60000:]
3336
y_train, y_test = y[:60000], y[60000:]
3437

sklearn/datasets/mldata.py

Lines changed: 13 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,19 @@
2525

2626
from .base import get_data_home
2727
from ..utils import Bunch
28+
from ..utils import deprecated
2829

2930
MLDATA_BASE_URL = "http://mldata.org/repository/data/download/matlab/%s"
3031

3132

33+
@deprecated('mldata_filename was deprecated in version 0.20 and will be '
34+
'removed in version 0.22')
3235
def mldata_filename(dataname):
3336
"""Convert a raw name for a data set in a mldata.org filename.
3437
38+
.. deprecated:: 0.20
39+
Will be removed in version 0.22
40+
3541
Parameters
3642
----------
3743
dataname : str
@@ -46,10 +52,14 @@ def mldata_filename(dataname):
4652
return re.sub(r'[().]', '', dataname)
4753

4854

55+
@deprecated('fetch_mldata was deprecated in version 0.20 and will be removed '
56+
'in version 0.22')
4957
def fetch_mldata(dataname, target_name='label', data_name='data',
5058
transpose_data=True, data_home=None):
5159
"""Fetch an mldata.org data set
5260
61+
mldata.org is no longer operational.
62+
5363
If the file does not exist yet, it is downloaded from mldata.org .
5464
5565
mldata.org does not have an enforced convention for storing data or
@@ -70,6 +80,9 @@ def fetch_mldata(dataname, target_name='label', data_name='data',
7080
mldata.org data sets may have multiple columns, which are stored in the
7181
Bunch object with their original name.
7282
83+
.. deprecated:: 0.20
84+
Will be removed in version 0.22
85+
7386
Parameters
7487
----------
7588
@@ -99,40 +112,6 @@ def fetch_mldata(dataname, target_name='label', data_name='data',
99112
'data', the data to learn, 'target', the classification labels,
100113
'DESCR', the full description of the dataset, and
101114
'COL_NAMES', the original names of the dataset columns.
102-
103-
Examples
104-
--------
105-
Load the 'iris' dataset from mldata.org:
106-
107-
>>> from sklearn.datasets.mldata import fetch_mldata
108-
>>> import tempfile
109-
>>> test_data_home = tempfile.mkdtemp()
110-
111-
>>> iris = fetch_mldata('iris', data_home=test_data_home)
112-
>>> iris.target.shape
113-
(150,)
114-
>>> iris.data.shape
115-
(150, 4)
116-
117-
Load the 'leukemia' dataset from mldata.org, which needs to be transposed
118-
to respects the scikit-learn axes convention:
119-
120-
>>> leuk = fetch_mldata('leukemia', transpose_data=True,
121-
... data_home=test_data_home)
122-
>>> leuk.data.shape
123-
(72, 7129)
124-
125-
Load an alternative 'iris' dataset, which has different names for the
126-
columns:
127-
128-
>>> iris2 = fetch_mldata('datasets-UCI iris', target_name=1,
129-
... data_name=0, data_home=test_data_home)
130-
>>> iris3 = fetch_mldata('datasets-UCI iris',
131-
... target_name='class', data_name='double0',
132-
... data_home=test_data_home)
133-
134-
>>> import shutil
135-
>>> shutil.rmtree(test_data_home)
136115
"""
137116

138117
# normalize dataset name

0 commit comments

Comments
 (0)