[MRG] Fix SGD non deterministic behavior #13422

ClemDoum · 2019-03-08T17:26:32Z

What does this fix:

While trying to make my intent classifier training deterministic in my NLU lib I noticed that I couldn't because the BaseSGDClassifier._fit_multiclass method is non deterministic.

To reproduce you have to initialize the SGDClassifier with a RandomState instance:

from sklearn.linear_model import SGDClassifier
from sklearn.utils import check_random_state


def sgd_test():
    seed = 1
    random_state = check_random_state(seed)

    num_classes = 3
    num_examples = 10
    x = random_state.normal(0.0, 1, size=(num_examples, num_classes))
    y = random_state.randint(num_classes, size=num_examples)

    num_iter = 1000
    ref_coef = None
    for i in range(num_iter):
        print("Iter %s" % i)
        random_state = check_random_state(seed)
        clf = SGDClassifier(
            random_state=random_state, n_jobs=num_classes).fit(x, y)
        if ref_coef is None:
            ref_coef = clf.coef_.tolist()
        assert ref_coef == clf.coef_.tolist()


if __name__ == '__main__':
    sgd_test()

I added a small print statement just before the joblib threads pass their seed to the plain_sgd or average_sgd function. This gives me the following output (and will give you a different output on your machine but will eventually fail):

Iter 0
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 1
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 2
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 3
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 4
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 5
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 6
Thread: 0, seed 1791095845
Thread: 2, seed 2135392491
Thread: 1, seed 946286476
Traceback (most recent call last):
  File "test_sgd_seed.py", line 27, in <module>
    sgd_test()
  File "test_sgd_seed.py", line 23, in sgd_test
    assert ref_coef == clf.coef_.tolist()
AssertionError

Process finished with exit code 1

What is happening is that the joblib threads share the BaseSGDClassifier.random_state and set a seed in the fit_binary function before it's passed to the plain_sgd or average_sgd function. Depending on the order on which the threads reach the seed setting, the output of the SGD can differ.

The other reason is that the estimator random state was not passed in the make_dataset function.

I think the bug was not noticeable in the unit tests because the SGD estimators were initialized with int random states. In this case, the input of the check_random_state function in the fit_binary function is the integer seed, and each thread is actually returning the exact same random state and then sample the exact same random seed for the SGD.

How does this fixes the bug:

I added an optional seed argument to the fit_binary function which default to None. If the seed is not None then it will be used otherwise the seed is set with the estimator random_state. This allows to set the jobs seeds before the jobs are distributed to the threads and avoids the non-deterministic behavior
I passed the estimator random state to the make_dataset function
I had to change a few seeds and doctests here and there to make the unit tests pass again

Help needed

Do you have any idea on how to properly test my fix ?

(Now also fixes #5015.)

Comment

While fixing the initial but another bug was found and fixed: the our_rand_r in the Cython code of the sklearn/utils/seq_dataset.pyx was not behaving consistently across platforms, see the full bug description and fix below

…ikit-learn#5015)

ClemDoum · 2019-03-08T17:59:00Z

Windows unit tests seem to fail, I need to investigate a bit

jnothman

Thanks for this. I agree the nondeterminism is a problem

jnothman · 2019-03-10T22:35:49Z

sklearn/linear_model/stochastic_gradient.py

-    dataset, intercept_decay = make_dataset(X, y_i, sample_weight)
+
+    # XXX should have random_state_!
+    random_state = random_state if random_state is not None \


Not sure we need to support retrieving the random_state from the estimator. Let's just use check_random_state and keep it simple.

ClemDoum · 2019-03-11T13:47:21Z

@jnothman thanks, I made the change and used check_random_state instead of the estimator random state.
I'm still not sure why the output differs on Windows though.

jnothman · 2019-03-11T23:04:42Z

Is it possible that MAX_INT is not platform-independent?

ClemDoum · 2019-03-22T17:23:37Z

OK I think I fixed it

jnothman

Under the assumption that there's no reasonable way to write a regression test, this LGTM.

agramfort · 2019-04-04T09:56:20Z

@ClemDoum you'll need to rebase on current master

codecov · 2019-04-04T10:30:51Z

Codecov Report

Merging #13422 into master will decrease coverage by 0.26%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #13422      +/-   ##
==========================================
- Coverage   96.65%   96.39%   -0.27%     
==========================================
  Files         376      377       +1     
  Lines       69711    69869     +158     
==========================================
- Hits        67381    67347      -34     
- Misses       2330     2522     +192

Impacted Files	Coverage Δ
sklearn/kernel_approximation.py	`100% <ø> (ø)`	⬆️
sklearn/linear_model/passive_aggressive.py	`100% <ø> (ø)`	⬆️
...earn/linear_model/tests/test_passive_aggressive.py	`100% <ø> (ø)`	⬆️
sklearn/linear_model/tests/test_logistic.py	`99.89% <ø> (ø)`	⬆️
sklearn/linear_model/perceptron.py	`100% <ø> (ø)`	⬆️
sklearn/linear_model/tests/test_sgd.py	`99.62% <100%> (ø)`	⬆️
sklearn/utils/tests/test_random.py	`98.9% <100%> (+0.03%)`	⬆️
sklearn/linear_model/stochastic_gradient.py	`98.58% <100%> (+0.01%)`	⬆️
sklearn/tests/test_multioutput.py	`100% <100%> (ø)`	⬆️
sklearn/utils/tests/test_seq_dataset.py	`100% <100%> (ø)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93e09aa...5bdf6c8. Read the comment docs.

jnothman · 2019-04-04T10:31:56Z

Thank you @ClemDoum. Sorry I forgot to merge it when I approved.

TomDLT · 2019-04-04T17:18:46Z

@ClemDoum you'll need to rebase on current master

Side note: Since GitHub now has a nice squash-and-merge button, we now prefer merging master instead of rebasing. It is easier to follow the discussion.

This reverts commit 7560d92.

lesteve · 2019-07-29T15:35:44Z

Does @ClemDoum or anyone else involved in this discussion remember the reason of pip install sphinx-gallery>=0.2,<0.3 : https://github.com/scikit-learn/scikit-learn/pull/13422/files#diff-fce1b4a04283d583b7c31114e4a4fc24

I was not able to find any reference to this in the discussion ...

ClemDoum · 2019-07-30T08:53:43Z

@lesteve yes sorry I didn't mentioned it in the PR but if I remember well sphinx-gallery==0.3.0 was broken on Pypi (sphinx-gallery/sphinx-gallery#459). Since the version of sphinx-gallery was floating, the CI was broken, I had to fix the version to make it pass.

lesteve · 2019-07-30T15:19:45Z

Right that makes sense, thanks a lot!

This removes the -Woverflow warnings observed when building scikit-learn. RAND_R_MAX is the max value for uint8, incrementing it causes an overflow (hence the warning). I think this commit fixes the implementation, yet I comes with a backwards incompatible results and tests for implementation relying on `our_rand_r` fails because results are now different. I see several alternatives to remove the warning while having tests pass - prefered solution: adapt the test suite using the new results so that all tests pass and ackowledge the change of behavior for impacted user-facing APIs in the changelog - accept the quirk of this implementation but hardcode and rename the effective constant - silent the -Woverflow warning by another mean Relates to: scikit-learn#13422 scikit-learn#24895

This removes the -Woverflow warnings observed when building scikit-learn. RAND_R_MAX is the max value for uint8, incrementing it causes an overflow (hence the warning). Elements were originaly mentionned in scikit-learn#13422 (comment) but left unreviewed, it seems. I think this commit fixes the implementation, yet I comes with a backwards incompatible results and tests for implementation relying on `our_rand_r` fails because results are now different. I see several alternatives to remove the warning while having tests pass - prefered solution: adapt the test suite using the new results so that all tests pass and ackowledge the change of behavior for impacted user-facing APIs in the changelog - accept the quirk of this implementation but hardcode and rename the effective constant - silent the -Woverflow warning by another mean Relates to: scikit-learn#13422 scikit-learn#24895

This removes the -Woverflow warnings observed when building scikit-learn. RAND_R_MAX is the max value for uint8, incrementing it causes an overflow (hence the warning). Elements were originally mentioned but seem to have been left unreviewed, see: scikit-learn#13422 (comment) I think this commit fixes the implementation, yet I comes with a backwards incompatible results and tests for implementation relying on `our_rand_r` fails because results are now different. I see several alternatives to remove the warning while having tests pass - prefered solution: adapt the test suite using the new results so that all tests pass and ackowledge the change of behavior for impacted user-facing APIs in the changelog - accept the quirk of this implementation but hardcode and rename the effective constant - silent the -Woverflow warning by another mean Relates to: scikit-learn#13422 scikit-learn#24895

This removes the -Woverflow warnings observed when building scikit-learn. RAND_R_MAX is the max value for uint8, incrementing it causes an overflow (hence the warning). Elements were originally mentioned but seem to have been left unreviewed, see: scikit-learn#13422 (comment) I think this commit fixes the implementation, yet I comes with a backwards incompatible results and tests for implementation relying on `our_rand_r` fails because results are now different. I see several alternatives to remove the warning while having tests pass - preferred solution: adapt the test suite using the new results so that all tests pass and acknowledge the change of behavior for impacted user-facing APIs in the change-log - accept the quirk of this implementation but hardcode and rename the effective constant - silent the -Woverflow warning by another mean Relates to: scikit-learn#13422 scikit-learn#24895

[MRG] FIX remove duplicated code and handle 0 seed for our_rand_r (sc…

9f9b6b0

…ikit-learn#5015)

ClemDoum changed the title ~~[MRG] Fix SGD non deterministic behabvior~~ [WIP] Fix SGD non deterministic behabvior Mar 8, 2019

jnothman reviewed Mar 10, 2019

View reviewed changes

ClemDoum changed the title ~~[WIP] Fix SGD non deterministic behabvior~~ [WIP] Fix SGD non deterministic behavior Mar 11, 2019

ClemDoum force-pushed the fix/sgd-random-state branch 3 times, most recently from 4c4d02d to a5fbfe6 Compare March 11, 2019 13:23

ClemDoum added 2 commits March 11, 2019 16:20

Don’t share random state between threads

322434f

Udpate unit tests

fe23a0c

ClemDoum force-pushed the fix/sgd-random-state branch from a5fbfe6 to fe23a0c Compare March 11, 2019 15:26

Add dirty prints to check seeds and numpy max on difference platforms

2015b20

ClemDoum force-pushed the fix/sgd-random-state branch from a89888d to 2015b20 Compare March 12, 2019 10:46

ClemDoum added 9 commits March 12, 2019 14:07

Add dirty prints to check seeds and numpy max on difference platforms

e114908

Add dirty prints to check seeds and numpy max on difference platforms

9a7efad

Add dirty prints to check seeds and numpy max on difference platforms

a1e950e

Add dirty prints to check seeds and numpy max on difference platforms

88ebe79

Add dirty prints to check seeds and numpy max on difference platforms

7d24faa

Add dirty prints to check seeds and numpy max on difference platforms

313ae17

Add dirty prints to check seeds and numpy max on difference platforms

4b0fecd

Add dirty prints to check seeds and numpy max on difference platforms

a384247

Add dirty prints to check seeds and numpy max on difference platforms

171d863

ClemDoum force-pushed the fix/sgd-random-state branch from aeddd29 to 171d863 Compare March 12, 2019 17:42

ClemDoum added 3 commits March 12, 2019 18:52

Add dirty prints to check seeds and numpy max on difference platforms

a388265

Enrich sequential dataset tests to check they work cross platform

799b9d9

Run only some tests in the CI

5916742

ClemDoum force-pushed the fix/sgd-random-state branch 2 times, most recently from e9e94e8 to ab03321 Compare March 13, 2019 13:21

jnothman approved these changes Mar 27, 2019

View reviewed changes

Merge branch 'master' into fix/sgd-random-state

93b5d09

jnothman added this to the 0.21 milestone Apr 4, 2019

Merge branch 'master' into fix/sgd-random-state

5bdf6c8

jnothman merged commit 01bc8b1 into scikit-learn:master Apr 4, 2019

This was referenced Apr 5, 2019

Inconsistencies in intent classification snipsco/snips-nlu#778

Closed

Random seeds and deterministic trainings snipsco/snips-nlu#779

Closed

ClemDoum mentioned this pull request Apr 5, 2019

[WIP]: NLU shouldn't be stochastic snipsco/snips-nlu#780

Closed

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019

FIX non deterministic behavior in SGD (scikit-learn#13422)

37ef0c3

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX non deterministic behavior in SGD (scikit-learn#13422)

7560d92

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX non deterministic behavior in SGD (scikit-learn#13422)"

d6b8680

This reverts commit 7560d92.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX non deterministic behavior in SGD (scikit-learn#13422)"

59a4d7d

This reverts commit 7560d92.

ClemDoum mentioned this pull request May 13, 2019

Improve random state handling snipsco/snips-nlu#801

Merged

3 tasks

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX non deterministic behavior in SGD (scikit-learn#13422)

51f53fb

lesteve mentioned this pull request Jul 29, 2019

[MRG] Use sphinx-gallery 0.3.1 in doc build #14507

Merged

amueller mentioned this pull request Aug 5, 2019

[MRG] FIX remove duplicated code and handle 0 seed for our_rand_r (#5… #11291

Closed

rickiepark mentioned this pull request Jul 5, 2020

Colab giving different results to book for Chapter 3 ageron/handson-ml2#197

Closed

jjerphan mentioned this pull request Nov 14, 2022

MAINT Remove all -Woverflow warnings #24919

Closed

Uh oh!

[MRG] Fix SGD non deterministic behavior #13422

[MRG] Fix SGD non deterministic behavior #13422

Uh oh!

Conversation

ClemDoum commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this fix:

How does this fixes the bug:

Help needed

Comment

Uh oh!

ClemDoum commented Mar 8, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 10, 2019

Choose a reason for hiding this comment

Uh oh!

ClemDoum commented Mar 11, 2019

Uh oh!

jnothman commented Mar 11, 2019

Uh oh!

ClemDoum commented Mar 22, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

agramfort commented Apr 4, 2019

Uh oh!

codecov bot commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jnothman commented Apr 4, 2019

Uh oh!

TomDLT commented Apr 4, 2019

Uh oh!

lesteve commented Jul 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClemDoum commented Jul 30, 2019

Uh oh!

lesteve commented Jul 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ClemDoum commented Mar 8, 2019 •

edited

Loading

codecov bot commented Apr 4, 2019 •

edited

Loading

lesteve commented Jul 29, 2019 •

edited

Loading