FIX Create a cosistent image via random_state additions in "plot_cluster_comparison.py" #26976

TamaraAtanasoska · 2023-08-01T19:24:10Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds the parameter random_state to all the clustering algos in the file that take that parameter. Additionally, we have added the random_state parameter to two datasets and removed the global random.seed setting.

Any other comments?

After a few tests, it looks like the generated image is consistently the same, but we aren't as confident as we were for the last PR. Please let us know what we can change to improve in case it is still not the same.

This PR was done as part of the PyLadies workshop meetup.
@adrinjalali and @glemaitre please take a look. Thank you!

github-actions · 2023-08-01T19:25:40Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b6d47f3. Link to the linter CI: here}

glemaitre · 2023-08-02T13:40:03Z

examples/cluster/plot_cluster_comparison.py

+)
+noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=8)
 blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
 no_structure = np.random.rand(n_samples, 2), None


This one is not seeded.
So we need to create a random number generator at least for all the np.random of the current file:

Suggested change

no_structure = np.random.rand(n_samples, 2), None

rng = np.random.RandomState(42)

no_structure = rng.rand(n_samples, 2), None

Thanks @glemaitre, I have pushed a commit 7dd631f that adds the changes you propose.

adrinjalali · 2023-08-03T11:06:33Z

examples/cluster/plot_cluster_comparison.py

+noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=8)
 blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
-no_structure = np.random.rand(n_samples, 2), None
+rng = np.random.RandomState(42)


I'd move this rng a few lines above and pass the same thing to mane_moons and make_circles

How does this relate to the changing order and the RandomState object usage conversation we had at the workshop (cc @glemaitre as it was us talking)? I got the impression that whenever possible the random_state parameter is to be preferred? As far as I understand, it isn't directly possible to add it in the no_structure case, so we use RandomState instead of the numpy implementation. From a best practice perspective, is it different in this specific case? Or this is a style suggestion so to say? Thanks!

Yes, I'm not saying not to use random_state, I'm suggesting:

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=rng)

Right @adrinjalali sorry, I wasn't clear enough, I meant passing an integer to the random_state parameter, instead of a RandomState object. From my conversation with @glemaitre at the event it felt like the first is preferred, we discussed exactly this. I am also happy to go with it as it's better to merge it than to stall it for a few weeks, just trying to understand.

then I'll let @glemaitre comment as why :)

@adrinjalali passing the rng instance, it will be mutated by each call. Potentially, if we change a line of code that alternates the rng in a future PR, we will change all subsequent behaviour of the function using rng. Passing an int will make sure that we don't change the behaviour in future PRs.

Basically, random-state instance dilemma stuff :)

glemaitre · 2023-08-21T10:17:13Z

I observe a couple of changes that I think we don't want to have. To be honest I am even surprised that this is linked to the random seed, I would have not expected such a difference. The main changes are on the first dataset:

Before

After

Differences

AffinityPropagation became crazy, MeanShift has a single cluster, AgglomerativeClustering did not get the right clustering while working before, OPTICS labels a lot of data as outliers.

For OPTICS, I can see the same behavior on the second dataset.

I think that it could be nice to change the random seed such that we preserve the visual effect.

TamaraAtanasoska · 2023-08-21T11:36:48Z

I observe a couple of changes that I think we don't want to have. To be honest I am even surprised that this is linked to the random seed, I would have not expected such a difference. The main changes are on the first dataset:

Before

After

Differences

AffinityPropagation became crazy, MeanShift has a single cluster, AgglomerativeClustering did not get the right clustering while working before, OPTICS labels a lot of data as outliers.

For OPTICS, I can see the same behavior on the second dataset.

I think that it could be nice to change the random seed such that we preserve the visual effect.

We noticed these radical changes at the end of the contribution night, but since they were consistent we thought it might be ok 😊 Would experimenting with different random seed values be the right way to proceed here? Is there any other approach?

glemaitre · 2023-08-21T11:45:12Z

Would experimenting with different random seed values be the right way to proceed here?

I would say that this is enough for this PR indeed.

TamaraAtanasoska · 2023-08-21T21:22:43Z

@glemaitre Having 30 as a random seed seems to give a close enough impression of how the original image was. Some of them are still slightly different, but it is much better. Would this suffice in your opinion? The change can be found in 0f3dc62.

I will also keep in mind that the image needs to give out a similar visual impression for future PRs.

glemaitre · 2023-08-22T19:46:56Z

I am fine with these changes. Would you mind fixing the linter issues ;)

…lustering algorithms and data. Co-authored-by: AnnaWey <[email protected]>

Co-authored-by: AnnaWey <[email protected]>

TamaraAtanasoska · 2023-08-23T08:47:47Z

I am fine with these changes. Would you mind fixing the linter issues ;)

Done! Looks all green now.

glemaitre · 2023-08-23T09:50:43Z

Thanks @TamaraAtanasoska, all good then. Merging.

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

…ter_comparison.py" (#26976) Co-authored-by: AnnaWey <[email protected]>

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 27c4c1e to aa91f56 Compare August 1, 2023 19:25

glemaitre self-requested a review August 2, 2023 13:38

glemaitre changed the title ~~Create a cosistent image via random_state additions in "plot_cluster_comparison.py"~~ ENH Create a cosistent image via random_state additions in "plot_cluster_comparison.py" Aug 2, 2023

glemaitre changed the title ~~ENH Create a cosistent image via random_state additions in "plot_cluster_comparison.py"~~ FIX Create a cosistent image via random_state additions in "plot_cluster_comparison.py" Aug 2, 2023

glemaitre reviewed Aug 2, 2023

View reviewed changes

adrinjalali reviewed Aug 3, 2023

View reviewed changes

adrinjalali approved these changes Aug 15, 2023

View reviewed changes

TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 89e27aa to 7dd631f Compare August 21, 2023 21:13

TamaraAtanasoska and others added 3 commits August 23, 2023 00:27

Remove global random seed setting and add random_state parameter to c…

cdb335e

…lustering algorithms and data. Co-authored-by: AnnaWey <[email protected]>

Add seed for the no_structure dataset

1887526

Co-authored-by: AnnaWey <[email protected]>

Change the random seed to 30 for all datasets

b6d47f3

TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 0f3dc62 to b6d47f3 Compare August 22, 2023 22:28

glemaitre merged commit 32f3c24 into scikit-learn:main Aug 23, 2023

TamaraAtanasoska deleted the image-randomness-removal-cluster-comparison branch August 24, 2023 12:31

TamaraAtanasoska mentioned this pull request Aug 24, 2023

DOC fix random_state in several example for reproducibility #27153

Merged

akaashpatelmns pushed a commit to akaashp2000/scikit-learn that referenced this pull request Aug 25, 2023

FIX Create a cosistent image via random_state additions in "plot_clus…

59255ff

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 29, 2023

FIX Create a cosistent image via random_state additions in "plot_clus…

628ea31

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023

FIX Create a cosistent image via random_state additions in "plot_clus…

f2abc45

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023

FIX Create a cosistent image via random_state additions in "plot_clus…

4ec98b0

…ter_comparison.py" (#26976) Co-authored-by: AnnaWey <[email protected]>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

FIX Create a cosistent image via random_state additions in "plot_clus…

ca29e6b

…ter_comparison.py" (scikit-learn#26976) Co-authored-by: AnnaWey <[email protected]>

	no_structure = np.random.rand(n_samples, 2), None
	rng = np.random.RandomState(42)
	no_structure = rng.rand(n_samples, 2), None

Uh oh!

FIX Create a cosistent image via random_state additions in "plot_cluster_comparison.py" #26976

FIX Create a cosistent image via random_state additions in "plot_cluster_comparison.py" #26976

Uh oh!

Conversation

TamaraAtanasoska commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 21, 2023

Before

After

Differences

Uh oh!

TamaraAtanasoska commented Aug 21, 2023

Before

After

Differences

Uh oh!

glemaitre commented Aug 21, 2023

Uh oh!

TamaraAtanasoska commented Aug 21, 2023

Uh oh!

glemaitre commented Aug 22, 2023

Uh oh!

TamaraAtanasoska commented Aug 23, 2023

Uh oh!

glemaitre commented Aug 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TamaraAtanasoska commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2023 •

edited

Loading