Skip to content

Conversation

@TamaraAtanasoska
Copy link
Contributor

@TamaraAtanasoska TamaraAtanasoska commented Aug 1, 2023

Reference Issues/PRs

Fixes #17568.

What does this implement/fix? Explain your changes.

This PR adds the parameter random_state to all the clustering algos in the file that take that parameter. Additionally, we have added the random_state parameter to two datasets and removed the global random.seed setting.

Any other comments?

After a few tests, it looks like the generated image is consistently the same, but we aren't as confident as we were for the last PR. Please let us know what we can change to improve in case it is still not the same.

This PR was done as part of the PyLadies workshop meetup.
@adrinjalali and @glemaitre please take a look. Thank you!

@TamaraAtanasoska TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 27c4c1e to aa91f56 Compare August 1, 2023 19:25
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b6d47f3. Link to the linter CI: here

@glemaitre glemaitre self-requested a review August 2, 2023 13:38
@glemaitre glemaitre changed the title Create a cosistent image via random_state additions in "plot_cluster_comparison.py" ENH Create a cosistent image via random_state additions in "plot_cluster_comparison.py" Aug 2, 2023
@glemaitre glemaitre changed the title ENH Create a cosistent image via random_state additions in "plot_cluster_comparison.py" FIX Create a cosistent image via random_state additions in "plot_cluster_comparison.py" Aug 2, 2023
)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=8)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not seeded.
So we need to create a random number generator at least for all the np.random of the current file:

Suggested change
no_structure = np.random.rand(n_samples, 2), None
rng = np.random.RandomState(42)
no_structure = rng.rand(n_samples, 2), None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre, I have pushed a commit 7dd631f that adds the changes you propose.

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=8)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
rng = np.random.RandomState(42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this rng a few lines above and pass the same thing to mane_moons and make_circles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to the changing order and the RandomState object usage conversation we had at the workshop (cc @glemaitre as it was us talking)? I got the impression that whenever possible the random_state parameter is to be preferred? As far as I understand, it isn't directly possible to add it in the no_structure case, so we use RandomState instead of the numpy implementation. From a best practice perspective, is it different in this specific case? Or this is a style suggestion so to say? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not saying not to use random_state, I'm suggesting:

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=rng)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @adrinjalali sorry, I wasn't clear enough, I meant passing an integer to the random_state parameter, instead of a RandomState object. From my conversation with @glemaitre at the event it felt like the first is preferred, we discussed exactly this. I am also happy to go with it as it's better to merge it than to stall it for a few weeks, just trying to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I'll let @glemaitre comment as why :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali passing the rng instance, it will be mutated by each call. Potentially, if we change a line of code that alternates the rng in a future PR, we will change all subsequent behaviour of the function using rng. Passing an int will make sure that we don't change the behaviour in future PRs.

Basically, random-state instance dilemma stuff :)

@glemaitre
Copy link
Member

I observe a couple of changes that I think we don't want to have. To be honest I am even surprised that this is linked to the random seed, I would have not expected such a difference. The main changes are on the first dataset:

Before

image

After

image

Differences

AffinityPropagation became crazy, MeanShift has a single cluster, AgglomerativeClustering did not get the right clustering while working before, OPTICS labels a lot of data as outliers.

For OPTICS, I can see the same behavior on the second dataset.

I think that it could be nice to change the random seed such that we preserve the visual effect.

@TamaraAtanasoska
Copy link
Contributor Author

I observe a couple of changes that I think we don't want to have. To be honest I am even surprised that this is linked to the random seed, I would have not expected such a difference. The main changes are on the first dataset:

Before

image

After

image

Differences

AffinityPropagation became crazy, MeanShift has a single cluster, AgglomerativeClustering did not get the right clustering while working before, OPTICS labels a lot of data as outliers.

For OPTICS, I can see the same behavior on the second dataset.

I think that it could be nice to change the random seed such that we preserve the visual effect.

We noticed these radical changes at the end of the contribution night, but since they were consistent we thought it might be ok 😊 Would experimenting with different random seed values be the right way to proceed here? Is there any other approach?

@glemaitre
Copy link
Member

Would experimenting with different random seed values be the right way to proceed here?

I would say that this is enough for this PR indeed.

@TamaraAtanasoska TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 89e27aa to 7dd631f Compare August 21, 2023 21:13
@TamaraAtanasoska
Copy link
Contributor Author

Screen Shot 2023-08-21 at 23 15 32

@glemaitre Having 30 as a random seed seems to give a close enough impression of how the original image was. Some of them are still slightly different, but it is much better. Would this suffice in your opinion? The change can be found in 0f3dc62.

I will also keep in mind that the image needs to give out a similar visual impression for future PRs.

@glemaitre
Copy link
Member

I am fine with these changes. Would you mind fixing the linter issues ;)

@TamaraAtanasoska TamaraAtanasoska force-pushed the image-randomness-removal-cluster-comparison branch from 0f3dc62 to b6d47f3 Compare August 22, 2023 22:28
@TamaraAtanasoska
Copy link
Contributor Author

I am fine with these changes. Would you mind fixing the linter issues ;)

Done! Looks all green now.

@glemaitre glemaitre merged commit 32f3c24 into scikit-learn:main Aug 23, 2023
@glemaitre
Copy link
Member

Thanks @TamaraAtanasoska, all good then. Merging.

@TamaraAtanasoska TamaraAtanasoska deleted the image-randomness-removal-cluster-comparison branch August 24, 2023 12:31
akaashpatelmns pushed a commit to akaashp2000/scikit-learn that referenced this pull request Aug 25, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 29, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023
jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce the size of some images in the documentation

3 participants