CLN Cleaned cluster/_hdbscan/_linkage.pyx#24857
CLN Cleaned cluster/_hdbscan/_linkage.pyx#24857thomasjpfan merged 12 commits intoscikit-learn:hdbscanfrom
cluster/_hdbscan/_linkage.pyx#24857Conversation
|
@thomasjpfan @glemaitre @jjerphan Wondering if any of you would be interested in reviewing this. |
| cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability( | ||
| cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability | ||
| ): |
There was a problem hiding this comment.
We should only use cnp.ndarray when memoryviews can't be used.
Is the usage of MST_edge_t preventing us to use (const-qualified) memoryview here and in other functions and methods?
Generally, most cnp.ndarray seem replaceable by memoryviews here.
There was a problem hiding this comment.
The current algorithm uses binary masks for indexing into subsets of cnp.ndarrays, simplifying the innermost scope of the algorithm. Hence I think it would be preferable to maintain the use of cnp.ndarray instead. In particular, it is necessary for:
scikit-learn/sklearn/cluster/_hdbscan/_linkage.pyx
Lines 48 to 52 in 39e7d7e
There was a problem hiding this comment.
I could attempt to change the algorithm to avoid using this, but it is a pretty eloquent solution. Ultimately I don't know how costly the python interactions here are, and whether they outweigh the usefulness of the algorithm as it is now.
There was a problem hiding this comment.
There are some overhead of just typing and referencing variable with cnp.ndarray instead of using memoryviews (because cnp.ndarray are PyObjects and memoryviews aren't if I remember correctly).
We can leave this for now. If this shows to be a hotspot, we can have a PR to rewrite this part. What do you think?
In this case, is it possible to add a comment to indicate that numpy arrays are used for binary masks and convenient indexing?
There was a problem hiding this comment.
Sounds good, I will add in such a comment.
There was a problem hiding this comment.
Was this comment added somewhere?
| distance_matrix = pairwise_distances( | ||
| X, metric=metric, n_jobs=n_jobs, **metric_params | ||
| ) | ||
| distance_matrix /= alpha |
There was a problem hiding this comment.
This is technically a bit out of scope for this PR since it lives in _hdbscan_brute, however upon investigation I noticed:
- We do not have any tests for the behavior of the
alphaparameter. - It is unlikely to be used.
- It is unadvised to use.
- It currently has no effect in
_hdbscan_brute(carryover bug?).
Hence I opted to remove it from _linkage.pyx. Since it was essentially only used in mst_from_data_matrix and the ineffective code above, I opted to remove it from HDBSCAN entirely. The above change does not affect any HDBSCAN behavior in a meaningful way.
There was a problem hiding this comment.
Thanks for spotting this. Do you think it is possible to extract it in another PR? This way, we can discuss the problem independently from this PR which can be merged in the meantime.
There was a problem hiding this comment.
I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.
|
I'll have a go soonish on the HDBSCAN PRs. |
jjerphan
left a comment
There was a problem hiding this comment.
Just a comment which is somewhat-orthogonal to this PR's goal.
| cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability( | ||
| cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability | ||
| ): |
There was a problem hiding this comment.
Was this comment added somewhere?
| distance_matrix = pairwise_distances( | ||
| X, metric=metric, n_jobs=n_jobs, **metric_params | ||
| ) | ||
| distance_matrix /= alpha |
There was a problem hiding this comment.
I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.
Co-authored-by: Thomas J. Fan <[email protected]>
| distance_matrix = pairwise_distances( | ||
| X, metric=metric, n_jobs=n_jobs, **metric_params | ||
| ) | ||
| distance_matrix /= alpha |
There was a problem hiding this comment.
What if alpha=None in this case?
| distance_matrix /= alpha | |
| if alpha is not None: | |
| distance_matrix /= alpha |
There was a problem hiding this comment.
I just left that as it is on hdbscan since we'll be removing/revisiting alpha in the follow-up PR and this portion is in _hdbscan_brute and a bit out of scope for this PR.
| # Note: we utilize ndarray's over memory-views to make use of numpy | ||
| # binary indexing and sub-selection below. |
There was a problem hiding this comment.
Thank you for this comment, this answers one of @Vincent-Maladiere's questions. 👍
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]>
Reference Issues/PRs
Towards #24686
What does this implement/fix? Explain your changes.
Cleans up and revises
_hdbscan/_linkage.pyx.Any other comments?