Skip to content

Conversation

@glemaitre
Copy link

Some additional style change

@glemaitre
Copy link
Author

It was easier to make a PR than commenting.
I think that we should test the function mutual_reachibility_distance directly.
We should ensure that dense and sparse would lead to the same results. This would be the minimal thing.

a core point.
max_dist : float, default=0.0
max_distance : float, default=0.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can be explicit regarding the naming.

`LIL` format.
min_points : int, default=5
min_samples : int, default=5
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the name to be consistent with the high-level function and DBSCAN as well

)[farther_neighbor_idx]
else:
core_distances[i] = np.infty
core_distances[i] = INFINITY
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid a Python interaction

graph of a distance matrix. Note that computation is performed in-place for
`distance_matrix`. If out-of-place computation is required, pass a copy to
this function.
def mutual_reachability_graph(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we build the graph, I prefer to make it explicit.

@glemaitre
Copy link
Author

glemaitre commented Oct 19, 2022

The test could be something like:

import numpy as np
from scipy import sparse
from sklearn.utils._testing import assert_allclose

dist = np.random.randn(5, 5)
mr_dense = mutual_reachability_graph(dist)
mr_sparse = mutual_reachability_graph(sparse.lil_matrix(dist))

assert_allclose(mr_sparse.A, mr_dense)

With my PR it fails. I will have a closer look to see what is wrong.
I probably introduced a bug :)

@glemaitre glemaitre marked this pull request as draft October 19, 2022 16:07
Copy link
Author

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the PR WIP but basically this is now working.
I have to update the documentation and make a pass to be sure that I like what I wrote.

graph of a distance matrix. Note that computation is performed in-place for
`distance_matrix`. If out-of-place computation is required, pass a copy to
this function.
ctypedef fused integral:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added fused type directly.

return distance_matrix
core_distances[i] = INFINITY

for col_ind in range(n_samples):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nogil interaction.

@glemaitre glemaitre marked this pull request as ready for review October 20, 2022 09:40
@glemaitre
Copy link
Author

I am pretty happy with the current implementation.
Basically, the changes are the following:

  • mainly renaming variables
  • add a CSR implementation that does not require the GIL (nor for the sorting par since this is based on NumPy)
  • add fused type to already be compatible with huge sparse with np.int64_t indices and np.float32 arrays.

A bit later, I will add a file that tests minimally the implementation.

Something that I was wondering and seems quite important, it seems that we assume that the distance used is symmetric. Otherwise, the way we build the core_distances would not work.
@Micky774 is the case from your understanding?

@Micky774 Micky774 merged commit 93f1896 into Micky774:HDBSCAN/clean_reachability Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants