-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Support Haversine distance in pairwise_distances #12568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test of pairwise_distances(..., metric='haversine')
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. why do you regard this as WIP. is there a set of tasks you still hope to complete?
modify X = rng.rand(12, 2) delete unnecessary code lines
|
@jnothman I don't understand my CI error =( |
|
Could you please benchmark how this Haversine approach compares to #4458 ? |
|
@rth yeah. I have to see the test. Because I have some problems with it |
|
@jnothman I think that is ready. |
|
Are you going to report the benchmarks requested above by Roman? |
|
@eamanu Minor. An entry is needed here for completion ? scikit-learn/sklearn/metrics/pairwise.py Lines 1073 to 1091 in bbbefca
|
|
According to #12552 (comment) and #12552 (comment) I was working on this PR. Using the #4458 (comment) I added the I make little code to test the performance Using this PR I have the next measure: and using #4458 we have: in the same conditions. IMO could be interest use cython to calculate this distances or use code that exist on DistanceMetric. Regards! |
|
oops I closed the PR. Sorry! |
|
@whiletruelearn done you comment thanks! |
|
So this version appears a little faster. Continue using ipython's timeit
magic for quick benchmarks
|
sklearn/metrics/pairwise.py
Outdated
| >>> X = np.ones((5,2)) | ||
| >>> Y = np.full((5,2), 2.) | ||
| >>> haversine_distances(X, Y) | ||
| array([[0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a very helpful example. Better to illustrate usage without Y and with two distinct samples, for instance. It might be informative to use real geographical coordinates, although the dishwasher on earth is then only approximate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case we don't need three examples of usage here. One will suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/metrics/pairwise.py
Outdated
| [0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123], | ||
| [0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123]]) | ||
| """ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rm blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| def slow_haversine_distances(x, y): | ||
| diff_lat = y[0] - x[0] | ||
| diff_lon = y[1] - x[1] | ||
| a = np.sin(diff_lat / 2) ** 2 + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use parentheses rather than \ to continue lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
ping @jnothman :-) |
|
@eamanu If you cannot finish this PR I can take it over during the sklearn sprint that takes place this week. |
|
Hi @zanospi It's ok for me. I will a little busy in the next weeks. Thanks! |
|
Got a green tick. Reviews? |
rth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few minor comments are below.
sklearn/metrics/pairwise.py
Outdated
| The Haversine distance is the angular distance between two points on | ||
| the surface of a sphere. The first distance of each point is assumed | ||
| to be the latitude, the second is the longitude, given in radians. | ||
| The dimension of the points must be 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "dimension of the data"? A point doesn't have any dimensions by definition (it can be in an ND space though)...
sklearn/metrics/pairwise.py
Outdated
| def haversine_distances(X, Y=None): | ||
| """Compute the Haversine distance between samples in X and Y | ||
| The Haversine distance is the angular distance between two points on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Haversine (or great-circle) distance is ...
| if metric != 'haversine': | ||
| nn = neighbors.NearestNeighbors(n_neighbors=3, | ||
| algorithm='auto', | ||
| metric=metric).fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe factorize the estimator definition outside of the if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very good nit.
| .. math:: | ||
| D(x, y) = 2\arcsin[\\sqrt{\\sin^2((x1 - y1) / 2) | ||
| + cos(x1)cos(y1)sin^2((x2 - y2) / 2)}] | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add
As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average.
to indicate that is not the exact distance between lat/long coordinates (but likely good enough for ML applications). The reference might be
"Sinnott, Roger W. (August 1984). "Virtues of the Haversine". Sky and Telescope. 68 (2): 159." but I was not able to find a PDF for it, and that sentence is adapted from https://en.wikipedia.org/wiki/Great-circle_distance
Yeah, sorry found more comments to add :) |
|
Okay. I might as well wrap it up. |
|
Thanks ! Also needs a "what's new" entry.. |
rth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
(The one CI failure is due to a conda timeout and can probably be ignored).
Do we need another review or should we merge?
|
I don't think I've done anything to the implementation, only made
tests work, so I'm happy to not have another review.
|
…learn#12568)" This reverts commit e0033d9.
…learn#12568)" This reverts commit e0033d9.
Reference Issues/PRs
Fixes #12552. Resolves #4453. Resolves #4458.
What does this implement/fix? Explain your changes.
This PR add Haversine distance to NearestNeighbors using DistanceMetric.get_metric('haversine').pairwise()
This PR can be use to doing this for all metrics not directly implemented in metrics.pairwise (See: #12552 (comment))
Any other comments?