Skip to content

Conversation

@eamanu
Copy link
Contributor

@eamanu eamanu commented Nov 12, 2018

Reference Issues/PRs

Fixes #12552. Resolves #4453. Resolves #4458.

What does this implement/fix? Explain your changes.

This PR add Haversine distance to NearestNeighbors using DistanceMetric.get_metric('haversine').pairwise()

This PR can be use to doing this for all metrics not directly implemented in metrics.pairwise (See: #12552 (comment))

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test of pairwise_distances(..., metric='haversine')

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. why do you regard this as WIP. is there a set of tasks you still hope to complete?

modify X = rng.rand(12, 2)
delete unnecessary code lines
@eamanu eamanu changed the title [WIP] Fix Support Haversine distance in NearestNeighbors issue [MRG] Fix Support Haversine distance in NearestNeighbors issue Nov 16, 2018
@eamanu
Copy link
Contributor Author

eamanu commented Nov 16, 2018

@jnothman I don't understand my CI error =(

@rth
Copy link
Member

rth commented Nov 16, 2018

Could you please benchmark how this Haversine approach compares to #4458 ?

@eamanu
Copy link
Contributor Author

eamanu commented Nov 16, 2018

@rth yeah. I have to see the test. Because I have some problems with it

@eamanu
Copy link
Contributor Author

eamanu commented Nov 18, 2018

@jnothman I think that is ready.

@jnothman
Copy link
Member

Are you going to report the benchmarks requested above by Roman?

@whiletruelearn
Copy link
Contributor

@eamanu Minor. An entry is needed here for completion ?

def distance_metrics():
"""Valid metrics for pairwise_distances.
This function simply returns the valid pairwise distance metrics.
It exists to allow for a description of the mapping for
each of the valid strings.
The valid distance metrics, and the function they map to, are:
============ ====================================
metric Function
============ ====================================
'cityblock' metrics.pairwise.manhattan_distances
'cosine' metrics.pairwise.cosine_distances
'euclidean' metrics.pairwise.euclidean_distances
'l1' metrics.pairwise.manhattan_distances
'l2' metrics.pairwise.euclidean_distances
'manhattan' metrics.pairwise.manhattan_distances
============ ====================================

@eamanu
Copy link
Contributor Author

eamanu commented Nov 20, 2018

Hello @rth @jnothman !

According to #12552 (comment) and #12552 (comment) I was working on this PR. Using the #4458 (comment) I added the istanceMetric.get_metric('haversine').pairwise method to metric.pairwise. This way we avoid duplicate code. Other important point is using this PR we can have more performance becasue DistanceMetric is written on Cython.

I make little code to test the performance

(sklearn) eamanu@eamanu:~/dev/scikit-learn$ cat test12552.py
from sklearn.metrics.pairwise import haversine_distances
import numpy as np

X = np.random.random((10000, 2))

result = haversine_distances(X,X)

Using this PR I have the next measure:

         394800 function calls (385299 primitive calls) in 13.313 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    809/1    0.063    0.000   13.315   13.315 {built-in method builtins.exec}
        1    0.000    0.000   13.315   13.315 test12552.py:1(<module>)
        1    0.000    0.000   11.709   11.709 pairwise.py:444(haversine_distances)
        1   11.679   11.679   11.679   11.679 {method 'pairwise' of 'sklearn.neighbors.dist_metrics.DistanceMetric' objects}
        4    0.001    0.000    1.837    0.459 __init__.py:4(<module>)
    724/2    0.007    0.000    1.635    0.818 <frozen importlib._bootstrap>:966(_find_and_load)
    724/2    0.005    0.000    1.635    0.818 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    665/3    0.005    0.000    1.635    0.545 <frozen importlib._bootstrap>:651(_load_unlocked)
    558/3    0.003    0.000    1.635    0.545 <frozen importlib._bootstrap_external>:672(exec_module)
    924/2    0.001    0.000    1.634    0.817 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
    461/1    0.001    0.000    1.605    1.605 {built-in method builtins.__import__}
       29    0.001    0.000    0.979    0.034 __init__.py:1(<module>)
        1    0.000    0.000    0.976    0.976 ranking.py:8(<module>)
2995/1227    0.004    0.000    0.862    0.001 <frozen importlib._bootstrap>:997(_handle_fromlist)
      558    0.008    0.000    0.717    0.001 <frozen importlib._bootstrap_external>:743(get_code)
        1    0.000    0.000    0.680    0.680 _function_transformer.py:1(<module>)
        1    0.000    0.000    0.680    0.680 testing.py:1(<module>)
      558    0.017    0.000    0.582    0.001 <frozen importlib._bootstrap_external>:830(get_data)
      558    0.565    0.001    0.565    0.001 {method 'read' of '_io.FileIO' objects}
        1    0.000    0.000    0.564    0.564 __init__.py:14(<module>)
        5    0.000    0.000    0.556    0.111 base.py:1(<module>)
        1    0.000    0.000    0.457    0.457 pytest.py:4(<module>)
  664/639    0.002    0.000    0.335    0.001 <frozen importlib._bootstrap>:564(module_from_spec)
    86/72    0.000    0.000    0.307    0.004 <frozen importlib._bootstrap_external>:919(create_module)
    86/72    0.291    0.003    0.306    0.004 {built-in method _imp.create_dynamic}
...

and using #4458 we have:

         392014 function calls (382600 primitive calls) in 104.730 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    800/1    0.124    0.000  104.734  104.734 {built-in method builtins.exec}
        1    0.000    0.000  104.734  104.734 test12552.py:1(<module>)
    705/1    0.015    0.000   63.475   63.475 <frozen importlib._bootstrap>:966(_find_and_load)
    705/1    0.011    0.000   63.475   63.475 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    907/1    0.002    0.000   63.475   63.475 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
    461/1    0.003    0.000   63.475   63.475 {built-in method builtins.__import__}
    652/2    0.013    0.000   63.474   31.737 <frozen importlib._bootstrap>:651(_load_unlocked)
    549/2    0.007    0.000   63.474   31.737 <frozen importlib._bootstrap_external>:672(exec_module)
2925/1157    0.011    0.000   43.158    0.037 <frozen importlib._bootstrap>:997(_handle_fromlist)
        1   37.602   37.602   41.258   41.258 pairwise.py:444(haversine_distances)
        3    0.001    0.000   36.459   12.153 __init__.py:4(<module>)
        1    0.000    0.000   34.235   34.235 __init__.py:14(<module>)
        4    0.000    0.000   33.752    8.438 base.py:1(<module>)
      549    0.020    0.000   33.690    0.061 <frozen importlib._bootstrap_external>:743(get_code)
      549    2.501    0.005   33.400    0.061 <frozen importlib._bootstrap_external>:830(get_data)
      549   30.899    0.056   30.899    0.056 {method 'read' of '_io.FileIO' objects}
       29    0.002    0.000   28.443    0.981 __init__.py:1(<module>)
        1    0.000    0.000   28.356   28.356 ranking.py:8(<module>)
  651/628    0.005    0.000   24.759    0.039 <frozen importlib._bootstrap>:564(module_from_spec)
    82/70    0.001    0.000   24.515    0.350 <frozen importlib._bootstrap_external>:919(create_module)
    82/70   23.565    0.287   24.513    0.350 {built-in method _imp.create_dynamic}
        1    0.000    0.000   20.331   20.331 __init__.py:342(<module>)

in the same conditions.

IMO could be interest use cython to calculate this distances or use code that exist on DistanceMetric.

Regards!

@eamanu eamanu closed this Nov 20, 2018
@eamanu
Copy link
Contributor Author

eamanu commented Nov 20, 2018

oops I closed the PR. Sorry!

@eamanu eamanu reopened this Nov 20, 2018
@eamanu
Copy link
Contributor Author

eamanu commented Nov 20, 2018

@whiletruelearn done you comment thanks!

@jnothman
Copy link
Member

jnothman commented Nov 20, 2018 via email

>>> X = np.ones((5,2))
>>> Y = np.full((5,2), 2.)
>>> haversine_distances(X, Y)
array([[0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a very helpful example. Better to illustrate usage without Y and with two distinct samples, for instance. It might be informative to use real geographical coordinates, although the dishwasher on earth is then only approximate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case we don't need three examples of usage here. One will suffice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123],
[0.87152123, 0.87152123, 0.87152123, 0.87152123, 0.87152123]])
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rm blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def slow_haversine_distances(x, y):
diff_lat = y[0] - x[0]
diff_lon = y[1] - x[1]
a = np.sin(diff_lat / 2) ** 2 + \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use parentheses rather than \ to continue lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@eamanu
Copy link
Contributor Author

eamanu commented Dec 29, 2018

ping @jnothman :-)

@rmenuet
Copy link
Contributor

rmenuet commented Feb 24, 2019

@eamanu If you cannot finish this PR I can take it over during the sklearn sprint that takes place this week.
If you prefer, I can also PR your branch.

@eamanu
Copy link
Contributor Author

eamanu commented Feb 24, 2019

Hi @zanospi It's ok for me. I will a little busy in the next weeks. Thanks!

@rmenuet rmenuet mentioned this pull request Feb 26, 2019
@jnothman
Copy link
Member

@rth does this comment of yours stand?

Thanks @eamanu ! Minor comments below but otherwise looks good.

If so, can you Approve and merge? Thanks

@jnothman jnothman closed this Apr 10, 2019
@jnothman jnothman reopened this Apr 10, 2019
@jnothman
Copy link
Member

Got a green tick. Reviews?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few minor comments are below.

The Haversine distance is the angular distance between two points on
the surface of a sphere. The first distance of each point is assumed
to be the latitude, the second is the longitude, given in radians.
The dimension of the points must be 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "dimension of the data"? A point doesn't have any dimensions by definition (it can be in an ND space though)...

def haversine_distances(X, Y=None):
"""Compute the Haversine distance between samples in X and Y
The Haversine distance is the angular distance between two points on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Haversine (or great-circle) distance is ...

if metric != 'haversine':
nn = neighbors.NearestNeighbors(n_neighbors=3,
algorithm='auto',
metric=metric).fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe factorize the estimator definition outside of the if?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very good nit.

.. math::
D(x, y) = 2\arcsin[\\sqrt{\\sin^2((x1 - y1) / 2)
+ cos(x1)cos(y1)sin^2((x2 - y2) / 2)}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add

As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average.

to indicate that is not the exact distance between lat/long coordinates (but likely good enough for ML applications). The reference might be
"Sinnott, Roger W. (August 1984). "Virtues of the Haversine". Sky and Telescope. 68 (2): 159." but I was not able to find a PDF for it, and that sentence is adapted from https://en.wikipedia.org/wiki/Great-circle_distance

@rth
Copy link
Member

rth commented Apr 24, 2019

@rth does this comment of yours stand?

Yeah, sorry found more comments to add :)

@jnothman
Copy link
Member

Okay. I might as well wrap it up.

@rth
Copy link
Member

rth commented Apr 24, 2019

Thanks ! Also needs a "what's new" entry..

@jnothman jnothman changed the title [MRG] Fix Support Haversine distance in NearestNeighbors issue ENH Support Haversine distance in pairwise_distances Apr 24, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(The one CI failure is due to a conda timeout and can probably be ignored).

Do we need another review or should we merge?

@jnothman
Copy link
Member

jnothman commented Apr 24, 2019 via email

@rth rth merged commit d88fec8 into scikit-learn:master Apr 24, 2019
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Haversine distance in NearestNeighbors Implement haversine metric in pairwise, document properly

5 participants