-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Motivation
SIMD intrinsics can accelerate pairwise distance computation by a factors of ~2.5-3.5x for float64 data, and ~5-6x for float32 data (benchmarked by this gist: https://gist.github.com/Micky774/bd1b8394fdaa82b25dcdfc111835c19b).
These benefits translate effectively into computation-bound estimators, such as KNeighborsRegressor (based on #26267):
Alternatives Considered
As discussed in #26010 and Micky774#11, while there is a significant preference towards avoiding implementing SIMD-based solutions within scikit-learn at this time. I do believe that there is a reasonable way to maintain such work (at least up to SSE3 instructions), however a better-accepted solution is to create a plug-in for DistanceMetric and offer the SIMD-accelerated implementations as an engine. While this is indeed a good solution in the long run, there is still much work needed to be done on the plug-in API (#22438). Working on a separate engine/plug-in for DistanceMetric while the API is still being solidified and #25535 is still unmerged is probably going to do more harm than good by adding one more moving part to the mix and slowing down the review process.
Suggested Solution
Allow users to pass instances of DistanceMetric directly to metric keyword arguments. This is backwards compatible and doesn't require any significant new infrastructure (mainly small changes to validation and updated docs/tests). This enables third-party libraries to provide their own accelerated solutions immediately.
In practice, this involves changes mainly in the following:
-
ArgKmin -
RadiusNeighbors -
ArgKminClassMode -
pairwise_distances -
pairwise_distances_argmin
This will allow us to enable the functionality in parts of the following estimators (non-exhaustive):
- NearestNeighbors
- KNeighborsRegressor
- KNeighborsClassifier
- RadiusNeighborsRegressor
- RadiusNeighborsClassifier
- DBSCAN
- OPTICS
- Isomap
- TSNE (self.method != "exact")
- KernelDensity
- AffinityPropagation
- Birch
- MeanShift
- NearestCentroid
Notes:
pairwise_distanecscan't actually use theDistanceMetricin its current state, however once FEA IntroducePairwiseDistances, a generic back-end forpairwise_distances#25561 is completed, it can benefit from acceleratedDistanceMetricoptions as well.- Currently
{KD, Ball}Treesupport passingDistanceMetricthrough themetricargument, however do not supportDistanceMetric32(see: ENH Addfloat32implementations forBallTreeandKDTree#25914)
Implementation
I have a sample implementation of this for KNeighborsRegressor, which is achieved by enabling this functionality for ArgKmin along with updating parameter validation in NeighborsBase; please see #26267.



