-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
add avx512 support for svm kernel #15962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the investigation.
Looks like that's the vector dot product. I wonder if we could do it with BLAS, the advantage being is that it should also work with AVX2 or AVX512 with CPU capability detected at runtime (and so would be accessible to all current users). With the current PR one would need to explicitly compile for AVX512, and few users would do that. We use BLAS via the scipy BLAS Cython API and I imagine there should be a way to link against the appropriate BLAS function from C++. |
|
@rth Thanks for your advice. But I'm bit confused by your suggestion. Scikit-learn use cython interface to train svm. However, my code piece is inside the C function part. Thus, I can not invoke the scipy BLAS Cython API to replace my code piece directly. So if I use my optimization techniques on BLAS implementation, it can not be used to improve performance on our scikit-learn use case. |
New comments, pls check. |
|
Thanks for your work on optimizing the SVM module.
I understand this, I'm just saying that dot product is typically BLAS functionality and we don't have that many people able to confidently review C code, let alone AVX 512 intrinsic. Furthermore, writing AVX512 code without run-time detection is of limited use since most users don't build packages from sources (with custom compile flags) and won't benefit from it. So I think it would beneficial to use BLAS instead if possible,
Overall the concern is maintainability and readability of the C code. Also we don't have machines with AVX512 code in CI so we are currently not able to test this properly in CI... |
|
There's a way to invoke the scipy blas from C. We do that for liblinear for instance. You need to pass a pointer to the blas function to the |
|
Thanks for your advise. @rth @jeremiedbb
The invoking chain should be modified like this. Besides, all the initialization of class Kernel and class SVC_Q should be modified. Though it's similar to what you've done in liblinear.cpp, I don't know whether it will be approved by upstream if I make the change. |
|
Sounds good!
According to the changelog they are as of v0.3.3
In my understanding it uses the CPUID instruction to detect the features supported by the CPU (e.g. here in OpenBLAS) then use the best kernel for the architecture. I'm not familiar enough with OpenBLAS internals to say how well it works in practice. Naively I would say just calling the BLAS function on the user side should be enough and it would take care of it.
I'm not sure if you mean scikit-learn upstream or libsvm upstream. The version of liblinear and libsvm included in scikit-learn unfortunately diverged significantly from their upstream and at present there is little hope we would be able to contribute the changes back there. As to the suggested changes in scikit-learn if you need to modify a few functions signature to pass blas functions pointers around I don't see an issue with it particularly if that is similar to what was done in liblinear. All of those is private API from the scikit-learn perspective. |
What does this implement/fix? Explain your changes.
We found the heated point of the svm workflow was in the kernel function with the help of perf and program instrumentation technique. The further investigation showed that the kernels heavily used a certain function named dot and a similar part in k_function. The dot function mainly takes two arrays and sums up the in-place multiplication value. For acceleration, we use avx512 instructions to replace the old one like FMA (fused multiply add) instruction and add data alignment part for profiling the avx512 code. The whole acceleration is around 40% on our CLX machine on MLpack benchmark. The detail configuration is as follows.
command: python3 run.py -l SCIKIT -m SVM -c svm.yaml
datasets: ['datasets/webpage_train.csv', 'datasets/webpage_test.csv', 'datasets/webpage_labels.csv']