Skip to content

Conversation

@jim0421
Copy link
Contributor

@jim0421 jim0421 commented Dec 24, 2019

What does this implement/fix? Explain your changes.

We found the heated point of the svm workflow was in the kernel function with the help of perf and program instrumentation technique. The further investigation showed that the kernels heavily used a certain function named dot and a similar part in k_function. The dot function mainly takes two arrays and sums up the in-place multiplication value. For acceleration, we use avx512 instructions to replace the old one like FMA (fused multiply add) instruction and add data alignment part for profiling the avx512 code. The whole acceleration is around 40% on our CLX machine on MLpack benchmark. The detail configuration is as follows.

@rth
Copy link
Member

rth commented Dec 24, 2019

Thanks for the investigation.

The dot function mainly takes two arrays and sums up the in-place multiplication value

Looks like that's the vector dot product. I wonder if we could do it with BLAS, the advantage being is that it should also work with AVX2 or AVX512 with CPU capability detected at runtime (and so would be accessible to all current users). With the current PR one would need to explicitly compile for AVX512, and few users would do that. We use BLAS via the scipy BLAS Cython API and I imagine there should be a way to link against the appropriate BLAS function from C++.

@jim0421
Copy link
Contributor Author

jim0421 commented Jan 10, 2020

@rth Thanks for your advice. But I'm bit confused by your suggestion. Scikit-learn use cython interface to train svm. However, my code piece is inside the C function part. Thus, I can not invoke the scipy BLAS Cython API to replace my code piece directly.

// this is the cython interface
/scikit-learn-0.21.3/sklearn/svm/libsvm.pyx
189     cdef int fit_status = 0
190     with nogil:
191         model = svm_train(&problem, &param, &fit_status)

// this is the place I want to modify
/scikit-learn-0.21.3/sklearn/svm/src/libsvm/svm.cpp
 411 double Kernel::dot(const PREFIX(node) &px, const PREFIX(node) &py)
 412 {
 413         double sum = 0;
 414
 415         int dim = min(px.dim, py.dim);
 416         //printf("dim2 is %d\nsize is %d\n", dim, sizeof(px.values[0]);
 417 #if defined(__AVX512F__)
......
// there is a invoking chain between dot and svm_train in svm.cpp

So if I use my optimization techniques on BLAS implementation, it can not be used to improve performance on our scikit-learn use case.

@jim0421
Copy link
Contributor Author

jim0421 commented Jan 14, 2020

Thanks for the investigation.

The dot function mainly takes two arrays and sums up the in-place multiplication value

Looks like that's the vector dot product. I wonder if we could do it with BLAS, the advantage being is that it should also work with AVX2 or AVX512 with CPU capability detected at runtime (and so would be accessible to all current users). With the current PR one would need to explicitly compile for AVX512, and few users would do that. We use BLAS via the scipy BLAS Cython API and I imagine there should be a way to link against the appropriate BLAS function from C++.

New comments, pls check.

@rth
Copy link
Member

rth commented Jan 14, 2020

Thanks for your work on optimizing the SVM module.

Scikit-learn use cython interface to train svm. However, my code piece is inside the C function part. Thus, I can not invoke the scipy BLAS Cython API to replace my code piece directly.

I understand this, I'm just saying that dot product is typically BLAS functionality and we don't have that many people able to confidently review C code, let alone AVX 512 intrinsic. Furthermore, writing AVX512 code without run-time detection is of limited use since most users don't build packages from sources (with custom compile flags) and won't benefit from it.

So I think it would beneficial to use BLAS instead if possible,

  • by using the one bundled in with scipy. It is possible to call cython functions from C although I'm not sure if that would involve initializing the Python interpreter, which is not great.
  • alternatively by bundling some of the OpenBLAS or BLIS routines under sklearn/externals/ as we used to do for reference BLAS. That is also sub-optimal, but at least we won't have to maintain AVX512 implementation of the dot product in the long run. When one sees the issues OpenBLAS had with AVX512 kernels (Calculation goes off the rails with AVX-512 kernels OpenMathLib/OpenBLAS#2029) I think it would be better to leave those to specialized numerical libraries when possible.
  • maybe some better things to try, not sure cc @jeremiedbb

Overall the concern is maintainability and readability of the C code. Also we don't have machines with AVX512 code in CI so we are currently not able to test this properly in CI...

@jeremiedbb
Copy link
Member

There's a way to invoke the scipy blas from C. We do that for liblinear for instance. You need to pass a pointer to the blas function to the kernel::dot method. There are already helpers to dot that in liblinear next to libsvm. I think it would make more sense than writing specialized code because as @rth said, the BLAS will likely have a runtime cpu detection.

@jim0421
Copy link
Contributor Author

jim0421 commented Jan 19, 2020

Thanks for your advise. @rth @jeremiedbb
As you've mentioned, it sounds better to use scipy cython_blas api. I think the work can be divided into two part.

  • The first part is to optimize the blas dot function in openblas which scipy built with. I'm not sure whether the sdot and ddot function in openblas has been optimized, but probably yes. I will check it later. And another question is how does openblas or scipy supports runtime cpu detection as you've said. Can you explain the process more in detail?
  • The second part is to invoke the scipy interface within the svm.cpp code. A pointer to the dot function should be passed and this involves the modification to the invoking chain to dot function. The detail is as follows.

svm.cpp
2411 //
2412 // Interface functions
2413 //
2414 PREFIX(model) *PREFIX(train)(const PREFIX(problem) *prob, const svm_parameter *param,
2415         int *status, BlasFunctions *blas_functions)

1891 static decision_function svm_train_one(
1892         const PREFIX(problem) *prob, const svm_parameter *param,
1893         double Cp, double Cn, int *status, BlasFunctions *blas_functions)

1638 static void solve_c_svc(
1639         const PREFIX(problem) *prob, const svm_parameter* param,
1640         double *alpha, Solver::SolutionInfo* si, double Cp, double Cn, BlasFunctions *blas_functions)

1665         Solver s;
1666         s.Solve(l, SVC_Q(*prob,*param,y,BlasFunctions *blas_functions), minus_ones, y,
1667                 alpha, C, param->eps, si, param->shrinking,
1668                 param->max_iter);

1461         SVC_Q(const PREFIX(problem)& prob, const svm_parameter& param, const schar *y_, BlasFunctions *blas_functions)
1462         :Kernel(prob.l, prob.x, param, BlasFunctions *blas_functions)

 400 double Kernel::dot(const PREFIX(node) *px, const PREFIX(node) *py)
 401 {
 402         double sum = 0;
 403
 404         int dim = min(px->dim, py->dim);
 405         printf("dim is %d\n", dim);
 406         //for (int i = 0; i < dim; i++)
 407         //      sum += (px->values)[i] * (py->values)[i];
 408         sum = blas->dot(dim, px->values , 1, py->values, 1);
 409         return sum;

The invoking chain should be modified like this. Besides, all the initialization of class Kernel and class SVC_Q should be modified. Though it's similar to what you've done in liblinear.cpp, I don't know whether it will be approved by upstream if I make the change.

@rth
Copy link
Member

rth commented Jan 19, 2020

Sounds good!

I'm not sure whether the sdot and ddot function in openblas has been optimized, but probably yes

According to the changelog they are as of v0.3.3

https://github.com/xianyi/OpenBLAS/blob/303869f5724bb86d722bc32f254a976625ea2046/Changelog.txt#L245

x86_64:

  • added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY,
    DSCAL, DGEMVN and DSYMVL

And another question is how does openblas or scipy supports runtime cpu detection as you've said. Can you explain the process more in detail?

In my understanding it uses the CPUID instruction to detect the features supported by the CPU (e.g. here in OpenBLAS) then use the best kernel for the architecture. I'm not familiar enough with OpenBLAS internals to say how well it works in practice. Naively I would say just calling the BLAS function on the user side should be enough and it would take care of it.

Though it's similar to what you've done in liblinear.cpp, I don't know whether it will be approved by upstream if I make the change.

I'm not sure if you mean scikit-learn upstream or libsvm upstream. The version of liblinear and libsvm included in scikit-learn unfortunately diverged significantly from their upstream and at present there is little hope we would be able to contribute the changes back there.

As to the suggested changes in scikit-learn if you need to modify a few functions signature to pass blas functions pointers around I don't see an issue with it particularly if that is similar to what was done in liblinear. All of those is private API from the scikit-learn perspective.

@rth
Copy link
Member

rth commented Aug 3, 2020

Closing in favor of #16530 Thanks again @jim0421 !

@rth rth closed this Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants