-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Implementation - Padé Approximant in Log Softmax Layer #3685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation - Padé Approximant in Log Softmax Layer #3685
Conversation
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you use mlpack style for variables ?
|
@shrit, thank you for pointing that out. The new commit includes the fix :) |
|
I approved this one too quickly, I did not see the that the tests were not passing. |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just preventing mlpack-bot from auto-approving until we get the fixes worked out. I'm guessing that the level of approximation might be too high, and things are not converging? (Maybe a threshold like x < 13 is needed?)
| }; | ||
|
|
||
| output.transform([padeApproximant](double x) { | ||
| return padeApproximant(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a bit cleaner to just inline the whole approximant into the lambda, but, up to you.
|
@shrit @rcurtin Sorry for the delay with the benchmarks -I needed to run a more detailed analysis to find an effective solution. Mnist Simple New Implementation (Scale 4, x < 13.0): The initial idea of adding only Despite the graph showing a seemingly doubled duration in runtime, the actual difference in the cnn run is minor. This improvement could be a viable option for implementation? What do you think? auto scaledPadeApproxExpMinusX = [](double x) {
if (x < 13.0) {
double s = 4.0;
double xs = x / s;
double numerator = 24 - 12*xs + 4*xs*xs - xs*xs*xs;
double denominator = 24 + 12*xs + 4*xs*xs + xs*xs*xs;
double pade = numerator / denominator;
return std::pow(pade, s);
}
return 0.0;
};
output.transform([scaledPadeApproxExpMinusX](double x) {
return scaledPadeApproxExpMinusX(x);
});I think I will also test the algorithm on mnist_cnn soon. |
|
@MarkFischinger give it a try, what I find weird is that, when we tested this separately the time was way faster, while here it looks much slower than the original one. |
|
Hey @shrit, I think the trouble we're seeing comes from the higher Here are the I'm thinking, since lower |
|
Yeah, a switch to the existing implementation at about The scaling trick is definitely a good one for convergence, but I suspect the |
|
@rcurtin I did some backtesting, and the results showed unfortunately no/only minor improvements in runtime. Statistically, combining those two algorithms should reduce the error, but I'm still looking for faster implementations because I'm hopeful that I can find a better solution. I'll update you as soon as possible, though my available time will be limited for the next few days due to the exams :/ |
|
I also wonder if it would be possible to get additional speedup by writing a loop that the compiler can autovectorize. I doubt that |
|
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
|
@shrit @rcurtin I'm adding targeted OpenMP parallelization to the Padé approximant computation. I’ve applied parallel processing to improve the performance of this approximant without the fast approximation function, which doesn’t gain from parallel execution here. Here are the benchmarks of the commits: They have an almost identical error, so that should not be an error. I'm also working/testing one more optimization :) |
|
Nice, the numbers look good. 👍 Two quick checks:
|
|
@rcurtin You're right, the default fast approx with OpenMP is quicker. I've gone ahead and committed the OpenMP-ized fast approx version :) |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate the Pade approximant didn't work out, but sometimes that's how it goes---I definitely know the feeling of spending many weeks investigating something only to find at the end of the day that the clever idea didn't work out. Do you think you want to call it case closed on the Pade approximant, or do you still have some ideas or thoughts that might help out?
As a side note, I personally would focus not on microbenchmarks (although they are definitely useful!) but instead on the actual real-world measured time for training in e.g. the mnist_cnn example. Personally I will use microbenchmarks to try and guide my development, but the "ground truth" I use to decide whether or not anything really is faster is an actual fully-working example, since this is closer to what a user will see in practice. 👍
| maxInput.each_row() += log(sum(output)); | ||
| #pragma omp parallel for collapse(2) | ||
| for (size_t i = 0; i < output.n_rows; i++) { | ||
| for (size_t j = 0; j < output.n_cols; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just iterate over output.n_elem, the code might be a little simpler. Also do you think you can fix the style? e.g.
if (condition)
{
stuff
}
else
{
other stuff
}
instead of
if (condition) {
stuff
} else {
other stuff
}
| return *this; | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for an extra line 👍
|
|
||
| template<typename MatType> | ||
| void LogSoftMaxType<MatType>::Forward(const MatType& input, MatType& output) | ||
| void LogSoftMaxType<MatType>::ForwardImpl(const MatType& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does ForwardImpl() get called? I think we also need a generic implementation that will work with Bandicoot (although I know that will be slower).
|
Also it would be great if you could provide some updated numbers on how much OpenMP helps, etc., just to give an idea---I haven't run the updated code myself, but I'm sure that you have in your experiments. 👍 |
f0646a6 to
822c499
Compare
|
@rcurtin I'll return to this if I get another idea, but the current commit should be merged, because the openMP fast approximation shows a noticeable speed boost. Here are the MNIST dataset benchmarks :) Fast Approximation without OpenMP Fast Approximation with OpenMP Padé approximant with OpenMP |
rcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkFischinger nice work, great to see we did end up getting an improvement here (even if it's not the one you were originally looking for :)).
Do you want to add a small bullet point to HISTORY.md indicating the speedup?
@conradsnicta just an FYI: we got some speedup in this PR over using .transform() and .each_col(), but the primary method of speedup is via OpenMP parallelization. I notice that neither .transform() nor .each_col() are OpenMP-ized---wanted to call it to your attention in case you wanted to incorporate that into the next Armadillo release.
shrit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, we started with Padé and ended up with the original implementation parallelized and removed the transform function



Following our discussion in issue #3662, I've implemented the pade approximant in the log softmax layer. Due to time constraints, I haven't run the tests yet, but I plan to do so shortly and update you with the results.