In each iteration of Solver::Solve, there are four chances to accelerate the computation.
The first opportunity is the most complex one since Net::ForwardBackward invokes the Forward and Backward of all the layers that comprise a net.
Dtype loss = net_->ForwardBackward(bottom_vec);
The second chance is more straightforward. An OpenMP directive is enough to parallelize the independent computation for each param_id.
The only extra trick that is needed to deal with the next occasion is to distinguish CPU and GPU mode.
The last one involves a plain old OpenMP friendly nested for loop.