I recently updated to current theano dev version. While doing so, I noticed a significant slowdown when training keras models.
It looks like that GPUDNNReduction is much slower than previous used GpuCAReduceCuda. I tried to create a minimal example to reproduce the problem:
import numpy as np
import theano
import pygpu
from time import time
print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))
dtype = theano.config.floatX
sizes = [1000, 10000, 20000]
a = theano.tensor.matrix()
f_max = theano.function([a], a.max(axis=1))
f_min = theano.function([a], a.min(axis=1))
f_sum = theano.function([a], a.sum(axis=1))
# print graphs
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
print(f)
theano.printing.debugprint(funcs[f], print_type=True)
# first call
funcs[f](np.random.random((5, 5)).astype(dtype))
# time execution
n_runs = 100
for f in funcs:
print(f)
for s in sizes:
data = np.random.random((s, s)).astype(dtype)
t1 = time()
for i in range(n_runs):
funcs[f](data)
print("{}x{}:\t{}".format(s, s, (time()-t1)/n_runs))
I tested both a Maxwell GPU (Titan X) and a Pascal GPU (GTX 1080Ti) which - for whatever reason - is slower than the Titan X:
When running with theano defaults the output looks like:
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max' 3
|GpuDnnReduction{red_op='maximum', axis=(1,), acc_dtype='float32', dtype='float32', return_indices=False} [id B] <GpuArrayType<None>(float32, vector)> '' 2
|GpuContiguous [id C] <GpuArrayType<None>(float32, matrix)> '' 1
|GpuFromHost<None> [id D] <GpuArrayType<None>(float32, matrix)> '' 0
|<TensorType(float32, matrix)> [id E] <TensorType(float32, matrix)>
...
max
1000x1000: 0.000675480365753
10000x10000: 0.0677583003044
20000x20000: 0.275841779709
sum
1000x1000: 0.00063549041748
10000x10000: 0.0676817893982
20000x20000: 0.275875630379
min
1000x1000: 0.000633809566498
10000x10000: 0.0676900100708
20000x20000: 0.275860497952
--------------------------------
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
...
max
1000x1000: 0.000814170837402
10000x10000: 0.0918325400352
20000x20000: 0.366781489849
sum
1000x1000: 0.000770909786224
10000x10000: 0.0916589212418
20000x20000: 0.366722888947
min
1000x1000: 0.000764350891113
10000x10000: 0.0915873193741
20000x20000: 0.366739079952
When running with THEANO_FLAGS='optimizer_excluding=local_dnn_reduction':
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max' 2
|GpuCAReduceCuda{maximum}{1} [id B] <GpuArrayType<None>(float32, vector)> '' 1
|GpuFromHost<None> [id C] <GpuArrayType<None>(float32, matrix)> '' 0
|<TensorType(float32, matrix)> [id D] <TensorType(float32, matrix)>
...
max
1000x1000: 0.000498871803284
10000x10000: 0.0555803990364
20000x20000: 0.221592080593
sum
1000x1000: 0.000458009243011
10000x10000: 0.054373550415
20000x20000: 0.21805713892
min
1000x1000: 0.000489480495453
10000x10000: 0.055532169342
20000x20000: 0.221575219631
--------------------------------
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
...
max
1000x1000: 0.000663950443268
10000x10000: 0.0790155386925
20000x20000: 0.315941710472
sum
1000x1000: 0.000606820583344
10000x10000: 0.0785833787918
20000x20000: 0.314622268677
min
1000x1000: 0.000656189918518
10000x10000: 0.0790068006516
20000x20000: 0.316200020313
Currently, I ended up in disabling all related optimizers: local_dnn_reduction, local_cudnn_maxandargmax, local_dnn_argmax.
I recently updated to current theano dev version. While doing so, I noticed a significant slowdown when training keras models.
It looks like that GPUDNNReduction is much slower than previous used GpuCAReduceCuda. I tried to create a minimal example to reproduce the problem:
I tested both a Maxwell GPU (Titan X) and a Pascal GPU (GTX 1080Ti) which - for whatever reason - is slower than the Titan X:
When running with theano defaults the output looks like:
When running with
THEANO_FLAGS='optimizer_excluding=local_dnn_reduction':Currently, I ended up in disabling all related optimizers:
local_dnn_reduction,local_cudnn_maxandargmax,local_dnn_argmax.