GPUDNNReduction is slower than GpuCAReduceCuda

I recently updated to current theano dev version. While doing so, I noticed a significant slowdown when training keras models.
It looks like that GPUDNNReduction is much slower than previous used GpuCAReduceCuda. I tried to create a minimal example to reproduce the problem:

```python
import numpy as np
import theano
import pygpu
from time import time

print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))

dtype = theano.config.floatX
sizes = [1000, 10000, 20000]

a = theano.tensor.matrix()
f_max = theano.function([a], a.max(axis=1))
f_min = theano.function([a], a.min(axis=1))
f_sum = theano.function([a], a.sum(axis=1))

# print graphs
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
    print(f)
    theano.printing.debugprint(funcs[f], print_type=True)
    # first call
    funcs[f](np.random.random((5, 5)).astype(dtype))

# time execution
n_runs = 100
for f in funcs:
    print(f)
    for s in sizes:
        data = np.random.random((s, s)).astype(dtype)
        t1 = time()
        for i in range(n_runs):
            funcs[f](data)
        print("{}x{}:\t{}".format(s, s, (time()-t1)/n_runs))
```

I tested both a Maxwell GPU (Titan X) and a Pascal GPU (GTX 1080Ti) which - for whatever reason - is slower than the Titan X:

When running with theano defaults the output looks like:
```bash
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   3
 |GpuDnnReduction{red_op='maximum', axis=(1,), acc_dtype='float32', dtype='float32', return_indices=False} [id B] <GpuArrayType<None>(float32, vector)> ''   2
   |GpuContiguous [id C] <GpuArrayType<None>(float32, matrix)> ''   1
     |GpuFromHost<None> [id D] <GpuArrayType<None>(float32, matrix)> ''   0
       |<TensorType(float32, matrix)> [id E] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000675480365753
10000x10000:	0.0677583003044
20000x20000:	0.275841779709
sum
1000x1000:	0.00063549041748
10000x10000:	0.0676817893982
20000x20000:	0.275875630379
min
1000x1000:	0.000633809566498
10000x10000:	0.0676900100708
20000x20000:	0.275860497952

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000814170837402
10000x10000:	0.0918325400352
20000x20000:	0.366781489849
sum
1000x1000:	0.000770909786224
10000x10000:	0.0916589212418
20000x20000:	0.366722888947
min
1000x1000:	0.000764350891113
10000x10000:	0.0915873193741
20000x20000:	0.366739079952
```

When running with `THEANO_FLAGS='optimizer_excluding=local_dnn_reduction'`:
```bash
Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   2
 |GpuCAReduceCuda{maximum}{1} [id B] <GpuArrayType<None>(float32, vector)> ''   1
   |GpuFromHost<None> [id C] <GpuArrayType<None>(float32, matrix)> ''   0
     |<TensorType(float32, matrix)> [id D] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000498871803284
10000x10000:	0.0555803990364
20000x20000:	0.221592080593
sum
1000x1000:	0.000458009243011
10000x10000:	0.054373550415
20000x20000:	0.21805713892
min
1000x1000:	0.000489480495453
10000x10000:	0.055532169342
20000x20000:	0.221575219631

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000663950443268
10000x10000:	0.0790155386925
20000x20000:	0.315941710472
sum
1000x1000:	0.000606820583344
10000x10000:	0.0785833787918
20000x20000:	0.314622268677
min
1000x1000:	0.000656189918518
10000x10000:	0.0790068006516
20000x20000:	0.316200020313

```

Currently, I ended up in disabling all related optimizers: `local_dnn_reduction`, `local_cudnn_maxandargmax`, `local_dnn_argmax`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDNNReduction is slower than GpuCAReduceCuda #6432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPUDNNReduction is slower than GpuCAReduceCuda #6432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions