Optimizations to set_to, copy, fill ops #8340

cjolivier01 · 2017-10-18T21:11:45Z

Description

Optimizations to set_to, copy, fill, full ops

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Intersting edge cases to note here

szha · 2017-10-18T22:16:58Z

Is it intended to only support fill for zero and one going forward?

cjolivier01 · 2017-10-18T22:30:34Z

I don't understand your question. Which item are you referring to and what gives you that assumption?

szha · 2017-10-18T22:56:31Z

Having the value to fill as the template parameter.

cjolivier01 · 2017-10-18T23:03:10Z

There's more than one thing you could be referring to in this PR. There's a FillCompute() and a set_to template, which are somewhat independent. Currently, the two use-cases are zero and one for the fill value for both of these. This does not mean that "going forward" another value can't be used if the need arises.

szha · 2017-10-18T23:09:47Z

There's an op in NDArray which can benefit from fill https://mxnet.incubator.apache.org/versions/master/api/python/ndarray.html?highlight=full#mxnet.ndarray.full. Currently it is done in the frontend through empty and in-place assignment. And this op is not included in symbol (though it's doable through ones and multiply). It's worth considering the full use case in this PR since doing it with fill will be faster, though it wouldn't be compatible with the current template implementation of set_to.

cjolivier01 · 2017-10-18T23:24:58Z

Making a fill with some runtime-determined value such as OpBase::SetToScalar is a trivial addition. For filling with a predefined constant scalar such as zeroes and ones, using an immediate such as with set_to is faster.

szha · 2017-10-18T23:30:09Z

Agreed. I'm basically asking to keep both fill and set_to.

cjolivier01 · 2017-10-18T23:39:07Z

You can re-add it if you need it sometime, although it's better to use OpBase::SetToScalar or op_with_req<mshadow_op::identity, Req> (op_with_req override Map() for setting a scalar is in a separate PR) because those properly handle Req

szha · 2017-10-18T23:41:58Z

As long as full is still on the radar it's fine. Would you make the change to properly support full in that PR then?

reminisce · 2017-10-19T04:53:59Z

src/operator/tensor/init_op.h

+  if (req[0] != kNullOp) {
+    mshadow::Stream<xpu> *s = ctx.get_stream<xpu>();
+    MSHADOW_TYPE_SWITCH(outputs[0].type_flag_, DType, {
+      mxnet_op::Kernel<mxnet_op::set_to<value>, xpu>::Launch(s,


What if req[0] is kAddTo?

op_with_req<> wrapper will handle this along with the other changes

reminisce · 2017-10-19T05:02:00Z

src/operator/tensor/init_op.h

+    const size_t size = outputs[0].Size();
+    if (size) {
+      MSHADOW_TYPE_SWITCH(outputs[0].type_flag_, DType, {
+        memset(outputs[0].dptr<DType>(), 0, size * sizeof(DType));


outputs[0].dptr_ is more efficient here than outputs[0].dptr<DType>().
Question: How much faster is this compared to the original implementation of filling up an TBlob?

a lot faster. assembly for memset looks something like:
; for 32 bit, load each eax register register byte with destination value
mov eax,
shl eax, 8
or al, ah
shl eax, 8
or al, ah
shl eax, 8
mov edi, [pointer]
mov ecx, [size]
shr ecx, 2
rep stosd ; clock cycle count, size/4 * https://web.itu.edu.tr/kesgin/mul06/intel/instr/stos.html
mov ecx, [size]
and ecx, 3
rep stosb ; clock cycle count, (size & 3 ) * ttps://web.itu.edu.tr/kesgin/mul06/intel/instr/stos.html

This is done without any change in code execution path (jmp, jle, jge, jz, etc) or compares
It is also very cache-friendly since it is perfectly sequential.

cjolivier01 · 2017-10-19T20:18:44Z

'full' on the radar? that struct isn't used. if someone wants to use it, they can add it. I am not going to add an unused kernel. OpBase::set_to_scalar is available.

cjolivier01 · 2017-10-19T22:14:18Z

Ok @szha and I spoke offline. I added the new operator _full to be used for ndarray.full() function.
It is tested in test_ndarray.test_outputs()

szha · 2017-10-22T01:00:38Z

The symbol-side can be updated to use the new op as well. The frontend definition that needs update is here: https://github.com/apache/incubator-mxnet/blob/725a5425d49e5e52455ce19055260df7ffaaadb9/python/mxnet/symbol/symbol.py#L2762

cjolivier01 · 2017-10-22T01:44:52Z

done

…che#8363) * Timing output for test_factorization_module when Verbose enabled * Trigger build * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme.

…s set (apache#8379)

* CPU optimization for ActivationOp Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass). Very slight improvement on GPU. OLD MSHADOW APPROACH -------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes Activation Operator CPU: Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes Activation Operator CPU: Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes Activation Operator CPU: Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes Activation Operator CPU: Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes NEW MXNET_OP APPROACH --------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes Activation Operator CPU: Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes Activation Operator CPU: Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes Activation Operator CPU: Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes Activation Operator CPU: Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes * lint * Trigger build * Trigger build * Negative begin and end support for csr slice (apache#8241) * negative index support for sparse slice * fix lint * getitem(int) for csr ndarray, support a[-1] * remove unneccessary argument * unittest and doc update * Preparing for 0.12.0.rc0: Final changes before RC (apache#8301) * Final changes before RC * Updates to NEWS.md * Updates * Enable smoothing in softmax operator (apache#8125) * v0.12 regression: Fix registration of children for Block (apache#8277) * Fix Block not registering children If the attribute was already set to something different than Block (e.g. None), it was not being registered. * fix if / elif for block children registration * trigger test * Add fix from apache#8152 * Add tests from apache#8152 * Revert "[CMAKE] Fix windows cmake build" (apache#8311) * Revert "Added my code signing key (apache#8293)" This reverts commit 22ab185. * Revert "[CMAKE] Fix windows cmake build (apache#8227)" This reverts commit 1c1c788. * fixed broken links. https was pointing to http for mxnet.io (apache#8300) * Update rnn.md (apache#8320) * fluent methods for missed ops (apache#8329) * update ps lite (apache#8327) * Fix unused type warning (apache#8316) * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme.

cjolivier01 · 2017-10-23T23:31:39Z

Any further comments on this, or everyone ok with me merging once it passes CI?

piiswrong · 2017-10-25T04:27:27Z

python/mxnet/ndarray/ndarray.py

+    if ctx is None:
+        ctx = Context.default_ctx
+    dtype = mx_real_t if dtype is None else dtype
+    out = _internal._full(shape=shape, ctx=ctx, dtype=dtype, value=val, out=out)


how about change setitem to use full instead?

What is setitem? Is that python code? I’m not especially familiar with the python parts.

I looked at __setitem__ in ndarray.py. It is not clear how full() would work there. Can you elaborate on what you have in mind?

I guess that would be replacing some value assignment functions used in __setitem__ with _full.

In order to avoid code conflict, I can make that change since I am working on __setitem__ right now for advanced indexing.

@reminisce remember to change this back to out[:] = val after you've made the change

cjolivier01 · 2017-10-27T17:13:07Z

Going to close and reopen to see if Jenkins starts to see this PR

cjolivier01 · 2017-10-27T17:14:03Z

That worked :)

* Fill optimizations * Optimize IdentityCompute for CPU * lint * Fix unused type warning (apache#8316) * remove unused variable * CR comments * CR comments * Added _full operator * Trigger build * Trigger build * Add _full to symbolic * Merge conflict resolution fix * lint * Timing output for test_factorization_module when Verbose enabled (apache#8363) * Timing output for test_factorization_module when Verbose enabled * Trigger build * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme. * Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (apache#8379) * CPU optimization for ActivationOp (apache#8296) * CPU optimization for ActivationOp Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass). Very slight improvement on GPU. OLD MSHADOW APPROACH -------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes Activation Operator CPU: Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes Activation Operator CPU: Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes Activation Operator CPU: Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes Activation Operator CPU: Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes NEW MXNET_OP APPROACH --------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes Activation Operator CPU: Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes Activation Operator CPU: Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes Activation Operator CPU: Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes Activation Operator CPU: Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes * lint * Trigger build * Trigger build * Negative begin and end support for csr slice (apache#8241) * negative index support for sparse slice * fix lint * getitem(int) for csr ndarray, support a[-1] * remove unneccessary argument * unittest and doc update * Preparing for 0.12.0.rc0: Final changes before RC (apache#8301) * Final changes before RC * Updates to NEWS.md * Updates * Enable smoothing in softmax operator (apache#8125) * v0.12 regression: Fix registration of children for Block (apache#8277) * Fix Block not registering children If the attribute was already set to something different than Block (e.g. None), it was not being registered. * fix if / elif for block children registration * trigger test * Add fix from apache#8152 * Add tests from apache#8152 * Revert "[CMAKE] Fix windows cmake build" (apache#8311) * Revert "Added my code signing key (apache#8293)" This reverts commit 22ab185. * Revert "[CMAKE] Fix windows cmake build (apache#8227)" This reverts commit 1c1c788. * fixed broken links. https was pointing to http for mxnet.io (apache#8300) * Update rnn.md (apache#8320) * fluent methods for missed ops (apache#8329) * update ps lite (apache#8327) * Fix unused type warning (apache#8316) * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme. * Fix GPU copy * Remove duplicate * Trigger build

Olivier added 2 commits October 18, 2017 14:11

Fill optimizations

d53cbfe

Optimize IdentityCompute for CPU

7fc4aef

Olivier and others added 2 commits October 18, 2017 15:24

lint

1246b68

Fix unused type warning (apache#8316)

107e94f

remove unused variable

4749e74

reminisce reviewed Oct 19, 2017

View reviewed changes

Olivier added 2 commits October 19, 2017 09:45

CR comments

782c674

CR comments

bee29a7

Added _full operator

79f810c

Olivier and others added 2 commits October 20, 2017 13:40

Trigger build

34e5e5a

Trigger build

797e357

cjolivier01 added 5 commits October 21, 2017 18:16

Add _full to symbolic

1fece98

Merge remote-tracking branch 'apache/master' into eric_perf

2edb073

Merge branch 'master' into eric_perf

06c1383

Merge conflict resolution fix

b3611a7

lint

1f2dfcb

cjolivier01 changed the title ~~Fill optimizations~~ Optimizations to set_to, copy, fill ops Oct 23, 2017

cjolivier01 and others added 6 commits October 23, 2017 10:33

Use omp_get_max_threads() when OMP_NUM_THREADS environment variable i…

0866311

…s set (apache#8379)

Fix GPU copy

e502187

Merge remote-tracking branch 'apache/master' into eric_perf

e5b2f92

Remove duplicate

3f88ddf

piiswrong reviewed Oct 25, 2017

View reviewed changes

Olivier and others added 3 commits October 25, 2017 08:35

Merge remote-tracking branch 'apache/master' into eric_perf

48e190a

Trigger build

211c096

Merge remote-tracking branch 'apache/master' into eric_perf

7d24aba

cjolivier01 closed this Oct 27, 2017

cjolivier01 reopened this Oct 27, 2017

cjolivier01 merged commit 3552b95 into apache:master Oct 28, 2017

cjolivier01 deleted the eric_perf branch October 28, 2017 01:50

ZiyueHuang mentioned this pull request Dec 7, 2017

[ROIAlign] refactor to nnvm interface & unified impl for cpu/gpu TuSimple/mx-maskrcnn#91

Merged

7 tasks

Optimizations to set_to, copy, fill ops #8340

Optimizations to set_to, copy, fill ops #8340

Uh oh!

Conversation

cjolivier01 commented Oct 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

szha commented Oct 18, 2017

Uh oh!

cjolivier01 commented Oct 18, 2017

Uh oh!

szha commented Oct 18, 2017

Uh oh!

cjolivier01 commented Oct 18, 2017

Uh oh!

szha commented Oct 18, 2017

Uh oh!

cjolivier01 commented Oct 18, 2017

Uh oh!

szha commented Oct 18, 2017

Uh oh!

cjolivier01 commented Oct 18, 2017

Uh oh!

szha commented Oct 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolivier01 Oct 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolivier01 commented Oct 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjolivier01 commented Oct 19, 2017

Uh oh!

szha commented Oct 22, 2017

Uh oh!

cjolivier01 commented Oct 22, 2017

Uh oh!

cjolivier01 commented Oct 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolivier01 Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolivier01 commented Oct 27, 2017

Uh oh!

cjolivier01 commented Oct 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjolivier01 commented Oct 18, 2017 •

edited

Loading

szha commented Oct 18, 2017 •

edited

Loading

cjolivier01 Oct 19, 2017 •

edited

Loading

cjolivier01 commented Oct 19, 2017 •

edited

Loading

cjolivier01 Oct 25, 2017 •

edited

Loading