Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@XiaotaoChen
Copy link
Contributor

Description

  1. Parallelize NDArray::Copy<cpu, cpu> OP by OpenMP when data size bigger than MXNET_CPU_PARALLEL_COPY_SIZE.
  2. introduce the environment variable named: MXNET_CPU_PARALLEL_COPY_SIZE (default 200,000) . When data size is bigger than this threshold, NDArray::Copy is implemented by OpenMP with the Recommended OMP Thread Count, otherwise implemented by default.

Comments

We are optimizing Sockeye's performance on CPU and found that the CopyCPU2CPU operation take too much time, probably accounting for 14% of the overall time (detail data as below). So we parallelized NDArray::Copy<cpu, cpu> operator.

As far as i know, NDArray::Copy<cpu, cpu> is called in the front-end by those functions: nd.copy(), nd.copyto(), nd.asnumpy(), nd.array(),nd.ones(), nd.zeros() . Those functions can gain 2-14 times speedup when data size is range from 20000 to 2e8. And Sockeye's perf increased from 17.10 sent/sec to 19.19 sent/sec on CPU inference, CopyCPU2CPU time overhead dropped from 14.68% to 2.3%. the detail data as below.

the logs and scripts are here: logs and scripts
@pengzhao-intel

detail data

hardware: skx-8180 single socket(28 cores)

functions\size 20,000(shape:[20000,1]) 200,000(shape:[20000,10]) 2000,000(shape:[20000,100]) 20,000,000(shape:[20000,1000]) 200,000,000(shape:[20000,10000])
APIs default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup
nd.copy 44.98164 45.434634 0.990029743 208.68778 79.902013 2.611796301 1651.295 190.01166 8.6904931 26326.93 3634.23824 7.24413888 291151.8 30640.8008 9.5020948
nd.copyto 37.51119 37.384033 1.003401372 192.45942 61.178207 3.145882062 1566.815 157.308578 9.9601395 15042.04 2261.47175 6.6514396 148679.5 23665.7222 6.28248457
nd.asnumpy 15.06805 15.155474 0.994231787 146.02343 25.184949 5.798043585 1210.038 85.465113 14.158267 26839.42 3322.951 8.07698195 288313.9 30143.3802 9.56474928
nd.array 56.04426 55.130323 1.016577773 651.10525 485.841433 1.34015998 4410.593 2852.86109 1.5460244 60601.87 35724.5922 1.69636283 596539.3 339708.265 1.75603397
nd.ones 37.59066 31.590462 1.189937108 42.200089 41.898092 1.007207894 49.11423 49.090385 1.0004857 1991.232 1997.4788 0.99687277 22429.08 22386.3522 1.00190887
nd.zeros 32.32956 31.971931 1.011185687 43.678284 43.423971 1.005856512 55.90121 56.385994 0.9914024 2319.415 2365.28715 0.98060627 22368.08 22318.8877 1.00220413

hardware: bdw-2699(44 cores)

functions\size 20,000(shape:[20000,1]) 200,000(shape:[20000,10]) 2000,000(shape:[20000,100]) 20,000,000(shape:[20000,1000]) 200,000,000(shape:[20000,10000])
APIs default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup default(us) parallel(us) speedup
nd.copy 141.3822 139.125188 1.016223008 283.70222 143.504143 1.976961843 1530.663 267.05265 5.7316906 42425.86 3845.33405 11.033076 424523.7 29502.2726 14.3895266
nd.copyto 135.1595 136.534373 0.989930147 274.32442 93.11835 2.945975922 1973.446 294.510523 6.7007664 20426.01 3140.17932 6.50472761 153221.1 30291.3427 5.05824612
nd.asnumpy 40.90468 41.572253 0.983941885 124.11277 41.222572 3.010796245 1521.81 81.857045 18.591068 32866.75 3491.84672 9.4124258 370223.1 27558.4698 13.4340953
nd.array 227.8487 190.917651 1.193439621 1233.6016 648.3078 1.902802295 4065.005 2900.36202 1.4015509 83189.32 49047.1363 1.69610964 819595.4 489327.01 1.67494413
nd.ones 107.8765 100.541115 1.07295866 133.58593 126.187007 1.058634587 117.8583 114.28992 1.0312217 1897.2 1758.56749 1.07883261 18391.24 18327.9673 1.00345244
nd.zeros 97.52909 94.858805 1.028150133 123.48493 116.785367 1.057366451 114.3217 115.57738 0.9891357 1996.326 1979.92325 1.00828477 18440.56 18458.9148 0.99900545

sockeye profile on skx-8180 single socket

  • offical master
Time of each OP:
FullyConnected        47694.792 ms      0.78006594485     ms/call       61142  calls    28.58 %
ImperativeBulk        29801.262 ms      1.89865328746     ms/call       15696  calls    17.86 %
CopyCPU2CPU           24502.718 ms      0.668176979084    ms/call       36671  calls    14.68 %
take                  17288.25  ms      0.507969971205    ms/call       34034  calls    10.36 %
log                   12085.495 ms      6.92975630734     ms/call       1744   calls    7.24 %
SliceChannel          10051.787 ms      0.350407411281    ms/call       28686  calls    6.02 %
softmax               4577.321  ms      1.31230533257     ms/call       3488   calls    2.74 %
_mul_scalar           4498.783  ms      2.57957740826     ms/call       1744   calls    2.70 %
Activation            4148.085  ms      0.028656693218    ms/call       144751 calls    2.49 %
elemwise_add          3534.039  ms      0.0621074654669   ms/call       56902  calls    2.12 %
batch_dot             2317.371  ms      0.664383887615    ms/call       3488   calls    1.39 %
DeleteVariable        2129.944  ms      0.0337539856106   ms/call       63102  calls    1.28 %
elemwise_mul          1196.172  ms      0.0140144107413   ms/call       85353  calls    0.72 %
Concat                887.246   ms      0.0558578443717   ms/call       15884  calls    0.53 %
LayerNorm             828.937   ms      0.332106169872    ms/call       2496   calls    0.50 %
repeat                463.889   ms      0.469998986829    ms/call       987    calls    0.28 %
_slice_assign         396.031   ms      0.131659242021    ms/call       3008   calls    0.24 %
Embedding             101.812   ms      0.0568464544947   ms/call       1791   calls    0.06 %
SetupExec             65.807    ms      0.000731863829977 ms/call       89917  calls    0.04 %
Dropout               59.13     ms      0.0167270155587   ms/call       3535   calls    0.04 %
_full                 58.297    ms      0.0317176278564   ms/call       1838   calls    0.03 %
stack                 40.659    ms      0.288361702128    ms/call       141    calls    0.02 %
SequenceMask          36.445    ms      0.0208973623853   ms/call       1744   calls    0.02 %
sum                   27.585    ms      0.0154020100503   ms/call       1791   calls    0.02 %
argsort               21.859    ms      0.465085106383    ms/call       47     calls    0.01 %
WaitForVar            19.604    ms      0.00400653995504  ms/call       4893   calls    0.01 %
expand_dims           18.804    ms      0.00283876811594  ms/call       6624   calls    0.01 %
SwapAxis              12.079    ms      0.1285            ms/call       94     calls    0.01 %
_zeros                10.918    ms      0.0100999074931   ms/call       1081   calls    0.01 %
Reshape               9.636     ms      0.00272588401697  ms/call       3535   calls    0.01 %
SequenceReverse       7.18      ms      0.0763829787234   ms/call       94     calls    0.00 %
tile                  1.412     ms      0.0300425531915   ms/call       47     calls    0.00 %
SequenceLast          0.833     ms      0.0177234042553   ms/call       47     calls    0.00 %
broadcast_to          0.627     ms      0.0133404255319   ms/call       47     calls    0.00 %
broadcast_not_equal   0.55      ms      0.0117021276596   ms/call       47     calls    0.00 %
_slice_assign_scalar  0.473     ms      0.0100638297872   ms/call       47     calls    0.00 %
broadcast_add         0.447     ms      0.00951063829787  ms/call       47     calls    0.00 %
_ones                 0.418     ms      0.00889361702128  ms/call       47     calls    0.00 %
_unravel_index        0.347     ms      0.0073829787234   ms/call       47     calls    0.00 %
Cast                  0.28      ms      0.00595744680851  ms/call       47     calls    0.00 %
zeros_like            0.244     ms      0.00259574468085  ms/call       94     calls    0.00 %

Total OP Time: 166897.56800000 ms
  • parallelized copy
Time of each OP:
FullyConnected        48756.792 ms      0.797435347224    ms/call       61142  calls    33.61 %
ImperativeBulk        29133.28  ms      1.85609582059     ms/call       15696  calls    20.08 %
take                  16485.268 ms      0.484376447082    ms/call       34034  calls    11.36 %
log                   11493.641 ms      6.59039048165     ms/call       1744   calls    7.92 %
SliceChannel          10047.895 ms      0.350271735341    ms/call       28686  calls    6.93 %
softmax               4443.701  ms      1.27399684633     ms/call       3488   calls    3.06 %
Activation            4366.243  ms      0.0301638192482   ms/call       144751 calls    3.01 %
_mul_scalar           4272.24   ms      2.44967889908     ms/call       1744   calls    2.95 %
elemwise_add          3784.498  ms      0.0665090506485   ms/call       56902  calls    2.61 %
CopyCPU2CPU           3337.278  ms      0.0910059174825   ms/call       36671  calls    2.30 %
batch_dot             2183.933  ms      0.626127580275    ms/call       3488   calls    1.51 %
DeleteVariable        1929.185  ms      0.0305724858166   ms/call       63102  calls    1.33 %
elemwise_mul          1111.778  ms      0.013025646433    ms/call       85353  calls    0.77 %
Concat                919.976   ms      0.0579184084613   ms/call       15884  calls    0.63 %
_slice_assign         808.582   ms      0.268810505319    ms/call       3008   calls    0.56 %
LayerNorm             804.849   ms      0.322455528846    ms/call       2496   calls    0.55 %
repeat                643.624   ms      0.652101317123    ms/call       987    calls    0.44 %
Embedding             98.968    ms      0.0552585147962   ms/call       1791   calls    0.07 %
_full                 96.208    ms      0.0523438520131   ms/call       1838   calls    0.07 %
Dropout               60.788    ms      0.017196039604    ms/call       3535   calls    0.04 %
SetupExec             59.639    ms      0.000663267235339 ms/call       89917  calls    0.04 %
stack                 35.195    ms      0.249609929078    ms/call       141    calls    0.02 %
SequenceMask          34.082    ms      0.0195424311927   ms/call       1744   calls    0.02 %
sum                   32.254    ms      0.0180089335567   ms/call       1791   calls    0.02 %
argsort               21.612    ms      0.459829787234    ms/call       47     calls    0.01 %
WaitForVar            19.927    ms      0.0040725526262   ms/call       4893   calls    0.01 %
expand_dims           17.696    ms      0.00267149758454  ms/call       6624   calls    0.01 %
SwapAxis              15.528    ms      0.165191489362    ms/call       94     calls    0.01 %
SequenceReverse       11.077    ms      0.117840425532    ms/call       94     calls    0.01 %
_zeros                10.9      ms      0.0100832562442   ms/call       1081   calls    0.01 %
Reshape               9.855     ms      0.00278783592645  ms/call       3535   calls    0.01 %
_slice_assign_scalar  1.715     ms      0.0364893617021   ms/call       47     calls    0.00 %
tile                  1.378     ms      0.0293191489362   ms/call       47     calls    0.00 %
broadcast_to          1.365     ms      0.0290425531915   ms/call       47     calls    0.00 %
broadcast_not_equal   0.836     ms      0.0177872340426   ms/call       47     calls    0.00 %
SequenceLast          0.835     ms      0.0177659574468   ms/call       47     calls    0.00 %
broadcast_add         0.452     ms      0.0096170212766   ms/call       47     calls    0.00 %
_unravel_index        0.338     ms      0.0071914893617   ms/call       47     calls    0.00 %
Cast                  0.305     ms      0.00648936170213  ms/call       47     calls    0.00 %
_ones                 0.282     ms      0.006             ms/call       47     calls    0.00 %
zeros_like            0.256     ms      0.00272340425532  ms/call       94     calls    0.00 %

Total OP Time: 145054.25400000 ms

@roywei
Copy link
Member

roywei commented Oct 23, 2018

@XiaotaoChen Thanks for the contribution!
@mxnet-label-bot [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Oct 23, 2018
if (to->type_flag_ == from.type_flag_) {
mshadow::Copy(to->FlatTo1D<cpu, DType>(),
from.FlatTo1D<cpu, DType>());
index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a API: mx.test_utils.set_env_var to set env var. it may can't work repeatly if add static modifier. like my [test case] (https://github.com/XiaotaoChen/incubator-mxnet/blob/cxt-test/tests/cxt-test/parallel-copy/test_ndarray_copy.py#L89-L92). what's your opinion? @szha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wasteful to get the variable every time it launches a copy as people don't usually change this setting at runtime. Try testing it with some other approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll do it.

namespace mxnet {
namespace ndarray {
template<typename DType>
void OMPCopy(const TBlob &from, TBlob *to, const index_t size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem NDArray-specific. Consider putting it in src/common/(utils.h?)

index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000);
const index_t size = from.Size();
if (size >= copy_block_size) {
OMPCopy<DType>(from, to, size);
Copy link
Member

@szha szha Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size check between from and to?

if (size >= copy_block_size) {
OMPCopy<DType>(from, to, size);
} else {
mshadow::Copy(to->FlatTo1D<cpu, DType>(), from.FlatTo1D<cpu, DType>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably get rid of the mshadow::Copy here, and merge that in OMP copy. This way you can put the threshold check as a condition in pragma omp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test the perf of omp copy in single thread and memcpy called by mshadow::Copy when data size is less than 200,000 as below.

size 20 200 200 20000 200000
memcpy(us) 0.0422 0.038967 0.172933 2.213567 74.105064
omp copy single thread(us) 0.254033 0.2407 0.389933 2.541833 49.168999
speedup 0.16612015 0.16189032 0.443494139 0.870854616 1.507150146

It shows that memcpy's perf is better than assignment directly in single thread when data size is small. So we want to keep mshadow::copy called when data size is less than MXNET_CPU_PARALLEL_COPY_SIZE. Looking forward to your suggestions.@szha

@XiaotaoChen
Copy link
Contributor Author

Hi, @szha the code is specificated with your suggestions, and the test script and log also updated.
May you review it again.

@ankkhedia
Copy link
Contributor

@szha It seems that your comments has been addressed. Could you please review the PR again?

const DType* src_dptr = from.dptr<DType>();
#pragma omp parallel for num_threads(engine::OpenMP::Get()->GetRecommendedOMPThreadCount())
for (index_t i = 0; i < size; ++i) {
dst_dptr[i] = src_dptr[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may exceed the boundary of src and dst? Do we really need parameter size here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, you are right. we can fetch data size by from or to. We need to check if the sizes of from and to are equal before calling OMPCopy. So i don't want to re-fetch this parameter, just pass this parameter from Copy function in ndarray/ndarray_function.cc.

@XiaotaoChen XiaotaoChen force-pushed the parallelize-copy branch 2 times, most recently from 0410f8e to be32560 Compare November 7, 2018 01:17
@kalyc
Copy link
Contributor

kalyc commented Nov 13, 2018

Thanks for your contribution @XiaotaoChen
@szha @TaoLv @nswamy requesting review!

@stu1130
Copy link
Contributor

stu1130 commented Nov 21, 2018

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment. The rest LGTM.

mshadow::Copy(to->FlatTo1D<cpu, DType>(),
from.FlatTo1D<cpu, DType>());
const index_t size = from.Size();
CHECK_EQ(size, to->Size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add descriptive error message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

dst[i] = src[i];
}
} else {
std::memcpy(dst, src, sizeof(DType) * size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a CheckContinuous function call before the memcpy in original mshadow::Copy function (https://github.com/dmlc/mshadow/blob/696803bd7723ade8230af878460d96c68a550fbc/mshadow/tensor_cpu-inl.h#L132)

Do we not need this any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I don't think CheckContinuous function is necessary. here are two reasons:

  1. class TBlob seems to be continuous always. because of CheckContiguous function in TBlob class;
  2. the inputs (from, to) of TBlob type will convert to Tensor(by to->FlatTo1D()) in the primitive implement of ndarray::Copy<cpu,cpu>. accroding to the code of FlatTo1D, get_with_shape,constructor of tensor and CheckContiguous function in Tensor, when convert TBlob input to Tensor, Tensor's stride_ is assigned by shape[dim - 1], And the CheckContiguous function in Tensor always return true. there is no need to check this.

- Values: Int ```(default=200000)```
- The minimum size to call parallel copy by openMP in CPU2CPU mode.
- When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count.
- When the array size is less than this threshold, NDArray::Copy(from , to)) is implemented by mshadow::Copy(to, from) in single thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer accurate, right, since it is just implemented using memcpy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done.

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Just minor nit pick.


* MXNET_CPU_PARALLEL_COPY_SIZE
- Values: Int ```(default=200000)```
- The minimum size to call parallel copy by openMP in CPU2CPU mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: openMP => OpenMP?

* MXNET_CPU_PARALLEL_COPY_SIZE
- Values: Int ```(default=200000)```
- The minimum size to call parallel copy by openMP in CPU2CPU mode.
- When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more accurate: bigger than or equal to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's done.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anirudh2290
Copy link
Member

@XiaotaoChen I tried your benchmark scripts on p2.8xlarge which has 16 cores. I dont see much perf difference. Can we keep the default to -1, and use memcpy when it is set to default . This way we wont impact existing users.

Here are my results:

*******************the default copy perf test*******************
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
<module 'mxnet' from '/home/ubuntu/experimentals/xiataochen_mxnet/python/mxnet/__init__.pyc'>
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 98.951658 (us)
mx.nd.copyto avg time: 83.843867 (us)
mx.nd.asnumpy avg time: 28.649966 (us)
mx.nd.array avg time: 128.841400 (us)
mx.nd.ones avg time: 86.990992 (us)
mx.nd.zeros avg time: 87.682406 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 202.488899 (us)
mx.nd.copyto avg time: 171.939532 (us)
mx.nd.asnumpy avg time: 120.393435 (us)
mx.nd.array avg time: 787.146886 (us)
mx.nd.ones avg time: 292.460124 (us)
mx.nd.zeros avg time: 288.883845 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1131.010056 (us)
mx.nd.copyto avg time: 1066.279411 (us)
mx.nd.asnumpy avg time: 1369.468371 (us)
mx.nd.array avg time: 4298.790296 (us)
mx.nd.ones avg time: 187.500318 (us)
mx.nd.zeros avg time: 167.926153 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 27297.194799 (us)
mx.nd.copyto avg time: 9794.155757 (us)
mx.nd.asnumpy avg time: 27480.689685 (us)
mx.nd.array avg time: 70762.650172 (us)
mx.nd.ones avg time: 3512.398402 (us)
mx.nd.zeros avg time: 2739.222844 (us)
*******************parallelize copy perf test*******************
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
<module 'mxnet' from '/home/ubuntu/experimentals/xiataochen_mxnet/python/mxnet/__init__.pyc'>
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 102.551778 (us)
mx.nd.copyto avg time: 83.732605 (us)
mx.nd.asnumpy avg time: 28.316180 (us)
mx.nd.array avg time: 126.830737 (us)
mx.nd.ones avg time: 93.253454 (us)
mx.nd.zeros avg time: 93.762080 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 187.253952 (us)
mx.nd.copyto avg time: 173.560778 (us)
mx.nd.asnumpy avg time: 120.075544 (us)
mx.nd.array avg time: 787.965457 (us)
mx.nd.ones avg time: 283.519427 (us)
mx.nd.zeros avg time: 283.678373 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1089.898745 (us)
mx.nd.copyto avg time: 1079.169909 (us)
mx.nd.asnumpy avg time: 1450.212797 (us)
mx.nd.array avg time: 4908.100764 (us)
mx.nd.ones avg time: 194.748243 (us)
mx.nd.zeros avg time: 181.714694 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 27326.941490 (us)
mx.nd.copyto avg time: 9780.820211 (us)
mx.nd.asnumpy avg time: 27630.313238 (us)
mx.nd.array avg time: 70585.219065 (us)
mx.nd.ones avg time: 3420.662880 (us)
mx.nd.zeros avg time: 3225.604693 (us)

@XiaotaoChen
Copy link
Contributor Author

@anirudh2290 thanks for your feedback. Because my limit of p2.8xlarge is 0, So i tested this on r4.8xlarge instance(16 physical cores same with p2.8xlarge instance) right now. the speedup is consistent with my previous results. According to your log, the default copy and parallelize copy seem like both running in single thread. Have you set OMP_THREADS or other environmental variable ? my test log as below.

*******************the default copy perf test*******************
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 89.907646 (us)
mx.nd.copyto avg time: 73.647499 (us)
mx.nd.asnumpy avg time: 27.632713 (us)
mx.nd.array avg time: 140.659014 (us)
mx.nd.ones avg time: 80.291430 (us)
mx.nd.zeros avg time: 64.857801 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 243.878365 (us)
mx.nd.copyto avg time: 185.608864 (us)
mx.nd.asnumpy avg time: 86.077054 (us)
mx.nd.array avg time: 1034.633319 (us)
mx.nd.ones avg time: 160.853068 (us)
mx.nd.zeros avg time: 163.189570 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1628.462474 (us)
mx.nd.copyto avg time: 1602.705320 (us)
mx.nd.asnumpy avg time: 1416.873932 (us)
mx.nd.array avg time: 4925.394058 (us)
mx.nd.ones avg time: 199.715296 (us)
mx.nd.zeros avg time: 169.722239 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 34876.290957 (us)
mx.nd.copyto avg time: 14724.659920 (us)
mx.nd.asnumpy avg time: 27063.266436 (us)
mx.nd.array avg time: 68980.081876 (us)
mx.nd.ones avg time: 3477.557500 (us)
mx.nd.zeros avg time: 3251.512845 (us)
size: 200000000, shape:(20000, 10000)
mx.nd.copy avg time: 318985.883395 (us)
mx.nd.copyto avg time: 123453.982671 (us)
mx.nd.asnumpy avg time: 301323.723793 (us)
mx.nd.array avg time: 671578.780810 (us)
mx.nd.ones avg time: 30080.588659 (us)
mx.nd.zeros avg time: 30204.280217 (us)
*******************parallelize copy perf test*******************
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 87.475777 (us)
mx.nd.copyto avg time: 74.092547 (us)
mx.nd.asnumpy avg time: 27.672450 (us)
mx.nd.array avg time: 139.975548 (us)
mx.nd.ones avg time: 69.546700 (us)
mx.nd.zeros avg time: 65.541267 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 1151.879628 (us)
mx.nd.copyto avg time: 151.093801 (us)
mx.nd.asnumpy avg time: 41.262309 (us)
mx.nd.array avg time: 742.316246 (us)
mx.nd.ones avg time: 110.252698 (us)
mx.nd.zeros avg time: 108.536084 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 914.112727 (us)
mx.nd.copyto avg time: 1570.677757 (us)
mx.nd.asnumpy avg time: 210.316976 (us)
mx.nd.array avg time: 4106.410344 (us)
mx.nd.ones avg time: 218.836466 (us)
mx.nd.zeros avg time: 188.517570 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 6437.516212 (us)
mx.nd.copyto avg time: 5406.737328 (us)
mx.nd.asnumpy avg time: 6036.376953 (us)
mx.nd.array avg time: 44171.086947 (us)
mx.nd.ones avg time: 3299.037615 (us)
mx.nd.zeros avg time: 3055.596352 (us)
size: 200000000, shape:(20000, 10000)
mx.nd.copy avg time: 58142.248789 (us)
mx.nd.copyto avg time: 45039.542516 (us)
mx.nd.asnumpy avg time: 44913.562139 (us)
mx.nd.array avg time: 416787.425677 (us)
mx.nd.ones avg time: 21619.351705 (us)
mx.nd.zeros avg time: 19476.739566 (us)

@anirudh2290
Copy link
Member

@XiaotaoChen apologies. I was running on a different branch. The numbers look good now. Thanks for the good work !

@anirudh2290 anirudh2290 merged commit 6fd4384 into apache:master Nov 29, 2018
@XiaotaoChen XiaotaoChen deleted the parallelize-copy branch September 28, 2019 10:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.