parallelize NDArray::Copy<cpu, cpu> when data size is large #12926

XiaotaoChen · 2018-10-23T16:21:16Z

Description

Parallelize NDArray::Copy<cpu, cpu> OP by OpenMP when data size bigger than MXNET_CPU_PARALLEL_COPY_SIZE.
introduce the environment variable named: MXNET_CPU_PARALLEL_COPY_SIZE (default 200,000) . When data size is bigger than this threshold, NDArray::Copy is implemented by OpenMP with the Recommended OMP Thread Count, otherwise implemented by default.

Comments

We are optimizing Sockeye's performance on CPU and found that the CopyCPU2CPU operation take too much time, probably accounting for 14% of the overall time (detail data as below). So we parallelized NDArray::Copy<cpu, cpu> operator.

As far as i know, NDArray::Copy<cpu, cpu> is called in the front-end by those functions: nd.copy(), nd.copyto(), nd.asnumpy(), nd.array(),nd.ones(), nd.zeros() . Those functions can gain 2-14 times speedup when data size is range from 20000 to 2e8. And Sockeye's perf increased from 17.10 sent/sec to 19.19 sent/sec on CPU inference, CopyCPU2CPU time overhead dropped from 14.68% to 2.3%. the detail data as below.

the logs and scripts are here: logs and scripts
@pengzhao-intel

detail data

hardware: skx-8180 single socket(28 cores)

functions\size	20,000(shape:[20000,1])			200,000(shape:[20000,10])			2000,000(shape:[20000,100])			20,000,000(shape:[20000,1000])			200,000,000(shape:[20000,10000])
APIs	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup
nd.copy	44.98164	45.434634	0.990029743	208.68778	79.902013	2.611796301	1651.295	190.01166	8.6904931	26326.93	3634.23824	7.24413888	291151.8	30640.8008	9.5020948
nd.copyto	37.51119	37.384033	1.003401372	192.45942	61.178207	3.145882062	1566.815	157.308578	9.9601395	15042.04	2261.47175	6.6514396	148679.5	23665.7222	6.28248457
nd.asnumpy	15.06805	15.155474	0.994231787	146.02343	25.184949	5.798043585	1210.038	85.465113	14.158267	26839.42	3322.951	8.07698195	288313.9	30143.3802	9.56474928
nd.array	56.04426	55.130323	1.016577773	651.10525	485.841433	1.34015998	4410.593	2852.86109	1.5460244	60601.87	35724.5922	1.69636283	596539.3	339708.265	1.75603397
nd.ones	37.59066	31.590462	1.189937108	42.200089	41.898092	1.007207894	49.11423	49.090385	1.0004857	1991.232	1997.4788	0.99687277	22429.08	22386.3522	1.00190887
nd.zeros	32.32956	31.971931	1.011185687	43.678284	43.423971	1.005856512	55.90121	56.385994	0.9914024	2319.415	2365.28715	0.98060627	22368.08	22318.8877	1.00220413

hardware: bdw-2699(44 cores)

functions\size	20,000(shape:[20000,1])			200,000(shape:[20000,10])			2000,000(shape:[20000,100])			20,000,000(shape:[20000,1000])			200,000,000(shape:[20000,10000])
APIs	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup	default(us)	parallel(us)	speedup
nd.copy	141.3822	139.125188	1.016223008	283.70222	143.504143	1.976961843	1530.663	267.05265	5.7316906	42425.86	3845.33405	11.033076	424523.7	29502.2726	14.3895266
nd.copyto	135.1595	136.534373	0.989930147	274.32442	93.11835	2.945975922	1973.446	294.510523	6.7007664	20426.01	3140.17932	6.50472761	153221.1	30291.3427	5.05824612
nd.asnumpy	40.90468	41.572253	0.983941885	124.11277	41.222572	3.010796245	1521.81	81.857045	18.591068	32866.75	3491.84672	9.4124258	370223.1	27558.4698	13.4340953
nd.array	227.8487	190.917651	1.193439621	1233.6016	648.3078	1.902802295	4065.005	2900.36202	1.4015509	83189.32	49047.1363	1.69610964	819595.4	489327.01	1.67494413
nd.ones	107.8765	100.541115	1.07295866	133.58593	126.187007	1.058634587	117.8583	114.28992	1.0312217	1897.2	1758.56749	1.07883261	18391.24	18327.9673	1.00345244
nd.zeros	97.52909	94.858805	1.028150133	123.48493	116.785367	1.057366451	114.3217	115.57738	0.9891357	1996.326	1979.92325	1.00828477	18440.56	18458.9148	0.99900545

sockeye profile on skx-8180 single socket

offical master

Time of each OP:
FullyConnected        47694.792 ms      0.78006594485     ms/call       61142  calls    28.58 %
ImperativeBulk        29801.262 ms      1.89865328746     ms/call       15696  calls    17.86 %
CopyCPU2CPU           24502.718 ms      0.668176979084    ms/call       36671  calls    14.68 %
take                  17288.25  ms      0.507969971205    ms/call       34034  calls    10.36 %
log                   12085.495 ms      6.92975630734     ms/call       1744   calls    7.24 %
SliceChannel          10051.787 ms      0.350407411281    ms/call       28686  calls    6.02 %
softmax               4577.321  ms      1.31230533257     ms/call       3488   calls    2.74 %
_mul_scalar           4498.783  ms      2.57957740826     ms/call       1744   calls    2.70 %
Activation            4148.085  ms      0.028656693218    ms/call       144751 calls    2.49 %
elemwise_add          3534.039  ms      0.0621074654669   ms/call       56902  calls    2.12 %
batch_dot             2317.371  ms      0.664383887615    ms/call       3488   calls    1.39 %
DeleteVariable        2129.944  ms      0.0337539856106   ms/call       63102  calls    1.28 %
elemwise_mul          1196.172  ms      0.0140144107413   ms/call       85353  calls    0.72 %
Concat                887.246   ms      0.0558578443717   ms/call       15884  calls    0.53 %
LayerNorm             828.937   ms      0.332106169872    ms/call       2496   calls    0.50 %
repeat                463.889   ms      0.469998986829    ms/call       987    calls    0.28 %
_slice_assign         396.031   ms      0.131659242021    ms/call       3008   calls    0.24 %
Embedding             101.812   ms      0.0568464544947   ms/call       1791   calls    0.06 %
SetupExec             65.807    ms      0.000731863829977 ms/call       89917  calls    0.04 %
Dropout               59.13     ms      0.0167270155587   ms/call       3535   calls    0.04 %
_full                 58.297    ms      0.0317176278564   ms/call       1838   calls    0.03 %
stack                 40.659    ms      0.288361702128    ms/call       141    calls    0.02 %
SequenceMask          36.445    ms      0.0208973623853   ms/call       1744   calls    0.02 %
sum                   27.585    ms      0.0154020100503   ms/call       1791   calls    0.02 %
argsort               21.859    ms      0.465085106383    ms/call       47     calls    0.01 %
WaitForVar            19.604    ms      0.00400653995504  ms/call       4893   calls    0.01 %
expand_dims           18.804    ms      0.00283876811594  ms/call       6624   calls    0.01 %
SwapAxis              12.079    ms      0.1285            ms/call       94     calls    0.01 %
_zeros                10.918    ms      0.0100999074931   ms/call       1081   calls    0.01 %
Reshape               9.636     ms      0.00272588401697  ms/call       3535   calls    0.01 %
SequenceReverse       7.18      ms      0.0763829787234   ms/call       94     calls    0.00 %
tile                  1.412     ms      0.0300425531915   ms/call       47     calls    0.00 %
SequenceLast          0.833     ms      0.0177234042553   ms/call       47     calls    0.00 %
broadcast_to          0.627     ms      0.0133404255319   ms/call       47     calls    0.00 %
broadcast_not_equal   0.55      ms      0.0117021276596   ms/call       47     calls    0.00 %
_slice_assign_scalar  0.473     ms      0.0100638297872   ms/call       47     calls    0.00 %
broadcast_add         0.447     ms      0.00951063829787  ms/call       47     calls    0.00 %
_ones                 0.418     ms      0.00889361702128  ms/call       47     calls    0.00 %
_unravel_index        0.347     ms      0.0073829787234   ms/call       47     calls    0.00 %
Cast                  0.28      ms      0.00595744680851  ms/call       47     calls    0.00 %
zeros_like            0.244     ms      0.00259574468085  ms/call       94     calls    0.00 %

Total OP Time: 166897.56800000 ms

parallelized copy

Time of each OP:
FullyConnected        48756.792 ms      0.797435347224    ms/call       61142  calls    33.61 %
ImperativeBulk        29133.28  ms      1.85609582059     ms/call       15696  calls    20.08 %
take                  16485.268 ms      0.484376447082    ms/call       34034  calls    11.36 %
log                   11493.641 ms      6.59039048165     ms/call       1744   calls    7.92 %
SliceChannel          10047.895 ms      0.350271735341    ms/call       28686  calls    6.93 %
softmax               4443.701  ms      1.27399684633     ms/call       3488   calls    3.06 %
Activation            4366.243  ms      0.0301638192482   ms/call       144751 calls    3.01 %
_mul_scalar           4272.24   ms      2.44967889908     ms/call       1744   calls    2.95 %
elemwise_add          3784.498  ms      0.0665090506485   ms/call       56902  calls    2.61 %
CopyCPU2CPU           3337.278  ms      0.0910059174825   ms/call       36671  calls    2.30 %
batch_dot             2183.933  ms      0.626127580275    ms/call       3488   calls    1.51 %
DeleteVariable        1929.185  ms      0.0305724858166   ms/call       63102  calls    1.33 %
elemwise_mul          1111.778  ms      0.013025646433    ms/call       85353  calls    0.77 %
Concat                919.976   ms      0.0579184084613   ms/call       15884  calls    0.63 %
_slice_assign         808.582   ms      0.268810505319    ms/call       3008   calls    0.56 %
LayerNorm             804.849   ms      0.322455528846    ms/call       2496   calls    0.55 %
repeat                643.624   ms      0.652101317123    ms/call       987    calls    0.44 %
Embedding             98.968    ms      0.0552585147962   ms/call       1791   calls    0.07 %
_full                 96.208    ms      0.0523438520131   ms/call       1838   calls    0.07 %
Dropout               60.788    ms      0.017196039604    ms/call       3535   calls    0.04 %
SetupExec             59.639    ms      0.000663267235339 ms/call       89917  calls    0.04 %
stack                 35.195    ms      0.249609929078    ms/call       141    calls    0.02 %
SequenceMask          34.082    ms      0.0195424311927   ms/call       1744   calls    0.02 %
sum                   32.254    ms      0.0180089335567   ms/call       1791   calls    0.02 %
argsort               21.612    ms      0.459829787234    ms/call       47     calls    0.01 %
WaitForVar            19.927    ms      0.0040725526262   ms/call       4893   calls    0.01 %
expand_dims           17.696    ms      0.00267149758454  ms/call       6624   calls    0.01 %
SwapAxis              15.528    ms      0.165191489362    ms/call       94     calls    0.01 %
SequenceReverse       11.077    ms      0.117840425532    ms/call       94     calls    0.01 %
_zeros                10.9      ms      0.0100832562442   ms/call       1081   calls    0.01 %
Reshape               9.855     ms      0.00278783592645  ms/call       3535   calls    0.01 %
_slice_assign_scalar  1.715     ms      0.0364893617021   ms/call       47     calls    0.00 %
tile                  1.378     ms      0.0293191489362   ms/call       47     calls    0.00 %
broadcast_to          1.365     ms      0.0290425531915   ms/call       47     calls    0.00 %
broadcast_not_equal   0.836     ms      0.0177872340426   ms/call       47     calls    0.00 %
SequenceLast          0.835     ms      0.0177659574468   ms/call       47     calls    0.00 %
broadcast_add         0.452     ms      0.0096170212766   ms/call       47     calls    0.00 %
_unravel_index        0.338     ms      0.0071914893617   ms/call       47     calls    0.00 %
Cast                  0.305     ms      0.00648936170213  ms/call       47     calls    0.00 %
_ones                 0.282     ms      0.006             ms/call       47     calls    0.00 %
zeros_like            0.256     ms      0.00272340425532  ms/call       94     calls    0.00 %

Total OP Time: 145054.25400000 ms

roywei · 2018-10-23T20:40:02Z

@XiaotaoChen Thanks for the contribution!
@mxnet-label-bot [pr-awaiting-review]

szha · 2018-10-24T18:02:01Z

src/ndarray/ndarray_function.cc

    if (to->type_flag_ == from.type_flag_) {
-        mshadow::Copy(to->FlatTo1D<cpu, DType>(),
-                      from.FlatTo1D<cpu, DType>());
+      index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000);


there is a API: mx.test_utils.set_env_var to set env var. it may can't work repeatly if add static modifier. like my [test case] (https://github.com/XiaotaoChen/incubator-mxnet/blob/cxt-test/tests/cxt-test/parallel-copy/test_ndarray_copy.py#L89-L92). what's your opinion？ @szha

It seems wasteful to get the variable every time it launches a copy as people don't usually change this setting at runtime. Try testing it with some other approach.

ok, I'll do it.

szha · 2018-10-24T18:26:44Z

src/ndarray/ndarray_function.cc

 namespace mxnet {
 namespace ndarray {
+template<typename DType>
+void OMPCopy(const TBlob &from, TBlob *to, const index_t size) {


This doesn't seem NDArray-specific. Consider putting it in src/common/(utils.h?)

szha · 2018-10-24T18:40:58Z

src/ndarray/ndarray_function.cc

+      index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000);
+      const index_t size = from.Size();
+      if (size >= copy_block_size) {
+        OMPCopy<DType>(from, to, size);


size check between from and to?

szha · 2018-10-24T18:42:44Z

src/ndarray/ndarray_function.cc

+      if (size >= copy_block_size) {
+        OMPCopy<DType>(from, to, size);
+      } else {
+        mshadow::Copy(to->FlatTo1D<cpu, DType>(), from.FlatTo1D<cpu, DType>());


We can probably get rid of the mshadow::Copy here, and merge that in OMP copy. This way you can put the threshold check as a condition in pragma omp

We test the perf of omp copy in single thread and memcpy called by mshadow::Copy when data size is less than 200,000 as below.

size 20 200 200 20000 200000

memcpy(us) 0.0422 0.038967 0.172933 2.213567 74.105064

omp copy single thread(us) 0.254033 0.2407 0.389933 2.541833 49.168999

speedup 0.16612015 0.16189032 0.443494139 0.870854616 1.507150146

It shows that memcpy's perf is better than assignment directly in single thread when data size is small. So we want to keep mshadow::copy called when data size is less than MXNET_CPU_PARALLEL_COPY_SIZE. Looking forward to your suggestions.@szha

XiaotaoChen · 2018-10-26T09:10:39Z

Hi, @szha the code is specificated with your suggestions, and the test script and log also updated.
May you review it again.

ankkhedia · 2018-10-29T21:46:32Z

@szha It seems that your comments has been addressed. Could you please review the PR again?

TaoLv · 2018-11-02T16:31:57Z

src/common/utils.h

+  const DType* src_dptr = from.dptr<DType>();
+  #pragma omp parallel for num_threads(engine::OpenMP::Get()->GetRecommendedOMPThreadCount())
+  for (index_t i = 0; i < size; ++i) {
+    dst_dptr[i] = src_dptr[i];


may exceed the boundary of src and dst? Do we really need parameter size here?

yeh, you are right. we can fetch data size by from or to. We need to check if the sizes of from and to are equal before calling OMPCopy. So i don't want to re-fetch this parameter, just pass this parameter from Copy function in ndarray/ndarray_function.cc.

…CPU_PARALLEL_COPY_SIZE

kalyc · 2018-11-13T21:15:44Z

Thanks for your contribution @XiaotaoChen
@szha @TaoLv @nswamy requesting review!

stu1130 · 2018-11-21T19:52:24Z

@szha @TaoLv @eric-haibin-lin @apeforest @yuxihu @samskalicky ping for review

TaoLv

A minor comment. The rest LGTM.

TaoLv · 2018-11-22T13:51:21Z

src/ndarray/ndarray_function.cc

-        mshadow::Copy(to->FlatTo1D<cpu, DType>(),
-                      from.FlatTo1D<cpu, DType>());
+      const index_t size = from.Size();
+      CHECK_EQ(size, to->Size());


Please add descriptive error message here.

apeforest · 2018-11-26T21:17:08Z

src/common/utils.h

+      dst[i] = src[i];
+    }
+  } else {
+    std::memcpy(dst, src, sizeof(DType) * size);


There is a CheckContinuous function call before the memcpy in original mshadow::Copy function (https://github.com/dmlc/mshadow/blob/696803bd7723ade8230af878460d96c68a550fbc/mshadow/tensor_cpu-inl.h#L132)

Do we not need this any more?

Hi, I don't think CheckContinuous function is necessary. here are two reasons:

class TBlob seems to be continuous always. because of CheckContiguous function in TBlob class;

the inputs (from, to) of TBlob type will convert to Tensor(by to->FlatTo1D()) in the primitive implement of ndarray::Copy<cpu,cpu>. accroding to the code of FlatTo1D, get_with_shape,constructor of tensor and CheckContiguous function in Tensor, when convert TBlob input to Tensor, Tensor's stride_ is assigned by shape[dim - 1], And the CheckContiguous function in Tensor always return true. there is no need to check this.

apeforest · 2018-11-26T21:21:24Z

docs/faq/env_var.md

+  - Values: Int ```(default=200000)```
+  - The minimum size to call parallel copy by openMP in CPU2CPU mode.
+  - When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count.
+  - When the array size is less than this threshold, NDArray::Copy(from , to)) is implemented by mshadow::Copy(to, from) in single thread.


This is no longer accurate, right, since it is just implemented using memcpy?

yuxihu

LGTM overall. Just minor nit pick.

yuxihu · 2018-11-27T18:18:54Z

docs/faq/env_var.md


+* MXNET_CPU_PARALLEL_COPY_SIZE
+  - Values: Int ```(default=200000)```
+  - The minimum size to call parallel copy by openMP in CPU2CPU mode.


nit: openMP => OpenMP?

yuxihu · 2018-11-27T18:20:48Z

docs/faq/env_var.md

+* MXNET_CPU_PARALLEL_COPY_SIZE
+  - Values: Int ```(default=200000)```
+  - The minimum size to call parallel copy by openMP in CPU2CPU mode.
+  - When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count.


To be more accurate: bigger than or equal to?

Thanks, it's done.

apeforest

LGTM

yuxihu

LGTM

anirudh2290 · 2018-11-28T22:49:43Z

@XiaotaoChen I tried your benchmark scripts on p2.8xlarge which has 16 cores. I dont see much perf difference. Can we keep the default to -1, and use memcpy when it is set to default . This way we wont impact existing users.

Here are my results:

*******************the default copy perf test*******************
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
<module 'mxnet' from '/home/ubuntu/experimentals/xiataochen_mxnet/python/mxnet/__init__.pyc'>
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 98.951658 (us)
mx.nd.copyto avg time: 83.843867 (us)
mx.nd.asnumpy avg time: 28.649966 (us)
mx.nd.array avg time: 128.841400 (us)
mx.nd.ones avg time: 86.990992 (us)
mx.nd.zeros avg time: 87.682406 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 202.488899 (us)
mx.nd.copyto avg time: 171.939532 (us)
mx.nd.asnumpy avg time: 120.393435 (us)
mx.nd.array avg time: 787.146886 (us)
mx.nd.ones avg time: 292.460124 (us)
mx.nd.zeros avg time: 288.883845 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1131.010056 (us)
mx.nd.copyto avg time: 1066.279411 (us)
mx.nd.asnumpy avg time: 1369.468371 (us)
mx.nd.array avg time: 4298.790296 (us)
mx.nd.ones avg time: 187.500318 (us)
mx.nd.zeros avg time: 167.926153 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 27297.194799 (us)
mx.nd.copyto avg time: 9794.155757 (us)
mx.nd.asnumpy avg time: 27480.689685 (us)
mx.nd.array avg time: 70762.650172 (us)
mx.nd.ones avg time: 3512.398402 (us)
mx.nd.zeros avg time: 2739.222844 (us)
*******************parallelize copy perf test*******************
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
<module 'mxnet' from '/home/ubuntu/experimentals/xiataochen_mxnet/python/mxnet/__init__.pyc'>
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 102.551778 (us)
mx.nd.copyto avg time: 83.732605 (us)
mx.nd.asnumpy avg time: 28.316180 (us)
mx.nd.array avg time: 126.830737 (us)
mx.nd.ones avg time: 93.253454 (us)
mx.nd.zeros avg time: 93.762080 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 187.253952 (us)
mx.nd.copyto avg time: 173.560778 (us)
mx.nd.asnumpy avg time: 120.075544 (us)
mx.nd.array avg time: 787.965457 (us)
mx.nd.ones avg time: 283.519427 (us)
mx.nd.zeros avg time: 283.678373 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1089.898745 (us)
mx.nd.copyto avg time: 1079.169909 (us)
mx.nd.asnumpy avg time: 1450.212797 (us)
mx.nd.array avg time: 4908.100764 (us)
mx.nd.ones avg time: 194.748243 (us)
mx.nd.zeros avg time: 181.714694 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 27326.941490 (us)
mx.nd.copyto avg time: 9780.820211 (us)
mx.nd.asnumpy avg time: 27630.313238 (us)
mx.nd.array avg time: 70585.219065 (us)
mx.nd.ones avg time: 3420.662880 (us)
mx.nd.zeros avg time: 3225.604693 (us)

XiaotaoChen · 2018-11-29T05:41:33Z

@anirudh2290 thanks for your feedback. Because my limit of p2.8xlarge is 0, So i tested this on r4.8xlarge instance(16 physical cores same with p2.8xlarge instance) right now. the speedup is consistent with my previous results. According to your log, the default copy and parallelize copy seem like both running in single thread. Have you set OMP_THREADS or other environmental variable ? my test log as below.

*******************the default copy perf test*******************
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 89.907646 (us)
mx.nd.copyto avg time: 73.647499 (us)
mx.nd.asnumpy avg time: 27.632713 (us)
mx.nd.array avg time: 140.659014 (us)
mx.nd.ones avg time: 80.291430 (us)
mx.nd.zeros avg time: 64.857801 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 243.878365 (us)
mx.nd.copyto avg time: 185.608864 (us)
mx.nd.asnumpy avg time: 86.077054 (us)
mx.nd.array avg time: 1034.633319 (us)
mx.nd.ones avg time: 160.853068 (us)
mx.nd.zeros avg time: 163.189570 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 1628.462474 (us)
mx.nd.copyto avg time: 1602.705320 (us)
mx.nd.asnumpy avg time: 1416.873932 (us)
mx.nd.array avg time: 4925.394058 (us)
mx.nd.ones avg time: 199.715296 (us)
mx.nd.zeros avg time: 169.722239 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 34876.290957 (us)
mx.nd.copyto avg time: 14724.659920 (us)
mx.nd.asnumpy avg time: 27063.266436 (us)
mx.nd.array avg time: 68980.081876 (us)
mx.nd.ones avg time: 3477.557500 (us)
mx.nd.zeros avg time: 3251.512845 (us)
size: 200000000, shape:(20000, 10000)
mx.nd.copy avg time: 318985.883395 (us)
mx.nd.copyto avg time: 123453.982671 (us)
mx.nd.asnumpy avg time: 301323.723793 (us)
mx.nd.array avg time: 671578.780810 (us)
mx.nd.ones avg time: 30080.588659 (us)
mx.nd.zeros avg time: 30204.280217 (us)
*******************parallelize copy perf test*******************
size: 20000, shape:(20000, 1)
mx.nd.copy avg time: 87.475777 (us)
mx.nd.copyto avg time: 74.092547 (us)
mx.nd.asnumpy avg time: 27.672450 (us)
mx.nd.array avg time: 139.975548 (us)
mx.nd.ones avg time: 69.546700 (us)
mx.nd.zeros avg time: 65.541267 (us)
size: 200000, shape:(20000, 10)
mx.nd.copy avg time: 1151.879628 (us)
mx.nd.copyto avg time: 151.093801 (us)
mx.nd.asnumpy avg time: 41.262309 (us)
mx.nd.array avg time: 742.316246 (us)
mx.nd.ones avg time: 110.252698 (us)
mx.nd.zeros avg time: 108.536084 (us)
size: 2000000, shape:(20000, 100)
mx.nd.copy avg time: 914.112727 (us)
mx.nd.copyto avg time: 1570.677757 (us)
mx.nd.asnumpy avg time: 210.316976 (us)
mx.nd.array avg time: 4106.410344 (us)
mx.nd.ones avg time: 218.836466 (us)
mx.nd.zeros avg time: 188.517570 (us)
size: 20000000, shape:(20000, 1000)
mx.nd.copy avg time: 6437.516212 (us)
mx.nd.copyto avg time: 5406.737328 (us)
mx.nd.asnumpy avg time: 6036.376953 (us)
mx.nd.array avg time: 44171.086947 (us)
mx.nd.ones avg time: 3299.037615 (us)
mx.nd.zeros avg time: 3055.596352 (us)
size: 200000000, shape:(20000, 10000)
mx.nd.copy avg time: 58142.248789 (us)
mx.nd.copyto avg time: 45039.542516 (us)
mx.nd.asnumpy avg time: 44913.562139 (us)
mx.nd.array avg time: 416787.425677 (us)
mx.nd.ones avg time: 21619.351705 (us)
mx.nd.zeros avg time: 19476.739566 (us)

anirudh2290 · 2018-11-29T07:58:45Z

@XiaotaoChen apologies. I was running on a different branch. The numbers look good now. Thanks for the good work !

XiaotaoChen requested review from anirudh2290 and szha as code owners October 23, 2018 16:21

marcoabreu added the pr-awaiting-review PR is waiting for code review label Oct 23, 2018

szha reviewed Oct 24, 2018

View reviewed changes

szha mentioned this pull request Oct 24, 2018

Optimization for embedding OP for CPU #12866

Closed

1 task

XiaotaoChen force-pushed the parallelize-copy branch from 5575ad1 to da53ec1 Compare October 26, 2018 08:23

pengzhao-intel mentioned this pull request Nov 1, 2018

A better take forward kernel for CPU #12997

Merged

7 tasks

TaoLv reviewed Nov 2, 2018

View reviewed changes

XiaotaoChen force-pushed the parallelize-copy branch 2 times, most recently from 0410f8e to be32560 Compare November 7, 2018 01:17

XiaotaoChen added 3 commits November 7, 2018 14:21

parallelize NDArray::Copy<cpu, cpu> by OpenMP when data size > MXNET_…

8ebdc2c

…CPU_PARALLEL_COPY_SIZE

code specification according to reviewer's suggestions

c7c2d4a

align with std::memcpy api

11d20e4

XiaotaoChen force-pushed the parallelize-copy branch from 98ae065 to 11d20e4 Compare November 7, 2018 06:24

TaoLv reviewed Nov 22, 2018

View reviewed changes

add descriptive error message

670af18

TaoLv approved these changes Nov 23, 2018

View reviewed changes

apeforest reviewed Nov 26, 2018

View reviewed changes

update MXNET_CPU_PARALLEL_COPY_SIZE doc

37adbc4

yuxihu reviewed Nov 27, 2018

View reviewed changes

apeforest approved these changes Nov 27, 2018

View reviewed changes

update MXNET_CPU_PARALLEL_COPY_SIZE doc again

669aa4c

yuxihu approved these changes Nov 28, 2018

View reviewed changes

anirudh2290 approved these changes Nov 29, 2018

View reviewed changes

anirudh2290 merged commit 6fd4384 into apache:master Nov 29, 2018

XiaotaoChen deleted the parallelize-copy branch September 28, 2019 10:25

size	20	200	200	20000	200000
memcpy(us)	0.0422	0.038967	0.172933	2.213567	74.105064
omp copy single thread(us)	0.254033	0.2407	0.389933	2.541833	49.168999
speedup	0.16612015	0.16189032	0.443494139	0.870854616	1.507150146

parallelize NDArray::Copy<cpu, cpu> when data size is large #12926

parallelize NDArray::Copy<cpu, cpu> when data size is large #12926

Uh oh!

Conversation

XiaotaoChen commented Oct 23, 2018

Description

Comments

detail data

Uh oh!

roywei commented Oct 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szha Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaotaoChen commented Oct 26, 2018

Uh oh!

ankkhedia commented Oct 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kalyc commented Nov 13, 2018

Uh oh!

stu1130 commented Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TaoLv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuxihu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apeforest left a comment

Choose a reason for hiding this comment

Uh oh!

yuxihu left a comment

Choose a reason for hiding this comment

Uh oh!

anirudh2290 commented Nov 28, 2018

Uh oh!

XiaotaoChen commented Nov 29, 2018

Uh oh!

anirudh2290 commented Nov 29, 2018

Uh oh!

Reviewers

Assignees

szha Oct 24, 2018 •

edited

Loading

stu1130 commented Nov 21, 2018 •

edited

Loading