-
Notifications
You must be signed in to change notification settings - Fork 6.7k
parallelize NDArray::Copy<cpu, cpu> when data size is large #12926
Conversation
|
@XiaotaoChen Thanks for the contribution! |
src/ndarray/ndarray_function.cc
Outdated
| if (to->type_flag_ == from.type_flag_) { | ||
| mshadow::Copy(to->FlatTo1D<cpu, DType>(), | ||
| from.FlatTo1D<cpu, DType>()); | ||
| index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a API: mx.test_utils.set_env_var to set env var. it may can't work repeatly if add static modifier. like my [test case] (https://github.com/XiaotaoChen/incubator-mxnet/blob/cxt-test/tests/cxt-test/parallel-copy/test_ndarray_copy.py#L89-L92). what's your opinion? @szha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems wasteful to get the variable every time it launches a copy as people don't usually change this setting at runtime. Try testing it with some other approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll do it.
src/ndarray/ndarray_function.cc
Outdated
| namespace mxnet { | ||
| namespace ndarray { | ||
| template<typename DType> | ||
| void OMPCopy(const TBlob &from, TBlob *to, const index_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem NDArray-specific. Consider putting it in src/common/(utils.h?)
src/ndarray/ndarray_function.cc
Outdated
| index_t copy_block_size = dmlc::GetEnv("MXNET_CPU_PARALLEL_COPY_SIZE", 200000); | ||
| const index_t size = from.Size(); | ||
| if (size >= copy_block_size) { | ||
| OMPCopy<DType>(from, to, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size check between from and to?
src/ndarray/ndarray_function.cc
Outdated
| if (size >= copy_block_size) { | ||
| OMPCopy<DType>(from, to, size); | ||
| } else { | ||
| mshadow::Copy(to->FlatTo1D<cpu, DType>(), from.FlatTo1D<cpu, DType>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably get rid of the mshadow::Copy here, and merge that in OMP copy. This way you can put the threshold check as a condition in pragma omp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We test the perf of omp copy in single thread and memcpy called by mshadow::Copy when data size is less than 200,000 as below.
| size | 20 | 200 | 200 | 20000 | 200000 |
|---|---|---|---|---|---|
| memcpy(us) | 0.0422 | 0.038967 | 0.172933 | 2.213567 | 74.105064 |
| omp copy single thread(us) | 0.254033 | 0.2407 | 0.389933 | 2.541833 | 49.168999 |
| speedup | 0.16612015 | 0.16189032 | 0.443494139 | 0.870854616 | 1.507150146 |
It shows that memcpy's perf is better than assignment directly in single thread when data size is small. So we want to keep mshadow::copy called when data size is less than MXNET_CPU_PARALLEL_COPY_SIZE. Looking forward to your suggestions.@szha
5575ad1 to
da53ec1
Compare
|
Hi, @szha the code is specificated with your suggestions, and the test script and log also updated. |
|
@szha It seems that your comments has been addressed. Could you please review the PR again? |
src/common/utils.h
Outdated
| const DType* src_dptr = from.dptr<DType>(); | ||
| #pragma omp parallel for num_threads(engine::OpenMP::Get()->GetRecommendedOMPThreadCount()) | ||
| for (index_t i = 0; i < size; ++i) { | ||
| dst_dptr[i] = src_dptr[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may exceed the boundary of src and dst? Do we really need parameter size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, you are right. we can fetch data size by from or to. We need to check if the sizes of from and to are equal before calling OMPCopy. So i don't want to re-fetch this parameter, just pass this parameter from Copy function in ndarray/ndarray_function.cc.
0410f8e to
be32560
Compare
98ae065 to
11d20e4
Compare
|
Thanks for your contribution @XiaotaoChen |
|
@szha @TaoLv @eric-haibin-lin @apeforest @yuxihu @samskalicky ping for review |
TaoLv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor comment. The rest LGTM.
src/ndarray/ndarray_function.cc
Outdated
| mshadow::Copy(to->FlatTo1D<cpu, DType>(), | ||
| from.FlatTo1D<cpu, DType>()); | ||
| const index_t size = from.Size(); | ||
| CHECK_EQ(size, to->Size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add descriptive error message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| dst[i] = src[i]; | ||
| } | ||
| } else { | ||
| std::memcpy(dst, src, sizeof(DType) * size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a CheckContinuous function call before the memcpy in original mshadow::Copy function (https://github.com/dmlc/mshadow/blob/696803bd7723ade8230af878460d96c68a550fbc/mshadow/tensor_cpu-inl.h#L132)
Do we not need this any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I don't think CheckContinuous function is necessary. here are two reasons:
- class TBlob seems to be continuous always. because of CheckContiguous function in TBlob class;
- the inputs (from, to) of TBlob type will convert to Tensor(by
to->FlatTo1D()) in the primitive implement of ndarray::Copy<cpu,cpu>. accroding to the code of FlatTo1D, get_with_shape,constructor of tensor and CheckContiguous function in Tensor, when convert TBlob input to Tensor, Tensor's stride_ is assigned byshape[dim - 1], And theCheckContiguous functionin Tensor always return true. there is no need to check this.
docs/faq/env_var.md
Outdated
| - Values: Int ```(default=200000)``` | ||
| - The minimum size to call parallel copy by openMP in CPU2CPU mode. | ||
| - When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count. | ||
| - When the array size is less than this threshold, NDArray::Copy(from , to)) is implemented by mshadow::Copy(to, from) in single thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer accurate, right, since it is just implemented using memcpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's done.
yuxihu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Just minor nit pick.
docs/faq/env_var.md
Outdated
|
|
||
| * MXNET_CPU_PARALLEL_COPY_SIZE | ||
| - Values: Int ```(default=200000)``` | ||
| - The minimum size to call parallel copy by openMP in CPU2CPU mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: openMP => OpenMP?
docs/faq/env_var.md
Outdated
| * MXNET_CPU_PARALLEL_COPY_SIZE | ||
| - Values: Int ```(default=200000)``` | ||
| - The minimum size to call parallel copy by openMP in CPU2CPU mode. | ||
| - When the array size is bigger than this threshold, NDArray::Copy(from, to) is implemented by OpenMP with the Recommended OMP Thread Count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more accurate: bigger than or equal to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's done.
apeforest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
yuxihu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@XiaotaoChen I tried your benchmark scripts on p2.8xlarge which has 16 cores. I dont see much perf difference. Can we keep the default to -1, and use memcpy when it is set to default . This way we wont impact existing users. Here are my results: |
|
@anirudh2290 thanks for your feedback. Because my limit of p2.8xlarge is 0, So i tested this on r4.8xlarge instance(16 physical cores same with p2.8xlarge instance) right now. the speedup is consistent with my previous results. According to your log, the |
|
@XiaotaoChen apologies. I was running on a different branch. The numbers look good now. Thanks for the good work ! |
Description
Comments
We are optimizing Sockeye's performance on CPU and found that the CopyCPU2CPU operation take too much time, probably accounting for 14% of the overall time (detail data as below). So we parallelized NDArray::Copy<cpu, cpu> operator.
As far as i know, NDArray::Copy<cpu, cpu> is called in the front-end by those functions: nd.copy(), nd.copyto(), nd.asnumpy(), nd.array(),nd.ones(), nd.zeros() . Those functions can gain 2-14 times speedup when data size is range from 20000 to 2e8. And Sockeye's perf increased from 17.10 sent/sec to 19.19 sent/sec on CPU inference, CopyCPU2CPU time overhead dropped from 14.68% to 2.3%. the detail data as below.
the logs and scripts are here: logs and scripts
@pengzhao-intel
detail data
hardware: skx-8180 single socket(28 cores)
hardware: bdw-2699(44 cores)
sockeye profile on skx-8180 single socket