arrayfire Improve performance of join

Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.

This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.

Jan 21 '19 23:01 umar456

I could be interested if the new implementation(s) to be faster for either afcuda or afcpu. Would it be ?

Jan 22 '19 03:01 WilliamTambellini

This would be faster on afcuda. I haven't looked into afcpu's implementation. This will require you to modify the CUDA kernel implementation.

Jan 22 '19 03:01 umar456

I don't think there will be a big improvement on the CPU backend.

Jan 22 '19 03:01 umar456

ok, a speed up for afcuda would be good enough, please add here more details and I will see if I could do it.

Jan 22 '19 21:01 WilliamTambellini

The join function seems to be calling the join kernel for each buffer that needs to be joined. It would be better if we performed this operation in one kernel call instead of multiple calls. Here are the steps you can take to do this:

Allocate a buffer to store the pointers to buffers that need to be joined.
Allocate a buffer for the offsets. You could do this with the previous buffer but a separate buffer may be easier.
Modify the kernel so that you loop over each of the arrays. This may be a simple modification based on what I saw. You probably just need to add another loop to iterate over each of the buffers.
Modify the host code to call the new function. Remove the join wrapper as its now unnecessary.

Jan 26 '19 18:01 umar456

Is this issue still open? If it is, I would like to work on it

Sep 30 '19 08:09 UmashankarTriforce

Hello, is it possible to still work on this?

Feb 03 '21 13:02 rahul-ar

@UmashankarTriforce Sorry for a late response.

@UmashankarTriforce @rahul-ar Yes, join CUDA/OpenCL can still use this performance improvement.

Feb 03 '21 15:02 9prady9

@9prady9 I am an absolute beginner with CUDA and would like to use this as an opportunity to become familiar. What is your preferred mode of communication to get in touch with you to know more about this?

Feb 03 '21 15:02 rahul-ar

@umar456 I am not convinced that putting everything together into 1 kernel, will be the fastest solution. I'm especially thinking about:

low cache hit rate, which will screw up performance, since you are reading from multiple buffers in parallel, impacting large arrays.
initializing the extra buffers also launches 1 kernel for each, impacting small arrays due to the overhead.

Would there be a possibility to publish different models/methods, so that they can be compared on speed? At the same time, it would be an excellent opportunity to learn from each others model (very small changes can have big impacts). I suppose, just creating different PRs for the same issue, is not the best.

Feb 03 '21 18:02 willyborn

@rahul-ar We have our slack channel for quick questions. But it would be better to iron out the approach/design here before going ahead with the implementation. Given that you are new to CUDA/OpenCL programming, I would say play with helloworld style samples from CUDA toolkit to get a taste of the difference in syntax and the APIs you need to use for CUDA programming. Only then, it would make sense to venture into arrayfire's code base. At that point you can go through our Development Guidelines section in our wiki page to understand ArrayFire code workflow. Given that you are a newbie to CUDA programming itself, I personally would suggest playing with CUDA samples first, then start with something like adding batch support for existing functions that enables you to understand ArrayFire code base before doing performance enhancements.

@willyborn Definitely, what works for small size arrays may not work for big arrays. I think it would be best to layout the methods here(in this GitHub issue), benchmark them with different problem sizes and then weigh out the pros/cons of different methods. Feel free to share your method here and any available benchmarks. We don't have any numbers for single kernel approach yet, nevertheless that's one way to handle multiple array joins. We shall share the numbers once we do generate them.

Feb 04 '21 10:02 9prady9

Hereby intermediate results on join improvements for OpenCL (CUDA & CPU follow later).

Improvements vary dependent on the array dimensions (from 7% up to 700x faster). Please consult the attached spreadsheet with the individual measures. The best algorithm goes with join-Copy11. Please provide suggestions/remarks for further improvement. Confirmation of the performance results on other HW would be welcome.

Join.xlsx

Mar 25 '21 10:03 willyborn

@willyborn Thank you, will have a look soon and let you know.

Apr 06 '21 11:04 9prady9

@willyborn Sorry about the delay on my end, I just returned to work after a long break. I see that you have already raised PRs #3144 #3145 . Do they address this issue in total ? Also are they related in any fashion ?

Jun 14 '21 14:06 9prady9

@pradeep No worry. Remarks remain welcome, up to the point they are merged.

PR#3144 is about the join, memcopy and JIT. PR#3145 is about the usage of join in 2 cases. Both have the following scenario: "{ af::array tmp = join(1,A,B); af::array FIN = join(1,tmp,C); }" --> This can faster:"{ af:: array FIN = join(1,A,B,C); }". --> Without PR#3144, you already have an improvement of 2x, thanks to the elimination of the tmp buffer (1 Read & 1 write less). --> With PR#3144, it is even more because A, B & C are un-evaluated JIT arrays and the improved kernel parameters are also helping.

While testing PR#3144, I tested all places that used the joins, detecting both cases.

On Mon, 14 Jun 2021 at 16:05, pradeep @.***> wrote:

@willyborn https://github.com/willyborn Sorry about the delay on my end, I just returned to work after a long break. I see that you have already raised PRs #3144 https://github.com/arrayfire/arrayfire/pull/3144 #3145 https://github.com/arrayfire/arrayfire/pull/3145 . Do they address this issue in total ? Also are they related in any fashion ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/issues/2417#issuecomment-860712163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPENYICEWYNLAYGY2LTTSYECPANCNFSM4GROUSRQ .

Jun 14 '21 18:06 willyborn