arrayfire icon indicating copy to clipboard operation
arrayfire copied to clipboard

OPT: Improved memcopy, JIT & join

Open willyborn opened this issue 4 years ago • 31 comments

  • workload per thread is increased to optimize for memcopy and JIT.
  • memcopy can now increase its unit size
  • JIT now can write directly into the final buffer, instead of first evaluating into a temp buffer
  • JIT now performs multiple copies in parallel (when same size)
  • join is making optimal selection between (memcopy and JIT)

Description

  • No feature changes for the end-user, only a stable performance improvement. Improvements are extreme for:
    • vectors in 2nd, 3th or 4th dimension
    • not yet evaluated arrays (JIT nodes)
    • arrays fully contained in L2 cache

Fixes: #2417

Changes to Users

Performance improvement, less dependent on dimension.

Checklist

  • [ x ] Rebased on latest master
  • [ x ] Code compiles
  • [ x ] Tests pass
  • [ ] Functions added to unified API
  • [ ] Functions documented

willyborn avatar Jun 09 '21 18:06 willyborn

Some extra comments to the realized performance impact:

  • copy linear array (MAX throughput) -- should be highest possible throughput is performed by the enclosed copy functions (OCL & CUDA). The small improvements come from C++ optimizations in copy.cpp
  • A[+10,,][,,].copy (Non linear) -- improvement: (CU)1x..84x / (CL)3x..120x an evaluated array is copy to a new array which 1st dimension is 10 items bigger. When the 1st dim is 1, this means that only 1 byte in 10 of the cached memory will be written explaining the slow performance at the end. (Still 120x faster than with master)
  • JIT.copy (Linear) -- improvement: (CU)3.7x / (CL)3.5x The array to be copied is not yet evaluated. In this case we do no longer evaluate it into a intermediate buffer after which the result is joined into the final buffer. The evaluation writes directly into the final buffer, eliminating a write to + read from the internal buffer. This explains why have a throughput higher than HW specified. Memory consumption is also lower, since the intermediate buffer is no longer allocated.
  • join(0,A,B) -- improvement: (CU)2%..71% / (CL)4%..3x The common (optimized) memcopy function is now used. Only evaluated arrays are measured here. The performances are now more stable, less independent from the array dimensions. In case we hit the HW limit, no further improvement is possible.
  • join(1,A,B) -- improvement: (CU)5%..47x / (CL)6%..50x Performance for vectors in 2nd, 3th or 4th dimension were very bad. Now they perform similar as the rest.
  • join(3,A,B) -- improvement: (CU)6%..360x / (CL)8%..711x Same remark as join(1,A,B)
  • join(0,A,B,C,D) -- improvement: (CU)7%..70% / (CL)8%..3x Same remark as join(0,A,B)
  • join(1,A,B,C,D) -- improvement: (CU)5%..46x / (CL)7%..50x same remark as join(1,A,B)
  • join(0,4xJIT) -- improvement: (CU)3x..5x / (CL)3x..7x Improvements of JIT generation and direct writing into final buffer
  • join(1,4xJIT) -- improvement: (CU)3x..101x / (CL)3x..108x Improvements of JIT generation and direct writing into final buffer. The JIT operation itself is also further optimized resulting in the exceptional improvement of 108x
  • Linear.as(f32|c32) only char -- improvement: (CU)20%..27% / (CL)18%..20% The improvements to the JIT engine is the only contributor this result.
  • B[+1,,][,,f32|c32] -- improvement: (CU)0%..211x / (CL)0%.. 297x This is the only place where the copy (reshape & scale) function is measured.

The CUDA scheduler optimizes much more than the OCL scheduler, even on the same HW. In the machine learning example, the needed time for 250 epochs reduced by 10%.

More details, with the specific measures can be found in the file below: Join.xlsx

willyborn avatar Jun 09 '21 19:06 willyborn

Hi @willyborn,

This is a tremendous change. Thank you for your contribution. I am looking at the benchmark numbers and they look fantastic. I am running my own benchmarks and I am seeing some good results as well. I did notice that the kernels are failing if the older kernel cache is not removed. I suppose that would only be an issue in development branches. Once the version number is updated it the errors will not appear for the end user. I want to understand the kernels that are being generated and I will give my feedback in a couple of days.

umar456 avatar Jun 11 '21 23:06 umar456

@9prady9 The reason the old kernels are no longer valid is a result of the hash calculation. For JIT kernels, we only take the function name (backend/cuda/jit.cpp:207) into account but not the real code. The function name remained the same, although the code behind (and corresponding parameters to call it) changed. A solution could be that we include a 'fixed version number' into the hash calculation, so that we can force the generation of a new set of hash codes.

willyborn avatar Jul 31 '21 22:07 willyborn

@willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time.

Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine.

9prady9 avatar Aug 01 '21 04:08 9prady9

This will have a serious performance impact, since the code generation is take more time than the exécution of the resulting kernel. You will also have to generate the kernel code, even when cached, because the hash key Is used to search for the cached kernel.

The problem is however not that big as it seams because the version nr of arrayfire (3.9) is included in the filename. As result endusers will never notice it.

On Sun, Aug 1, 2021, 06:38 pradeep @.***> wrote:

@willyborn https://github.com/willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time.

Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#issuecomment-890449055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ .

willyborn avatar Aug 01 '21 10:08 willyborn

This will have a serious performance impact, since the code generation is take more time than the exécution of the resulting kernel. You will also have to generate the kernel code, even when cached, because the hash key Is used to search for the cached kernel. The problem is however not that big as it seams because the version nr of arrayfire (3.9) is included in the filename. As result endusers will never notice it. On Sun, Aug 1, 2021, 06:38 pradeep @.***> wrote: @willyborn https://github.com/willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time. Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ .

I haven't timed the hashing mechanism. But yes, it needs to be hashed at each lookup which would be inefficient.

I think we can do away with re-using AF_REVISION defined in version.hpp which is generated during build. It has the commit hash and that should do the trick - adding AF_REVISION to the hashed inputs for JIT workflow.

9prady9 avatar Aug 01 '21 14:08 9prady9

I already had all the dims available in int format and no longer in dim_t format, because they are updated inside the 'memcopy.hpp calls'. On top, most of the OpenCL kernels are converting all dims / idx into int inside their kernels.

Finally, I wanted to speed things up more, avoiding conversions and reducing the bytes sent to the GPU. If I have to send the dim_t version, I have to convert on the CPU and later convert again on GPU.

Continuing calculations in dim_t format is really much slower than int on OpenCL. I did not have the same impression on CUDA, even when running on the same HW.

On Tue, 3 Aug 2021 at 16:55, pradeep @.***> wrote:

@.**** commented on this pull request.

In src/backend/opencl/kernel/copy.cl https://github.com/arrayfire/arrayfire/pull/3144#discussion_r681840303:

@@ -8,16 +8,14 @@ ********************************************************/

typedef struct {

  • dim_t dim[4];
  • int dims[4];

is there a reason we can't use KParam like the other kernels ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#discussion_r681840303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPDHXLLXZN3DT3EHBHDT277P7ANCNFSM46MTN7OQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

willyborn avatar Aug 03 '21 19:08 willyborn

I notice that this PR is already closed.

Does it still make sense to update the code with the comments? Can it still be merged into master or am I too late for code updates?

I will independently continue trying to answer all your the questions.

willyborn avatar Aug 03 '21 19:08 willyborn

@willyborn I was under the impression you closed it intentionally to avoid repetitive ci builds or something similar.

9prady9 avatar Aug 04 '21 01:08 9prady9

@willyborn Also, can you please remove the merge commit from this branch

With your master even with arayfire's master branch, git rebase master should remove it the merge commit. Also, feel free to squash the trace statement fix into your commit. I did these locally for easy review/test. If you want me to push, I can do that as well. Just let me know.

9prady9 avatar Aug 04 '21 03:08 9prady9

All remarks are included now, except 1 on 'src/backend/cuda/kernel/memcopy.cuh ' from @9prady9 where I need some help.

I will start the testing before Releasing. After the discussions, I got some ideas to improve further on collapseShape & increasePerThreadWorkload. This can take some time, because test runs typically take a few hours each.

willyborn avatar Aug 11 '21 20:08 willyborn

@willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up.

9prady9 avatar Oct 11 '21 11:10 9prady9

No problem, I am continuing with the info provided last time. It made many things clear, resulting in much simplification.

I will provide a separate PR for the JIT disk caching. I found a way so that it reacts now identical to the rest, without extra performance impact. For this PR the JIT code is even not impacted. I will prepare this before continuing with the copy bench tests.

Willy

On Mon, Oct 11, 2021, 13:23 pradeep @.***> wrote:

@willyborn https://github.com/willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 https://github.com/arrayfire/arrayfire/pull/3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#issuecomment-939938466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

willyborn avatar Oct 11 '21 13:10 willyborn

Pradeep,

Do not rebase, since the code still is in progress. I know it is taking a long time, although I'm using this opportunity to better understand which factors are impacting performance. For the moment, I am around 40% better for u8 and 15% for f32, so still progressing (although I am approaching the HW limit). I will rebase before pushing new code.

On Mon, 11 Oct 2021 at 15:40, Sabine & Willy Born < @.***> wrote:

No problem, I am continuing with the info provided last time. It made many things clear, resulting in much simplification.

I will provide a separate PR for the JIT disk caching. I found a way so that it reacts now identical to the rest, without extra performance impact. For this PR the JIT code is even not impacted. I will prepare this before continuing with the copy bench tests.

Willy

On Mon, Oct 11, 2021, 13:23 pradeep @.***> wrote:

@willyborn https://github.com/willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 https://github.com/arrayfire/arrayfire/pull/3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#issuecomment-939938466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

willyborn avatar Oct 11 '21 14:10 willyborn

@willyborn Any idea on the ETA when this should be ready ? rough estimate is good.

9prady9 avatar Oct 13 '21 05:10 9prady9

For the JIT disk caching bug, you will have the PR today. All tests ran successfully yesterday evening.

For the copy, this will take more time and I do not have an ETA.

Willy

On Wed, Oct 13, 2021, 07:42 pradeep @.***> wrote:

@willyborn https://github.com/willyborn Any idea on the ETA when this should be ready ? rough estimate is good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#issuecomment-941940826, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPBRMYODPVSJLURCY73UGUL5JANCNFSM46MTN7OQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

willyborn avatar Oct 13 '21 05:10 willyborn

@willyborn Sure, please do. We can have the JIT caching issue fix in 3.8.1. I am curious, if this fix is relevant to master or is it on top of this branch ? If it isn't associated with master, then we don't need to rush that one either.

9prady9 avatar Oct 13 '21 05:10 9prady9

@willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged.

umar456 avatar Apr 05 '22 19:04 umar456

@willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged.

umar456 avatar Apr 05 '22 19:04 umar456

@Umar I am still searching for an algorithm to determine the number of loops I need inside the kernel. The potential goes up to 2x faster speeds for certain conditions. For the moment, my algorithm is not yet consistent enough to release. For the rest, thanks to your comments, I succeeded in simplifying much compared to the previous release of PR.

I still have 1 idea left to check. Each full test cycle takes currently 150Hrs to complete all different combinations, explaining why the progress is so slow. I would find it a pity, if I does not succeed in getting this incorporated in a consistent manner. In case you want the intermediate results (3.8.2 for example), I can release an intermediate version if requested.

BR, Willy

On Tue, 5 Apr 2022 at 21:32, Umar Arshad @.***> wrote:

@willyborn https://github.com/willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged.

— Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/pull/3144#issuecomment-1089234960, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPACDLYBFIXK362FJ7LVDSILBANCNFSM46MTN7OQ . You are receiving this because you were mentioned.Message ID: @.***>

willyborn avatar Apr 06 '22 10:04 willyborn

I am not trying to rush you but there are a lot of good changes in here. It would be nice to get an intermediate PR unless it will take too much effort to finalize. Its up to you. Additionally, we can help you with some testing if you want.

umar456 avatar Apr 06 '22 16:04 umar456

I joined a document with all my findings during this journey. I was surprised by some of the results. You will also find the final performance results on many cases of joins. It is important to look at the full curve, instead of only large array's. Variations in speed are mostly resulting from optimizations being possible or not (like vectorization).

Findings Copy&Join.docx Join-Final.xlsx

willyborn avatar Aug 02 '22 14:08 willyborn

Hey @willyborn, This is amazing work. I am still going through the document and will go over the code later today. These are excellent findings. I'd often assumed that this function could be implemented in a better way. There is a lot of useful information here and I will try to turn this around asap. Do you believe this PR is ready to be merged from your perspective? Do you still have some unfinished work before I begin? I see there are a couple of conflicts. They seem pretty minor. I can help you with them if you have any issues resolving them.

umar456 avatar Aug 02 '22 15:08 umar456

I have the impression that those conflicts are from yet unpublished changes. I verified if the remote master was updated in the mean time, although it wasn't. (I have the impression, internally it is)

Looking at the conflicts, the changes are minor, although I have not idea how to solve this without trowing away my changes.

willyborn avatar Aug 02 '22 15:08 willyborn

Do you mind if I resolve them? I can push once I make the relevant changes to your branch.

umar456 avatar Aug 02 '22 15:08 umar456

No problem. At the end it is a team result.

willyborn avatar Aug 02 '22 15:08 willyborn

I found the missing updates, github will remain a mystery for me, or perhaps it is the MS Visual Studio implementation which puts me frequently on the wrong foot.

willyborn avatar Aug 02 '22 15:08 willyborn

Ahh okay. Go on ahead and push it with the update. Maybe the conflicts will be resolved with the rebase.

umar456 avatar Aug 02 '22 16:08 umar456

Rebased and reran all tests.

willyborn avatar Aug 03 '22 10:08 willyborn

@umar456 I missed one of your questions. I have no outstanding work.

willyborn avatar Sep 05 '22 14:09 willyborn