-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Make the global EC a BatchedExecutor (performance) #7470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the global EC a BatchedExecutor (performance) #7470
The head ref may contain hidden characters: "wip-make-global-ec-batched-\u221A"
Conversation
|
Seems like this should be backported to 2.12.x |
|
@hepin1989 The 2.12.x tree does not use the same implementation as 2.13.x. I'd have to run benches for 2.12.x separately. |
|
I should note that this also applies to any ECs created using the |
8339e6a to
2754cc4
Compare
|
@SethTisue Let me know what is needed to get this merged. Would love to see this in before rc1 |
|
There are some ~15% regressions for |
|
@plokhotnyuk I hope to be able to bring them back to around where they were for 2.13.0-M5. As for The lower performance for CallbackBenchmark is due to how the bench is written, as the bench prefers as much parallelism as possible (one can argue that this is a faulty test but given that it is a callback it can only do side-effects). As for the regression of FirstCompletedOfBenchmark this is due to the fact that FirstCompletedOf is an inherent race and does not benefit from trampolining. I'll see if I can do something about that. Given the impact for the typical use-cases of stacking transformations I think the improvements for |
well, in I can run the community build on it if you like (though the 2.13 build has ~70 projects currently, compared to ~180 on 2.12, so it's somewhat less assurance)
(RC1 won't be out before January, so there's time) |
|
@SethTisue :) I'm always uneasy about approving my own work! |
|
@plokhotnyuk I've made some extra optimizations and we're almost back to HEAD in performance of FirstCompletedOf: (We've got about 10% to go, will see if I can improve that further, but it is doing more work in that case so it is hard to regain everything, but I'll try) For CallbackFutureBenchmark I'm at: Which means the post-bench is back to baseline and the prebench has about 10% to go. Will look into this some more. |
|
@plokhotnyuk In the pre-case it is really difficult to claw back extra performance because the 10% hit is the extra allocation for the batch, which is always discarded after only having had a single item in it. I will see what the effect is when the benchmark is executed from within a Future. |
|
@plokhotnyuk When the CallbackFutureBenchmark is executed from within [info] CallbackFutureBenchmark.post gbl 8192 1 thrpt 10 4.014 ± 0.042 ops/ms I'll see when I can do to make that 10% perf gap even smaller, but it does seem like a trade-off situation. |
|
@viktorklang |
|
If I add a method to Batchable, and let each batchable decide whether it is ready to be batched or not, then I can mark I added a bit of optimization to FirstCompletedOf which gives a pretty significant boost in the precompleted section: |
|
@NthPortal Exactly. pre = precompleted (completed before the operations are added), post = postcompleted (completed after the operations are added) |
|
@viktorklang do we care about one of those more than the other (perhaps because it is more common)? |
|
@NthPortal It's use-case specific and timing-related (as the futures could be completed either before, during, or after) |
|
@NthPortal @plokhotnyuk On second thought, I'd prefer not to introduce the new method to Batchable since it is likely to create difficult-to-reason about performance implications for end-users. |
|
I just added some improvements. Will update the text of the PR when I have cleaned up the bench results. |
246c292 to
4ccc9e6
Compare
|
@NthPortal @SethTisue Alright, I believe this one is ready for community build and then RC1 :) |
|
it finished at https://scala-ci.typesafe.com/job/scala-2.13.x-integrate-community-build/1637/ , I see only the usual failures. |
|
Thanks @SethTisue ! |
| val state = get() | ||
| if (state.isInstanceOf[Try[T]]) "Future("+state+")" | ||
| else if (state.isInstanceOf[Link[T]]) state.asInstanceOf[Link[T]].promise(this).toString | ||
| else if (state.getClass eq classOf[Link[T]]) state.asInstanceOf[Link[T]].promise(this).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be classOf[Link[_]]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I tend to add the explicit type even if it is erased, mainly as documentation.
| val state = get() | ||
| if (state.isInstanceOf[Try[T]]) "Future("+state+")" | ||
| else if (state.isInstanceOf[Link[T]]) state.asInstanceOf[Link[T]].promise(this).toString | ||
| else if (state.getClass eq classOf[Link[T]]) state.asInstanceOf[Link[T]].promise(this).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you benchmarked this and it was faster? I'm surprised the JIT can't tell that the type has no children and not walk the object's class hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was faster, and I too was surprised :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do remember the getClass being faster a long time ago (Java 1.1/1.2) but it was fixed
I did a quick JMH synthetic benchmark to compare, and I can see no real performance advantage of getClass as the differences are within the noise
Can someone review the benchmark, or re-run on another system to compare
@viktorklang can you share the benchmark that you used and the results
[info] Benchmark Mode Cnt Score Error Units
[info] InstanceOfBenchmark.innerByClassAllInner avgt 400 38.009 ▒ 0.106 ns/op
[info] InstanceOfBenchmark.innerByClassAllInnerT avgt 400 29.694 ▒ 0.083 ns/op
[info] InstanceOfBenchmark.innerByClassMixed avgt 400 37.584 ▒ 0.049 ns/op
[info] InstanceOfBenchmark.innerByInstanceAllInner avgt 400 37.709 ▒ 0.087 ns/op
[info] InstanceOfBenchmark.innerByInstanceAllInnerT avgt 400 29.733 ▒ 0.088 ns/op
[info] InstanceOfBenchmark.innerByInstanceMixed avgt 400 35.275 ▒ 0.122 ns/op
[info] InstanceOfBenchmark.innerTByClassAllInner avgt 400 29.586 ▒ 0.076 ns/op
[info] InstanceOfBenchmark.innerTByClassAllInnerT avgt 400 37.945 ▒ 0.076 ns/op
[info] InstanceOfBenchmark.innerTByClassMixed avgt 400 34.131 ▒ 0.227 ns/op
[info] InstanceOfBenchmark.innerTByInstanceAllInner avgt 400 29.834 ▒ 0.126 ns/op
[info] InstanceOfBenchmark.innerTByInstanceAllInnerT avgt 400 37.654 ▒ 0.046 ns/op
[info] InstanceOfBenchmark.innerTByInstanceMixed avgt 400 33.725 ▒ 0.046 ns/op
full code and run details in https://gist.github.com/mkeskells/c88dd1201d750d4edd2f306f8ad60f41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeskells The difference was small, yes. (I run these benches: https://github.com/scala/scala/blob/76b34c455e54adea171bd618e7ea5d37cc2c4af1/test/benchmarks/src/main/scala/scala/concurrent/FutureBenchmark.scala) I found another, much bigger, source of performance gain so I am happy to switch back to using isInstanceOf (I just need to do some extra benching and cleanup first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to understand if and when a different approach is faster, after all if it an optimisation that the compiler can apply for us we can keep the code idiomatic and faster.
I would ask @retronym to have a look at the when/if any optimisation is generalisable in the compiler codegen
Or if it just an optimisation to pass onto a JIT/Graal team, if it is just a runtime rather than codegen optimisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, HotSpot and Graal generate equivalent code for instanceof Foo and .getClass == Foo.class testing against an effectively final class. I'd suggest switching back to instanceof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@retronym Agreed—already done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeskells I'm going to compile a list of proposed changes to the compiler based on my observations, for sure!
| throw e | ||
| } else throw inner | ||
| e | ||
| } else inner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should inner suppress cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO yes—inner suppresses cause when cause is non-fatal and inner is fatal—this is to make sure that we propagate serious problems.
76b34c4 to
4ccc9e6
Compare
|
I just pushed a new commit which should improve performance a bit further. |
|
@SethTisue Could you run the community build against this latest commit. I believe it should be as good as it gets right now. |
8772676 to
e63b1b0
Compare
|
@SethTisue Found even more optimizations! Looking really good now! |
e63b1b0 to
c2fd06b
Compare
|
Just updated the PR with the latest perf numbers compared to the 2.13.x mainline—some operators have dramatically better performance on the Global EC with these changes. |
9b0362d to
4981c02
Compare
* Making the global EC a BatchedExecutor (performance)
* Implements a new BatchedExecutor.Batch with less overhead
* Switches to tryComplete0 instead of tryComplete to reduce calls
* Simplifies the relinking logic
* Moves unlinking to DefaultPromise
* Inlines foreach and onComplete
* Runs zipWith both map and flatMap on the same supplied EC
* Switches to use tryComplete0 for more transformations, for fewer calls
4981c02 to
fbada39
Compare
|
@SethTisue I'm merging this now. I've squashed the commits. Let me know if/when you run the next community-build on the 2.13.x mainline. |
I haven't seen any regressions since scala/community-build@ee79f0f |
|
Is there a reason why is the BatchedExecutor trait is private[concurrent]? Are users supposed to copy and paste that code or make their own if they need a separate ExecutionContext, e.g. to avoid blocking the global pool, but want it to not have worse performance? |
|
@shado23 I wasn't sure it should be supported end-user API. If you want to avoid blocking the global pool then you can use |
|
@viktorklang Thanks for the reply. I hear what you're saying, but I think there are cases where having a truly separate ExecutionContext is advantageous. Slick for instance does exactly that for its JDBC IO threads. In my experience the blocking context behavior can be a cure worse than the disease. Yes, you may avoid thread starvation, but can in return end up with thousands(!) of threads and an OOM exception if you're not very careful about how much work you schedule (which Future doesn't make particularly easy). In the cases I've run into it was always easier and safer to provide a right sized thread pool from the get go, because even if you happen to avoid the OOM issue your system won't be able to deal with an infinite amount of concurrent IO ops anyway and it could even be that you're dealing with background jobs that you don't want to take up a significant amount of resources. Even limiting the extra threads of the global pool isn't the best solution, as you still end up mixing IO and computation work, which has quite different needs, and without control of the scheduling you can still end up blocking all of them. Ultimately even pure computational tasks can look like blocking to the rest of the system if they're heavy enough, so even there you have to put thought into how and where they are executed. I do think that having the executor be batching by default is a good change, if I'm not mistaken it essentially gives you the behavior that Twitter Future and Monix Task (and probably others) already have where a simple map doesn't immediately incur a trip through the ExecutionContext and associated context switch in most cases. I think that's great, because you currently have to contort yourself to try to avoid this if your transformations are small or accept a significant performance loss. Ultimately, I think it would be useful to allow users to create their own BatchedExecutor. It doesn't seem like it would expose any delicate internals or be very risky. I understand that binary compatibility concerns etc. create an incentive to be careful about the public API you expose, but in some cases it can also be quite annoying as a user. In this case copy and pasting the implementation wouldn't be the worst thing in the world, but it feels petty to me to disallow external usage if there aren't good reasons to do it. (Maybe there are?) |
The global default EC has a default number of extra threads of 256 (configurable via System Properties)—this means that if you use
The biggest issues I see is that users/extenders of BatchingExecutor will not realize the implications on the code it executes—which is why the synchronous EC used by Future is not exposed directly to end users. Also note that you are able to create a "clone" of the global EC by calling this: https://github.com/scala/scala/blob/2.13.x/src/library/scala/concurrent/ExecutionContext.scala#L153 I have some additional optimizations to BatchingExecutor that I hope I'll be able to ship with 2.13.0-RC1. |
Sure, in many cases that will probably work well enough. To be really seamless and safe most likely requires some cooperation from the VM to allow avoiding blocking IO (such as what Go does). Maybe project Loom will provide something here in the future.
That's good enough for me then.
Sounds good. I'm looking forward to being able to take advantage of all the great optimization work you're doing. |
|
@shado23 Thanks! |
This PR makes the global EC implement BatchingExecutor, which on benches for Future increases performance significantly for most combinators.
Numbers from my box using scala.concurrent.global:
Note that all operations have not improved performance, but in reality, some of the benches are not typical end-user code.