Skip to content

Conversation

@gfoidl
Copy link
Member

@gfoidl gfoidl commented Oct 8, 2020

  • updated the sample code to use long for _requestCount which is more realisitc than using int
  • for int a torn read can happen on 32-bit architectures, so added a note therefore

For background see e.g. dotnet/aspnetcore#26630

@gfoidl gfoidl requested review from a team and sdmaclea as code owners October 8, 2020 09:36
@dotnet-bot dotnet-bot added this to the October 2020 milestone Oct 8, 2020
TABS instead of spaces.
@sdmaclea
Copy link
Contributor

sdmaclea commented Oct 8, 2020

/cc @dotnet/dotnet-diag

DisplayRateTimeScale = TimeSpan.FromSeconds(1)
};
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might better belong with the code above at line 111.

Volatile.Read might be the better default sample code. As it probably affects weakly ordered architectures as well as 32-bit architectures.

I wonder if _requestCount itself should be marked volatile. Does that work fix the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_requestCount itself should be marked volatile. Does that work fix the issue?

I don't think so, and on 32-bit torn-reads, which we want to avoid, are still a concern.

There's an interesting discusssion in npgsql/npgsql#3223 whether Volatile.Read is sufficient or not in contrast to Interlocked.Read.

the better default sample code

Isn't the default code sample to show "how it's not done"?
Maybe it should be changed so that the example shows a correct reference implementation and then the text points out for what to watch for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly read through the conversation in npgsql/npgsql#3223. Semantically Interlocked.Read seems more explicit, The Volatile anti-tearing seems more like a side-effect. Given policy of non-breaking changes it is likely to stay a permanent side-effect though.

I would have expected an Interlocked<T> C# generic type which mandated interlocked access, so it could be used in member variables.... Perhaps I'm too optimistic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any objection or downside to Interlocked.Read? Interlocked would be my preference. The conversation made it sound like "Interlocked is clearly documented to do what we want, Volatile.Read also technically works but relies on more obscure detail." I didn't notice any rationale why we'd prefer using Volatile given the choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any rationale why we'd prefer using Volatile

I think this goes on my hat. I've chosen it because on x64 it produces so nice asm 😉
But now I'm not sure if this choice was a good one and I'm clueless hoping for a conclusion which can be applied on "all" uses for the polling-counter.

As background:
I needed to implement an event counter, looked at the docs and searched for references. The docs lacked that point -- hence this PR -- and in ASP.NET Core the reads were just plain reads, but there was an issue to fix this.
Fast forward, we landed here, and I'd like to have the docs show a good and proper reference implementation which can be copied by folks without further thinking whether it's correct or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've chosen it because on x64 it produces so nice asm

These event counter callbacks are invoked at most once a second (per counter). I would expect any sub-microsecond optimization will have no measurable impact on app performance and we can prioritize coding style that has clear intent.

I'd like to have the docs show a good and proper reference implementation which can be copied by folks without further thinking whether it's correct or not

I think Interlocked.Read makes a clear and correct default implementation with neglible downsides - that is what I'd recommend we use in the sample.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to Interlocked.Read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prioritize coding style that has clear intent.

Absolutely, especially in the docs.

no measurable impact on app performance

Agree.

The following is more a theoretical digression than real observation.
It's just about to help me understand how things work or if I have a failure in my mental model so that it can be corrected.

Let's assume a high-perf server with lots of requests (e.g. kestrel) running on x64.

For each request a long field in the event source object will be incremented by Interlocked.Increment, thus lock add qword ptr [rcx], 1 will be executed. The lock-prefix ensures that the cpu has exclusive ownership of the cache line in play for the duration of the operation (MESI-protocol and variants of it), and it turns the instruction into a full memory barrier (mfence).

Now the timer (1s interval) for the event counter triggers, so the field is read by Interlocked.Read, which is lock cmpxchg [rcx], rdx -- it's not only a read, but a read-modify-write. From Intel Developer manual "The processor never produces a locked read without also producing a locked write." and also quote here.
So, same as before, exclusive ownership of the cache line is ensured.
This also means that during this read no other write can happen -- only before or after the instruction -- and so a increment at the same time will be "blocked", also the caller of the increment which will ultimately "block" the processing of that request.

So being on x64 on we don't care if the snapshot taken from the counter is the most to up to date value (we don't need memory barriers to ensure the freshest value), a ordinary load mov rax, [rcx+8] would be the least disruptive operation.

Notes:

  • "block" is in quotes, as I really wouldn't say blocking as it's just a mini-fraction of time. But I don't any better word therefore
  • I'm aware that this won't show up in RPS or any other measure -- at least it will be within noise, so negligible
  • I don't want contrast this with Volatile or put any argument pro or contra a variant disussed here -- it's only for understanding as written above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any specific question. I don't see any gross error in your description.

Coming from the weakly ordered side of the micro-architectural world, I would replace the word "blocked" with "ordered". When the lock cmpxchg executes as a read or a write it is guaranteed to write, so it must take exclusive ownership of the cache line. So its write is ordered with respect to other writes/reads. Effectively the writes/reads are strongly ordered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation 👍, that's what I was looking for with the previous comment.
And for the better wording with "ordered".

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants