Skip to content

Conversation

@TimLovellSmith
Copy link
Contributor

@TimLovellSmith TimLovellSmith commented Jun 21, 2021

Fixes #1719.

@TimLovellSmith
Copy link
Contributor Author

Hm. Only one test failure on appveyor, and seems like it could just be not enough worker threads again?

22:26:40.679: Awaiting 2 available task completion(s) for 5000ms, IOCP: (Busy=1,Free=999,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)
22:26:40.679: Response from Interactive/127.0.0.1:6379 / PING: SimpleString: PONG
22:26:45.699: Not all available tasks completed cleanly (from ReconfigureAsync#1747, timeout 5000ms), IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)

dotnet : [xUnit.net 00:00:07.99]     MultiMaster.TestMultiWithTiebreak(a: null, b: "127.0.0.1:6381", elected: "127.0.0.1:6381") [FAIL]
 
  Skipped TestInfoReplicationChecks.Exec [1 ms]
  Failed MultiMaster.TestMultiWithTiebreak(a: null, b: "127.0.0.1:6381", elected: "127.0.0.1:6381") [5 s]
  Error Message:
   elected
Expected: True
Actual:   False
  Stack Trace:
     at StackExchange.Redis.Tests.MultiMaster.TestMultiWithTiebreak(String a, String b, String elected) in C:\projects\stackexchange-redis\tests\StackExchange.Redis.Tests\MultiMaster.cs:line 79
  Standard Output Messages:
 Running under .NET Framework 4.8.4121.0 (4.0.30319.42000)
   Running on: Microsoft Windows 10.0.17763 
 22:26:40.663: 127.0.0.1:6379,127.0.0.1:6381,name=TestMultiWithTiebreak,password=*****,tiebreaker=__tie__
 22:26:40.663: 127.0.0.1:6379/Interactive: Connecting...
 22:26:40.663: 127.0.0.1:6379: BeginConnectAsync
 22:26:40.663: 127.0.0.1:6379/Interactive: Connected 
 22:26:40.663: 127.0.0.1:6379: Server handshake
 22:26:40.663: 127.0.0.1:6379: Authenticating (password)
 22:26:40.663: 127.0.0.1:6379: Setting client name: TestMultiWithTiebreak
 22:26:40.663: 127.0.0.1:6379: Auto-configuring...
 22:26:40.663: 127.0.0.1:6379: Sending critical tracer (handshake): ECHO
 22:26:40.663: 127.0.0.1:6379/Interactive: Writing: ECHO
 22:26:40.663: 127.0.0.1:6379: Flushing outbound buffer
 22:26:40.663: 127.0.0.1:6379: OnEstablishingAsync complete
 22:26:40.663: 127.0.0.1:6379: Starting read
 22:26:40.663: 127.0.0.1:6379: Auto-configured (CONFIG) read-only replica: true
 22:26:40.663: 127.0.0.1:6379: Auto-configured (CONFIG) databases: 2000
 22:26:40.663: 127.0.0.1:6381/Interactive: Connecting...
 22:26:40.663: 127.0.0.1:6379: Auto-configured (INFO) version: 3.0.503
 22:26:40.663: 127.0.0.1:6379: Auto-configured (INFO) server-type: standalone
 22:26:40.663: 127.0.0.1:6381: BeginConnectAsync
 22:26:40.663: 127.0.0.1:6379: Auto-configured (INFO) role: master
 22:26:40.663: Response from Interactive/127.0.0.1:6379 / ECHO: BulkString: 16 bytes
 22:26:40.679: 127.0.0.1:6381/Interactive: Connected 
 22:26:40.679: 127.0.0.1:6381: Server handshake
 22:26:40.679: 127.0.0.1:6381: Authenticating (password)
 22:26:40.679: 127.0.0.1:6381: Setting client name: TestMultiWithTiebreak
 22:26:40.679: 127.0.0.1:6381: Auto-configuring...
 22:26:40.679: 127.0.0.1:6381: Sending critical tracer (handshake): ECHO
 22:26:40.679: 127.0.0.1:6381/Interactive: Writing: ECHO
 22:26:40.679: 127.0.0.1:6381: Flushing outbound buffer
 22:26:40.679: 127.0.0.1:6381: OnEstablishingAsync complete
 22:26:40.679: 127.0.0.1:6381: Starting read
 22:26:40.679: 127.0.0.1:6381: Auto-configured (CONFIG) read-only replica: true
 22:26:40.679: 127.0.0.1:6381: Auto-configured (CONFIG) databases: 2000
 22:26:40.679: 127.0.0.1:6381: Auto-configured (INFO) version: 3.0.503
 22:26:40.679: 127.0.0.1:6381: Auto-configured (INFO) server-type: standalone
 22:26:40.679: 127.0.0.1:6381: Auto-configured (INFO) role: master
 22:26:40.679: Response from Interactive/127.0.0.1:6381 / ECHO: BulkString: 16 bytes
 22:26:40.679: 2 unique nodes specified
 22:26:40.679: 127.0.0.1:6379: OnConnectedAsync init (State=ConnectedEstablished)
 22:26:40.679: 127.0.0.1:6379: OnConnectedAsync completed (already connected start)
 22:26:40.679: 127.0.0.1:6379/Interactive: Writing: PING
 22:26:40.679: 127.0.0.1:6381: OnConnectedAsync init (State=ConnectedEstablished)
 22:26:40.679: 127.0.0.1:6381: OnConnectedAsync completed (already connected start)
 22:26:40.679: 127.0.0.1:6381/Interactive: Writing: PING
 22:26:40.679: Response from Interactive/127.0.0.1:6381 / PING: SimpleString: PONG
 22:26:40.679: 127.0.0.1:6381: OnConnectedAsync completed (already connected end)
 22:26:40.679: Allowing 2 endpoint(s) 00:00:05 to respond...
 22:26:40.679: Awaiting 2 available task completion(s) for 5000ms, IOCP: (Busy=1,Free=999,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)
 22:26:40.679: Response from Interactive/127.0.0.1:6379 / PING: SimpleString: PONG
 22:26:45.699: Not all available tasks completed cleanly (from ReconfigureAsync#1747, timeout 5000ms), IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)
 22:26:45.699:   Task[0] Status: WaitingForActivation
 22:26:45.699:   Task[1] Status: RanToCompletion
 22:26:45.699: 127.0.0.1:6379: Endpoint is ConnectedEstablished
 22:26:45.699: 127.0.0.1:6381: Endpoint is ConnectedEstablished
 22:26:45.699: 127.0.0.1:6379: Requesting tie-break (Key="__tie__")...
 22:26:45.699: 127.0.0.1:6379/Interactive: Writing: GET __tie__
 22:26:45.699: 127.0.0.1:6381: Requesting tie-break (Key="__tie__")...
 22:26:45.699: 127.0.0.1:6381/Interactive: Writing: GET __tie__
 22:26:45.699: 127.0.0.1:6379: Did not respond
 22:26:45.699: 127.0.0.1:6381: Returned with success as Standalone primary (Source: Already connected)
 22:26:45.699: Waiting for tiebreakers...
 22:26:45.699: Awaiting 2 tiebreaker task completion(s) for 50ms, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)
 22:26:45.699: Response from Interactive/127.0.0.1:6379 / GET __tie__: (null)
 22:26:45.699: Response from Interactive/127.0.0.1:6381 / GET __tie__: BulkString: 14 bytes
 22:26:45.699: All 2 tiebreaker tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=2042,Min=4,Max=2047)
 22:26:45.699: Election: 127.0.0.1:6379 had no tiebreaker set
 22:26:45.699: Election: 127.0.0.1:6381 nominates: 127.0.0.1:6381
 22:26:45.699: Election: Single master detected: 127.0.0.1:6381
 22:26:45.699: 127.0.0.1:6381: Clearing as RedundantMaster
 22:26:45.699: 127.0.0.1:6379: Standalone v3.0.503, master; 2000 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active; not in use: DidNotRespond
 22:26:45.699: 127.0.0.1:6379: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
 22:26:45.699: 127.0.0.1:6379: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
 22:26:45.699: 127.0.0.1:6381: Standalone v3.0.503, master; 2000 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
 22:26:45.699: 127.0.0.1:6381: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
 22:26:45.699: 127.0.0.1:6381: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
 22:26:45.699: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
 22:26:45.699: Starting heartbeat...
 22:26:45.699: 22:26:45.699: Connect took: 5028ms
 22:26:45.715: Service Counts: (Scheduler) Queue: 2897, Pool: 0

@NickCraver
Copy link
Collaborator

Talked about this on the call - this is the remaining failure I'm seeing but very rarely - we need a log of it (I added to the logging) to get more info about the cause. Merging main in here to get better logging.

@TimLovellSmith
Copy link
Contributor Author

@NickCraver Any questions on this PR still? It is still needed to prevent ordinary user messages being sent out of order based on race conditions.

@TimLovellSmith TimLovellSmith changed the title Revert "More reverts" to reapply the fix to #1719. Fixes "messages get sent out-of-order when a backlog is created by another thread" #1719 Jun 23, 2021
@TimLovellSmith TimLovellSmith changed the title Fixes "messages get sent out-of-order when a backlog is created by another thread" #1719 Fixes "messages get sent out-of-order when a backlog is created by another thread" Jun 23, 2021
@NickCraver
Copy link
Collaborator

@TimLovellSmith hoping to get time on it tomorrow or Friday, busy release and house week here - just need to wrap my head around it, basics seem sound.

@NickCraver
Copy link
Collaborator

cc for @mgravell eyes

@NickCraver NickCraver linked an issue Jun 26, 2021 that may be closed by this pull request
@NickCraver
Copy link
Collaborator

Fired up benchmarks here, we're within jitter. The main concern was around performance but with this inlined the 99.99% case is a bool check on the queue:

Main:

Method Runtime Mean Error StdDev Op/s Gen 0 Allocated
StringSet/s .NET 4.7.2 1,018.4 us 19.99 us 21.39 us 982.0 - 426 B
StringGet/s .NET 4.7.2 970.6 us 19.38 us 39.15 us 1,030.3 - 402 B
'HashGetAll F+F/s' .NET 4.7.2 2,844.2 us 60.29 us 176.83 us 351.6 40.0000 133404 B
'HashGetAll F+F/a' .NET 4.7.2 2,894.5 us 57.59 us 142.35 us 345.5 40.0000 134452 B
StringSet/s .NET Core 5.0 961.8 us 15.89 us 14.09 us 1,039.7 - 168 B
StringGet/s .NET Core 5.0 958.2 us 18.97 us 37.45 us 1,043.7 - 153 B
'HashGetAll F+F/s' .NET Core 5.0 2,330.2 us 45.54 us 83.28 us 429.1 40.0000 128289 B
'HashGetAll F+F/a' .NET Core 5.0 2,304.4 us 45.62 us 111.91 us 434.0 40.0000 128405 B

Branch:

Method Runtime Mean Error StdDev Op/s Gen 0 Allocated
StringSet/s .NET 4.7.2 1,006.6 us 17.71 us 18.95 us 993.4 - 410 B
StringGet/s .NET 4.7.2 960.8 us 18.70 us 20.01 us 1,040.8 - 417 B
'HashGetAll F+F/s' .NET 4.7.2 2,916.7 us 57.77 us 151.17 us 342.8 40.0000 133543 B
'HashGetAll F+F/a' .NET 4.7.2 2,989.4 us 59.54 us 169.86 us 334.5 40.0000 134475 B
StringSet/s .NET Core 5.0 1,008.5 us 19.77 us 30.77 us 991.6 - 153 B
StringGet/s .NET Core 5.0 937.2 us 18.68 us 16.56 us 1,067.0 - 153 B
'HashGetAll F+F/s' .NET Core 5.0 2,437.0 us 48.22 us 96.31 us 410.3 40.0000 128289 B
'HashGetAll F+F/a' .NET Core 5.0 2,392.8 us 47.20 us 96.41 us 417.9 40.0000 128405 B

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to double check this with @mgravell Monday, but I think we're good 👍 - nice work @TimLovellSmith!

@philon-msft
Copy link
Collaborator

@mgravell, did you have any concerns with this change? If not, I'll pull it into an internal release for our impacted teams.

@mgravell
Copy link
Collaborator

mgravell commented Jul 1, 2021

@philon-msft I'm reviewing it this morning; will let you know!

Copy link
Collaborator

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgravell mgravell merged commit cd6cafa into main Jul 1, 2021
@mgravell mgravell deleted the fix1719 branch July 1, 2021 07:30
@mgravell
Copy link
Collaborator

mgravell commented Jul 1, 2021

@philon-msft v2.2.62

@TimLovellSmith
Copy link
Contributor Author

Cool, its in! Thank you for the extensive benchmarking and reviewing of this and the related backlog changes Nick and Marc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish messages out-of-order without waiting the result Pipelined operations are sent out of order

5 participants