Skip to content

Conversation

@rkarthick
Copy link
Contributor

Envoy proxy for redis: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_protocols/redis

Envoy does the heavy lifting of redis cluster instance discovery and partitioning commands among redis cluster instances from the clients to a set of proxy instances. Clients can round robin or randomly select a proxy to connect to and the proxy will do the right thing of picking the right cluster instance for a given command.

Apart from the glue of wiring up the new proxy in Enums and disabling pub/sub for the proxy (similar to Twemproxy), this PR contains the following major change in the primary node selection; When an envoy proxy mode is selected for connection, the check for verifying if the node is master or replica is disabled. In the envoy proxy world we can connect to any proxy instance. All we care about here is ensuring that the load is balanced between the proxy instances (which is accomplished through the current round-robin logic).

@mgravell raised a valid point offline that the order of operations wouldn't be guaranteed in this mode and this is indeed one of the limitation of this approach. The commands that form the foundation for redis transactions (exec, discard, multi, watch, unwatch) (https://redis.io/topics/transactions) are not supported by Envoy (https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/network/common/redis/supported_commands.h) and hence added to the exclusion list in PR.

The testing is inspired from this PR: #1425. Open to suggestions on that.

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I put some tidy suggestions inline - want to add testing here as well but initial is 👍

@rkarthick
Copy link
Contributor Author

rkarthick commented Feb 11, 2022 via email

@rkarthick
Copy link
Contributor Author

Interesting, the failing testcase (ServerTakesPrecendenceOverSnapshot) is working locally. @NickCraver do you know if SimulateConnectionFailure can be flaky for any reason?

I will also make changes to docker files to test basic envoy connection this week :)

@rkarthick rkarthick requested a review from NickCraver February 17, 2022 17:41
@rkarthick
Copy link
Contributor Author

@NickCraver Envoy doesnt seem to have windows binary, that we could run locally using *.cmd file. Do you have any thoughts on how we can get the environment going in Windows Server? The tests are passing in Ubuntu, except for some random flakiness with others.

@ravgupms
Copy link

@rkarthick you can download Envoy for Windows from the following link:
https://archive.tetratelabs.io/envoy/download/v1.21.0/envoy-v1.21.0-windows-amd64.tar.xz

@NickCraver
Copy link
Collaborator

@rkarthick Sorry it's a crazy week this week in particular (bootcamp at work), nice progress here! I think we can use the envoy image instead of installing for better caching/re-use/speed here and only mount our yaml - I think? Is there a reason that approach doesn't work?

I'm not too worried about Windows - can help with skipping tests if we don't have the server as an approach there (based on TestConfig)

@rkarthick
Copy link
Contributor Author

No worries @NickCraver and thank you for the reply.

@rkarthick Sorry it's a crazy week this week in particular (bootcamp at work), nice progress here! I think we can use the envoy image instead of installing for better caching/re-use/speed here and only mount our yaml - I think? Is there a reason that approach doesn't work?

The usage of envoy proxy docker container within docker-compose will involve the following changes:

  1. Make redis in docker-compose.yaml use a fixed ipv4_address
  2. Use that ipv4_address in the Cluster/node-700*.conf files instead of 127.0.0.1
  3. Use the ipv4_address in envoy.yaml

This is because envoy redis cluster instance discovery relies on cluster slots redis command. cluster slots command returns the IP address in nodes.conf file (hostnames are supported in Redis 7.0, however envoy still has to support this discovery). If the IP address is not reachable (such as 127.0.0.1 from a different envoy container), envoy will not be able to communicate with the redis cluster instance.

Hence I decided to install the envoy within the docker image to make the change minimal. Mostly because I havent worked within sentinel before and wasnt sure if modifying the IP address from 127.0.0.1 in nodes.conf to a fixed ipv4 address will have a side effect in it.

I could potentially make these changes to see if affects sentinel or other testcases. But it would be great to get your thoughts before I go down that route :).

@NickCraver
Copy link
Collaborator

@rkarthick I think I follow - I agree current is likely the better path, let me get a free evening (been a busy week) to poke at this and get things green. I really appreciate all the effort here, will help get some final polish ideas tested for the build pipeline :)

@NickCraver NickCraver self-assigned this Feb 18, 2022
@NickCraver
Copy link
Collaborator

I noticed the tests took 10 seconds, then noticed it took 2 seconds....whatever the connect timeout is. Will dig into why, but it's indeed exhausting timeouts:

23:15:10.3019: 127.0.0.1:7015: Starting read
23:15:10.3140: Response from 127.0.0.1:7015/Interactive / EXISTS: Integer: 0
23:15:12.2798: Not all available tasks completed cleanly (from ReconfigureAsync#1760, timeout 2000ms), IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=3,Free=2044,Min=16,Max=2047)

In waiting for the subscriber connection to connect _when it's not supported_, we were being dumb and waiting the full timeout to proceed. So...don't do that, do the right thing.
@NickCraver
Copy link
Collaborator

Well I found a few issues here:

  1. The command map wasn't right...but it wasn't right for Twemproxy either :)
  2. We were always waiting the full timeout when subscriptions aren't supported at all (via command map) - changed the logic to behave properly.

I'll split the above fixes into their own PR as they are existing bugs in connection logic, but getting there! These tests now run instantly - we should probably add some more command testing but looking good on actually connecting now.

NickCraver added a commit that referenced this pull request Feb 19, 2022
In investigating an issue in #1989, I found a few gaps. Overall:
1. Twemproxy has an out of date CommandMap, which propagated to Envoy.
2. We were expecting both interactive and subscription connections to complete to complete the async handler...but we shouldn't because subscriptions may be disabled.
3. RedisSubscriber changes on the sync path weren't validating the message (asserting the command map has it enabled).

This fixes all of the above and adds another test considering all 3.
@NickCraver
Copy link
Collaborator

Prepping the fixes to the underlying issues in #2001 - I'll merge main in here once it's in place.


RedisCommand.SCRIPT,

RedisCommand.ECHO, RedisCommand.PING, RedisCommand.QUIT, RedisCommand.SELECT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkarthick
Copy link
Contributor Author

Thank you for making these fixes and catching the subscription functionality issue :). I can add a few more command tests over the weekend. Is there a reason why we removed PING though in envoy. Its supported by envoy.

@NickCraver
Copy link
Collaborator

Yep, look at the argument name ;) Please check me for the rest though, it definitely needs more eyes and is probably missing more.

NickCraver added a commit that referenced this pull request Feb 22, 2022
In investigating an issue in #1989, I found a few gaps. Overall:
1. Twemproxy has an out of date CommandMap, which propagated to Envoy.
2. We were expecting both interactive and subscription connections to complete to complete the async handler...but we shouldn't because subscriptions may be disabled.
3. RedisSubscriber changes on the sync path weren't validating the message (asserting the command map has it enabled).

This fixes all of the above and adds another test considering all 3.
RedisCommand.SCRIPT,

RedisCommand.ECHO, RedisCommand.PING, RedisCommand.SELECT,
RedisCommand.ECHO, RedisCommand.SELECT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now - thanks for this! We can tweak the command map if needed :)

@NickCraver NickCraver merged commit 8197e97 into StackExchange:main Feb 22, 2022
@rkarthick
Copy link
Contributor Author

This is great. Thank you for getting this into the master :).

@NickCraver
Copy link
Collaborator

@rkarthick Thanks for getting this done! I appreciate the iteration as we found existing issues here <3

NickCraver added a commit that referenced this pull request Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants