Skip to content

Conversation

@thrau
Copy link
Member

@thrau thrau commented May 2, 2023

Motivation

In SQS, FIFO queues behave very differently from standard queues when receiving messages. A feature we haven't implemented at all were message group IDs. While we were able to send and track them, we didn't actually do anything with them. But they greatly affect how receiving of messages work. Here's the gist from the FIFO delivery logic.

When receiving messages from a FIFO queue with multiple message group IDs, Amazon SQS first attempts to return as many messages with the same message group ID as possible. This allows other consumers to process messages with a different message group ID. When you receive a message with a message group ID, no more messages for the same message group ID are returned unless you delete the message or it becomes visible.

The issue was first raised in #6766, and requested by several Pro users.

Turns out that this was a non-trivial change to make because of a) the way our SqsProvider communicates with the SqsQueue, and b) how messages are organized in the FifoQueue.

Changes

Here's a summary of changes and their respective commits to make it easier to review:

  • f8e87a2 Added a bunch of tests to validate the behavior of AWS
  • 16821e5 Refactor the communication between the SqsProvider and SqsQueue.
    The problem was that abstracting over a get method that only returns a single message was blocking a way to retrieve multiple messages from the queue atomically, which is exactly what the FIFO behavior requires. So I replaced the SqsQueue.get method with a receive method that much more closely resembles what ReceiveMessage is doing, and allows you to get multiple messages at once. Now the access to the underlying datastructure holding the messages is hidden behind receive.
  • 8f1d70e Implement the actual message group logic (significant changes in the FifoQueue - completely new receive logic).
    Instead of storing individual messages in the queue, we now have a queue that orders MessageGroup objects that in turn contain a priority queue of messages. The implication is that we need more memory to store the messages. The worst case if each message has its own unique message group. The trade off is a simplified implementation of making sure that only one consumer receives a message group when there's a race for the queue. We first dequeue the message group (making it inaccessible to other consumers), and then get the messages from the group. Because we know that only one thread will access the messages in the group, it also vastly simplifies concurrent access. The visible queue is therefore now a construct that has moved completely into the StandardQueue
  • f08d4ac Adapt SQS developer endpoints to the changes

Future work

Expired receipt handles

I added a currently xfailed test that shows that you cannot delete a message with an expired receipt handle in FIFO queues. I was too lazy to implement this right now, since the PR is already quite large, and it needs some additional tinkering with the receipt handle encoder.

Message group ordering

It's not clear exactly how message groups themselves are ordered in the queue. There seems to be some sort of priority in AWS given to groups when they expire, but it behaved differently depending on whether messages were removed or expired. I couldn't figure out how AWS does it exactly, and it's not documented. The only mention I could find is a vague statement I interpret as "there is no particular guarantee in which order message groups will be received".

FIFO queue logic applies only per message group ID. Each message group ID represents a distinct ordered message group within an Amazon SQS queue. For each message group ID, all messages are sent and received in strict order. However, messages with different message group ID values might be sent and received out of order.

For this reason, I had to currently allow different orderings of message groups (note that the order within a group is guaranteed)

ordering = [message["Body"] for message in response["Messages"]]
# these are in principle all valid orderings
assert ordering in (
["g2-m1", "g2-m2", "g3-m1"],
["g3-m1", "g2-m1", "g2-m2"],
)

More tests

There are a bunch more tests one could write. Including concurrent access, ordering behavior of groups with more groups to see how they are ordered after deletion etc.

Code duplication

There is a bit of code duplication especially in the receive implementations. It seems hard to generalize, and maybe not worth it at all, given that we are not going to introduce other queue types. Maybe one could extract some util methods that do some of the work (e.g. collect messages from a queue or queue-like interface).

Performance

There is a lot of locking going on in FIFO queues. Basically all access to message groups is currently completely linearized across consumers. This will affect performance mostly when there are many parallel consumers. But more systematic benchmarks are necessary, that can maybe be preformed as part of #7342

Fixes

@github-actions
Copy link

github-actions bot commented May 2, 2023

LocalStack Community integration with Pro

2 024 tests   1 743 ✔️  1h 12m 41s ⏱️
       2 suites     281 💤
       2 files           0

Results for commit 84837c3.

♻️ This comment has been updated with latest results.

@thrau thrau added semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases aws:sqs Amazon Simple Queue Service labels May 3, 2023
@thrau thrau force-pushed the fix-sqs-fifo-message-group branch 2 times, most recently from 79dd6a5 to ad05a48 Compare May 4, 2023 14:29
@thrau thrau force-pushed the fix-sqs-fifo-message-group branch 2 times, most recently from 8a80530 to f08d4ac Compare May 6, 2023 13:17
@thrau
Copy link
Member Author

thrau commented May 6, 2023

do you have a test for #7036 @dominikschubert ?

@coveralls
Copy link

coveralls commented May 6, 2023

Coverage Status

Coverage: 82.223% (+0.06%) from 82.161% when pulling 84837c3 on fix-sqs-fifo-message-group into 42f62d3 on master.

@dominikschubert
Copy link
Member

do you have a test for #7036 @dominikschubert ?

Yep! Still needs to be adapted for the new client fixture though: #7037

@thrau thrau marked this pull request as ready for review May 6, 2023 16:52
@thrau thrau requested a review from baermat as a code owner May 6, 2023 16:52
Copy link
Member

@baermat baermat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 🎉 and seems like it was quite the undertaking, I can feel brainpower radiating from the screen :)
I have a couple of suggestions, questions and nits, but nothing breaking, so we should be ready to go soon 🚀

Comment on lines 1285 to 1288
assert ordering in (
["g3-m1", "g1-m1", "g1-m2", "g1-m3", "g1-m4", "g2-m1"],
["g3-m1", "g2-m1", "g1-m1", "g1-m2", "g1-m3", "g1-m4"],
["g1-m1", "g1-m2", "g1-m3", "g1-m4", "g2-m1", "g3-m1"],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this: Is there a specific reason why there is only a subset of the valid orderings?
Let's merge all group 1 messages into "g1" for now as they need to appear in order, then the valid orderings are
g1-g2-g3
g1-g3-g2
g2-g1-g3
g2-g3-g1
g3-g1-g2
g3-g2-g1
Are these 3 are really all we need, or was this simply omitted for simplicity? But then it implies we don't need the missing ones, hence my confusion.
Or I am missing something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you pointed this out, I was a bit sloppy here. Instead of adding all orderings, I instead changed all the tests assert to be much more specific now. Some test cases were actually the same in AWS and LocalStack. There was just one case where I couldn't at all reproduce how AWS does the ordering of groups, and another where there's clearly a race and two orderings are possible. I also added comments whether the ordering is related to AWS or LocalStack. I think this is much clearer.

@thrau thrau force-pushed the fix-sqs-fifo-message-group branch from f08d4ac to 84837c3 Compare May 8, 2023 19:23
@thrau thrau requested a review from baermat May 8, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aws:sqs Amazon Simple Queue Service semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases

Projects

None yet

5 participants