Skip to content

Don't abort signing if an old queue ring buffer item is in the aborted state.#210

Merged
ximon18 merged 6 commits intomainfrom
209-cannot-acquire-the-queue-semaphore
Oct 14, 2025
Merged

Don't abort signing if an old queue ring buffer item is in the aborted state.#210
ximon18 merged 6 commits intomainfrom
209-cannot-acquire-the-queue-semaphore

Conversation

@ximon18
Copy link
Member

@ximon18 ximon18 commented Oct 13, 2025

Fixes #209.

Note: We should revisit all of this before we plan a production release as the queuing was thrown together for the alpha release and not designed or properly reviewed, nor does it have any tests at present leading to the kind of bug seen in #209.

Below I go into detail about how the queuing works. Along the way we come across the issues that this PR fixes.

Internally the signer uses a ring buffer as a queue, of type VecDeque, called zones_being_signed, and a semaphore with the same number of permits as the queue is long. When the queue is full signing requests block until a semaphore permit becomes free. The signer takes items from the queue when it is not too busy signing. each time returning a permit back to the semaphore thus enabling a pending queue push to occur. Items in the queue move through the states Requested -> InProgress -> Finished / Aborted.

For example, given a queue with 5 slots. In Cascade alpha-0.1.0 there are actually 100 slots (this limit is hard-coded) incoming signing requests exercise the queue like so (where T is a moment in time):

T: 0
    SP = 5/5
   +------------------------+
Q: ||
   +------------------------+

Initially the queue is empty. There is no front or back of the queue. There are 5 semaphore permits available.

T: 1
     v Front, SP = 4/5
   +------------------------+
Q: | Ra |
   +------------------------+
          ^ Back

A signing request for zone a is pushed to the back of the queue. Queue items begin life in the queue in state Requested. One semaphore permit has been acquired. The queue now has a back and a front that are the same item.

T: 2
     v Front, SP = 3/5
   +------------------------+
Q: | Ia | Rb |
   +------------------------+
               ^ Back

A signing request for zone b is pushed to the back of the queue. It too begins life in state Requested. In the time since T:1 the enqueued signing request for zone a has started signing and has thus entered state InProgress. Two semaphore permits have been acquired.

T: 3
     v Front, SP = 2/5
   +------------------------+
Q: | Ia | Rb | Ra |
   +------------------------+
                    ^ Back

A second signing request for zone a is received but will not be able to start signing until (a) the existing signing request for the same zone enters the Finished or Aborted state, and (b) a concurrent signing slot becomes free. In Cascade 1.0.1-alpha only 1 concurrent signing operation at a time was permitted at a time. This should have been higher but was mistakenly hard-coded to 1. This PR doesn't change this at this time.

T: 4
     v Front, SP = 0/5
   +------------------------+
Q: | Ia | Rb | Ra | Rc | Rd |                      | Re |
   +------------------------+
                              ^ Back

Zone a is still signing, and requests to sign for zones c, d and e were received. The request for zone e cannot be pushed onto the queue as there are no free sempahore permits to acquire, and so the async task that is attempting to sign yields until a semaphore permit (and thus queue slot) becomes free.

Imagine for a moment that the operator now runs cascade zone status a. This will cause a read of the queue using a front-to-back iterator, looking for the first occurence of an entry for zone a. Without the queue semaphore, a blocking attempt to push to the queue would hold a write lock on the queue, but because of the semaphore that write lock is only held briefly when a slot becomes available, so we are free to take a read lock. We find the InProgress entry for zone a and stored with it are statistics about the on-going signing operation that are used to report back to the operator.

T: 5i
     v Front, SP = 0/5
   +------------------------+
Q: | Fa | Ib | Ra | Rc | Rd |                      | Re |
   +------------------------+
                              ^ Back
T: 5ii
     v Front, SP = 0/5
   +-----------------------------+
Q: | Fa | Ib | Ra | Rc | Rd | Re |
   +-----------------------------+
     ^ Back

T: 5iii
          v Front, SP = 0/5
   +-----------------------------+
Q: | -- | Ib | Ra | Rc | Rd | Re |
   +-----------------------------+
     ^ Back

Zone a finished signing, thus enabling zone b to start signing, and also returning a permit to the queue semaphore which the waiting task for zone e is able to acquire and thus join the queue. In alpha-0.1.0 the push to the back of the queue causes the VecDeque to reallocate as it was at capacity. In 5ii we see that the queue is temporarily over capacity by one. After the zone a signing request is enqueued the front of the queue (Fa) is popped, and checked to make sure it was in state Finished. We now arrive at state 5iii.

Note that at this point the statistics held by the Finished queue item for zone a will be lost. It will no longer be possible to report the signing statistics for that signing operation (as they are not persisted in state, only held in memory in the queue), but this is allowed to fail, it will simply fail to give more detailed information when cascade zone status is called if there are no other queue items for the same zone, but in this case there is another queue item for the same zone so its statistics will be read instead.

IF however the signing of zone a had been aborted, the pop front would have found a zone a queue item in the aborted state and wrongly abandoned the signing attempt for zone e. This PR fixes that by also allowing the queue item to be in state Aborted, not only Finished.

With a very busy system with many many zones and short enough expiration durations, signing statistics won't persist for long for use by cascade zone status, though will show while the zones are signing. For a system with fewer zones and/or less frequent re-signing of zones, the statistics will persist for a while such that cascade zone status will be able to show the statistics for the oldest queue record for the zone in question. This PR also changes the iteration order over the queue when making a status report to be back-to-front so that the statistics reflect the newest signing operation for a zone, and as the same zone cannot be signed twice at once (this is prevented by a semaphore per zone that each have only one permit) that means that the signing statistics will be reported for the zone that is actively queuing/being signed/most recently finished, not the oldest one as before.

Finally, this PR also updates the get() fn used by zone cascade status to skip queue items in status Abandoned, because the zone status report only tries to get this report when the zone is signing or has been previously signed, so we want the current InProgress or most recent Finished queue item to get statistics for, not an Abdanoned queue item (which can happen when dnst keyset cron triggered a re-signing operation for a zone that has not yet been signed once, e.g. if cron starts while initial signing is in-progress and finished before signing finishes, the zone has not yet been signed once so cannot be re-signed).

- Increases the maximum number of concurrent zone signing operations to
whatever is lower: 3 or the maximum possible parallelism detected by
Rust.
- Increases queue capacity by one to avoid reallocation to due exceeding
capacity.
- Fixes wrongly aborting signing due to the queue item being in state
Aborted.
- Fixes missing signing statistics in `cascade zone status` caused by
(a) walking the queue in the wrong direction and (b) stopping the walk
at an Abandoned queue item.
@ximon18 ximon18 requested a review from a team October 13, 2025 22:36
@ximon18 ximon18 added the bug Something isn't working label Oct 13, 2025
@ximon18 ximon18 linked an issue Oct 13, 2025 that may be closed by this pull request
@ximon18 ximon18 added this to the 0.1.0-alpha2 milestone Oct 13, 2025
@ximon18 ximon18 merged commit 3c2a154 into main Oct 14, 2025
27 checks passed
@ximon18 ximon18 deleted the 209-cannot-acquire-the-queue-semaphore branch October 14, 2025 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Cannot acquire the queue semaphore""

2 participants