Conversation
- Increases the maximum number of concurrent zone signing operations to whatever is lower: 3 or the maximum possible parallelism detected by Rust. - Increases queue capacity by one to avoid reallocation to due exceeding capacity. - Fixes wrongly aborting signing due to the queue item being in state Aborted. - Fixes missing signing statistics in `cascade zone status` caused by (a) walking the queue in the wrong direction and (b) stopping the walk at an Abandoned queue item.
Philip-NLnetLabs
approved these changes
Oct 14, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #209.
Note: We should revisit all of this before we plan a production release as the queuing was thrown together for the alpha release and not designed or properly reviewed, nor does it have any tests at present leading to the kind of bug seen in #209.
Below I go into detail about how the queuing works. Along the way we come across the issues that this PR fixes.
Internally the signer uses a ring buffer as a queue, of type
VecDeque, calledzones_being_signed, and a semaphore with the same number of permits as the queue is long. When the queue is full signing requests block until a semaphore permit becomes free. The signer takes items from the queue when it is not too busy signing. each time returning a permit back to the semaphore thus enabling a pending queue push to occur. Items in the queue move through the statesRequested->InProgress->Finished/Aborted.For example, given a queue with 5 slots. In Cascade alpha-0.1.0 there are actually 100 slots (this limit is hard-coded) incoming signing requests exercise the queue like so (where T is a moment in time):
Initially the queue is empty. There is no front or back of the queue. There are 5 semaphore permits available.
A signing request for zone
ais pushed to the back of the queue. Queue items begin life in the queue in stateRequested. One semaphore permit has been acquired. The queue now has a back and a front that are the same item.A signing request for zone
bis pushed to the back of the queue. It too begins life in stateRequested. In the time since T:1 the enqueued signing request for zoneahas started signing and has thus entered stateInProgress. Two semaphore permits have been acquired.A second signing request for zone
ais received but will not be able to start signing until (a) the existing signing request for the same zone enters theFinishedorAbortedstate, and (b) a concurrent signing slot becomes free. In Cascade 1.0.1-alpha only 1 concurrent signing operation at a time was permitted at a time. This should have been higher but was mistakenly hard-coded to 1. This PR doesn't change this at this time.Zone
ais still signing, and requests to sign for zonesc,dandewere received. The request for zoneecannot be pushed onto the queue as there are no free sempahore permits to acquire, and so the async task that is attempting to sign yields until a semaphore permit (and thus queue slot) becomes free.Imagine for a moment that the operator now runs
cascade zone status a. This will cause a read of the queue using a front-to-back iterator, looking for the first occurence of an entry for zonea. Without the queue semaphore, a blocking attempt to push to the queue would hold a write lock on the queue, but because of the semaphore that write lock is only held briefly when a slot becomes available, so we are free to take a read lock. We find theInProgressentry for zoneaand stored with it are statistics about the on-going signing operation that are used to report back to the operator.Zone
afinished signing, thus enabling zonebto start signing, and also returning a permit to the queue semaphore which the waiting task for zoneeis able to acquire and thus join the queue. In alpha-0.1.0 the push to the back of the queue causes theVecDequeto reallocate as it was at capacity. In 5ii we see that the queue is temporarily over capacity by one. After the zoneasigning request is enqueued the front of the queue (Fa) is popped, and checked to make sure it was in stateFinished. We now arrive at state 5iii.Note that at this point the statistics held by the
Finishedqueue item for zoneawill be lost. It will no longer be possible to report the signing statistics for that signing operation (as they are not persisted in state, only held in memory in the queue), but this is allowed to fail, it will simply fail to give more detailed information whencascade zone statusis called if there are no other queue items for the same zone, but in this case there is another queue item for the same zone so its statistics will be read instead.IF however the signing of zone
ahad been aborted, the pop front would have found a zoneaqueue item in the aborted state and wrongly abandoned the signing attempt for zonee. This PR fixes that by also allowing the queue item to be in stateAborted, not onlyFinished.With a very busy system with many many zones and short enough expiration durations, signing statistics won't persist for long for use by
cascade zone status, though will show while the zones are signing. For a system with fewer zones and/or less frequent re-signing of zones, the statistics will persist for a while such thatcascade zone statuswill be able to show the statistics for the oldest queue record for the zone in question. This PR also changes the iteration order over the queue when making a status report to be back-to-front so that the statistics reflect the newest signing operation for a zone, and as the same zone cannot be signed twice at once (this is prevented by a semaphore per zone that each have only one permit) that means that the signing statistics will be reported for the zone that is actively queuing/being signed/most recently finished, not the oldest one as before.Finally, this PR also updates the
get()fn used byzone cascade statusto skip queue items in statusAbandoned, because the zone status report only tries to get this report when the zone is signing or has been previously signed, so we want the currentInProgressor most recentFinishedqueue item to get statistics for, not anAbdanonedqueue item (which can happen whendnst keyset crontriggered a re-signing operation for a zone that has not yet been signed once, e.g. if cron starts while initial signing is in-progress and finished before signing finishes, the zone has not yet been signed once so cannot be re-signed).