Add support for retries with delays (ex. rate limiting) #3449

zkry · 2018-12-26T15:47:53Z

Fixes #3122
This pull request introduces a new type of error which contains information regarding "Retry-After" header when a rate limit is hit. The error then behaves similar to the is-retriable error, except that this doesn't get re-added to the queue immediately, but rather waits until the time is up.

To test this I modified the git-lfs/lfs-test-server server to rate limit requests (maybe I can clean this up and make a PR for this as well), though I guess some automatic test would be needed. What would you suggest for this?

bk2204 · 2019-01-03T15:20:23Z

Hey, this looks like a great improvement. Welcome to Git LFS!

We have a set of integration tests which use the test server in t/cmd and can be run using make integration. The best way to handle this would be to wire up a special name in the test server and then write a test case in shell that used that special repo name.

zkry · 2019-01-04T18:27:17Z

Thanks! Perfect. That makes sense. I'll try to work on this then.

zkry · 2019-01-07T10:29:03Z

I think I got the integration tests figured out along with some cases I missed the first time.

bk2204 · 2019-01-07T18:03:53Z

t/t-batch-retries-ratelimit.sh

+      -c "filter.lfs.process=" \
+      -c "filter.lfs.smudge=cat" \
+      -c "filter.lfs.required=false" \
+      clone "$GITSERVER/$reponame" "$reponame-assert"


Should we not have a test for the clone case as well, since that's a situation in which we're going to have a large number of requests?

Yeah, that makes sense. I updated the PR with the added test.

bk2204

This looks good to me. Thanks again for working on this; I know a bunch of people have been wanting this.

max-wittig · 2019-01-08T15:42:49Z

Awesome! Thanks for implementing this 👍

zkry · 2019-01-08T18:49:26Z

No problem. 👍 Thank you guys for the support.

In commits 2197f76 and 4fe0a8d of PR git-lfs#3449 support was added to parse any Retry-After headers provided by a server with a 429 response and delay making further requests until the appropriate time. As part of this work, the NewRetriableLaterError() function in the "errors" package was introduced, which attempts to parse the Retry-After header value. This method accepts an input error, such as that defined by the handleResponse() method of the "lfshttp" package, with the expectation that it will be wrapped in a new retriableLaterError structure. However, the NewRetriableLaterError() function overwrites its input "err" argument when it attempts to parse the header, and so only wraps a "nil" value. We can resolve this problem by using a different name for the errors returned from the header parsing attempts. As well, we rename the value returned from the attempt to parse the header as a timestamped formatted according to RFC 1123, so it does not collide with the name of the standard library's "time" package.

A prior commit in this PR resolves a bug where a 429 response to an upload or download request causes a Go panic in the client if the response lacks a Retry-After header. The same condition, when it occurs in the response to a batch API request, does not trigger a Go panic; instead, though, we simply fail without retrying the batch API request at all. This stands in constrast to how we now handle 429 responses for object uploads and downloads when no Retry-After header is provided, because in that case, we perform multiple retries, following the exponential backoff logic introduced in PR git-lfs#4097. This difference stems in part from the fact that the download() function of the basicDownloadAdapter structure and the DoTransfer() function of the basicUploadAdapter structure both handle 429 responses by first calling the NewRetriableLaterError() function of the "errors" package to try to parse any Retry-After header, and if that returns nil, then calling the NewRetriableError() function, so they always return some form of retriable error after a 429 status code is received. We therefore modify the handleResponse() method of the Client structure in the "lfshttp" package to likewise always return a retriable error of some kind after a 429 response. If a Retry-After header is found and is able to be parsed, then a retriableLaterError (from the "errors" package) is returned; otherwise, a generic retriableError is returned. This change is not sufficient on its own, however. When the batch API returns 429 responses without a Retry-After header, the transfer queue now retries its requests following the exponential backoff logic, as we expect. If one of those eventually succeeds, though, the batch is still processed as if it encountered an unrecoverable failure, and the Git LFS client ultimately returns a non-zero exit code. The reason this occurs is because the enqueueAndCollectRetriesFor() method of the TransferQueue structure in the "tq" package sets the flag which causes it to return an error for the batch both when an object in the batch cannot be retried (because it has reached its retry limit) or when an object in the batch can be retried but no specific retry wait time was provided by a retriableLaterError. The latter of these two cases is what is now triggered when the batch API returns a 429 status code and no Retry-After header. In commit a3ecbcc of PR git-lfs#4573 this code was updated to improve how batch API 429 responses with Retry-After headers are handled, building on the original code introduced in PR git-lfs#3449 and some fixes in PR git-lfs#3930. This commit added the flag, named hasNonScheduledErrors, which is set if any objects in a batch which experiences an error either can not be retried, or can be retried but don't have a specific wait time as provided by a Retry-After header. If the flag is set, then the error encountered during the processing of the batch is returned by the enqueueAndCollectRetriesFor() method, and although it is wrapped by NewRetriableError function, because the error is returned instead of just a nil, it is collected into the errors channel of the queue by the collectBatches() caller method, and this ultimately causes the client to report the error and return a non-zero exit code. By constrast, the handleTransferResult() method of the TransferQueue structure treats retriable errors from individual object uploads and downloads in the same way for both errors with a specified wait time and those without. To bring our handling of batch API requests into alignment with this approach, we can simply avoid setting the flag variable when a batch encounters an error and an object can be retried but without a specified wait time. We also rename the flag variable to hasNonRetriableObjects, which better reflects its meaning, as it signals the fact that at least one object in the batch can not be retried. As well, we update some related comments to clarify the current actions and intent of this section of code in the enqueueAndCollectRetriesFor() method. We then add a test to the t/t-batch-retries-ratelimit.sh test suite like the ones we added to the t/t-batch-storage-retries-ratelimit.sh script in a previous commit in this PR. The test relies on a new sentinel value in the test repository name which now recognize in our lfstest-gitserver test server, and which causes the test server to return a 429 response to batch API requests, but without a Retry-After header. This test fails without both of the changes we make in this commit to ensure we handle 429 batch API responses without Retry-After headers.

In commit 4fe0a8d of PR git-lfs#3449 we updated our lfstest-gitserver test utility program to simulate rate limits on specific resources, allowing it to be used to validate the changes made in that PR to respect 429 HTTP status code responses sent by servers to the Git LFS client. In the lfstest-gitserver utility, the checkRateLimit() function is called when specific resources are requested by the client, and it determines whether the test server should allow the request to proceed as normal or whether a 429 response should be returned. When each resource is first requested, a timer is enabled by the checkRateLimit() function and the function returns values to the caller indicating that a 429 response with a Retry-After header should be sent to the client. Once the timer has expired, the function permits a fixed number of subsequent requests for the same resource to proceed without delay, after which the timer is re-enabled and future requests will again receive 429 responses. To track the timer and request count values for each resource, the checkRateLimit() function constructs a key from three of its four arguments. However, the function's first argument, named "api", is at present left unused. This oversight stems from the design of the similar incrementRetriesFor(), which also constructs per-resource keys to help track request counts. These counts are used when validating the retry behaviour of the Git LFS client after it receives repeated 5xx HTTP status codes while uploading or downloading objects. In both functions, the "api" argument is unused in the construction of per-resource key strings. The argument is used by the incrementRetriesFor() function to construct a value it checks against the entry, if any, in the oidHandlers map for a given object ID. This map determines whether a specific object's content (and therefore Git LFS SHA-256 object ID) indicates that the test utility server should behave differently for that object. In the case of the incrementRetriesFor() function, this indicates whether a retry count should be kept and reported for the requests for the same object. Arguably, we could simply drop the "api" argument from the set of values passed to the checkRateLimit() function. However, since all our callers of both this function and the incrementRetriesFor() function take care to pass an appropriate value for this argument, either "batch" or "storage", we instead just incorporate the "api" argument into the per-resource keys constructed by both functions, and refactor this key generation into a small getResourceKey() helper function. This change should allow for the checkRateLimit() and incrementRetriesFor() functions to be used more broadly in the future, without concern that we may accidentally replicate the same resource keys across tests of different APIs.

The "batch clone causes retries" test was added to our t/t-batch-retries-ratelimit.sh test script in commit 540d93a of PR git-lfs#3449, with the intent that it should validate the Git LFS client's handling of 429 HTTP status code responses to individual object transfer requests during clone operations. Later, in commit 32f571d of PR git-lfs#4573, this test was revised to instead validate how the client handles 429 responses to Git LFS Batch API requests during clone operations. In the same commit, the "batch clone with multiple files causes retries" test was also added, which performs the same checks but when more than one Git LFS object is being transferred. Like the other tests in this test script, these tests depend on our lfstest-gitserver utility program to track the requests made for certain resources and generate 429 responses under specific conditions. In particular, when a resource is first requested, a timer is enabled by the checkRateLimit() function and the function returns values to the caller indicating that a 429 response with a Retry-After header should be sent to the client. Once the timer has expired, the function permits a fixed number of subsequent requests for the same resource to proceed without delay, after which the timer is re-enabled and future requests will again receive 429 responses. In the case of the "batch clone causes retries" and "batch clone with multiple files causes retries" tests, the tests clone a bare repository, create Git LFS objects in it, and push that change to the test server. The tests then clone the repository again, and expect that during this second clone the server will apply its rate-limiting behaviour. The intent of the tests is that the Git LFS client should be forced to retry its Batch API requests during this clone operation until it no longer receives 429 responses from the test server. However, in practice, the push operation that uploads the Git LFS objects to the server is the one which is rate-limited and delayed by the test server, because the checkRateLimit() function starts its per-resource timer the first time it is invoked for a given resource. (The resource is defined by just the API name "batch" and the test repository name, as the other elements of the resource key are left blank in the calls to the function for these repositories.) The second clone operation then proceeds without delay and without receiving a 429 response. To resolve this problem with the tests, we add a /limits/ endpoint to our lfstest-gitserver program which accepts query arguments defining a given resource key, and a value to set as that key's remaining number of non-rate-limited requests. If the value provided is the string "max", the default maximum value as defined by the refillTokenCount constant is used. Next, we define a set_server_rate_limit() test helper function in our t/testhelpers.sh shell library, which makes a GET request to our new /limits/ endpoint and passes its input arguments as query parameters. We can then invoke this helper function twice in our "batch clone causes retries" and "batch clone with multiple files causes retries" tests, the first time before pushing the commit with the new Git LFS objects, and the second time afterwards. For the first call we provide a "max" value for the number of available non-rate-limited requests, ensuring the push proceeds without delay, and for the second call we reset the number of available non-rate-limited requests to zero, so that the clone operation will now receive a 429 response to its Git LFS Batch API request and be forced to wait and retry that request. Finally, this allows us to check the number of retries performed by the client during the second clone operation. To do this we utilize the same approach we introduced in a prior commit in this PR for several other tests in the same test script, where we check the number of times the "enqueue retry" trace log message is output. For the test with multiple Git LFS objects, we confirm that we see this message exactly once for each object, and only with the "#1" retry count value. These checks requires that we set the GIT_TRACE environment variable for the "git lfs clone" commands and capture the commands' output into clone.log files. Because we use shell pipelines and the tee(1) command for this purpose, we also have to adjust how we check the exit code from the "git lfs clone" commands since they are the first command in the pipelines.

As reported in git-lfs#6007, at present the Git LFS client does not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. Since PR git-lfs#4097 we have employed an exponential backoff retry strategy after receiving a 429 response without a Retry-After header. This PR also modified the logic used to handle retries of object transfer requests after those receive a 429 response with a Retry-After header and a defined delay interval or fixed retry time, and in making these modifications may have inadvertently caused the problem described above. Starting with PR git-lfs#3449, when a 429 response with a Retry-After header is received in reply to an object transfer request, the handleTransferResult() method of the TransferQueue structure in our "tq" package sets the ReadyTime element of the objectTuple structure representing the object to reflect the interval or fixed time value specified in the Retry-After header. The handleTransferResult() method then sends this objectTuple to the "retries" channel passed to the method as an argument. The enqueueAndCollectRetriesFor() method then reads this channel and appends each objectTuple structure to the set of objects that will be enumerated in the next request to the Git LFS Batch API. Prior to PR git-lfs#4097 this was done with a simple call to the built-in append() Go function inside the loop over the "retries" channel at the end of the enqueueAndCollectRetriesFor() method. In PR git-lfs#4097 this method was refactored so that it defines an anonymous function to append an objectTuple to the set of objects assigned to the next batch request, assigns this anonymous function to the enqueueRetry variable, and then calls the function via the variable whenever a retry request should be made for a given object, including within the loop over the "retries" channel at the end of the method. In this specific use case the function is called with a nil value for its readyTime argument, presumably because the objectTuple's ReadyTime element may already contain the time value determined from a Retry-After header. However, unlike the earlier implementation from PR git-lfs#3449, where the objectTuple was simply appended to the set of objects for the next batch request, the anonymous function assigned to the enqueueRetry variable always resets the objectTuple's ReadyTime value, either to the value of the readyTime function argument, if that is not nil, or to a time value calculated by the ReadyTime() method of the retryCounter structure element of the TransferQueue structure. This latter condition, where the retryCounter structure's ReadyTime() method is used to set the objectTuple's ReadyTime element, is what implements the exponential backoff retry strategy added in PR git-lfs#4097. But this behaviour is now utilized even when the objectTuple has a valid ReadyTime value already set by the handleTransferResult() method, and so this causes the Git LFS client to fail to honour the delays specified in Retry-After headers sent with 429 responses to individual object transfer requests. The same problem does not occur when a Retry-After header is sent with a 429 response to a Batch API request because in that case the enqueueRetry variable's anonymous function is called with a non-nil readyTime argument by the enqueueAndCollectRetriesFor() when it handles the error returned by the Batch() function. The regression only affects responses to individual object transfers because the enqueueAndCollectRetriesFor() method now always calls the enqueueRetry variable's anonymous function with a nil value for its readyTime argument when the method is looping over the "retries" channel. To resolve this problem, we can not simply adjust the anonymous function to respect any pre-existing value in the objectTuple's ReadyTime element, as this may be in the past if more than one retry request needs to be made for an object. That is, the value in the ReadyTime element may have been set by a previous call to the anonymous function, and so we need to reset it in every such call. Instead, we introduce a new retryLaterTime element to the objectTuple structure, and set it in the handleTransferResult() method rather than the ReadyTime element. Next, in the enqueueRetry variable's anonymous function, we check if this value is non-zero, and if so, we use it to populate the ReadyTime element of the objectTuple. We then set the retryLaterTime element back to the zero-equivalent time value so it will not be reused in subsequent calls for any future retry requests for the same object. With these changes in place, we can then update the three tests in our t/t-batch-storage-retries-ratelimit.sh test script where we now expect to see a single pair of "retrying object" and "enqueue retry" trace log messages generated by the tests' "git lfs push", "git lfs pull", and "git lfs clone" commands. These checks follow the same approach we introduced to other tests in this test script and the t/t-batch-retries-ratelimit.sh test script in several prior commits in this PR. We were not previously able to add such checks to the three tests modified in this commit because they would have revealed that multiple retry requests were being made for the single Git LFS object in each test, since the Git LFS client was ignoring the delay specified in the Retry-After headers sent by our lfstest-gitserver test utility program. With the changes in this commit, we can now add checks to these tests to confirm that only one retry request is made in each test.

In PR git-lfs#3449 we added support for the detection of 429 HTTP status code responses to either Git LFS Batch API requests or individual object transfer requests, with an implementation which expected a Retry-After header with a specified retry time or delay period. Commit 2197f76 of that PR introduced two trace log output statements which attempted to report the delay period calculated from the Retry-After header, one each in the enqueueAndCollectRetriesFor() and handleTransferResult() methods of the TransferQueue structure in our "tq" package. Both of these statements attempted to format a value of the Duration type (from the "time" package of the Go standard library) into a number of seconds, expressed as a string, for their trace log messages. However, the type returned by the Seconds() method of the Duration type is a float64 type, so the trace log messages would contain format error markers along with the floating-point value, for instance, "%!s(float64=8.999544366) seconds". The format string for one of these trace log output statements was later updated in commit cf02216 of PR git-lfs#4097, when an exponential backoff technique was introduced to better respect servers which sent responses with 429 status codes but no Retry-After headers. Specifically, the trace log message in the enqueueAndCollectRetriesFor() method was revised to replace the "%s" formatting verb for the float64 return type from the Duration type's Seconds() method with a "%.2f" verb, which correctly converts the float64 value into a string. We now correct the related trace log message in the handleTransferResult() to use the same "%2f" formatting verb for the return type from the Duration type's Seconds() method. We also adjust the text of the message slightly to align more closely with the corresponding message in the enqueueAndCollectRetriesFor() method.

In commit 4fe0a8d of PR git-lfs#3449 we updated our lfstest-gitserver test utility program to simulate rate limits on specific resources, allowing it to be used to validate the changes made in that PR to respect 429 HTTP status code responses sent by servers to the Git LFS client. In the lfstest-gitserver utility, the checkRateLimit() function is called when specific resources are requested by the client, and it determines whether the test server should allow the request to proceed as normal or whether a 429 response should be returned. When each resource is first requested, a timer is enabled by the checkRateLimit() function and the function returns values to the caller indicating that a 429 response with a Retry-After header should be sent to the client. Once the timer has expired, the function permits a fixed number of subsequent requests for the same resource to proceed without delay, after which the timer is re-enabled and future requests will again receive 429 responses. To track the timer and request count values for each resource, the checkRateLimit() function constructs a key from three of its four arguments. However, the function's first argument, named "api", is at present left unused. This oversight stems from the design of the similar incrementRetriesFor(), which also constructs per-resource keys to help track request counts. These counts are used when validating the retry behaviour of the Git LFS client after it receives repeated 5xx HTTP status codes while uploading or downloading objects. In both functions, the "api" argument is unused in the construction of per-resource key strings. The argument is used by the incrementRetriesFor() function to construct a value it checks against the entry, if any, in the oidHandlers map for a given object ID. This map determines whether a specific object's content (and therefore Git LFS SHA-256 object ID) indicates that the test utility server should behave differently for that object. In the case of the incrementRetriesFor() function, this indicates whether a retry count should be kept and reported for the requests for the same object. Arguably, we could simply drop the "api" argument from the set of values passed to the checkRateLimit() function. However, since all our callers of both this function and the incrementRetriesFor() function take care to pass an appropriate value for this argument, either "batch" or "storage", we instead just incorporate the "api" argument into the per-resource keys constructed by both functions, and refactor this key generation into a small getResourceKey() helper function. This change should allow for the checkRateLimit() and incrementRetriesFor() functions to be used more broadly in the future, without concern that we may accidentally replicate the same resource keys across tests of different APIs.

The "batch clone causes retries" test was added to our t/t-batch-retries-ratelimit.sh test script in commit 540d93a of PR git-lfs#3449, with the intent that it should validate the Git LFS client's handling of 429 HTTP status code responses to individual object transfer requests during clone operations. Later, in commit 32f571d of PR git-lfs#4573, this test was revised to instead validate how the client handles 429 responses to Git LFS Batch API requests during clone operations. In the same commit, the "batch clone with multiple files causes retries" test was also added, which performs the same checks but when more than one Git LFS object is being transferred. Like the other tests in this test script, these tests depend on our lfstest-gitserver utility program to track the requests made for certain resources and generate 429 responses under specific conditions. In particular, when a resource is first requested, a timer is enabled by the checkRateLimit() function and the function returns values to the caller indicating that a 429 response with a Retry-After header should be sent to the client. Once the timer has expired, the function permits a fixed number of subsequent requests for the same resource to proceed without delay, after which the timer is re-enabled and future requests will again receive 429 responses. In the case of the "batch clone causes retries" and "batch clone with multiple files causes retries" tests, the tests clone a bare repository, create Git LFS objects in it, and push that change to the test server. The tests then clone the repository again, and expect that during this second clone the server will apply its rate-limiting behaviour. The intent of the tests is that the Git LFS client should be forced to retry its Batch API requests during this clone operation until it no longer receives 429 responses from the test server. However, in practice, the push operation that uploads the Git LFS objects to the server is the one which is rate-limited and delayed by the test server, because the checkRateLimit() function starts its per-resource timer the first time it is invoked for a given resource. (The resource is defined by just the API name "batch" and the test repository name, as the other elements of the resource key are left blank in the calls to the function for these repositories.) The second clone operation then proceeds without delay and without receiving a 429 response. To resolve this problem with the tests, we add a /limits/ endpoint to our lfstest-gitserver program which accepts query arguments defining a given resource key, and a value to set as that key's remaining number of non-rate-limited requests. If the value provided is the string "max", the default maximum value as defined by the refillTokenCount constant is used. Next, we define a set_server_rate_limit() test helper function in our t/testhelpers.sh shell library, which makes a GET request to our new /limits/ endpoint and passes its input arguments as query parameters. We can then invoke this helper function twice in our "batch clone causes retries" and "batch clone with multiple files causes retries" tests, the first time before pushing the commit with the new Git LFS objects, and the second time afterwards. For the first call we provide a "max" value for the number of available non-rate-limited requests, ensuring the push proceeds without delay, and for the second call we reset the number of available non-rate-limited requests to zero, so that the clone operation will now receive a 429 response to its Git LFS Batch API request and be forced to wait and retry that request. Finally, this allows us to check the number of retries performed by the client during the second clone operation. To do this we utilize the same approach we introduced in a prior commit in this PR for several other tests in the same test script, where we check the number of times the "enqueue retry" trace log message is output. For the test with multiple Git LFS objects, we confirm that we see this message exactly once for each object, and only with the "#1" retry count value. These checks requires that we set the GIT_TRACE environment variable for the "git lfs clone" commands and capture the commands' output into clone.log files. Because we use shell pipelines and the tee(1) command for this purpose, we also have to adjust how we check the exit code from the "git lfs clone" commands since they are the first command in the pipelines.

As reported in git-lfs#6007, at present the Git LFS client does not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. Since PR git-lfs#4097 we have employed an exponential backoff retry strategy after receiving a 429 response without a Retry-After header. This PR also modified the logic used to handle retries of object transfer requests after those receive a 429 response with a Retry-After header and a defined delay interval or fixed retry time, and in making these modifications may have inadvertently caused the problem described above. Starting with PR git-lfs#3449, when a 429 response with a Retry-After header is received in reply to an object transfer request, the handleTransferResult() method of the TransferQueue structure in our "tq" package sets the ReadyTime element of the objectTuple structure representing the object to reflect the interval or fixed time value specified in the Retry-After header. The handleTransferResult() method then sends this objectTuple to the "retries" channel passed to the method as an argument. The enqueueAndCollectRetriesFor() method then reads this channel and appends each objectTuple structure to the set of objects that will be enumerated in the next request to the Git LFS Batch API. Prior to PR git-lfs#4097 this was done with a simple call to the built-in append() Go function inside the loop over the "retries" channel at the end of the enqueueAndCollectRetriesFor() method. In PR git-lfs#4097 this method was refactored so that it defines an anonymous function to append an objectTuple to the set of objects assigned to the next batch request, assigns this anonymous function to the enqueueRetry variable, and then calls the function via the variable whenever a retry request should be made for a given object, including within the loop over the "retries" channel at the end of the method. In this specific use case the function is called with a nil value for its readyTime argument, presumably because the objectTuple's ReadyTime element may already contain the time value determined from a Retry-After header. However, unlike the earlier implementation from PR git-lfs#3449, where the objectTuple was simply appended to the set of objects for the next batch request, the anonymous function assigned to the enqueueRetry variable always resets the objectTuple's ReadyTime value, either to the value of the readyTime function argument, if that is not nil, or to a time value calculated by the ReadyTime() method of the retryCounter structure element of the TransferQueue structure. This latter condition, where the retryCounter structure's ReadyTime() method is used to set the objectTuple's ReadyTime element, is what implements the exponential backoff retry strategy added in PR git-lfs#4097. But this behaviour is now utilized even when the objectTuple has a valid ReadyTime value already set by the handleTransferResult() method, and so this causes the Git LFS client to fail to honour the delays specified in Retry-After headers sent with 429 responses to individual object transfer requests. The same problem does not occur when a Retry-After header is sent with a 429 response to a Batch API request because in that case the enqueueRetry variable's anonymous function is called with a non-nil readyTime argument by the enqueueAndCollectRetriesFor() when it handles the error returned by the Batch() function. The regression only affects responses to individual object transfers because the enqueueAndCollectRetriesFor() method now always calls the enqueueRetry variable's anonymous function with a nil value for its readyTime argument when the method is looping over the "retries" channel. To resolve this problem, we can not simply adjust the anonymous function to respect any pre-existing value in the objectTuple's ReadyTime element, as this may be in the past if more than one retry request needs to be made for an object. That is, the value in the ReadyTime element may have been set by a previous call to the anonymous function, and so we need to reset it in every such call. Instead, we introduce a new retryLaterTime element to the objectTuple structure, and set it in the handleTransferResult() method rather than the ReadyTime element. Next, in the enqueueRetry variable's anonymous function, we check if this value is non-zero, and if so, we use it to populate the ReadyTime element of the objectTuple. We then set the retryLaterTime element back to the zero-equivalent time value so it will not be reused in subsequent calls for any future retry requests for the same object. With these changes in place, we can then update the three tests in our t/t-batch-storage-retries-ratelimit.sh test script where we now expect to see a single pair of "retrying object" and "enqueue retry" trace log messages generated by the tests' "git lfs push", "git lfs pull", and "git lfs clone" commands. These checks follow the same approach we introduced to other tests in this test script and the t/t-batch-retries-ratelimit.sh test script in several prior commits in this PR. We were not previously able to add such checks to the three tests modified in this commit because they would have revealed that multiple retry requests were being made for the single Git LFS object in each test, since the Git LFS client was ignoring the delay specified in the Retry-After headers sent by our lfstest-gitserver test utility program. With the changes in this commit, we can now add checks to these tests to confirm that only one retry request is made in each test.

In PR git-lfs#3449 we added support for the detection of 429 HTTP status code responses to either Git LFS Batch API requests or individual object transfer requests, with an implementation which expected a Retry-After header with a specified retry time or delay period. Commit 2197f76 of that PR introduced two trace log output statements which attempted to report the delay period calculated from the Retry-After header, one each in the enqueueAndCollectRetriesFor() and handleTransferResult() methods of the TransferQueue structure in our "tq" package. Both of these statements attempted to format a value of the Duration type (from the "time" package of the Go standard library) into a number of seconds, expressed as a string, for their trace log messages. However, the type returned by the Seconds() method of the Duration type is a float64 type, so the trace log messages would contain format error markers along with the floating-point value, for instance, "%!s(float64=8.999544366) seconds". The format string for one of these trace log output statements was later updated in commit cf02216 of PR git-lfs#4097, when an exponential backoff technique was introduced to better respect servers which sent responses with 429 status codes but no Retry-After headers. Specifically, the trace log message in the enqueueAndCollectRetriesFor() method was revised to replace the "%s" formatting verb for the float64 return type from the Duration type's Seconds() method with a "%.2f" verb, which correctly converts the float64 value into a string. We now correct the related trace log message in the handleTransferResult() to use the same "%2f" formatting verb for the return type from the Duration type's Seconds() method. We also adjust the text of the message slightly to align more closely with the corresponding message in the enqueueAndCollectRetriesFor() method.

Add support for retries with delays (ex rate limiting)

2197f76

zkry changed the title ~~Add support for retries with delays (ex rate limiting)~~ Add support for retries with delays (ex. rate limiting) Dec 26, 2018

Add integration tests; check other places where 429 could occur

4fe0a8d

bk2204 reviewed Jan 7, 2019

View reviewed changes

Add a test to check that git lfs clone handles rate limits

540d93a

bk2204 approved these changes Jan 8, 2019

View reviewed changes

bk2204 merged commit aab5de6 into git-lfs:master Jan 8, 2019

dlouzan mentioned this pull request Aug 2, 2021

Obey rate-limits, instead of crashing #3122

Closed

DanRunfola mentioned this pull request Jan 20, 2023

Retries with Delays not working in some cases #5261

Closed

chrisd8088 mentioned this pull request Jun 19, 2024

Fix panic caused by accessing non-existent header #5804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for retries with delays (ex. rate limiting) #3449

Add support for retries with delays (ex. rate limiting) #3449

Uh oh!

zkry commented Dec 26, 2018

Uh oh!

bk2204 commented Jan 3, 2019

Uh oh!

zkry commented Jan 4, 2019

Uh oh!

zkry commented Jan 7, 2019

Uh oh!

bk2204 Jan 7, 2019

Uh oh!

zkry Jan 8, 2019

Uh oh!

sharidz313 Jan 8, 2019

Uh oh!

bk2204 left a comment

Uh oh!

max-wittig commented Jan 8, 2019

Uh oh!

zkry commented Jan 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add support for retries with delays (ex. rate limiting) #3449

Add support for retries with delays (ex. rate limiting) #3449

Uh oh!

Conversation

zkry commented Dec 26, 2018

Uh oh!

bk2204 commented Jan 3, 2019

Uh oh!

zkry commented Jan 4, 2019

Uh oh!

zkry commented Jan 7, 2019

Uh oh!

bk2204 Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

zkry Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

sharidz313 Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

bk2204 left a comment

Choose a reason for hiding this comment

Uh oh!

max-wittig commented Jan 8, 2019

Uh oh!

zkry commented Jan 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants