fix: table-not-found issue with executeSelect while running long queries #2222
fix: table-not-found issue with executeSelect while running long queries #2222prash-mi merged 14 commits intogoogleapis:mainfrom
Conversation
…n table_not_found
# Conflicts: # README.md
| // for getQueryResults) per iteration of the loop | ||
| long startTimeMs = System.currentTimeMillis(); | ||
| long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job | ||
| long poolingIntervalMs = 60000; // 1 min |
There was a problem hiding this comment.
Thanks for point it out, I have update it
| // This logic will wait for approx (poolingIntervalMs + 10 seconds which is the default timeout | ||
| // for getQueryResults) per iteration of the loop | ||
| long startTimeMs = System.currentTimeMillis(); | ||
| long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job |
There was a problem hiding this comment.
I think it's reasonable to leave the totalTimeoutMs and assorted logic out of this PR. You may in the future want to add a user configurable timeout, but per-rpc timeouts are sufficient.
BQ guarantees that jobs make forward progress (a job won't get stuck in pending forever). This interval is so long you may as well just trust the system.
There was a problem hiding this comment.
sure. I have updated the logic and have added a timeoutMs param at the RPC layer which is currently hardcoded to 60 seconds. QQ, in the very corner case when the job runs for 18 hours and is still not-complete then will the backend throw and error? Otherwise it will be an infinite loop as jobComplete will never be true.
Also, I have captured task to make timeoutMs user configurable here: #2240 (point#6)
|
|
||
| } else { // wait for the defined poolingIntervalMs and the loop will retry | ||
| try { | ||
| Thread.sleep(poolingIntervalMs); |
There was a problem hiding this comment.
The behavior here seems wrong. getQueryResults will poll from the server side for up to 10 seconds, and then you sleep for 60 in the client? Effective latency for an 11 sec query becomes more than a minute.
You should not wait on the client side at all, just re-issue a subsequent getQueryResults call. Allow the poll and wait to happen within BQ.
There was a problem hiding this comment.
Done, updated to logic to use RCP based poll and wait
shollyman
left a comment
There was a problem hiding this comment.
Thanks for getting all this together!
Internal Bug's Ref: b/241134681 .
table-not-foundissue has been observed withexecuteSelectwhile running long queries. This issue comes while initialising storage read session when the query job is not complete.This is a short term fix where we are polling the job's status using
jobs.getQueryResultsand the session with read API is initialised when the job is complete, thus avoidingtable-not-found.Capturing this FR for the long term fix/re-design: #2240
Ref: go/executeselect-re-design