Observed
We have recently observed calls rows.Next() blocking indefinitely.
We noticed that a query, which appears to have at-least returned its first block, get stuck on the CH server side, then never results in a read timeout. Instead, our code was permanently blocked on a call to rows.Next().
goroutine 1098 [select, 72 minutes]:
github.com/ClickHouse/clickhouse-go/v2.(*rows).Next(0xc019ae3c80)
/go/pkg/mod/github.com/!click!house/clickhouse-go/[email protected]/clickhouse_rows.go:52 +0x239
Taken from a goroutine profile. Here is where we observed the rows.Next call being blocked for 72 minutes.
Blocking indefinitely on this select statement:
|
select { |
|
case err := <-r.errors: |
|
if err != nil { |
|
r.err = err |
|
return false |
|
} |
|
case block := <-r.stream: |
|
if block == nil { |
|
return false |
|
} |
|
if block.Packet == proto.ServerTotals { |
|
r.row, r.block, r.totals = 0, nil, block |
|
return false |
|
} |
|
r.row, r.block = 0, block |
|
} |
Digging into the client code it appears that ReadTimeout is currently only being applied to the first block:
|
// set a read deadline - alternative to context.Read operation will fail if no data is received after deadline. |
|
c.conn.SetReadDeadline(time.Now().Add(c.readTimeout)) |
|
defer c.conn.SetReadDeadline(time.Time{}) |
|
// context level deadlines override any read deadline |
|
if deadline, ok := ctx.Deadline(); ok { |
|
c.conn.SetDeadline(deadline) |
|
defer c.conn.SetDeadline(time.Time{}) |
|
} |
This is because we're deferring resetting the read (and read/write) deadline to the zero value.
Since the rest of the connection is processed asynchronously in the goroutine below, this read timeout can be cleared regardless whether or not a subsequent is scheduled before returning the rows. Even so, as the docs for SetReadDeadline state, setting this deadline appears for future and current block read calls anyway:
// SetReadDeadline sets the deadline for future Read calls
// and any currently-blocked Read call.
// A zero value for t means Read will not time out.
SetReadDeadline(t time.Time) error
Expected behaviour
Our understanding is that the read deadline will apply to all reads performed during the lifetime of a query.
Such that if CH server fails to deliver subsequent bytes for whatever reason, the client will fail with a timeout and can move on.
Here are a couple thoughts on how ReadTimeout could be interpretted.
I am unsure what the expectation is, I can't see any documentation for it:
read deadline per call to conn.Read
Maybe what we want is to apply the read timeout per call to Read.
Setting the deadline explicit before starting each call.
So that a long query continues to run if we're still getting bytes back from CH.
read deadline on rows.Close
Or perhaps clearing the read timeout should move to occur within rows.Close().
Instead of before returning the rows iterator itself.
If we want ReadTimeout to apply to the entire query.
This obviously could be problematic for users who are forgetting to close their rows iterator and starting another query on the same connection. Since the read deadline would extend to the next query.
We could put in a fail-safe here for folks that miss this, by also clearing it via runtime.SetFinalizer on the rows struct pointer.
In all cases, we should make sure to only clear the deadline once and not if another query has been started and taken its place.
Code example
I sadly don't have a reproducible query, which can create this behaviour in CH to trigger the bad behaviour here in the client. However, maybe there is some mockable test server implementations we can use in this repository? (open question). It is a little fiddly to test as we need to simulate a query sending at least the first block before blocking indefinitely.
If anyone can point me in the direction of where might be best to test that, I would be happy to create a test case!
Error log
Repeat of what we have above, but this is the smoking gun we have in our application:
goroutine 1098 [select, 72 minutes]:
github.com/ClickHouse/clickhouse-go/v2.(*rows).Next(0xc019ae3c80)
/go/pkg/mod/github.com/!click!house/clickhouse-go/[email protected]/clickhouse_rows.go:52 +0x239
Details
Environment
I can ask around and see if I can get some better reproductions and CH server configuration / data.
This is happening in internally within ClickHouse Cloud.
Observed
We have recently observed calls
rows.Next()blocking indefinitely.We noticed that a query, which appears to have at-least returned its first block, get stuck on the CH server side, then never results in a read timeout. Instead, our code was permanently blocked on a call to
rows.Next().Blocking indefinitely on this
selectstatement:clickhouse-go/clickhouse_rows.go
Lines 52 to 67 in 2482bad
Digging into the client code it appears that
ReadTimeoutis currently only being applied to the first block:clickhouse-go/conn_query.go
Lines 41 to 48 in 2482bad
This is because we're deferring resetting the read (and read/write) deadline to the zero value.
Since the rest of the connection is processed asynchronously in the goroutine below, this read timeout can be cleared regardless whether or not a subsequent is scheduled before returning the rows. Even so, as the docs for
SetReadDeadlinestate, setting this deadline appears for future and current block read calls anyway:Expected behaviour
Our understanding is that the read deadline will apply to all reads performed during the lifetime of a query.
Such that if CH server fails to deliver subsequent bytes for whatever reason, the client will fail with a timeout and can move on.
Here are a couple thoughts on how
ReadTimeoutcould be interpretted.I am unsure what the expectation is, I can't see any documentation for it:
read deadline per call to
conn.ReadMaybe what we want is to apply the read timeout per call to Read.
Setting the deadline explicit before starting each call.
So that a long query continues to run if we're still getting bytes back from CH.
read deadline on rows.Close
Or perhaps clearing the read timeout should move to occur within
rows.Close().Instead of before returning the rows iterator itself.
If we want ReadTimeout to apply to the entire query.
This obviously could be problematic for users who are forgetting to close their rows iterator and starting another query on the same connection. Since the read deadline would extend to the next query.
We could put in a fail-safe here for folks that miss this, by also clearing it via
runtime.SetFinalizeron therowsstruct pointer.In all cases, we should make sure to only clear the deadline once and not if another query has been started and taken its place.
Code example
I sadly don't have a reproducible query, which can create this behaviour in CH to trigger the bad behaviour here in the client. However, maybe there is some mockable test server implementations we can use in this repository? (open question). It is a little fiddly to test as we need to simulate a query sending at least the first block before blocking indefinitely.
If anyone can point me in the direction of where might be best to test that, I would be happy to create a test case!
Error log
Repeat of what we have above, but this is the smoking gun we have in our application:
Details
Environment
clickhouse-goversion:v2.37.2CREATE TABLEstatements for tables involved:I can ask around and see if I can get some better reproductions and CH server configuration / data.
This is happening in internally within ClickHouse Cloud.