Skip to content

ReadTimeout appears to apply to the first read block only #1607

@GeorgeMac

Description

@GeorgeMac

Observed

We have recently observed calls rows.Next() blocking indefinitely.

We noticed that a query, which appears to have at-least returned its first block, get stuck on the CH server side, then never results in a read timeout. Instead, our code was permanently blocked on a call to rows.Next().

goroutine 1098 [select, 72 minutes]:
github.com/ClickHouse/clickhouse-go/v2.(*rows).Next(0xc019ae3c80)
	/go/pkg/mod/github.com/!click!house/clickhouse-go/[email protected]/clickhouse_rows.go:52 +0x239

Taken from a goroutine profile. Here is where we observed the rows.Next call being blocked for 72 minutes.

Blocking indefinitely on this select statement:

select {
case err := <-r.errors:
if err != nil {
r.err = err
return false
}
case block := <-r.stream:
if block == nil {
return false
}
if block.Packet == proto.ServerTotals {
r.row, r.block, r.totals = 0, nil, block
return false
}
r.row, r.block = 0, block
}

Digging into the client code it appears that ReadTimeout is currently only being applied to the first block:

// set a read deadline - alternative to context.Read operation will fail if no data is received after deadline.
c.conn.SetReadDeadline(time.Now().Add(c.readTimeout))
defer c.conn.SetReadDeadline(time.Time{})
// context level deadlines override any read deadline
if deadline, ok := ctx.Deadline(); ok {
c.conn.SetDeadline(deadline)
defer c.conn.SetDeadline(time.Time{})
}

This is because we're deferring resetting the read (and read/write) deadline to the zero value.

Since the rest of the connection is processed asynchronously in the goroutine below, this read timeout can be cleared regardless whether or not a subsequent is scheduled before returning the rows. Even so, as the docs for SetReadDeadline state, setting this deadline appears for future and current block read calls anyway:

	// SetReadDeadline sets the deadline for future Read calls
	// and any currently-blocked Read call.
	// A zero value for t means Read will not time out.
	SetReadDeadline(t time.Time) error

Expected behaviour

Our understanding is that the read deadline will apply to all reads performed during the lifetime of a query.
Such that if CH server fails to deliver subsequent bytes for whatever reason, the client will fail with a timeout and can move on.

Here are a couple thoughts on how ReadTimeout could be interpretted.
I am unsure what the expectation is, I can't see any documentation for it:

read deadline per call to conn.Read

Maybe what we want is to apply the read timeout per call to Read.
Setting the deadline explicit before starting each call.
So that a long query continues to run if we're still getting bytes back from CH.

read deadline on rows.Close

Or perhaps clearing the read timeout should move to occur within rows.Close().
Instead of before returning the rows iterator itself.
If we want ReadTimeout to apply to the entire query.

This obviously could be problematic for users who are forgetting to close their rows iterator and starting another query on the same connection. Since the read deadline would extend to the next query.
We could put in a fail-safe here for folks that miss this, by also clearing it via runtime.SetFinalizer on the rows struct pointer.
In all cases, we should make sure to only clear the deadline once and not if another query has been started and taken its place.

Code example

I sadly don't have a reproducible query, which can create this behaviour in CH to trigger the bad behaviour here in the client. However, maybe there is some mockable test server implementations we can use in this repository? (open question). It is a little fiddly to test as we need to simulate a query sending at least the first block before blocking indefinitely.

If anyone can point me in the direction of where might be best to test that, I would be happy to create a test case!

Error log

Repeat of what we have above, but this is the smoking gun we have in our application:

goroutine 1098 [select, 72 minutes]:
github.com/ClickHouse/clickhouse-go/v2.(*rows).Next(0xc019ae3c80)
	/go/pkg/mod/github.com/!click!house/clickhouse-go/[email protected]/clickhouse_rows.go:52 +0x239

Details

Environment

  • clickhouse-go version: v2.37.2
  • Interface: ClickHouse API
  • Go version: 1.23
  • Operating system: Linux/Arm
  • ClickHouse version: 25.4
  • Is it a ClickHouse Cloud?
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for all these tables, use clickhouse-obfuscator if necessary

I can ask around and see if I can get some better reproductions and CH server configuration / data.
This is happening in internally within ClickHouse Cloud.

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions