Vitess: ignore unhealthy replicas with realtime stats #136

timvaillancourt · 2020-11-25T19:54:29Z

This PR causes Freno to ignore Vitess tablets that return unhealthy realtime tablet stats, similar to the way vtgate filters replicas for serving traffic, minus replication and minimum node count checks

The added logic in this PR relies on realtime stats (an optional feature) and new vtctld API fields added to Vitess 8.0.0, but if these fields are not found the old logic is used, making this change backwards-compatible. If no stats are found tablets are assumed to be healthy like we do today

Example API of the new /keyspace/<ks>/tablets/ API response with additional realtime stats:

$ curl -s https://<VTCTLD HOSTNAME>/api/keyspace/test_ks/tablets/ |jq '.[] | select( .hostname == "<REDACTED>" ).stats'
{
  "last_error": "vttablet error: replication is not running",
  "realtime": {
    "health_error": "replication is not running"
  },
  "serving": false,
  "up": true
}

For the most part the logic is:

~~serving must be true~~ (we need to ignore serving, see comments below)
last_error must be "" (empty)
realtime sub-document is not nil
- I believe this check ignores tablets that haven't registered with vtctld yet, but it's a guess - just copying what Vitess did so we see the same tablets as vtgate/clients 👍

cc @tomkrouper / @drogart / @shlomi-noach

timvaillancourt · 2020-11-26T11:01:07Z

With some advice from PlanetScale I believe I've matched the logic vtgate uses to find tablets, minus the logic for checking replication lag (Freno will do this) and minimum tablet count 👍

Path I followed through the code:

https://github.com/vitessio/vitess/blob/master/go/vt/vtgate/tabletgateway.go *TabletGateway.withRetry(...)
https://github.com/vitessio/vitess/blob/master/go/vt/discovery/healthcheck.go *HealthCheckImpl.GetHealthyTabletStats(...)
https://github.com/vitessio/vitess/blob/master/go/vt/discovery/replicationlag.go *FilterStatsByReplicationLag(...)
- This new logic aims to copy this block: https://github.com/vitessio/vitess/blob/master/go/vt/discovery/replicationlag.go#L113-L119

…b.com/github/freno into vitess-ignore-replication-not-running

shlomi-noach

This looks to be right. Mind you that I'm not that familiar with realtime stats in Vitess.

timvaillancourt · 2020-12-09T15:06:41Z

This looks to be right. Mind you that I'm not that familiar with realtime stats in Vitess.

After some testing with Vitess the logic in this PR won't work as it stands. When a node is considered "unhealthy" by Vitess, the following is returned by the vtctld API:

$ curl -ks https://<vtctld hostname>/api/keyspace/test_ks/tablets/?cells=dc1 \
    | jq '.[] | select( .hostname == "<hostname>" ).stats'
{
  "realtime": {
    "seconds_behind_master": 30
  },
  "serving": false,
  "up": true
}

This response was gathered from a node that exceeds the -discovery_high_replication_lag_minimum_serving=# threshold

While Vitess won't send reads to a replica in serving: false, this behaviour means the logic in this PR will cause Freno to ignore nodes that lag beyond -discovery_high_replication_lag_minimum_serving=#, hiding the lag vs throttling when replication is overwhelmed because valid lagging nodes can become serving: false

This means:

This PR needs to consider serving: false AND serving: true for Freno probes
There will still be cases where a broken node can cause Freno to throttle (although the replica is not serving in Vitess):
- Replication SQL thread hits an error
- Replication was running but is stopped later on (eg: STOP SLAVE, restart of mysqld, etc)

In my testing vtctld didn't "notice" situations where replication is broken after previously being healthy, only the replication lag caused by the problem is reported as seconds_behind_master without a last_error set. This could mean that in some situations we're unable to differentiate a node with "broken" replication from a valid "lagging" one

If Vitess periodically updated the Realtime stats last_error field with the health of the SQL replication thread, this would allow us to know when replication is broken vs overwhelmed with writes. Currently replication is not running is returned on a new tablet that has never been "seen" replicating, but not when a previously-healthy tablet hits a problem

cc @tomkrouper / @shlomi-noach for thoughts

timvaillancourt · 2020-12-09T16:46:09Z

Re-requesting review from @shlomi-noach, @drogart and @tomkrouper

…b.com/github/freno into vitess-ignore-replication-not-running

timvaillancourt added 4 commits November 25, 2020 19:55

Ignore Vitess replicas without running replication

ec56825

Fix typo

b082909

Method rename

7b57af9

Add to comment

71b0b38

timvaillancourt changed the title ~~Vitess: ignore replicas with replication not running~~ Vitess: ignore replicas with 'replication is not running' error Nov 25, 2020

timvaillancourt requested review from drogart, shlomi-noach and tomkrouper November 25, 2020 19:56

timvaillancourt added enhancement vitess labels Nov 25, 2020

timvaillancourt added this to the v1.1.1 milestone Nov 25, 2020

timvaillancourt added 2 commits November 26, 2020 07:24

Copy logic vtgate uses to filter tablets, minus lag+tablet count

cba0d55

Merge branch 'master' into vitess-ignore-replication-not-running

154cb13

timvaillancourt changed the title ~~Vitess: ignore replicas with 'replication is not running' error~~ Vitess: ignore unhealthy replicas with realtime stats Nov 26, 2020

timvaillancourt temporarily deployed to staging November 26, 2020 18:53 Inactive

timvaillancourt temporarily deployed to staging November 26, 2020 19:18 Inactive

timvaillancourt added 4 commits November 27, 2020 18:15

Only check if &TabletRealtimeStats{} is nil, not the HealthError

ed9041d

Merge branch 'vitess-ignore-replication-not-running' of https://githu…

ca37267

…b.com/github/freno into vitess-ignore-replication-not-running

Add test for nil realtime stats

77ab0fc

Merge branch 'master' into vitess-ignore-replication-not-running

b847582

timvaillancourt temporarily deployed to staging November 28, 2020 20:42 Inactive

shlomi-noach approved these changes Nov 30, 2020

View reviewed changes

drogart approved these changes Nov 30, 2020

View reviewed changes

Merge branch 'master' into vitess-ignore-replication-not-running

9300dc7

timvaillancourt temporarily deployed to staging November 30, 2020 17:12 Inactive

Do not ignore 'serving: false' tablets

83b3b05

timvaillancourt requested review from drogart and shlomi-noach December 9, 2020 16:45

Merge branch 'master' into vitess-ignore-replication-not-running

5a34769

timvaillancourt added 2 commits December 9, 2020 21:21

Improve test comments

7830565

Merge branch 'vitess-ignore-replication-not-running' of https://githu…

dcdb196

…b.com/github/freno into vitess-ignore-replication-not-running

drogart approved these changes Mar 17, 2021

View reviewed changes

timvaillancourt had a problem deploying to production/role=mysqlutil&environment=staging March 17, 2021 20:24 Failure

timvaillancourt temporarily deployed to staging March 17, 2021 20:25 Inactive

timvaillancourt temporarily deployed to production March 17, 2021 20:28 Inactive

timvaillancourt merged commit 8d8b3f5 into master Mar 18, 2021

timvaillancourt deleted the vitess-ignore-replication-not-running branch March 18, 2021 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vitess: ignore unhealthy replicas with realtime stats #136

Vitess: ignore unhealthy replicas with realtime stats #136

Uh oh!

timvaillancourt commented Nov 25, 2020 •

edited

Loading

Uh oh!

timvaillancourt commented Nov 26, 2020 •

edited

Loading

Uh oh!

shlomi-noach left a comment

Uh oh!

timvaillancourt commented Dec 9, 2020 •

edited

Loading

Uh oh!

timvaillancourt commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vitess: ignore unhealthy replicas with realtime stats #136

Vitess: ignore unhealthy replicas with realtime stats #136

Uh oh!

Conversation

timvaillancourt commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvaillancourt commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shlomi-noach left a comment

Choose a reason for hiding this comment

Uh oh!

timvaillancourt commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvaillancourt commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

timvaillancourt commented Nov 25, 2020 •

edited

Loading

timvaillancourt commented Nov 26, 2020 •

edited

Loading

timvaillancourt commented Dec 9, 2020 •

edited

Loading