Skip to content

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Nov 25, 2020

This PR causes Freno to ignore Vitess tablets that return unhealthy realtime tablet stats, similar to the way vtgate filters replicas for serving traffic, minus replication and minimum node count checks

The added logic in this PR relies on realtime stats (an optional feature) and new vtctld API fields added to Vitess 8.0.0, but if these fields are not found the old logic is used, making this change backwards-compatible. If no stats are found tablets are assumed to be healthy like we do today

Example API of the new /keyspace/<ks>/tablets/ API response with additional realtime stats:

$ curl -s https://<VTCTLD HOSTNAME>/api/keyspace/test_ks/tablets/ |jq '.[] | select( .hostname == "<REDACTED>" ).stats'
{
  "last_error": "vttablet error: replication is not running",
  "realtime": {
    "health_error": "replication is not running"
  },
  "serving": false,
  "up": true
}

For the most part the logic is:

  1. serving must be true (we need to ignore serving, see comments below)
  2. last_error must be "" (empty)
  3. realtime sub-document is not nil
    • I believe this check ignores tablets that haven't registered with vtctld yet, but it's a guess - just copying what Vitess did so we see the same tablets as vtgate/clients 👍

cc @tomkrouper / @drogart / @shlomi-noach

@timvaillancourt timvaillancourt changed the title Vitess: ignore replicas with replication not running Vitess: ignore replicas with 'replication is not running' error Nov 25, 2020
@timvaillancourt timvaillancourt added this to the v1.1.1 milestone Nov 25, 2020
@timvaillancourt timvaillancourt changed the title Vitess: ignore replicas with 'replication is not running' error Vitess: ignore unhealthy replicas with realtime stats Nov 26, 2020
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Nov 26, 2020

@timvaillancourt timvaillancourt temporarily deployed to staging November 26, 2020 18:53 Inactive
@timvaillancourt timvaillancourt temporarily deployed to staging November 26, 2020 19:18 Inactive
@timvaillancourt timvaillancourt temporarily deployed to staging November 28, 2020 20:42 Inactive
Copy link

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be right. Mind you that I'm not that familiar with realtime stats in Vitess.

@timvaillancourt timvaillancourt temporarily deployed to staging November 30, 2020 17:12 Inactive
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Dec 9, 2020

This looks to be right. Mind you that I'm not that familiar with realtime stats in Vitess.

After some testing with Vitess the logic in this PR won't work as it stands. When a node is considered "unhealthy" by Vitess, the following is returned by the vtctld API:

$ curl -ks https://<vtctld hostname>/api/keyspace/test_ks/tablets/?cells=dc1 \
    | jq '.[] | select( .hostname == "<hostname>" ).stats'
{
  "realtime": {
    "seconds_behind_master": 30
  },
  "serving": false,
  "up": true
}

This response was gathered from a node that exceeds the -discovery_high_replication_lag_minimum_serving=# threshold

While Vitess won't send reads to a replica in serving: false, this behaviour means the logic in this PR will cause Freno to ignore nodes that lag beyond -discovery_high_replication_lag_minimum_serving=#, hiding the lag vs throttling when replication is overwhelmed because valid lagging nodes can become serving: false

This means:

  1. This PR needs to consider serving: false AND serving: true for Freno probes
  2. There will still be cases where a broken node can cause Freno to throttle (although the replica is not serving in Vitess):
    • Replication SQL thread hits an error
    • Replication was running but is stopped later on (eg: STOP SLAVE, restart of mysqld, etc)

In my testing vtctld didn't "notice" situations where replication is broken after previously being healthy, only the replication lag caused by the problem is reported as seconds_behind_master without a last_error set. This could mean that in some situations we're unable to differentiate a node with "broken" replication from a valid "lagging" one

If Vitess periodically updated the Realtime stats last_error field with the health of the SQL replication thread, this would allow us to know when replication is broken vs overwhelmed with writes. Currently replication is not running is returned on a new tablet that has never been "seen" replicating, but not when a previously-healthy tablet hits a problem

cc @tomkrouper / @shlomi-noach for thoughts

@timvaillancourt
Copy link
Contributor Author

Re-requesting review from @shlomi-noach, @drogart and @tomkrouper

@timvaillancourt timvaillancourt had a problem deploying to production/role=mysqlutil&environment=staging March 17, 2021 20:24 Failure
@timvaillancourt timvaillancourt temporarily deployed to staging March 17, 2021 20:25 Inactive
@timvaillancourt timvaillancourt temporarily deployed to production March 17, 2021 20:28 Inactive
@timvaillancourt timvaillancourt merged commit 8d8b3f5 into master Mar 18, 2021
@timvaillancourt timvaillancourt deleted the vitess-ignore-replication-not-running branch March 18, 2021 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants