Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

monitoring: restart alerts for services w/o 0-downtime deploys #17299

Merged
merged 1 commit into from
Jan 15, 2021

Conversation

bobheadxi
Copy link
Member

Working container restart alerts were recently added for Kubernetes deployments as part of https://github.com/sourcegraph/sourcegraph/pull/17239

However, it seems that periods of container restarts during deployments are expected for certain services. This change removes the critical alert and warn on extended restarts instead.

@bobheadxi bobheadxi force-pushed the container-restart-zerodowntime branch from 07df01b to 52e9e53 Compare January 15, 2021 03:08
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jan 15, 2021

Notifying subscribers in CODENOTIFY files for diff 57098f2...52e9e53.

Notify File(s)
@christinaforney doc/admin/observability/alert_solutions.md
@slimsag monitoring/definitions/git_server.go
monitoring/definitions/zoekt_index_server.go
monitoring/definitions/zoekt_web_server.go
@sourcegraph/distribution monitoring/definitions/git_server.go
monitoring/definitions/zoekt_index_server.go
monitoring/definitions/zoekt_web_server.go

@codecov
Copy link

codecov bot commented Jan 15, 2021

Codecov Report

Merging #17299 (52e9e53) into main (57098f2) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main   #17299   +/-   ##
=======================================
  Coverage   52.15%   52.15%           
=======================================
  Files        1712     1712           
  Lines       85398    85398           
  Branches     7605     7548   -57     
=======================================
+ Hits        44537    44539    +2     
+ Misses      36954    36950    -4     
- Partials     3907     3909    +2     
Flag Coverage Δ
go 51.23% <ø> (+<0.01%) ⬆️
integration 30.60% <ø> (ø)
storybook 30.23% <ø> (ø)
typescript 54.39% <ø> (ø)
unit 34.87% <ø> (ø)
Impacted Files Coverage Δ
.../internal/codeintel/resolvers/graphql/locations.go 83.50% <0.00%> (ø)
...nal/campaigns/resolvers/changeset_apply_preview.go 60.68% <0.00%> (+1.70%) ⬆️

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Alternatively you should be able to find out restarts per replica and could alert on that? IE is an individual replica restarting more than once will remove the noise of deployments.

@bobheadxi
Copy link
Member Author

restarts per replica and could alert on that?

I don't think we can do dynamic thresholds without a bit of work (update how the thresholds are drawn, etc), and I'm not sure if there's a great way to reliably count replicas for a service across deployment types just from metrics

I'll leave that as some possible future work, will merge this for now because it is currently paging

@bobheadxi bobheadxi changed the title monitoring: adjust restart alerts for services w/o 0-downtime deploys monitoring: restart alerts for services w/o 0-downtime deploys Jan 15, 2021
@bobheadxi bobheadxi merged commit 966827b into main Jan 15, 2021
@bobheadxi bobheadxi deleted the container-restart-zerodowntime branch January 15, 2021 10:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants