Skip to content

ATC uses 100% CPU and ~40 DB connections #2346

@wagdav

Description

@wagdav

Hi there!

Thanks for making Concourse!

Bug Report

Just after updating from 3.13.0 to 3.14.1 we noticed 100% CPU usage and a 10x increase in the DB connections on the ATC:

  • the high CPU usage lasts for about 3 minutes with a period of 5 minutes atc-cpu
  • when the CPU usage is high, the ATC seems to be busy with checking resources
  • the "build scheduling duration" shows a peaks of ~20minutes, similarly to the CPU load, with 5 minutes period atc-scheduling-duration
  • the worker load graph forms a sawtooth like shape, also with 5 minutes period worker-load
  • the number of database connections goes as high as 40 (the nominal value is around 4) also the number of database accesses increases atc-db

In relation to this problem (not sure if it's a cause or effect) the our SSM credential store starts to throttle and the resource checks fail.

Very similar behavior is reported in

I have the feeling that when a resource check fails, the ATC leaks DB (and possibly other) handles. This results in an avalanche of queries to the credential store.

The following can also be handy:

  • Concourse version: 3.14.1
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS
  • Browser (if applicable):
  • Did this used to work? yes, in 3.13.0 it worked fine

Workaround

We managed to stabilize the system by increasing the resource check interval from 1 minute to 5 minutes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions