Welcome to Codidact Meta!

Codidact Meta is the "town hall" (meta-discussion site) for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Incident postmortem: 20th Sep site availability problems

+11

−0

On the 20th September 2025, Codidact communities experienced a period of slow response or complete unavailability which took several hours to resolve. This is what happened.

Incident

The culmination of this incident was yesterday, the 20th September. From the early hours of the morning (UTC) our availability monitoring started alerting that communities were periodically unavailable, and in manual testing any requests that were getting through were very slow.

We identified the cause and put a fix in place around 15:00 UTC and continued to monitor. This appeared to resolve the problem and the incident was finally closed this morning, 21st September.

Root cause

In the lead-up to this incident, we’ve been experiencing a number of issues with site reliability and availability. Primarily, we’ve been under very heavy load for several weeks, serving an average of 20,000 requests per hour. Although this has caused some slowdowns recently we have broadly coped (bar the problems in the linked post) with additional caching and updating our code to speed up slow or inefficient actions, which has taken enough demand off our server to allow us to manage.

Over the last week or so, however, we’ve seen a higher sustained peak rate of requests, topping out at 35,000 requests per hour. We’ve also seen a reduction in the effectiveness of our CDN caching, dropping from around 70% of requests cached at peak to around 25% this week. Together, these factors have meant more requests hitting our server. We were still able to deal with this level of demand, so although communities slowed down we remained up.

However, the level of demand on our server meant that we were making significantly more requests of our database than normal. Our database uses AWS' gp2 storage, which has the concept of “burst performance” - it can sustain performance above the baseline for a while, but not indefinitely. The level of demand we were under meant we exhausted our burst performance credit, leaving the database running on relatively slow baseline performance for all requests. This caused slow responses, which caused requests to queue up - these requests eventually timed out, resulting in the 504 error that we were seeing.

Response & Remediation

The issue was identified by our automated availability monitoring in the early hours UTC, although at this point nobody with access was available to investigate.

Once an investigator came online, we initially looked into server performance, thinking the web server was overloaded. This showed that it was running well within its capacity. Our web request profiler was showing some long waits for the database, so our initial working theory became exhaustion of the database connection pool. Our web server is set up with as many request threads as database connections in the pool, so we increased both request threads and available pooled connections from 5 to 20, then to 40, and restarted. This briefly appeared to resolve the issue, but it returned within 10-15 minutes or so.

We moved to AWS, and took the database server offline for maintenance briefly to enable additional logging. This showed nothing particularly outstanding or unexpected. However, it did have the effect of reducing demand on the database to zero for a period of around 30 minutes and crucially involved a database reboot, which - unknown to us at the time - restored our burst credit to full. With the issue still unresolved, we increased the thread and connection pools again to 80 each and restarted.

This initially resulted in the connection pool being used fully to 80 connections, but this then dropped back to pre-incident levels of ~15 concurrent connections. This indicated to us that the backlog of requests had been served and requests were being served sufficiently quickly as they came in once again, and communities returned to normal.

Concurrent database connections over the incident investigation period, initially tracking with the increases we made, then spiking to 80 connections before quickly returning to normal at around 15 connections.

Image shows concurrent database connections over the incident investigation period, initially tracking with the increases we made, then spiking to 80 connections before quickly returning to normal at around 15 connections.

While the communities appeared to be working normally we looked further into AWS monitoring, and finally identified the exhausted burst credit as the ultimate cause. We were able to see the period of baseline performance followed by the restoration to 100% following the database reboot.

Burst credit over the month preceding the incident. There is a period towards the end of the graph hovering around 0% which correlates with the incident, followed by a jump to 100% correlating with the database reboot.

Image shows burst credit over the month preceding the incident. There is a period towards the end of the graph hovering around 0% which correlates with the incident, followed by a jump to 100% correlating with the database reboot.

This concern was quickly resolved by moving our database storage to AWS' gp3 storage, which can sustain a higher performance level indefinitely. After all the investigation that went into this, the final fix took all of two minutes to apply!

Learning & Next Steps

Burst performance of gp2 storage is not something we have ever been concerned about before now. The available level of performance and burst credit is high enough to deal with a volume of requests that we haven’t previously had cause to be worried we’d reach. However, the unprecedented level of traffic we’ve received in the last month or so and particularly the last week reached that threshold. Since this isn’t a concern we’ve had before, we had no active monitoring or alerting in place for it, which is why it took as long as it did to identify this as the cause.

The move to gp3 storage has negated this as a concern completely, because gp3 storage does not use the concept of burst performance and can sustain a higher level of performance indefinitely. This has also removed the need for us to add additional monitoring or alerting for this. We have, however, added alerting for database CPU credit, and connected all our existing alarms to additional alerting channels so that the right people can be aware of any alarms earlier.

We don’t believe there are any remaining impacts from this incident that still need resolving. As an additional bonus, we were able to deploy the changes we’ve been working on recently, which we believe should resolve the vast majority of other issues that we’ve been experiencing recently, including those persistent 422 errors. If you do see any remaining issues, please flag them to us via Discord or Meta as usual.

Thank you for your patience with us as we’ve worked through the issues these last few weeks. It’s been very frustrating chasing down issues that we’ve struggled to reproduce, or in this case with no obvious cause - your support is appreciated.

incident-postmortem

posted 5 months ago

CC BY-SA 4.0

ArtOfCode‭ staff

11 156 874 264

Copy Link

Raw

Markdown

History

1 comment thread

I'm curious as to whether this is a result of new people visiting the site, or just bots and scrapers... (4 comments)