Check run infrastructure impaired

Incident Report for Checkly

Postmortem

On May 1st, 11:15 pm UTC a buildup in memory usage in our check scheduling service started. It was caused by a user creating an extremely high number of client certificates for their checks. Over time, this caused the scheduler instances to crash until almost no checks could be scheduled from ~11 pm - 1 am UTC.

After an initial investigation, the on-call engineer decided to declare an incident and scale up the scheduling service to handle the high memory consumption.

This partially restored service, but was not enough for full recovery. Memory usage was increasing very quickly and still causing crashes which caused some check runs to get lost. Check scheduling was still significantly impaired.

After investigating using logs, metrics, and events stored in our data lake, we were able to identify an account with an extremely high number of client certificates. The check scheduling service executed a database query through an implicit model dependency for every check run job, which loaded all certificates into memory, leading to high memory consumption.

The incident was fully resolved when we enforced a hard limit of 10 certificates per account. The limit brought back the scheduling service' memory consumption to regular levels. After fixing the memory issues, we manually scaled up the platform to handle the build-up of queued check runs, and within ~30 minutes, the service returned to normal around 5:30 am UTC.

How we move forward:

We have limited the amount of certificates users can create to 10 and the size to be max 50kb. Thereby preventing the same issue from happening a second time.
We are adding tooling to investigate the heap usage of our NodeJS services so we can debug memory problems much faster if scaling up does not resolve them. Our current methods have proven insufficient under pressure in an incident situation.
We are currently reviewing all APIs to make sure we have proper limits in place for other entities as well.

Posted May 03, 2024 - 17:43 UTC

Resolved

Check runs that were successfully scheduled have been backfilled. This incident has been resolved.

Posted May 02, 2024 - 05:42 UTC

Update

We are continuing to monitor for any further issues.

Posted May 02, 2024 - 05:28 UTC

Monitoring

We rolled out another fix and all metrics return back to normal. Checks are processing fine again, Check results rolling in

Posted May 02, 2024 - 04:41 UTC

Identified

We are still seeing some related errors and checks are not processing as expected. we moved back to "partial outage".

Posted May 02, 2024 - 04:00 UTC

Update

Checks get processed since ~1:11 UTC but it will take some time for results to show up

Posted May 02, 2024 - 02:41 UTC

Update

While catching up after the outage, we see some higher than usual scheduling delays in us-west-2, eu-south-1, ap-southeast-2 and me-south-1.

Posted May 02, 2024 - 01:36 UTC

Monitoring

A fix has been implemented and we are monitoring our infra working through the backlog of checks.

Posted May 02, 2024 - 01:14 UTC

Identified

The failing service has been identified and we are looking into a fix.

Posted May 02, 2024 - 01:08 UTC

Investigating

We are currently investigating this issue.

Posted May 02, 2024 - 00:51 UTC

This incident affected: Browser check runtime and API Check Runtime.