On May 1st, 11:15 pm UTC a buildup in memory usage in our check scheduling service started. It was caused by a user creating an extremely high number of client certificates for their checks. Over time, this caused the scheduler instances to crash until almost no checks could be scheduled from ~11 pm - 1 am UTC.
After an initial investigation, the on-call engineer decided to declare an incident and scale up the scheduling service to handle the high memory consumption.
This partially restored service, but was not enough for full recovery. Memory usage was increasing very quickly and still causing crashes which caused some check runs to get lost. Check scheduling was still significantly impaired.
After investigating using logs, metrics, and events stored in our data lake, we were able to identify an account with an extremely high number of client certificates. The check scheduling service executed a database query through an implicit model dependency for every check run job, which loaded all certificates into memory, leading to high memory consumption.
The incident was fully resolved when we enforced a hard limit of 10 certificates per account. The limit brought back the scheduling service' memory consumption to regular levels. After fixing the memory issues, we manually scaled up the platform to handle the build-up of queued check runs, and within ~30 minutes, the service returned to normal around 5:30 am UTC.
How we move forward: