Increased scheduling delay across multiple regions
Incident Report for Checkly
Postmortem

Timeline:

  • 11:32 UTC: Database migration caused locks on critical tables, leading to temporary API unavailability.

  • 11:39 UTC: Database locks were removed, and check scheduling began to recover.

Impact:

  • Our public API was partially unavailable.

  • Checks were not scheduled when the locks were effective.

  • Delayed checks were eventually processed across all regions.

  • Some high-frequency checks (intervals ≤ 5 minutes) may have skipped a check run during the incident window.

Root Cause:

The issue stemmed from a database migration that inadvertently locked critical tables, impacting API and scheduling functionality.

Resolution:

Once the database locks were removed, system recovery began immediately, with checks processing resuming across all regions.

Next Steps:

We are reviewing our database migration procedures to minimize locking risks and implementing additional safeguards to ensure API and check scheduling resilience during similar operations.

Feel free to reach out to support if you have any questions. We apologize for the inconvenience and appreciate your understanding as we work to improve our systems.

Posted Nov 11, 2024 - 12:43 UTC

Resolved
This incident has been resolved.
Posted Nov 11, 2024 - 12:06 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 11, 2024 - 11:48 UTC
Investigating
We are currently investigating this issue.
Posted Nov 11, 2024 - 11:42 UTC
This incident affected: Browser check runtime, API Check Runtime, and Multistep check runtime.