Elevated error rate for check scheduling
Incident Report for Checkly
Postmortem

Impact: Partial unavailability of internal observability (o11y) stack

Summary: Today, on the 29th of January, between 15:19 UTC and 16:37 UTC, Checkly experienced a partial outage of the internal observability stack. This incident led to a loss of metrics and logs for our runtime environments.

Customer Impact: After an analysis of our database records, we concluded that customer checks remained unaffected during this period.

Root Cause: The issue was traced back to a recently released version of a monitoring dependency that introduced a breaking change. This incident was amplified by the absence of version pinning in our configuration.

Resolution and Prevention: To address this issue and prevent similar occurrences in the future, we have implemented version pinning for our observability dependencies. This measure will ensure greater stability and control over our internal systems.

We apologize for any inconvenience caused and are committed to continuously improving our systems for a reliable user experience.

Posted Jan 29, 2024 - 18:09 UTC

Resolved
This incident is resolved.
Posted Jan 29, 2024 - 16:41 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 29, 2024 - 16:30 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 29, 2024 - 16:19 UTC
Investigating
We are currently investigating this issue.
Posted Jan 29, 2024 - 15:50 UTC
This incident affected: Check Scheduler and Browser check runtime.