Impact: Partial unavailability of internal observability (o11y) stack
Summary: Today, on the 29th of January, between 15:19 UTC and 16:37 UTC, Checkly experienced a partial outage of the internal observability stack. This incident led to a loss of metrics and logs for our runtime environments.
Customer Impact: After an analysis of our database records, we concluded that customer checks remained unaffected during this period.
Root Cause: The issue was traced back to a recently released version of a monitoring dependency that introduced a breaking change. This incident was amplified by the absence of version pinning in our configuration.
Resolution and Prevention: To address this issue and prevent similar occurrences in the future, we have implemented version pinning for our observability dependencies. This measure will ensure greater stability and control over our internal systems.
We apologize for any inconvenience caused and are committed to continuously improving our systems for a reliable user experience.