Issue Summary:
On March 2, increased latency was observed across services in the EU region. During the incident window, users experienced slower response times while navigating the platform and certain API requests had slower response times. Service availability remained intact, but overall performance was degraded until mitigation was applied.
Root Cause:
The incident was caused by an abnormally high number of database connections. The excessive connections increased CPU and memory consumption within a dependent service, exhausting available resources and degrading its performance. This degradation propagated latency to services across the EU region.
Corrective Action:
Resource allocation (CPU and memory) for the impacted service pods was increased to stabilize the workload and reduce latency. Following this adjustment, monitoring confirmed that system performance returned to normal levels.
Preventive Measures:
To reduce the risk of recurrence, we are taking the following action: