Outage in main cluster

Incident Report for Bureau Works

Postmortem

Today at 5:30 UTC, a volume connected to an MQ cluster became unresponsive after an inflow of messages that created a project with an overwhelming volume of work units—surpassing hundreds of thousands. Even though theses disk volumes are elastic and managed by the MQ service, i.e., they don't have a pre-established limit, the cluster became temporarily unresponsive and cause the system to disconnect from it, causing an outage in Bureau Works.

We immediately started a recovery procedure to normalize the system. Bureau Works came back online after ~40 minutes, with fully functional API and UI services.

We have deployed a clean MQ cluster with no loss of data and we are implementing contingencies to avoid this event from happening again.

Posted Nov 14, 2024 - 19:59 UTC

Resolved

We experienced a temporary interruption in one of our managed services hosted in the AWS US-EAST-1 region. The issue has been resolved, and the service is fully operational.

Posted Nov 14, 2024 - 17:30 UTC