Today at 5:30 UTC, a volume connected to an MQ cluster became unresponsive after an inflow of messages that created a project with an overwhelming volume of work units—surpassing hundreds of thousands. Even though theses disk volumes are elastic and managed by the MQ service, i.e., they don't have a pre-established limit, the cluster became temporarily unresponsive and cause the system to disconnect from it, causing an outage in Bureau Works.
We immediately started a recovery procedure to normalize the system. Bureau Works came back online after ~40 minutes, with fully functional API and UI services.
We have deployed a clean MQ cluster with no loss of data and we are implementing contingencies to avoid this event from happening again.