XO Search API outage - 7th March - Resolved – Rezolve Ai Support Portal

SUMMARY

Date of incident: March 7th, 2023
Customer Impact: No items returned by the XO Search API

Thursday 9th March

Please find attached the RCA document to this article.

The root cause and the improvement actions are added to the section below.

Tuesday 7th March 5:13 PM CET

The issue has been resolved by the Cloud team.

The baseline capacity for XO Search API was increased.

They are going to work on improving the scaling policy to avoid similar situations in the future.

Tuesday 7th March 4:11 PM CET

Attraqt has identified a new outage in the XO Search API.

The Cloud team is still working to mitigate it.

We will add an update here as soon as we have more information.

Tuesday 7th March 1:12 PM CET

Attraqt experienced an outage in the XO Search API due to an overloaded capacity. The autoscaling was activated but it needed a few minutes to add additional capacity. During this time, we sent a lot of errors (429) to our customers and were not able to answer the incoming queries.

The outage lasted for 12 minutes

ROOT CAUSE ANALYSIS REPORT

On the 7th March 2023 at around 12:12 UTC the was an unexpected, and significant, spike in queries to our XO Search API. This resulted in roughly 12 minutes during which the XO Search APIs returned an error response while the auto-scaling initiated and added the necessary capacity to resolve the issue automatically.
At 15:10 UTC a second, much larger, spike in queries was received, resulting in the XO Search APIs once again returning errors. The issue was further investigated by the Attraqt team and the cause of the spike was identified as a rogue API call sending a number of queries, an order of magnitude higher than even peak sale traffic, to the XO Search API. To mitigate this issue the origin of the rogue API calls was manually blocked which allowed the service to begin recovering and we observed error rates fall from roughly 100% to 30% between 15:30 UTC and 15:55 UTC.
From 16:00 UTC onwards regular service operations resumed.
Because the XO product is a multi-tenant platform any minor or expected (such as during peak sales periods) traffic spikes from tenants are handled by the auto-scaling implementation, ensuring there is sufficient capacity for the service to function normally for all customers. However, in cases where a sudden large and unexpected spike in queries is received, the autoscaling is not able to scale up fast enough, resulting in disruption to all tenants.

Improvement Options and Action Items

Attraqt to improve the auto-scaling policy for the XO service in order to scale up faster in instances where a large volume of queries come in unexpectedly. This work is scheduled and expected to be completed before the end of March.
Attraqt increased the baseline capacity of the XO service.
Attraqt to investigate ways in which we can better insulate tenants from disruption of the service outside of their own instance. Due to the early stages of this investigation, we are not currently able to share more details on this.

FOR MORE INFORMATION

All currently available information is included in this article. We will continue to provide updates on the issue here as we work to resolve the incident.

If you have logged a ticket with us, we will provide the same information there as soon as possible.

The report of our root cause analysis investigation is usually posted here a few days after the incident has been resolved. If you have additional questions about this incident, please log a ticket with us.