SUMMARY
- Regions affected: EU
- Customer Impact: Service is intermittently unavailable, ability to trade impacted
24/02/2023 - 11:30 AM
The RCA document has been completed and can be found attached to this article.
The root cause is shared below in the Root Cause Analysis Report.
15/02/2023 - 07:27
Following the mitigation of the issue yesterday afternoon the environments were closely monitored overnight to ensure the issue didn't re-occur.
A root cause analysis report is being generated and will be made available here in the next 5 business days.
14/02/2023 - 15:07
We have identified an issue where resources responsible for returning Live responses to FHR data were failing. This issue impacted the return responses for FHR in some cases and in others resulted in slow requests and increased error rates.
The issue has now been mitigated and service availability is expected to have returned to normal for any affected customers.
If you are still experiencing performance disruption and have not already logged a ticket with the support team please log a ticket with us to investigate further.
14/02/2023 - 14:03
We are currently investigating an issue affecting multiple EU region customers resulting in the Fredhopper service being intermittently unavailable. This issue may impact site performance and availability.
Thank you for your patience while we investigate this issue.
More information to follow.
ROOT CAUSE ANALYSIS REPORT
On 2023-02-14 at 13:23 UTC multiple alerts were received by Attraqt Cloud team relating to uptime availability across numerous accounts in the EU region due to elevated error rates and increased response times of the Query API in the EU region. The observed behavior was because some of the dedicated customer service instances had insufficient resource capacity to handle the incoming traffic. Mitigation measures were taken to restore the capacity. In some cases, this restoration was delayed due to application-level state synchronization issues.
The trigger for the event was a periodic resource refresh that’s part of the ongoing infrastructure optimization work that simplified the management of the underlying resources.
The issue's root cause was later identified as an unexpected behavior of the built-in auto-healing function of the configuration management system that caused decommissioning of the otherwise healthy resources. The automated recovery failed, too, and this had various effects on customers.
A list of improvement actions was put in place to prevent the reoccurrence.
Improvement Options and Action Items
-
Disable built-in automated environment cleanup functions. (replace with new mechanism)
-
Instance refresh improvements:
-
Implement gradual executions of changes (split per region and customer)
-
Prevent interference with a full reindex or other relevant customer jobs.
-
-
Create a feature to facilitate faster state recovery for environments without the requirement of reindexing.
-
Improve testing of environment changes/optimizations before production deployments.
-
Improve the mechanism of quality assurance for all applied environment changes.
FOR MORE INFORMATION
All currently available information is included in this article. We will continue to provided updates on the issue here as we work to resolve the incident.
If you have logged a ticket with us we will provide the same information there as soon as possible.
The report of our root cause analysis investigation is usually posted here a few days after the incident has been resolved. If you have additional questions about this incident, please log a ticket with us.
Kommentare
0 Kommentare
Bitte melden Sie sich an, um einen Kommentar zu hinterlassen.