
Top stories


Marketing & MediaWorld Press Freedom Day 2025: The impact of AI on media and press freedom
31 minutes




More news









Marketing & Media
Chicken Licken bravely debones a rare phobia with their latest campaign
Joe Public 1 day








Now that we know what potentially can go wrong and it will go wrong at some point, what can be done to avoid this from affecting customer experience and, at the end of the day, sales and business confidence.
A good monitoring foundation and accurate metric thresholds need to be in place. By monitoring the critical metrics a lot of the potential bottlenecks and issues can be identified before it becomes a problem. For example database usage, CPU/memory utilisation, disk/network performance and endpoint latency and application logs can lead to a cascading effect eventually causing downtime.
A good architecture design and decoupling of services are needed to decrease the blast radius of outages be it from infrastructure or application-related issues.
Utilising multiple availability zones or data centres for application hosting with automatic failover decreases the risk of an infrastructure outage effecting availability and converting services to microservices utilising Kubernetes Orchestration can increase service availability.
Partnering with a cloud provider like AWS solves several of the potential issues that collocated, or on-premises infrastructure are unable to handle or easily handle.
The ability to rapidly scale up or down resources either manually or by using AWS autoscaling triggers executing on predefined metric thresholds. This can provide a seamless and predictable experience for your customers.
Synthesis Managed Services provides this service to our retail and e-commerce partners, by providing a group of DevOps and System engineers to actively monitor and test the environment during the day and react within seconds to resolve potential problems that arise.
Utilising these methodologies and best practices, we have been able to provide our partners a predictable business outcome during Black Friday, knowing that no matter how many customers sign up, or buy toasters, there will be no impact on business as usual.
Usually, the first reaction from the support team is it’s the developer’s fault and from the developers, it’s the infrastructure. With good monitoring in place, it makes the task of finding the root cause of the problem a lot faster, and this eliminates the blame game.
We have seen issues ranging from hitting Disk IOPS constraints, API limits, CPU load and even memory leaks in a service.
Usually, around 11pm on the Thursday, users start logging in and refreshing your site continuously. Some of them have become quite crafty utilising web scraping tools to poll for deals and keywords that can lead to hundreds if not thousands of web requests per second.
The teams need to be on standby and ready to deal with any issue that might arise during the day.
After the day has come and gone take a moment and go through all the issues experienced and come up with remediation plans. Usually the first Black Friday is the most error prone but by taking the learnings from the day and implementing fixes for them, the next Black Friday will go a lot smoother.
Side note: Do not panic. With good preparation a good team can mitigate most, if not all, interruptions to business and when an issue pops they will be able to deal with it quickly.