Video Introduction to Support

WATCH VIDEO

Video Introduction to Training

WATCH VIDEO

ABOUT

BLOGS

INSIGHTS

Latest Event

Latest Case Study

SERVICES

Home

Blogs

Affino Social Commerce Blog

Apologies for a tough week in the Cloud

20 March 201014:21

Markus Karlsson

challengehostingreliabilityserviceshared+-

The Comrz team would like to apologise for what’s been a very turbulent week on our shared hosting service.

We use two cloud providers, ACC Yotta and Amazon, Yotta for the high performance cloud service and Amazon for the entry and mid level services. After the initial teething pains in the early days of Comrz, both services have become increasingly stable, reliable and effective. That is until Friday last week when things started to go badly wrong with the Amazon setup.

The fact that this coincided with the release of the latest version of Affino (5.5.15) led us to think there must be something wrong with the release. As a result we’ve invested hundreds of man-hours over the past week in trying to identify what went wrong with Affino, and why there was no performance issue with it on any of our QA servers for the past three weeks.

To cut a long story short, finally on Friday we ruled out every single possibility that the issue involved Affino, which meant that there had to be a problem specific to the hosting instances which Amazon provide us.

The greatest benefit of being on the cloud is that you don’t have to think about the underlying hardware or infrastructure, in theory things just work. The problem is when things start to fail, especially when that failure is both random and very gradual as was happening in this case. Eventually our main hosting setup just didn’t respond and nothing we did on the Amazon console would get it responsive again. The lack of effective diagnostics, and the very vague support from Amazon does little to help.

Normally instance failures on Amazon would not be a serious issue since the way we run things is to keep all the server profiles and data on the Amazon storage cloud, i.e. independent of the application servers. Normally that would mean 10 to 15 minutes of downtime whilst we created new instances, and restored the configurations. Not this time.

Amazon has discontinued the application server we have been using without us knowing it, although that is not to say that Amazon hasn’t announced it somewhere, just not in a way that we as customers and active users would know. It meant that we had to learn the new generation of the OS and learn how to deal with all the foibles for setting it up. Unfortunately there was a significant number of these.

It meant we had to do a complete re-build of our hosting setup which took us six and a half hours and was completed at 23:30 GMT.

Suffice to say we’ve learnt a lot out of this exercise and are taking a number of steps to ensure that we don’t have a repeat of this, ever. A key part of this will be a monthly exercise we will be doing to run through the entire restoration process to ensure that everything we’ve put in place will work. It’s simply too risky not to do so since it’s clear that we will be caught out otherwise by the continuously evolving cloud services.

We have taken on board your criticism that we need more active communication (and not simply status alerts) and are rolling out a new notification system which will be up next week and will mean that we can communicate much more effectively with clients on their specific services.

Most importantly, we have taken, and will be taking significant steps to ensure that we don’t have a repeat of the events this past week. We’re now much more able to identify cloud issues, which will allow us to act much faster should a similar occurrence arise.

Next: Affino 5.5.15 Release