Yesterday one of the disks in the array went offline and failed, during the rebuild process another disk failed which leads to the more likely scenario that the raid card has failed.

 

Investigation is underway, once the state of things is known a decision will be made as to what the next steps are, if recovery is going to take too long then a new raid card+disks will be fitted and the base VPS will be restored only for customers to restore from their own backups.

 

Sorry for the inconvenience.

Update #1 - full hardware checks are underway, sadly the supplier in the DC is responding very slowly which is impeding progress.

Update #2 - 
The raid array has failed and some data corruption has occurred, if this cannot be resolved within 2 hours from now (17:45 GMT) new drives and card will be fitted and the server will be reinstalled, sadly this would mean that customers would need to retire their own data.

Update #3 - It is looking promising that data recovery can be achieved which is obviously a more convenient option, hardware options are being investigated now and the window for full recovery starting has been pushed at to 22:00 GMT)

Update #4 - Thank you for your continued patience, a test recovery has been performed and it was successful, sadly the raid card is the fault and a replacement is not immediately available, at this stage it is looking like a more viable option to provision all new hardware and remove this server from service as it has had poor uptime in comparison to others with a number of faults over the past 24 months. 

The good news is that this means an upgrade in terms of server performance however it will take time to get the new hardware in place and then move the data, a delivery estimate is pending from the DC and the next update will contain some time scales. 

Update #5 - Due to certain decisions needing to be made and some staff simply not being available out of hours the likely availability of the hardware is not going to be until Monday, sorry for the inconvenience however hardware failures of this nature cannot be predicted, there is limited space on other UK nodes, if you are happy to have your VPS simply recreated on another node then please open a ticket for this, it will be without data (you need to restore the data yourself) and will be on a limited first come first served basis.

The Next update will be in the Morning.

Update #6 - After a lot of work it has been possible to get UK Node 11 online, at least 95% of VPS's seem to be booting without issue the reason for the minority not booting is not clear yet it could be due to user error when upgrading Ubuntu/Debian based servers as this is common with Xen PV.

The array is running in a degraded state after some replacements were possible so all customers are urged to take backups of data.

This does change plans going forward which are being assessed and the next update will contain the definitive plan of action, if your VPS is still down please open a ticket so it can be investigated on a case by case basis.


UPDATE #7 - The following email has been sent to all affected users:


 

Good Morning

 

 

 

As some of you are painfully aware UK Node 11 suffered a raid failure early hours of the morning on Sunday while in the final analysis only a limited amount of user data was impacted it has caused lengthy down time for some users which is less than ideal, the same problem hit around 13 months ago again on a UK node with an identical raid card and drive setup.

 

 

 

During the extended down time a lot of options were looked into, simply replacing the Raid card and disks and leaving people to restore their own data or extending the down time further to attempt data recovery for customers while this may be of little comfort to the few users who’s data could not be recovered the net result was around a 95% recovery.

 

 

 

The concern is that the rest of the UK hardware is essentially the same and having spoken with a number of people in the industry they have also experienced exactly the same thing with this hardware, the general impression is that they simply do not provide adequate fault tolerance, failed disks are common however they are usually replaced fast and with little or no impact on users however this is now the third cassis/config of this type that has simply been unable to sustain a single drive failure/rebuild process.

 

 

 

With the above in mind although it is a little bit premature for the life of the hardware it has been decided the risk is simply too great to continue using the current servers in production and as such replacement hardware is being racked and tested now.

 

 

 

What this means for you is the following:

 

 

 

  • Faster Processors – the new servers will be using E5-2650v2’s.
  • Faster Storage – the new servers will be using 8 disk arrays instead of 4.
  • Faster Network – the new servers will be on gbit ports instead of 100mbit.
  • Further Migrations – Throughout October starting as soon as Thursday 9th all virtual servers will be migrated to the new hardware, your full disk image will be moved including all data.
  • IP Changes – Due to technical complications it will be necessary to change IP’s both IPv4 and v6 will be affected, as much advance notice of your new IP as possible will be given.

 

 

 

This upgrade will result in a faster and more stable and reliable service long term, while service disruption is never good this is absolutely necessary and it not being taken lightly, to be clear this affects all UK XEN PV or XEN HVM users, details on how to get your new IPv4 will follow pre migration work, IPv6 will change slightly and each user will be allocated a /112 instead of individual IP’s

 

 

 

 

 

Thank you for your continued patience and understanding.

 

 

 

Anthony Smith.

 

 



UPDATE #8 This node is still considered unstable, some issues have been found preventing networking on a number of servers, the node is rebooting now.

UPDATE #9 Due to having to move some servers around to keep people online the load on Node 15 has increased, sorry for any inconvenience migrations to new hardware should be starting within 24 hours.



Saturday, October 4, 2014

« Back