The issues impacting KVM Node 7 are known and investigation is in progress.

The swap partition on the host node went ready only according to the console which locked up the host node.

As this stage it is not known if this is as a result of the updates or a disk issue, initial checks on the disks indicate no issues.

For now the swap partition has been turned off as their is enough ram in the system to not require it at all.

It will be monitored over the next 24 hours.


update: 11:15 - 22nd May 2019 - The server has failed again, efforts are being made to resolve this however it is now seming more like a hardware issue so migrations will start to ensure continuity of service today.

update: 11:30 - 22nd May 2019 - It is now clear that there is a failure on the raid, a plan is being made for recovery now.

update: 12:05 - 22nd May 2019 - It has been discovered that the issues actually started around 10:30 pm last night, however because the host node OS was running issue free the backup started at midnight which essentially wrote trash backups to the seperate backup array in the server so only a number of customer disk images are recoverable from the backup array.

It may be possible to get the NVMe raid array back up in read only mode in which case full data recovery will be possible, the next few hours will be spent on that.

update: 13:30 - 22nd May 2019 - the array can now be seen by the OS in recovery, some data is available, recovery steps are ongoing, this work will continue until 16:00, if it is not looking likely that recovery is possible by that time a reinstall and fresh servers will be provided.

update 14:25 - 22nd May 2019 - it has only been possible to recover around 60% of customer disk images, it is not possible to say what state they are in, it seems what has happened is that the controller card on the drive is failing when under load which is assumed is temprature related as its repeatable, as soon as it gets beyong 37C the problems start, however it seems that the raid array as been reallocting and reallocating and then replicating bad data.

It has not been possible to be sure that what has been recovered is viable so a reinstall is recommended regardless, the down side baing that the backups ran as expected and backed up the bad data so customers will need to recover from their own backups.

As a result of this the backup system will change and be upgraded at the cost of Inception Hosting to keep more than 1 day of backups at a time where required.

Spare disks are being sent now from a local DC, the OS level backup is in tact so a restore will be done when they arrive in a few hours and the aim is to have everyones servers available for them to start the restore by 10pm, updates will follow as a more accurate time is available. 


 

update 20:45 - 22nd May 2019 - The full system has been reinstalled and what was available to be recovered has been retsored, all disk images were recovered however due to the issues it is possible there may be some corruption, all virtual servers have been left shut down, please follow these instructions: (if your VPS is running you can ignore the rest of the steps, please check first)



1. click reboot in solusvm.

 

2. If your server boots then your data was recovered and you are good to go, if it does not boot or you get a message regarding no disk found go to step 3.

 

3. If you are at this stage you will need to restore from your own backups or mount your own recovery process, please click reinstall and select your OS of choice this will recreate your disk image, even if you want to install from ISO you must do this first.

 



Tuesday, May 21, 2019

<< Geri