18th April 2019

 

The performance issues are known on UK STOR2 however tests have shown that either uploading from or writing files to the server from another source i.e. buffered writes and cached reads it has never fallen much below 100 MB/s i.e. the link speed so I am happy that the server is performing acceptable for its attached products intended purpose which is why no action will be taken over the holiday weekend to ensure peoples backups will not be affected during the holiday period when they may not be available to react should any issues occur with a reboot.

A rebootless solution will continue to be looked at over the holiday weekend, if none can be found then a scheduled maintenance reboot will be booked for Tuesday.

Currently it looks like there may be a memory leak in one of the qemu packages which is slowly forcing some VM's to run in swap some of the time which has an overall hit on performance, this is just a theory right now however.

Sadly a large number of users seem to be repeadly running sythetic benchmarking over and over again to see if the issue is still present which in iself is responcible for much of the issue, it is requested that you remian patent for a resolution and do not run any further benchmarks at this time.

Updates will follow on Tuesday 23rd April.


Update 19:05 19th April - Sadly the issue has just slowley increased through the day so some disruptive actions need to be taken immediatley the kload on the host node has shot up over 120 in the past 30 minutes, updates will follow.

Update 19:34 19th April - the server even with no guests/VPS running became gradually less responcive until it simply went off line, a remote power cycle has been triggered, updates to follow.

Update 22:20 19th April - Services restored, the server appeared as if it would not boot at all after the reboot was issued, the DRAC stopped responding as well so remote hands were called due to the issues the decision was made to run a diagnostic disk on the hardware just to be sure which passed, the reason the server became unreachable was due to human error on the vlan config it actually had come back up it was simply unreachable, monitoring will continue to ensure the performance issues are resolved.



Friday, April 19, 2019



« Back