A big thunderstorm hit Austin this evening and at 9:04PM local time we lost power to several of our most important infrastructure servers in the lab. Most everything seems to have booted back up on its own. However, 2 of the 3 servers providing Ceph storage to the DeveloperCloud failed to boot back up on their own. Every VM in the DeveloperCloud is backed by Ceph so this has caused quite a bit of havoc. Additionally the main network node providing external access to the cloud failed to boot back up properly.
I currently have the Ceph cluster recovering. However, its looking like it could be a couple hours until it decides all its data is in the proper state and can be used for write access.
The network node is still giving me lots of trouble. I'll give an update once I have more information.
On 05/28/2017 10:30 PM, Andy Doan wrote:
A big thunderstorm hit Austin this evening and at 9:04PM local time we lost power to several of our most important infrastructure servers in the lab. Most everything seems to have booted back up on its own. However, 2 of the 3 servers providing Ceph storage to the DeveloperCloud failed to boot back up on their own. Every VM in the DeveloperCloud is backed by Ceph so this has caused quite a bit of havoc. Additionally the main network node providing external access to the cloud failed to boot back up properly.
I currently have the Ceph cluster recovering. However, its looking like it could be a couple hours until it decides all its data is in the proper state and can be used for write access.
The network node is still giving me lots of trouble. I'll give an update once I have more information.
The Ceph cluster has been restored, but the Neutron network node is still failing to manage external network access.
Additionally, I've found one of our top-of-rack servers will no longer boot. If you have a server in rack 2, ie an r2-* server, you will not have serial console access. I'm actually just bringing the server home to try and recover stuff, so it could be a couple of days before serial consoles are restored.
On 05/29/2017 09:36 PM, Andy Doan wrote:
On 05/28/2017 10:30 PM, Andy Doan wrote:
A big thunderstorm hit Austin this evening and at 9:04PM local time we lost power to several of our most important infrastructure servers in the lab. Most everything seems to have booted back up on its own. However, 2 of the 3 servers providing Ceph storage to the DeveloperCloud failed to boot back up on their own. Every VM in the DeveloperCloud is backed by Ceph so this has caused quite a bit of havoc. Additionally the main network node providing external access to the cloud failed to boot back up properly.
I currently have the Ceph cluster recovering. However, its looking like it could be a couple hours until it decides all its data is in the proper state and can be used for write access.
The network node is still giving me lots of trouble. I'll give an update once I have more information.
This is still down.
The Ceph cluster has been restored, but the Neutron network node is still failing to manage external network access.
Additionally, I've found one of our top-of-rack servers will no longer boot. If you have a server in rack 2, ie an r2-* server, you will not have serial console access. I'm actually just bringing the server home to try and recover stuff, so it could be a couple of days before serial consoles are restored.
Our rack 2 top-of-rack server has been restored and serial access should be available again.
On 05/30/2017 09:26 PM, Andy Doan wrote:
On 05/29/2017 09:36 PM, Andy Doan wrote:
On 05/28/2017 10:30 PM, Andy Doan wrote:
A big thunderstorm hit Austin this evening and at 9:04PM local time we lost power to several of our most important infrastructure servers in the lab. Most everything seems to have booted back up on its own. However, 2 of the 3 servers providing Ceph storage to the DeveloperCloud failed to boot back up on their own. Every VM in the DeveloperCloud is backed by Ceph so this has caused quite a bit of havoc. Additionally the main network node providing external access to the cloud failed to boot back up properly.
I currently have the Ceph cluster recovering. However, its looking like it could be a couple hours until it decides all its data is in the proper state and can be used for write access.
The network node is still giving me lots of trouble. I'll give an update once I have more information.
This is still down.
OpenStack external networking is back up again thanks to some great debug work by Yibo Cai. At this point, everything I'm aware of in the Colo is working properly again. Please let me know if you have issues.