At appox 9:12a EST this morning, we discovered a storage node go offline.
9:18a contact was made with the DC as no connections to the storage node was found.
Shortly after, we discovered the storage server in a boot loop from BIOS. Further diagnostic of removing PCI cards, RAM and CPUs has pointed the issue to a motherboard failure. Due to how Ceph is configured in our current infrastructure, if a whole storage node goes offline, there is no recovery and GPU nodes can't mount our storage system. While we can sustain a multiple drive failure, there is no recovery of a whole node with 36 drives going black.
A new server motherboard has been ordered with next day delivery to the Datacenter and will be installed ASAP once receiving. Until then, BF will be down. Current ETA of motherboard delivery is Tuesday afternoon Feb 25th.
We greatly apologize for any inconvenience this has caused and will compensate trial users when we are back online via a ticket on the website.
*UPDATE* - Motherboard in currently being installed. Prayers nothing else comes up during install as I'm not physically there to see issues.
So... It seems the motherboard swap didn't work. There is a short somewhere in the storage server, either the power distribution block, or the backplanes. I don't have the patience to wait anymore and have loaded up the car with our planned to be future storage server, and will swap the servers entirely. Only issue with this is it's about a 7hr drive to the Datacenter... I plan on making the drive tonight (currently 8:33p EST) and replacing the server around 3-4am. I apologize for any of the inconvenience, but I feel the real resolution is going to require an admin to be in the datacenter.
Reddit thread link - https://www.reddit.com/r/BaconFeet/comments/f8el7v/223_full_service_down/?utm_source=share&utm_medium=web2x