The Situation
Although IT infrastructure generally always adheres to N+1 technology, there may always be a situation where the “Worst Case Scenario” presents itself. I was asked the question today regarding the possibility and outcome of our SAN rebooting on the VMware Hosts. This was due to a Change Request entered on the system for a SAN firmware upgrade which stated the SAN would need to be rebooted if the change failed. Admittedly there wasn’t too much of a risk due to the fact they upgrade one node at a time and everything still runs on the other node. However what would happen if we lost all nodes?
The Consequences
This is indeed worse case scenario however if the SAN did crash and reboot, you would get an All-Paths-Down (APD) situation to your VMware hosts.
In vSphere 4.x, an All-Paths-Down (APD) situation occurs when all paths to a device are down. As there is no indication whether this is a permanent or temporary device loss, the ESXi host keeps reattempting to establish connectivity. APD-style situations commonly occur when the LUN is incorrectly unpresented from the ESXi/ESX host (which would be this case). The ESXi/ESX host, still believing the device is available, retries all SCSI commands indefinitely. This has an impact on the management agents, as their commands are not responded to until the device is again accessible.
This causes the ESXi/ESX host to become inaccessible/not-responding in vCenter Server.
It is difficult to say whether you would be able to recover all your devices and if there any corruption on the SAN side until everything is restored.
When the storage is back and running, I would recommend:
- Completing a rescan of all the VM HBAs from the ssh console to try bring the paths back online.
esxcfg-rescan vmhba#
Where <vmkernel SCSI adapter name> is the vmhba# to be rescanned.
- Restart management agents on the host
- If you still cannot see the SAN you will need to complete a reboot of the hosts.
Leave a Reply