The problem turned out to be that the vMotion subnet was getting pinned to a vmk / uplink that was not yet active. I discovered this using esxcfg-route -l
So here is a ping on the vMotion network failing.
~ # vmkping 333.444.1.65
PING 333.444.1.65 (333.444.1.65): 56 data bytes
--- 333.444.1.65 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
Here is the arp table. It is aware of the subnet, but it is somehow pinned at the VMware layer to vmk6. vmk6 is vMotion on Fabric B, which is currently not working due to non-vmware related reasons.
~ # esxcfg-route -l
VMkernel Routes:
Network Netmask Gateway Interface
777.444.124.0 255.255.255.0 Local Subnet vmk1
777.444.125.0 255.255.255.0 Local Subnet vmk3
333.444.1.0 255.255.255.0 Local Subnet vmk6
555.888.128.0 255.255.255.0 Local Subnet vmk4
555.888.158.0 255.255.255.0 Local Subnet vmk0
default 0.0.0.0 555.888.158.1 vmk0
Here is the arp table after I delete vmk6 via vCenter. 333.444.1.0 is now pinned to vmk5. Neither, disabling vMotion, dropping the uplink, nor a reboot would switch the route to vmk5. Only deleting vmk6 would fix it. vmk5 is vMotion on Fabric A, which is working.
~ # esxcfg-route -l
VMkernel Routes:
Network Netmask Gateway Interface
777.444.124.0 255.255.255.0 Local Subnet vmk1
777.444.125.0 255.255.255.0 Local Subnet vmk3
333.444.1.0 255.255.255.0 Local Subnet vmk5
555.888.128.0 255.255.255.0 Local Subnet vmk4
555.888.158.0 255.255.255.0 Local Subnet vmk0
default 0.0.0.0 555.888.158.1 vmk0
Here is the ping working once vmk6 is gone.
~ # vmkping 333.444.1.65
PING 333.444.1.65 (333.444.1.65): 56 data bytes
64 bytes from 333.444.1.65: icmp_seq=0 ttl=64 time=0.346 ms
64 bytes from 333.444.1.65: icmp_seq=1 ttl=64 time=0.121 ms
64 bytes from 333.444.1.65: icmp_seq=2 ttl=64 time=0.146 ms