Hello,
I am encountering a strange problem in my environment and I'm wondering if anyone might have some ideas to help with troubleshooting. We have a 12 node WSFC on Windows Server 2008 (not R2, I know, its ancient but can't be helped). 6 nodes exist in the primary datacenter and 6 exist in the secondary datacenter. We have combined SQL Server 2012 FCI's with Availability Groups. So when failing over to another node in the same datacenter we would do a traditional FCI failover using FCM tool in windows and when failing between datacenters we would failover the availability group through SSMS.
The problem we have encountered on occasion is that when we failover from one node to another node in the same datacenter, the SQL Server group will failover to the new node and come online right away, however the AG does not always follow it. Sometimes it will refuse to move from the previous node, and sometimes it will move to another node completely. This results in the AG being unavailable and client machines cannot connect to the Listener. The workaround to this issue is to either stop cluster service on the node currently hosting the AG, or in rare cases, reboot the machine. This will force the AG to move to the node where the FCI is currently hosted.
This issue is difficult to reproduce because it is sporadic, and as well it will affect a customer facing database when we do, so we are treating it as a delicate matter. We were unable to reproduce this after opening a case with MS, but the issue has since started happening again. Looking for some working theories, or some further direction to investigate.
Thanks,
Bill