I'm investigating an issue with relatively slow failover (manual).
We have: 3 MSSQL Servers 2012 with AlwaysOn setup. All servers (including domain controller and DNS) are next to each other and connected with 1 Gb network speed.
We also have an availability group listener set up with 2 IPs: 10.10.10.17 (in domain network) and 192.168.2.156 (local network), thus all SQL servers have 2 IPs.
In our busy environment SQL failover takes in average 30 sec (sometimes it is 10 sec, sometimes it is 1 min). During this time database is unavailable and it throws an immediate exception saying that data base is resolving or not in a right state. It is
unacceptable for us to be even 5 sec down as we process a lot of transactions via our API from our clients. And to have this type of exception is kinda really bad for business.
So, right after failover from SQL1 to SQL3 I captured logs: Get-ClusterLog –Node SQL3 –TimeSpan 15
Here are some results (a complete failover took about 12 seconds):
2014/02/18-20:36:47.790 INFO [RCM] move of group IPSStorage HA from SQL1(3) to SQL3(2) of type MoveType::Manual is about to succeed
<skip>
2014/02/18-20:36:50.994 INFO [RES] Network Name <IPSStorage HA_IPSSTORAGEHA>: Getting Read/Write private properties
2014/02/18-20:36:52.775 INFO [RCM] rcm::RcmApi::OnlineResource: (IPSStorage HA, 0)
2014/02/18-20:36:53.993 INFO [RES] Network Name: [NNLIB] Registered server name IPSSTORAGEHA on transport \Device\NetBt_If2
2014/02/18-20:36:57.022 INFO [RES] Network Name: [NNLIB] Registered workstation name IPSSTORAGEHA, on transport \Device\NetBt_If2
Just to register names it took in total 10 sec (57sec - 47 sec). Any idea why it took so long? Or maybe is there anything we can make AlwaysOn to be true always on without interruptions?
Thank you for your support!
Dima
Dima