Quantcast
Channel: SQL Server High Availability and Disaster Recovery forum
Viewing all articles
Browse latest Browse all 4689

SQL Cluster unexpected failover

$
0
0

So we had one of our SQL clusters unexpectedly failover recently. Second time in a few months. Two node active/passive SQL 2012 cluster running on Windows 2012 Standard.

Here's what we could cull from the application/system logs?

1. "

Cluster resource 'SQLServer' of type 'SQL Server' in clustered role 'SQLServerRole' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet."

2. "

Cluster resource 'SQLServer' (resource type 'SQL Server', DLL 'sqsrvres.dll') did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource. This may affect other resources hosted in the same RHS process. The resources will then be restarted. 

The suspect resource 'SQLServer' will be marked to run in an isolated RHS process to avoid impacting multiple resources in the event that this resource failure occurs again. Please ensure services, applications, or underlying infrastructure (such as storage or networking) associated with the suspect resource is functioning properly."

3. "The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.  Please determine which resource and resource DLL is causing the issue and verify it is functioning properly."

4. "A timeout (30000 milliseconds) was reached while waiting for a transaction response from the MSSQLSERVER service."

Cluster.log wasn't much more helpful on the root cause either:

"

00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name <Cluster Name>: Netbios: Slow Operation, FinishWithReply: 0
00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name:  [NN] got sync reply: 0
00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name <Cluster Name>: Netbios: End of Slow Operation, state: Initialized/Idle, prevWorkState: Idle
00000f20.00000e94::2014/12/04-21:25:55.240 INFO  [RES] SQL Server Agent <SQL Server Agent>: [sqagtres] IsAlive request.
00000f20.00000e94::2014/12/04-21:25:55.240 INFO  [RES] SQL Server Agent <SQL Server Agent>: [sqagtres] CheckServiceAlive: returning TRUE (success)
00001134.000001d8::2014/12/04-21:25:57.287 ERR   [RES] SQL Server <SQLServer>: [sqsrvres] Failure detected, diagnostics heartbeat is lost
00001134.000001d8::2014/12/04-21:25:57.287 INFO  [RES] SQL Server <SQLServer>: [sqsrvres] IsAlive returns FALSE
00001134.000001d8::2014/12/04-21:25:57.287 WARN  [RHS] Resource SQLServer IsAlive has indicated failure.
00000880.0000161c::2014/12/04-21:25:57.303 INFO  [NM] Received request from client address HOST-XXX-SQL02.
00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'SQLServer', gen(3) result 1/0.
00000880.000023a4::2014/12/04-21:25:57.303 INFO  [GEM] Sending 1 messages as a batched GEM message
00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] Res SQLServer: Online -> ProcessingFailure( StateUnknown )
00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] TransitionToState(SQLServer) Online-->ProcessingFailure.
00000880.0000161c::2014/12/04-21:25:57.318 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (SQLServerRole, Online --> Pending)
00000880.00001db8::2014/12/04-21:25:57.334 INFO  [GEM] Sending 1 messages as a batched GEM message
00000880.0000161c::2014/12/04-21:25:57.334 ERR   [RCM] rcm::RcmResource::HandleFailure: (SQLServer)
00000880.00001db8::2014/12/04-21:25:57.334 INFO  [GEM] Sending 1 messages as a batched GEM message
00000880.00000bac::2014/12/04-21:25:57.334 INFO  [RCM] ignored non-local state Pending for group SQLServerRole
00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] resource SQLServer: failure count: 1, restartAction: 2 persistentState: 1.
00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] Greater than restartPeriod time has elapsed since first failure of SQLServer, resetting failureTime and failureCount.
00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] Will queue immediate restart (500 milliseconds) of SQLServer after terminate is complete."

Any ideas? Anywhere we could look for more specific info? Any preventative measures we could take?

Thanks,

Ryan


Viewing all articles
Browse latest Browse all 4689

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>