Hello,
Win2k12 r2 OS, SQL2k12 Ent edition. Two node fail-over cluster hosting a 1.5 tb synchronized database (Synchronous commit). Both the servers are from the same subnet. Seeing following error in the cluster events. After a few minutes, the AG resource comes
back online again:
Cluster resource 'AOGrp1' of type 'SQL Server Availability Group' in clustered role 'AOGrp1' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster
Manager or the Get-ClusterResource Windows PowerShell cmdlet.
At that time, I see following in the sql server logs:
Message
SQL Server hosting availability group 'AOGrp1' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
Message
Error: 19421, Severity: 16, State: 1.
Here is the partial capture from the cluster logs. Is this related to network connectivity or something else?
5c::2019/03/14-10:02:07.561 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'clean' to 'warning' at 2019-03-14 10:02:07.340
00004590.00004838::2019/03/14-10:02:09.248 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:02:14.373 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:02:20.283 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:02:25.358 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00003a6c::2019/03/14-10:02:32.624 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:02:32.780 INFO [RES] Network Name <AOGrp1_SQL20LS>: Dns: HealthCheck: SQL20LS
00004590.00004838::2019/03/14-10:02:32.780 INFO [RES] Network Name <AOGrp1_SQL20LS>: Dns: End of Slow Operation, state: Initialized/Reading, prevWorkState: Reading
00004590.00004adc::2019/03/14-10:02:37.889 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
000036ec.00003458::2019/03/14-10:02:38.874 INFO [RES] SQL Server Availability Group: [hadrag] Availability Group with resource ID '7a245c40-fd05-4b59-99ac-31c395284899' did not receive healthinformation before HealthCheckTimeout
000036ec.00003458::2019/03/14-10:02:38.874 ERR [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
000036ec.00003458::2019/03/14-10:02:38.874 ERR [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Resource Alive result 0.
000036ec.00003458::2019/03/14-10:02:38.874 INFO [RES] SQL Server Availability Group: [hadrag] Availability Group with resource ID '7a245c40-fd05-4b59-99ac-31c395284899' did not receive healthinformation before HealthCheckTimeout
000036ec.00003458::2019/03/14-10:02:38.874 ERR [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
000036ec.00003458::2019/03/14-10:02:38.874 ERR [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Resource Alive result 0.
000036ec.00003458::2019/03/14-10:02:38.874 WARN [RHS] Resource AOGrp1 IsAlive has indicated failure.
000028b0.00000840::2019/03/14-10:02:40.774 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'AOGrp1', gen(11) result 1/0.
000028b0.00000840::2019/03/14-10:02:40.774 INFO [RCM] Res AOGrp1: Online -> ProcessingFailure( StateUnknown )
000028b0.00000840::2019/03/14-10:02:40.774 INFO [RCM] TransitionToState(AOGrp1) Online-->ProcessingFailure.
000028b0.00000840::2019/03/14-10:02:40.774 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AOGrp1, Online --> Pending)
000028b0.00000840::2019/03/14-10:02:40.774 ERR [RCM] rcm::RcmResource::HandleFailure: (AOGrp1)
000028b0.00000840::2019/03/14-10:02:40.999 INFO [RCM] resource AOGrp1: failure count: 1, restartAction: 2 persistentState: 1.
000028b0.00000840::2019/03/14-10:02:40.999 INFO [RCM] Greater than restartPeriod time has elapsed since first failure of AOGrp1, resetting failureTime and failureCount.
000028b0.00000840::2019/03/14-10:02:40.999 INFO [RCM] Will queue immediate restart (500 milliseconds) of AOGrp1 after terminate is complete.
000028b0.00000840::2019/03/14-10:02:40.999 INFO [RCM] Res AOGrp1: ProcessingFailure -> WaitingToTerminate( DelayRestartingResource )
000028b0.00000840::2019/03/14-10:02:40.999 INFO [RCM] TransitionToState(AOGrp1) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].
000028b0.00000840::2019/03/14-10:02:41.030 INFO [RCM] Res AOGrp1_FSShare: Online -> WaitingToTerminate( WaitingToComeOnline )
000028b0.00000840::2019/03/14-10:02:41.030 INFO [RCM] TransitionToState(AOGrp1_FSShare) Online-->[WaitingToTerminate to WaitingToComeOnline].
000028b0.00000840::2019/03/14-10:02:41.030 INFO [RCM] Res AOGrp1_FSShare: [WaitingToTerminate to WaitingToComeOnline] -> Terminating( WaitingToComeOnline )
000028b0.00000840::2019/03/14-10:02:41.030 INFO [RCM] TransitionToState(AOGrp1_FSShare) [WaitingToTerminate to WaitingToComeOnline]-->[Terminating to WaitingToComeOnline].
000028b0.00000840::2019/03/14-10:02:41.030 INFO [RCM] AOGrp1 not yet ready to terminate; dependent AOGrp1_FSShare still terminating.
000028b0.00002540::2019/03/14-10:02:41.483 INFO [RCM] ignored non-local state Pending for group AOGrp1
000028b0.000046b0::2019/03/14-10:02:41.546 INFO [GUM] Node 1: executing request locally, gumId:3496, my action: /dm/update, # of updates: 1
000028b0.00002974::2019/03/14-10:02:41.594 ERR [RCM] [GIM] ResType Virtual Machine has no resources, not collecting local utilization info
000028b0.00002974::2019/03/14-10:02:41.594 INFO [RCM] [GIM] Scheduling Local Node Crawler to run in 300000 millisec.
000028b0.00003b60::2019/03/14-10:02:41.624 INFO [GUM] Node 1: executing request locally, gumId:3497, my action: /dm/update, # of updates: 1
000028b0.00003414::2019/03/14-10:02:41.844 INFO [API] s_ApiUnblockGetNotifyCall: for the HDL( 1a )
000028b0.00003414::2019/03/14-10:02:41.859 INFO [API] s_ApiGetQuorumResource final status 0.
000028b0.00002974::2019/03/14-10:02:41.859 INFO [API] s_ApiGetQuorumResource final status 0.
000028b0.000046b0::2019/03/14-10:02:41.874 WARN [API] s_ApiOpenResourceEx: Resource not found, status = 5007
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM] HandleMonitorReply: TERMINATERESOURCE for 'AOGrp1_FSShare', gen(0) result 0/0.
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM] Res AOGrp1_FSShare: [Terminating to WaitingToComeOnline] -> WaitingToComeOnline( StateUnknown )
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM] TransitionToState(AOGrp1_FSShare) [Terminating to WaitingToComeOnline]-->WaitingToComeOnline.
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM] Res AOGrp1: [WaitingToTerminate to DelayRestartingResource] -> Terminating( DelayRestartingResource )
000028b0.00003b60::2019/03/14-10:02:41.874 INFO [RCM] TransitionToState(AOGrp1) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource].
000036ec.00004518::2019/03/14-10:02:41.874 ERR [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Lease Thread terminated
000028b0.000046b0::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000028b0.000046b0::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000028b0.000046b0::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000028b0.000046b0::2019/03/14-10:02:41.874 INFO [RCM-rbtr] giving default token to group AOGrp1
000036ec.00003654::2019/03/14-10:02:41.874 INFO [RES] SQL Server Availability Group: [hadrag] Stopping Health Worker Thread
000036ec.00002c74::2019/03/14-10:02:41.874 INFO [RES] SQL Server Availability Group: [hadrag] Health worker was asked to terminate
000036ec.00001f5c::2019/03/14-10:02:42.847 INFO [RES] SQL Server Availability Group: [hadrag] SQLMoreResults() returns -1 with following information
000036ec.00002c74::2019/03/14-10:02:42.847 INFO [RES] SQL Server Availability Group: [hadrag] Change diagnostics interval worker is stopped
000036ec.00001f5c::2019/03/14-10:02:42.847 ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [HY008] [Microsoft][SQL Server Native Client 11.0]Operation canceled (0)
000036ec.00001f5c::2019/03/14-10:02:42.847 ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [01000] [Microsoft][SQL Server Native Client 11.0][SQL Server] (0)
000036ec.00001f5c::2019/03/14-10:02:42.847 INFO [RES] SQL Server Availability Group: [hadrag] No more diagnostics results
000036ec.00001f5c::2019/03/14-10:02:42.847 INFO [RES] SQL Server Availability Group: [hadrag] Diagnostics is stopped
000036ec.00001f5c::2019/03/14-10:02:42.847 INFO [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server
000036ec.00001f5c::2019/03/14-10:02:42.925 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event logging is stopped
000036ec.00001f5c::2019/03/14-10:02:43.421 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event target state:
000036ec.00001f5c::2019/03/14-10:02:43.421 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event session summary: dropped buffers = 0, dropped events = 0
00004590.00004adc::2019/03/14-10:02:43.624 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
000036ec.00003654::2019/03/14-10:02:43.764 INFO [RES] SQL Server Availability Group: [hadrag] Stopping Change Diagnostics interval Worker Thread
000036ec.00003654::2019/03/14-10:02:48.171 INFO [RES] SQL Server Availability Group <AOGrp1>: [hadrag] Connect to SQL Server ...
00004590.00003a6c::2019/03/14-10:02:48.781 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:02:49.405 INFO [RES] Network Name <AOGrp1_SQL20LS>: [cxl::Pinger-"SQL20LS"] Host not registered.
00004590.00004838::2019/03/14-10:02:49.405 WARN [RES] Network Name <AOGrp1_SQL20LS>: [cxl::Pinger-"SQL20LS"] Could not find any endpoints for remote target
00004590.00004838::2019/03/14-10:02:49.421 INFO [RES] Network Name <AOGrp1_SQL20LS>: Setting resource specific message to Name Resolution Not Yet Available
00004590.00004838::2019/03/14-10:02:54.452 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
00004590.00004838::2019/03/14-10:03:00.015 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:6c77a32d-387c-4a20-bfd6-8006b879b8c0:Netbios
000036ec.00003654::2019/03/14-10:03:00.077 INFO [RES] SQL Server Availability Group <AOGrp1>: [hadrag] The connection was established successfully
Will greatly appreciate your input. Thanks.
Victor
Victor