Quantcast
Channel: SQL Server High Availability and Disaster Recovery forum
Viewing all articles
Browse latest Browse all 4689

Implications of lease & healthcheck timeout in SQL Server 2014 AlwaysOn AG environment

$
0
0

Issue Background

We have recently migrated to a production environment running on SQL Server 2014 Ent. Ed. & have faced an Availability Group (AG) outage issue as of last week.

 

The AG instance is having 1 Primary & 1 Sync Sec replica. The two replicas are on separate servers & hosted within the same datacenter / LAN network. For reasons of easier explanation, will refer to original Primary replica asA& Secondary replica asB.

 

The sequence of events started with an auto failover fromA to B. This didn’t create any issue as all workloads were running fine till this point. Do note, with this failover,B is now primary & A secondary.

Issue startedsoon after that, when the AG failed over again within a few minutes of the first failover. At this time, DBs in A (Secondary) were not in sync & thus the failover was not successful. Consequently the AG group went into resolving state on both B (primary)& A (secondary) instances.

To come out of this resolving state, we did a manual fail-over toB, after which A went out of AG. But, it was still part of Windows cluster & was reachable physically.

Diagnosis

Went through SQL Server error logs & saw the following line which triggered sequence of events as mentioned above:

The lease between availability group 'AG_Name' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster.

Searching on this, lead us to multiple possible reasons why this might happen.

In our case, there were a couple of queries which were taking 100% CPU. While these queries were running, the lease check process was not receiving any response from invocation ofsp_server_diagnostics, within the timeout period of 30 secs. This resulted in lease timeout & subsequent failover. 

With some more digging & testing, we re-configured MaxDOP (earlier it was set to 0), CTP & also lease timeout from 30 to 100 secs. Its been 6 days since this issue had occurred & it has not happened thereafter. However it raises some queries & doubts.

Queries

  1. Why did A replica went out of AG? We had to manually add it to AG & include the DBs back.
  2. Have understood the working of lease timeout & its frequency of running. However, what has not been clear is the exact difference between lease timeout & health check timeout? Moreover, if both these processes result in AG health monitoring & failover, if required, why are the 2 checks separately needed.
  3. What should be the ideal range of values for lease & health check timeout? Because a higher timeout value will only result in delayed failover when an actual issue happens.

Have searched on the above queries, esp. 2 & 3 but have not found any reasonable clarity.

Will really appreciate if these are adequately answered.

Thanking in advance,

Subho


What's-On















Viewing all articles
Browse latest Browse all 4689

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>