Greetings. For several months now we've been getting alerts re high Worker Threads in our server that houses an Availability Group. This is a large data warehouse environment (I know, not ideal for an AG) that runs about 1500 ETL processes each night. Last night however we had an AG crash that created a stack dump, and generated this message in the SQL log at the end of the stack dump messages:
Message
The lease between availability group 'myAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
While my boss is painfully going through diagnosing the dump, I'm trying to determine if our high WT's could be what caused the issue.
A few months back I used this link to start gathering WT information every 5 minutes via a SQL job. Here is the code used to create my job:
--step wtSumOfWorkersCount insert into wtSumOfWorkersCount(dateAdded, [sum_current_tasks_count], [sum_current_workers_count], [sum_active_workers_count]) select getdate(), sum(current_tasks_count), sum(current_workers_count), sum(active_workers_count) from sys.dm_os_schedulers --step wtCountWorkersByWaitType insert into [dbo].[wtCountWorkersByWaitType] (is_preemptive,return_code, state,last_wait_type, numWorkers, dateAdded) select is_preemptive,return_code, state,last_wait_type,count(*) , getdate() from sys.dm_os_workers Group by return_code, state,last_wait_type,is_preemptive --step wtCountRequestsByUserProcesses insert into wtCountRequestsByUserProcesses select is_user_process,count(*) , getdate() from sys.dm_exec_sessions s inner join sys.dm_exec_requests r on s.session_id = r.session_id group by is_user_process --step wtCountNotCountedAgainstMax ;with cte as ( select s.is_user_process, w.worker_address, w.is_preemptive, w.state, r.status, t.task_state, r.command, w.last_wait_type, t.session_id, t.exec_context_id, t.request_id from sys.dm_exec_sessions s inner join sys.dm_exec_requests r on s.session_id = r.session_id inner join sys.dm_os_tasks t on r.task_address = t.task_address inner join sys.dm_os_workers w on t.worker_address = w.worker_address where s.is_user_process = 0 ) insert into [dbo].[wtCountNotCountedAgainstMax] ( is_user_process,command, last_wait_type, cmd_cnt, dateAdded) select is_user_process,command, last_wait_type, count(*) , getdate() from cte group by is_user_process,command, last_wait_type
Now I'm trying to diagnose all the info I have collected and have several questions:
1) I ran "Sp_server_diagnostics" as suggested at the top of the article. It says my Max WT count is 2944. The problem though is now matter what I query, my WT's are nowhere near 2944. That said I'm lead to believe either my alert is completely bogus, or my queries for this job are completely off. However, as you'll note all I did was copy/ paste from the MSDN blog. Any ideas here? An important fun fact is that 6 minutes before our stack dump/ AG crash occurred we got an alert for 100% WT usage, which obviously causes some concern. I'm working with our Monitoring team to get the query that generates this alert.
2) Per this same article, Availability Groups don't really count against the max WT's allocated. However, I've read elsewhere just the opposite as well. One of the theories to lower our WT's is to remove some of the DB's from our AG, and create a new AG with the Primary on the current Passive node. The logic being that unless there's a failover, and only one node is being used for both Primaries, the number of overall WT's will go down. All said, do AG's count againt the number of WT's, or not?
I'll likely have other questions, but this is a good start. Thanks!
Thanks in advance! ChrisRDBA