Scenario:
We are running SQL 2016 with the latest service pack/patches in a 3 Node Availability Group. We have implemented service broker queues and have two worker processes on app servers processing messages out of these queues. DTC is configured for the AG and the app servers involved. We recently have had nearly daily issues with our Availability Group with the following behavior:
Queues are operating normally until a large set of requests come in for processing
We receive errors in our application log with a Transaction In Doubt exception and at this point the database essentially becomes locked up. Queues do not process, but running the locks/waits reports do not show any waits or blocked requests.
DTC shows several pending queries with a single one pending commit, but nothing ever changes. Attempts to force commit or kill the pending request are rejected and restarting DTC on all the servers in the AG and the App Servers has no effect.
After a period of time (15 Min?) the primary node of the AG becomes completely unresponsive, and will not allow the user to connect to it AT ALL.
We are forced to log into one of the nodes directly. It appears as though communication between the nodes has been interrupted and we are forced to fail over the AG manually. After the failover more often than not we are forced to remove and re-add the Queue DB to the AG manually on the two secondary nodes.
We strongly suspect that DTC is the root cause and we believe that its something we have done wrong in our code that is causing this under high load. This issue only occurs in a queue that involves both queue messages and reading data from FileStream which leads us to think that it may be a timing issue between the nodes of the AG not having the file fully committed in one of the secondary nodes? Investigation of the code around both processes does not have anything that jumps out at us, especially since the code works just fine without high load.
Our big concern right now is that the AG is becoming completely unresponsive and google searches have not yielded anything resembling our scenario or behavior. We are hoping that someone with more knowledge of DTC and Service Broker might have some ideas for us to investigate.
Thank you in advance!