Quantcast
Channel: SQL Server High Availability and Disaster Recovery forum
Viewing all articles
Browse latest Browse all 4689

SQL Server 2016 Availability Group DISCONNECTED replica

$
0
0

Hello,

i would like to ask you for help experts, because i have exhausted all options and ways i remembered or found.

We have 2 servers (each 64 CPU, 1 TB RAM, SSD disks) SQL 2016 (13.0.2164) deployed Availability group(asynchronnous mode). AG contains 8 databases (total DB size is +- 1TB). AG synchronization use prive network card 10Gbit.

Secondary replica irregularly change state to DISCONNECTED. I tried to solve it by stopping and starting endpoints, droping and creating endpoints. Only remove secondary replica and adding again solve the problem (but i had to restore all databases again).

Error log message :

A connection timeout has occurred on a previously established connection to availability replica 'abx' with id [xxx-xxx-xxx].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

Last connection error from sys.dm_hadr_availability_replica_states :

An error occurred while receiving data: '10054(An existing connection was forcibly closed by the remote host.)'

There is on other error messages or messages related to this issue in Windows Log ect.

Endpoints are createdthis way

CREATE ENDPOINT [Hadr_endpoint]
    STATE=STARTED
    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = (192.168.20.10))
    FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

I tried stop/start endpoints, drop/create endpoints, disable endpoint encrytion, windows and certificate endpoint authentication, GRANT CONNECT permission to endpoints to logins, ALTER ATHORIZATION on endpoints for SQL Service account (both instances use same domain user account). Both servers listening on 5022 ports, firewall disabled, optical cable is used to connect both servers directly (withnout any network device between servers). Both servers has enought worker threads (max worker threads is 1472, primary replica use 1200 all time, secondary 600 worker threads)

I am realy desperate because DISCONNECT state can occur almost anytime. Sometimes hold everything ok for 1-2 days, sometimes secondary replica transit to DISCONNECTED after 2 hours from manualy remove/add/restore db/join db. When secondary replica is DISCONNECTED, transaction log of all databases growing.There is only one thing i can do and it is remove AG, remove cluster, reinstall both Windows 2012 servers (and drivers, firmware) and configure new cluster, SQL servers and AG.

If you have any idea what can i do to solve my problem, please, talk to me. All feedback is appreciated.

Thank you

David


Viewing all articles
Browse latest Browse all 4689

Trending Articles