Windows 2008 R2/SQL 2008 SP1 CU5 active/active cluster keeps failing

I have set up many of the same clusters in the last few months, but this cluster specifically has been having lots of problems lately.

OS: Windows Server 2008 R2 latest patches

SQL: SQL Server 2008 SP1 CU5 (version: 10.0.2746)

Disks: VMAX SAN

In the very beginning when I set up the cluster, the cluster validation kept failing when I tried to add a node. Turned out that we had to unjoin the servers from the domain and rejoin. Right after that the cluster validation succeeded, the second node joined the cluster, and quorum was changed to "Node and Disk Majority". SQL was installed and set up as active/active (one sql instance on each node). No issues then until SQL server was actually in use.

Symptoms:

SQL server has heavy ETL processing plus being a subscriber of a sql replication. The following errors first started on 4/24 when I set up sql replication (these servers are replication subscribers):

EventID: 1127, Source: FailoverCluster, Task Category: Network Manager

Cluster network interface 'man1fscl01a - Hartbeat to man1fscl91b' for cluster node 'man1fscl01a' on network 'ClusterHeartBeat' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 1130, Source: FailoverClustering, Task Category: Network Manager

Cluster network 'ClusterHeartBeat' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Then we started getting more errors on the cluster since 5/14 after I started the ETL processing job. The following entries have been consistently present in the Windows System log, sometimes multiple times in a day. SQL server instances have been failed over constantly (I can see multiple sql error logs through the day in the last week).

EventID: 1592 Source FailoverClustering, Task Category: Node-to-Node Communications

Cluster node 'man1fscl01a' lost communication with cluster node 'man1fscl01b'. Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop. If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 4201, Source: Iphlpsvc

Isatap interface isatap.{640E853E-3232-4A2D-8095-076A798D85AE} is no longer active.

EventID: 1135, Source: FailoverClustering, Task Category: Node Mgr

Cluster node 'man1fscl01b' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 1069, Source: FailoverClustering, Task Category: Resource Control Manager (would happen a few times in a row)

Cluster resource 'Drive Q:' in clustered service or application 'Cluster Group' failed.

EventID: 1177, Source: FailoverClustering, TaskCategory: Quorum Manager

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 7036, Source: Service Control Manager

The Cluster Service service entered the stopped state.

EventID: 7024, Source: Service Control Manger

The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

I've asked the hardware team to check out the NICs, and they said that the drivers are all up to date. I've asked the network team to check out network issues but they reported that there's no issue - however, since these servers are not anywhere near us (they're on opposite side of the continent) I am not sure if they really did check network issues there. I have asked the SAN team to check disk issues because it seems like sql/cluster would fail only when sql server is busy doing ETL processing, with high I/O usage. SAN team reported that they found no issues. I am at lost here. The hardware team insisted that these servers need to be re-imaged, which would mean another 2 weeks before they're coming back to be where they are now. I just cannot believe that is the only solution. Please help!

Windows 2008 R2/SQL 2008 SP1 CU5 active/active cluster keeps failing

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List