I have set up many of the same clusters in the last few months, but this cluster specifically has been having lots of problems lately.
OS: Windows Server 2008 R2 latest patches
SQL: SQL Server 2008 SP1 CU5 (version: 10.0.2746)
Disks: VMAX SAN
In the very beginning when I set up the cluster, the cluster validation kept failing when I tried to add a node. Turned out that we had to unjoin the servers from the domain and rejoin. Right after that the cluster validation succeeded, the second node joined the cluster, and quorum was changed to "Node and Disk Majority". SQL was installed and set up as active/active (one sql instance on each node). No issues then until SQL server was actually in use.
Symptoms:
SQL server has heavy ETL processing plus being a subscriber of a sql replication. The following errors first started on 4/24 when I set up sql replication (these servers are replication subscribers):
EventID: 1127, Source: FailoverCluster, Task Category: Network Manager
Cluster network interface 'man1fscl01a - Hartbeat to man1fscl91b' for cluster node 'man1fscl01a' on network 'ClusterHeartBeat' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
EventID: 1130, Source: FailoverClustering, Task Category: Network Manager
Cluster network 'ClusterHeartBeat' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Then we started getting more errors on the cluster since 5/14 after I started the ETL processing job. The following entries have been consistently present in the Windows System log, sometimes multiple times in a day. SQL server instances have been failed over constantly (I can see multiple sql error logs through the day in the last week).
EventID: 1592 Source FailoverClustering, Task Category: Node-to-Node Communications
Cluster node 'man1fscl01a' lost communication with cluster node 'man1fscl01b'. Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop. If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
EventID: 4201, Source: Iphlpsvc
Isatap interface isatap.{640E853E-3232-4A2D-8095-076A798D85AE} is no longer active.
EventID: 1135, Source: FailoverClustering, Task Category: Node Mgr
Cluster node 'man1fscl01b' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
EventID: 1069, Source: FailoverClustering, Task Category: Resource Control Manager (would happen a few times in a row)
Cluster resource 'Drive Q:' in clustered service or application 'Cluster Group' failed.
EventID: 1177, Source: FailoverClustering, TaskCategory: Quorum Manager
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
EventID: 7036, Source: Service Control Manager
The Cluster Service service entered the stopped state.
EventID: 7024, Source: Service Control Manger
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..
I've asked the hardware team to check out the NICs, and they said that the drivers are all up to date. I've asked the network team to check out network issues but they reported that there's no issue - however, since these servers are not anywhere near us (they're on opposite side of the continent) I am not sure if they really did check network issues there. I have asked the SAN team to check disk issues because it seems like sql/cluster would fail only when sql server is busy doing ETL processing, with high I/O usage. SAN team reported that they found no issues. I am at lost here. The hardware team insisted that these servers need to be re-imaged, which would mean another 2 weeks before they're coming back to be where they are now. I just cannot believe that is the only solution. Please help!
G