Hi,
on a SQL 2012 cluster with Availibility groups pushing from a primary to a secondary replica, small databases seem to have very low Redo Rate (KB/s). The reason i'm questioning this is that we have a third party SQL monitoring tool who marks our servers as 'Critical' all the time, the reason being that some of these databases have redo rates below 500 KB/s (thats the monitoring tool's threshold) and some of ours are listed as 4KB/s. The theory being that in a failover situation, the replication of data is perceived to be so slow that they primary and secondary replicas may not be adequately sync'd.
We have some more heavily used databases on the same AG, same Primary and Secondary Replica's, etc. that are at 20,000+ KB/s - so we dont think it is a bandwidth issue.
As a test i created a new db, added it to the AG and left it. No tables, data, nothing. It's redo rate got logged at 44KB/s (in sys.dm_hadr_database_replica_states) and my alerting tool marked it as critical. I then created a table (on
the primary replica) and wrote a little script to push some junk data in. I had Perfmon running on the secondary replica monitoring
SQL Server:Database Replica > Redone Bytes/sec , redo_queue_size (KB) and redo_rate amongst others. Immediately upon starting my script on the primary, Perfmon jumped on the secondary. Within about 2 minutes of pummelling the database my
Redo rate had risen (from 44 KB/s) to 1,850 KB/s and my alerting software decided my replica was no longer in a critical state.
Sorry if this is longwinded, but is there some algorithm that averages the amount of data being exchanged between replicas over a period of time? The Redo rate seems like it should be calculated based on some thing lower in the OSI stack than whether the specific database currently has (or has recently had) much data to replicate?
Thanks!
Matt