Hi!
We have constant problem with growing REDO queue on one of our secondaries, which is used for reporting and different calculations – analytic server in some way.
The environment is deployed within Azure VMs. So, we cannot control much ((
We have 1 primary and 2 secondaries. Primary and one of the secondaries are the same servers (8 cpu, 56Gb RAM) and the problem secondary is different (4 CPU, 28 GB RAM). ALL the servers though have the same disk configuration – Premium for t-logs (2400 iops,
150Mb/sec) and data (5000 iops, 200Mb/sec), formatted with 64K cluster size.
All the secondaries are in synchronous mode.
Analytic server is under heavy jobs sometimes which select a lot of data from AlwaysOn-databases and insert it into analytic databases on other disks.
Another secondary server is a warm standby, it does nothing, besides AlwaysOn processing.
The strange thing is that the Redo Rate on analytic secondary which we have problem with is something about 1 000 - 3 000 KB/sec and never raises above 5 000 KB/sec.
While on another secondary it is about 30 000 KB/sec – many times more.
Redo Queue Size is periodically growing enormously on the analytic secondary – up to several GBs and more and it cannot catch up for hours – sometimes we have to delete it from configuration and completely re-add database it from backup.
It is strange because the disks are same on both secondaries and the disks are really fast.
Log Send Rate is pretty intensive all the time and is about 30 000-50 000 KB/sec
I collected WAIT_INFO XEvent during one of the problem periods.
First strange thing is that query to identify session_id for the needed database REDO (STARTUP_DB) for me shows 2 rows with for the problem database with the same session_id for waits:
-ASYNC_IO_COMPLETION
-SLEEP_BPOOL_FLUSH
Then, after collecting wait data for 5 minutes, grouping and aggregating I see the following picture (sorted by SUM duration DESC):
The picture of TOP waits is as above.
So, what can be the cause of the problem?
Is that something with the disks? Or it might be something else? I cannot believe it’s a problem with disks, because they host only always on logs and data and are declared really fast.
Or some other activity might cause these problems with redo queue?
Maybe I need to dig into other causes like:
- Long-running active transactions
- High network latency / low network throughput
- Blocked REDO thread
- Shared REDO Target
?
Help is greatly appreciated.
I'm almost thinking about switching to good old LogShipping )))