We’re running CV11.24.25 with a two-node grid (physical) with CIFS mount paths from a Nexsan Unity that takes secondary copies from MAs that perform backups (no direct backups other than DDB), with a a partition on each MA. We decided to replace this with a four-node (virtual) grid with S3 (NetApp) storage. The four-node grid was set up with a global dedupe policy based on 512KB dedupe block size with a partition on each node. The two-node grid is the standard 128KB dedupe block size.
We had ~600TB of back-end storage (~3.3PB of front-end) and have ~1.75PB front-end left to process after about two months of copying. There were 105 storage policies (multi-tenant env) with retentions ranging from 30 days to 12 years (DB, file, VM, O365 apps) with anything higher than 30 days being extended retentions (normally 30 days/1 cycle and then monthly/yearly with extended retention).
We do not seem able to maintain any reasonably high copy rates. Having looked at other conversations here we’ve tried disk/network read optimized, toggling space optimized, disabled “pick from running jobs” and a few of the read ahead additional settings (documented and undocumented) and currently have none of these settings (there are default values as we see read aheads in the logs).
Some storage policies have thousands of jobs (from MBs to TBs depending on agent and backup type). Sometimes we will have high rate with a few readers (2TB/Hr) which obviously depends on dedupe rate but mainly we’ll have 10’s of GB/Hr. Other times we will only get high rates with high numbers of readers (40+) but the rates here are skewed by small jobs. If we run several AUX copies in parallel we soon hit problems reading from storage (SMB timeouts generating poorly trapped errors resulting in the stream count dropping until the jobs is suspended/resumed or killed/re-run.
If I look at the CVPerfMgr.log files on the destination media agents they show high waits for data from the source media agents. So, the issue is in reading and sending the data. If we stop all activity and run “validate mount path” jobs we see high throughout rate, similarly if we run CVDiskPerf.exe and even recursive copy at Windows prompt level.
It seems the AUX copies just can’t request data fast enough from the disk system especially when data is unique. So,
- Does anyone have any suggestions on how we can get the copy rates higher?
- Also can anyone tell us what processing and where this processing is performed in order to go from the 128KB dedupe size on the source data to the 512KB on the destination data (we’ve recently seen that the recommendations are now to match source and destination when the destination is cloud but obviously can’t change that easily now).
- Is there a definitive list of additional settings that can be used for reading the data (with detail on what they do)?
- Where can we extract/consolidate performance data from on the source media agents? CVPerfMgr doesn’t show anything (expected as this seems a destination sever only feature). I’ve attached log extracts from each destination node for one job for reference. Along with an extract from one of the source nodes.
- Is it better to have one node doing all the work for one AUX job or to split between the nodes? I’ve tried both with little obvious difference.
Apologies if this is too long but I’ve tried to include all the pertinent detail up front.
Best answer by JordanView original