Hi @Mike London UK
Looks like you’ve gone through 2 months of pain already. Unfortunately due to the mismatched block size between source and destination copy, the entire application size worth of data at source (at 128KB) will need to be read in order to process the destination (at 512KB). This means that DASH copy is essentially not running optimally at all.
There is unfortunately no way to fix this except start again from scratch but with a matched 128KB destination DDB using the S3 StorageGrid (I assume?)
With matched 128KB block size, you will start seeing DASH copy benefits where signatures read at source will actually exist on destination, thus negating the need to actually read any data blocks if that signature already exists on destination.
Right now with the mismatch, no signature will ever match destination copy, resulting in source MA needing to read 4 blocks to generate a new 512kb signature before being able to check that with the destination DDB.
512KB block size on destination will also mean lower overall dedupe saving, so even at the end of the copy, you may find that destination DDB uses significantly more disk space than source DDB for same application size.
Hope this answers your questions here.
Thank you
Thanks @Jordan we were prepared for the dedupe reduction on this final copy. It’s the reading that’s got me perplexed, in that one job with 20 readers split between to the two source MAs will get the disk array busy at ~>600MB/S and CV reporting 2TB/Hr (not great but reasonable given the dedupe block conversion overhead). Introducing another job with 20 streams split between the two MAs again seems to reduce the backend rate to~200MB/S and CV reporting <~100GB/Hr for both. Suspending the second job allows the first to ramp up to speeds attained before the second job started (resume and totla through drops again, suspend and first job ramps up).
It’s almost as if there is a single thread somewhere in the AUX copy process that is controlling the reads from the disk library. I know the array’s read cache may take a hit but not to this extent. Could it be that there is a common cache for all jobs and that this get filled by one job and others have to wait for empty slots/buffers in that cache? All the additional settings I’ve seen (some discussed in “Additional settings and uses , can they be common for all media agents”) seem to be around optimising the signature lookup process. I’m the market for read optimising attributes if any exist.
Hi @Mike London UK
The behaviour you report is actually not to do with Commvault, but rather disk contention. Commvault has coordinator threads for each Aux Copy job (they don’t talk to each other or have common cache etc).
Each job will just try to utilize the max number of streams possible. Each stream has multiple threads and each thread has a number of buffers in memory (default is 90). Each memory buffer is only a small amount of space (64KB usually).
So unless your MA is running out of RAM memory, there wouldn’t be any caching or stream limits from Commvault end.
Usually when you see the behaviour of running more streams resulting in less throughput is due to IOPS load on storage system. When storage can’t keep up with the requests, things start queueing in disk queue lengths. The longer the length, the longer the queue. This means that when you have too many streams going, it may result in many streams waiting and the disks constantly switching and trying to fulfill requests but the net output being worse than if you had less requests.
You can see this behaviour also when any third party disk performance tool like CrystalDiskMark etc. With queue depths up to a certain amount, disk performance get better but then if you keep pushing the queue depth higher and higher, disks will reach a threshold and essentially “fall off a cliff”.
Thanks @Jordan , do you know of any log entries we can look for in the CV logs to indicate the jobs are waiting on reads from the storage?
Given that we’re using CIFS mount paths, does anyone have any recommendations on SMB configuration on a media agent? From my research there aren’t many options to play with on the client or server side. I have increased SessionTimeout (client setting) from its default 60 to 600 which stabilised the AUX copy process.
Hi @Mike London UK
The high wait times you saw in CVPerfMgr would indicate CV waiting for storage coupled with the behaviour you described where more streams results in less overall throughput, points to a storage contention issue here.
Just to close this out, we have created a new global deduplication policy with 128KB dedupe block size. The remaining storage policies copied to this much more quickly and we are now in the process of copying the data in the 512KB global dedupe policy to the new one. This will take some time but is more CPU bound than IO bound but we can cope with this )we have added new media agents to help with the load).
Thanks for all your help & comments.
Just to clarify… back in the day when DASH Copy was first introduced (V9 times) we introduced two operating modes, “Network Optimized” and “Disk Read Optimized”.
- The difference between the two was all to do with the read operation.
- The outcome was a dedupe optimized secondary copy in both cases.
However, with “Network Optimized” - we rehydrate the data (if it is deduped already) to produce a new signature rather than read the signature on source disk. We check in the signature into the destination store in the same way. The only difference here is that we have to generate the signature. This mode is best suited to situations where the source is not deduped by Commvault. This mode will also run if the source is deduped but with different dedupe block size because a 128K signature will not align to a 512K signature.
With “Disk Read Optimized” we do not go through that rehydration. We read the signature from source disk and check that into the destination. So a big difference in read IO here, which explains the throughput outcome.
The default operating mode originally was “Network Optimized” and customers suffered from poor throughput even though source and destination dedupe block may have been the same. Then we switched up and in V10 the default became “Disk Read Optimized”, which makes more sense.
Also, it used to be a best practice to always use 512K for cloud targets. While this is still true when backing up directly to cloud targets (FlashBlade being the exception - always use 128K here). Since 11.23 the revised best practice is to match source and destination dedupe block size, for optimal throughput outcome. Using 128K will always result in better dedupe reduction but with DASH Copy the key thing here is throughput. Using 128K on Cloud storage will incur extra payload when doing large reads because unravelling 1,000,000 128K blocks will be more overhead compared to 250,000 512K blocks but DEV have made some performance enhancements to offset some of those overheads.
If changing from 512K to 128K, be aware that you should then immediately seal the store to start a new DDB. Changing the block size will mean a new dedupe baseline. So extra bandwidth will be consumed and extra storage will be consumed until things settle down. Also, the DDB for a 128K store will be larger than for a 512K store since there are 4x more primary records being managed.
regards.