Skip to main content
Solved

AUX Copy optimization from Disklib to S3 library

  • 21 December 2021
  • 7 replies
  • 1092 views

Forum|alt.badge.img+8

We’re running CV11.24.25 with a two-node grid (physical) with CIFS mount paths from a Nexsan Unity that takes secondary copies from MAs that perform backups (no direct backups other than DDB), with a a partition on each MA. We decided to replace this with a four-node (virtual) grid with S3 (NetApp) storage. The four-node grid was set up with a global dedupe policy based on 512KB dedupe block size with a partition on each node. The two-node grid is the standard 128KB dedupe block size.

We had ~600TB of back-end storage (~3.3PB of front-end) and have ~1.75PB front-end left to process after about two months of copying. There were 105 storage policies (multi-tenant env) with retentions ranging from 30 days to 12 years (DB, file, VM, O365 apps) with anything higher than 30 days being extended retentions (normally 30 days/1 cycle and then monthly/yearly with extended retention).

We do not seem able to maintain any reasonably high copy rates. Having looked at other conversations here we’ve tried disk/network read optimized, toggling space optimized, disabled “pick from running jobs” and a few of the read ahead additional settings (documented and undocumented) and currently have none of these settings (there are default values as we see read aheads in the logs).

Some storage policies have thousands of jobs (from MBs to TBs depending on agent and backup type). Sometimes we will have high rate with a few readers (2TB/Hr) which obviously depends on dedupe rate but mainly we’ll have 10’s of GB/Hr. Other times we will only get high rates with high numbers of readers (40+) but the rates here are skewed by small jobs. If we run several AUX copies in parallel we soon hit problems reading from storage (SMB timeouts generating poorly trapped errors resulting in the stream count dropping until the jobs is suspended/resumed or killed/re-run.

If I look at the CVPerfMgr.log files on the destination media agents they show high waits for data from the source media agents. So, the issue is in reading and sending the data. If we stop all activity and run “validate mount path” jobs we see high throughout rate, similarly if we run CVDiskPerf.exe and even recursive copy at Windows prompt level.

It seems the AUX copies just can’t request data fast enough from the disk system especially when data is unique. So,

  1. Does anyone have any suggestions on how we can get the copy rates higher?
  2. Also can anyone tell us what processing and where this processing is performed in order to go from the 128KB dedupe size on the source data to the 512KB on the destination data (we’ve recently seen that the recommendations are now to match source and destination when the destination is cloud but obviously can’t change that easily now).
  3. Is there a definitive list of additional settings that can be used for reading the data (with detail on what they do)?
  4. Where can we extract/consolidate performance data from on the source media agents? CVPerfMgr doesn’t show anything (expected as this seems a destination sever only feature). I’ve attached log extracts from each destination node for one job for reference. Along with an extract from one of the source nodes.
  5. Is it better to have one node doing all the work for one AUX job or to split between the nodes? I’ve tried both with little obvious difference.

Apologies if this is too long but I’ve tried to include all the pertinent detail up front.

 

Best answer by Jordan

Hi @Mike London UK 


The behaviour you report is actually not to do with Commvault, but rather disk contention. Commvault has coordinator threads for each Aux Copy job (they don’t talk to each other or have common cache etc).

Each job will just try to utilize the max number of streams possible. Each stream has multiple threads and each thread has a number of buffers in memory (default is 90). Each memory buffer is only a small amount of space (64KB usually). 

 

So unless your MA is running out of RAM memory, there wouldn’t be any caching or stream limits from Commvault end. 

 

Usually when you see the behaviour of running more streams resulting in less throughput is due to IOPS load on storage system. When storage can’t keep up with the requests, things start queueing in disk queue lengths. The longer the length, the longer the queue. This means that when you have too many streams going, it may result in many streams waiting and the disks constantly switching and trying to fulfill requests but the net output being worse than if you had less requests. 

 

You can see this behaviour also when any third party disk performance tool like CrystalDiskMark etc. With queue depths up to a certain amount, disk performance get better but then if you keep pushing the queue depth higher and higher, disks will reach a threshold and essentially “fall off a cliff”. 

 

 

View original
Did this answer your question?

7 replies

Forum|alt.badge.img+11
  • Vaulter
  • 135 replies
  • December 22, 2021

Hi @Mike London UK 

 

Looks like you’ve gone through 2 months of pain already. Unfortunately due to the mismatched block size between source and destination copy, the entire application size worth of data at source (at 128KB) will need to be read in order to process the destination (at 512KB). This means that DASH copy is essentially not running optimally at all.

There is unfortunately no way to fix this except start again from scratch but with a matched 128KB destination DDB using the S3 StorageGrid (I assume?)

 

With matched 128KB block size, you will start seeing DASH copy benefits where signatures read at source will actually exist on destination, thus negating the need to actually read any data blocks if that signature already exists on destination.

 

Right now with the mismatch, no signature will ever match destination copy, resulting in source MA needing to read 4 blocks to generate a new 512kb signature before being able to check that with the destination DDB. 

 

512KB block size on destination will also mean lower overall dedupe saving, so even at the end of the copy, you may find that destination DDB uses significantly more disk space than source DDB for same application size. 

 

Hope this answers your questions here.

 

Thank you


Forum|alt.badge.img+8

Thanks @Jordan we were prepared for the dedupe reduction on this final copy. It’s the reading that’s got me perplexed, in that one job with 20 readers split between to the two source MAs will get the disk array busy at ~>600MB/S and CV reporting 2TB/Hr (not great but reasonable given the dedupe block conversion overhead). Introducing another job with 20 streams split between the two MAs again seems to reduce the backend rate to~200MB/S and CV reporting <~100GB/Hr for both. Suspending the second job allows the first to ramp up to speeds attained before the second job started (resume and totla through drops again, suspend and first job ramps up).

It’s almost as if there is a single thread somewhere in the AUX copy process that is controlling the reads from the disk library. I know the array’s read cache may take a hit but not to this extent. Could it be that there is a common cache for all jobs and that this get filled by one job and others have to wait for empty slots/buffers in that cache? All the additional settings I’ve seen (some discussed in “Additional settings and uses , can they be common for all media agents”) seem to be around optimising the signature lookup process. I’m the market for read optimising attributes if any exist.


Forum|alt.badge.img+11
  • Vaulter
  • 135 replies
  • Answer
  • December 22, 2021

Hi @Mike London UK 


The behaviour you report is actually not to do with Commvault, but rather disk contention. Commvault has coordinator threads for each Aux Copy job (they don’t talk to each other or have common cache etc).

Each job will just try to utilize the max number of streams possible. Each stream has multiple threads and each thread has a number of buffers in memory (default is 90). Each memory buffer is only a small amount of space (64KB usually). 

 

So unless your MA is running out of RAM memory, there wouldn’t be any caching or stream limits from Commvault end. 

 

Usually when you see the behaviour of running more streams resulting in less throughput is due to IOPS load on storage system. When storage can’t keep up with the requests, things start queueing in disk queue lengths. The longer the length, the longer the queue. This means that when you have too many streams going, it may result in many streams waiting and the disks constantly switching and trying to fulfill requests but the net output being worse than if you had less requests. 

 

You can see this behaviour also when any third party disk performance tool like CrystalDiskMark etc. With queue depths up to a certain amount, disk performance get better but then if you keep pushing the queue depth higher and higher, disks will reach a threshold and essentially “fall off a cliff”. 

 

 


Forum|alt.badge.img+8

Thanks @Jordan , do you know of any log entries we can look for in the CV logs to indicate the jobs are waiting on reads from the storage?

Given that we’re using CIFS mount paths, does anyone have any recommendations on SMB configuration on a media agent? From my research there aren’t many options to play with on the client or server side. I have increased SessionTimeout (client setting) from its default 60 to 600 which stabilised the AUX copy process.


Forum|alt.badge.img+11
  • Vaulter
  • 135 replies
  • December 24, 2021

Hi @Mike London UK 

 

The high wait times you saw in CVPerfMgr would indicate CV waiting for storage coupled with the behaviour you described where more streams results in less overall throughput, points to a storage contention issue here.


Forum|alt.badge.img+8

Just to close this out, we have created a new global deduplication policy with 128KB dedupe block size. The remaining storage policies copied to this much more quickly and we are now in the process of copying the data in the 512KB global dedupe policy to the new one. This will take some time but is more CPU bound than IO bound but we can cope with this )we have added new media agents to help with the load).

Thanks for all your help & comments.


Ross
Vaulter
Forum|alt.badge.img+8
  • Vaulter
  • 30 replies
  • March 3, 2023

Just to clarify… back in the day when DASH Copy was first introduced (V9 times) we introduced two operating modes, “Network Optimized” and “Disk Read Optimized”.

  • The difference between the two was all to do with the read operation.
  • The outcome was a dedupe optimized secondary copy in both cases.

However, with “Network Optimized” - we rehydrate the data (if it is deduped already) to produce a new signature rather than read the signature on source disk. We check in the signature into the destination store in the same way. The only difference here is that we have to generate the signature. This mode is best suited to situations where the source is not deduped by Commvault. This mode will also run if the source is deduped but with different dedupe block size because a 128K signature will not align to a 512K signature.

With “Disk Read Optimized” we do not go through that rehydration. We read the signature from source disk and check that into the destination. So a big difference in read IO here, which explains the throughput outcome.

The default operating mode originally was “Network Optimized” and customers suffered from poor throughput even though source and destination dedupe block may have been the same. Then we switched up and in V10 the default became “Disk Read Optimized”, which makes more sense.

Also, it used to be a best practice to always use 512K for cloud targets. While this is still true when backing up directly to cloud targets (FlashBlade being the exception - always use 128K here). Since 11.23 the revised best practice is to match source and destination dedupe block size, for optimal throughput outcome. Using 128K will always result in better dedupe reduction but with DASH Copy the key thing here is throughput. Using 128K on Cloud storage will incur extra payload when doing large reads because unravelling 1,000,000 128K blocks will be more overhead compared to 250,000 512K blocks but DEV have made some performance enhancements to offset some of those overheads.

If changing from 512K to 128K, be aware that you should then immediately seal the store to start a new DDB. Changing the block size will mean a new dedupe baseline. So extra bandwidth will be consumed and extra storage will be consumed until things settle down. Also, the DDB for a 128K store will be larger than for a 512K store since there are 4x more primary records being managed.

regards.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings