Question

Inexplicably and painfully slow Auxiliary Copy throughput for copy between media agents

  • 30 October 2023
  • 8 replies
  • 520 views

Badge +2

Dear Community,

 

Thank you to everyone in advance for their replies - we have a bit of a headscratcher on our hands here, one that we have failed to resolve despite our best efforts so far. A support case is ongoing, but we were pointed back towards our own infrastructure - but let's start at the beginning.

 

September 8th, my former colleague and Commvault admin (who unfortunately left the company) enabled storage policy level encryption. This led to some issues where synthetic full backup jobs were no longer running due to decryption issues, which in turn led to our primary media agent running full.
September 25th, I once again disabled storage policy encryption because I thought that may fix the issue, but it did not. I created a spillover path on our offsite media agent to keep backup jobs running, and a few days later, the issue that caused the synthetic full jobs to fail was resolved. Some data was written to the mount path (associated with the same library, but mounted on the offsite media agent).
Around September 20th, I noticed that auxiliary copy throughput for the copy between primary and offsite media agent was extremely low (usual throughput was 10TB/hr, it was/is now down to ~200GB/hr). At first I thought it might be related to the decryption issue we were battling at the time, but once that was resolved, aux copy throughput did not recover.


I opened a Commvault support case, and after several tests, read speeds on the media agents were found to be low (using Commvault's performance analysis tool), and disk queue length for the drive in question would not go below 50 during job activity. Even now, most auxiliary copy jobs are stuck with the "Total number of data transfer operations on the MediaAgent exceeds the maximum allowed value." error. In addition, aux copy throughput from the offsite media agent to tapes is also severely affected, corroborating the suspicion of a read speed related issue.


Based on this information, we upgraded the storage array firmware, drivers, etc, as well as opened a case with our hardware vendor, who told us that the disks and the storage arrays were completely healthy (upon reviewing various logs).


At this point I am at a bit of a loss. Commvault support says the issue must be on the infrastructure side, but we have checked the infrastructure side - network, hardware etc are all more than fine. Could this perhaps still be related to the encryption that was once enabled? Could it be related to the fact that there is a spillover mount path on the offsite media agent, mounted to the library of the primary one? I cannot move the mount path back to the primary media agent yet - since there is a substantial auxiliary copy job backlog, data is not ageing the way it is supposed to, and we are still short on space on our primary media agent.


Any advice would be greatly appreciated.
Thanks so much in advance!

 


8 replies

Badge +3

Hi @SahiNo,

Good day.

Parallel Data Transfer Operations option allows you to set the maximum number of concurrent read/write operations to the MediaAgent. This value controls the maximum number of data streams that can be managed by the MediaAgent.

https://documentation.commvault.com/2023e/expert/8871_setting_maximum_number_of_parallel_data_transfer_operations.html

Also, I understand that the auxcopy performance is due to the low read speed from the source library.

Please run the CVDISKPerf on the source mount path to validate it. Accordingly, involve your storage vendor.
https://documentation.commvault.com/11.24/expert/8855_disk_performance_tool_01.html

 

Note: Ensure that the Commvault folders are excluded from the AV scanning.

 

Regards,

Wasim

Badge +2

Hi Wasim, and thanks for your response!

 

I’ve already tried setting the maximum parallel transfer operations to a lower value, sadly to no avail.

 

Also, I understand that the auxcopy performance is due to the low read speed from the source library.

 

The current job details read as follows:

Source Media Agent: OffSiteMA
Source Drive/Mount Path: IP | F:\Path | SourceMA
Source Media: CV_MAGNETIC
Destination Media Agent: OffSiteMA
Destination Drive/Mount Path: F:\Path_OffSiteMA

 

CVDiskPerf has already been run on both media agents with the following results:

Source MA:

 

Destination MA:

 

We have engaged the storage vendor, who told us that the disks and the storage arrays are completely healthy (upon reviewing various logs).

 

Folders are excluded from AV scanning just as before the issue started.

Badge +2

Update from the hardware vendor: hardware is totally fine. The read speeds are also similar in another Commcell, yet in that Commcell, throughput between media agents is 50x the throughput in this one.

Badge

I’m experiencing very similar to you but without the encryption piece. I’ve had many tickets opened up with Commvault over the years due to this and they all keep coming back to the same “Storage read speeds are low. Talk to storage vendor” response you are getting. We’ve talked to the storage vendor, and everything looks fine. We’ve checked our network, and everything is fine. There is lots of bandwidth available over the network. We’ve double and triple checked the A/V exclusions. We’ve even temporarily disabled A/V on the Commvault servers. DASH copies still remain painfully slow. Honestly to the point where I don’t know why we’re even bothering to let them continue.

All my tickets are currently closed because I’m at the point of basically giving up on this, but I did recently make a change that had some interesting results.

Normally you would set up your DRmonthlycopy to come from the DRDiskcopy but since my DRDiskcopy is so far behind I decided to change the source of my DRmonthlycopy to the Localmonthlycopy to see if I could get that copy caught up. I was actually quite surprised and impressed by the speeds I was getting with having it set up this way but then that made me question. If storage read speeds are to blame for why my Localdisk to DRdisk copies being so slow then why, if my Localmonthlycopy is the source for my DRmonthlycopy, isn’t the storage read speeds causing issues for my Monthly copies? They’re copying from the same storage over the same network.

There is something else going on here, but nobody seems to be able to figure it out and basically passing the ball to a different vendor (storage) or piece of the puzzle (network). It’s been really frustrating. I need to open up another ticket with this new information but honestly, I just don’t see the point anymore.

Sean

Badge +2

Hi Sean,

 

Feeling your pain - our issue has not been going on for as long as yours, but it has been over a month. We need to keep the DASH copies, and we rely on a fix for this. We are re-checking some final options on the infrastructure side, but short of engaging Commvault Professional Services there is nothing else we can think of. Everything keeps pointing back at the application side. In any case, I will keep you updated.

Badge

Hi Sean,

 

Feeling your pain - our issue has not been going on for as long as yours, but it has been over a month. We need to keep the DASH copies, and we rely on a fix for this. We are re-checking some final options on the infrastructure side, but short of engaging Commvault Professional Services there is nothing else we can think of. Everything keeps pointing back at the application side. In any case, I will keep you updated.

Thanks! I appreciate it.  Honestly this has been happening to us right from the beginning of moving to Commvault. That being said, it was partly our own doing as we didn’t have enough storage on our DR side for quite some time, so our DASH copies were months behind before we even enabled them. Although the slowness has not helped us.

I, at one point, managed to get them caught up by picking a data and saying nothing before this date in order for us to start fresh and get back on track but it didn’t last long, and it’s been a downward spiral ever since. 

I’ve had Commvault techs tell me they’ve never seen AUX copies so slow. That doesn’t make me feel any better.

Hope you are able to get some resolution and I look forward to hearing about it. I’ll be opening a new ticket with this new information I’ve found and if we find a resolution, I will be sure to pass that knowledge forward.

Badge +2

Hey @SeanG,

 

Our issue has been resolved. It took a long time, and I’m not sure it will help you since these issues are always very specific, but I promised a follow-up. What was causing the throughput issues for us was that due to a previous (unrelated) issue, our primary media agent had filled up completely, so we added a spillover mount path from the DR site media agent on the source (primary) media agent library. Sharing was enabled for both media agents for that mount path, which was causing I/O errors.


Cheers, and happy holidays!

Userlevel 5
Badge +13

So the auxiliary copy process the primary media agent was reading the data from the DR site mount path to send the blocks to the DR MA to write the jobs in the DR site Storage. In other words, data was travelling twice via the link between sites

Reply