I have a customer that is backing up in car videos (local sheriff’s dept) and he is deduping this data. We have created a new stand alone dedupe database for this data and are using a selective copy to copy weekly fulls to another target. We have set up another stand alone dedupe database for this copy. The customer is having issues getting this data copied over - it’s taking multiple days. The full backup is about 40 TB in size. We have recommended not using dedupe for this copy, but the customer does not have enough capacity for non-deduped copies. Are there any optimizations that can be made to speed up this aux copy? They currently have only 1 subclient backing up all the data - if they split this out, would that help in using multiple streams for the aux copy? Any other suggestions?
Thanks
Page 1 / 1
@Melissa Adams , adding more subclients should potentially help.
If it’s only 1 subclient right now, then we’re only using 1 single stream (and the Aux Copy will mirror that).
If the source can handle multiple subclients, then that will result in more connections from Primary to Aux Copy.
Of course, check the actual max throughput between Media Agents/libraries to see if there’s a bottleneck there as well!
There’s a load of things you can do, as Mike’s mentioned. The first thing is to check if the destination has ‘baselined’ the source yet. That is, if the Source DDB is 100TB of data on disk, then when the destination reaches this also, we’ll see an increase in throughput as theres less unique data to send.
Or as mentioned, Seeding the destination copy where the network is a huge bottleneck is a great option.
If the data is being backed up using File System agent, you can increase the number of readers to increase the number of streams. You don’t necessarily need to break it into multiple subclients, though this is not a bad idea as there are other benefits besides the increased stream count (smaller indexes making browse much easier, as well as smaller index databases).
Once the aboves checked, the next thing to understand the Aux Copy performance is to identify the bottleneck. Have a look for a log file on the Source Media Agent ‘CVPerfMgr.log’. If you filter by the job id, you’ll have Read, Write and Dedupe metrics (plus a bunch of others). We can use this to identify where to focus your energy and get things moving smoothly.
Cheers, Jase
Another tip - rather than having to wait until the backup job completes you can choose the option to start copying the data even if the backup job is still running. If the backup takes 8 hours to run, that's 8 hours of extra copy time
If selected, when an auxiliary copy job with the Use Scalable Resource Allocation option enabled is performed during a backup job, the data available on the primary copy is picked for copy by the running auxiliary copy job. This option causes the auxiliary copy operation to create the secondary copy faster. This option saves time for the auxiliary copy operation, especially when the backups running are huge.
@Mike Struening@Damian Andre@jgeorges The mystery continues. The customer does have a ticket open now to assist with optimization. See the screen shots:
The size on disk is 27TB and the size Commvault is backing up is 45 TB. Any more ideas? They do have only the ICV folder added as content.
Thanks all for the answers so far!
The only thing I can think of (immediately) is that the job has been running for 5 days….so perhaps it had to restart a few times? Or the data in the folder has changed?
I’m not sure if that is valid because there are no skips or failures…..any chance there are nested folders/mount paths within that content?
Also, can you share the case number so I can track it?
@Mike Struening Here is the incident - 220322-453
Thanks! I’ll keep an eye on it (I do see some progress about some job discrepancy).
Hi @Melissa Adams
I’ve had a look at the screenshots you shared to the incident and from a quick glance, theres no deduplication on these backups. Most commonly this is due to Deduplication being disabled at a Client/Subclient level, this option will overwrite the storage policy deduplication settings.
Synthetic Full Backup Size Greater Than the Size of Data on Disk
Synthetic full backup size may be greater than the size of the original data on disk in Windows Server 2012 and Windows Server 2016 computers.
Windows Server 2012 and Windows Server 2016 have a deduplication feature that can be enabled. When this feature is enabled, data from deduplicated volumes are backed up in their original, uncompressed state. Therefore, the size of synthetic full backups can be greater than the size of the deduplicated data.
Would also be checking if Object Based Retention is enabled or setup as this will cause backups to grow indefinitiely as we retain all versions of modified files and deleted items:
I’ve had a look at the screenshots you shared to the incident and from a quick glance, theres no deduplication on these backups. Most commonly this is due to Deduplication being disabled at a Client/Subclient level, this option will overwrite the storage policy deduplication settings.
We are the end user that Melissa has been assisting us. Deduplication did get disabled at the subclient for that particular job. Not sure how or why it was disabled there but it was. We normally always ensure that setting is enabled when we create a new subclient as almost all backups in our environment are done to a global DDB.
We will adjust and monitor to see if things look accurate on the next backup job.
Thank You,
Kevin
it also looks like we have object based retention enabled. We do this because we have a mix of Data Protection and Archiving licenses. Long history to this but archiving was significantly less expensive years back when we had to buy a large amount of licensing.
We enable the archiving functionality, so the subclient uses our archiving license. We don’t actually want or need any archiving. We always just enabled the checkbox and didn’t pay much more attention to it and it never seemed to be an issue, but looking at this closer now, I think we may want to toggle the file versions to retain 0 versions instead of how it is setup now.
The other thought is that we change it to job based instead of object based and we set the archiving rules to something like 20 years for file modified time and file created time. Maybe it doesn’t matter though if none of the file types are selected though?
Thank you again,
Kevin
That absolutely was a very common practice to use Archiving, not actually archive, but get the cheaper licensing mode.
I’m not sure (at a glance) if you actually have any files archived, so your point is potentially valid that it wouldn’t have any impact at all.
Regarding the Dedupe enabled, definitely keep us posted!
We confirmed that the object based retention settings were one of the culprits in our environment. As we have never used archiving, we didn’t know how to look at backup jobs to see “archived” files. We found some archiving documentation that helped us figure out how to view backups through the command center and the “show deleted items” was easily selectable. That visibility confirmed that we had deleted items with the little garbage can icon next to them in the backup jobs.
We adjusted the object based retention settings, ran another synthetic full and we were able to purge the archived files out of the new backup jobs.
At the same time we are splitting the video files into yearly sub-clients so it will hopefully help in making the AuxCopy process more efficient.
We have a lot of cleanup to do yet and lots of changes in data so it will take a bit for everything to work its way through to see the result, but this all feels very good and I think we are back on track to getting our process working as we had previously designed it to work.
Thank you again,
Kevin
@KevinH THANK YOU!!! Glad you made some headway!
Melissa
We are continuing to work on this project to optimize this large Aux Copy job to our offline Commvault media agents.
From a bigger picture, we have three media agents at our primary site that write all the primary copies. They use a partitioned DDB and the back end storage is a pair of SMB based NAS devices.
At the moment our “offline” Aux Copy destinations are at two different sites, each that are full mesh 4 x 10 Gbps dark fiber connected. The offline servers are a single Dell r740xd running VMware. At the moment we have a single media agent running on the Dell that has a global DDB and writes all the backend storage to local RAID6 sets. We have the server full of SSD and 16 TB disks to have enough performance and capacity for a copy of all of our data.
This structure worked great for 6-12 months when the media agents were running on vSphere 6.7. Most of our issues started after upgrading the hosts to vSphere 7.0. We believe something changed in the underlying vmfs file system from 6.7 yo 7.0 that is causing some bottlenecks or deadlocks of sorts with the media agent. We have had VMware review the server and they cannot isolate any potential storage or performance issues.
Back to the bigger picture, we are wondering if this system would work better if we ran two virtual media agents on the VMware host using a partitioned DDB and using some form of grid store or UNC paths to storage instead of just locally mounted disks. Our thought is that if the one media agent is stuck processing something like data aging or whatever that if we have two of them, one can actively keep churning Aux Copy jobs while the other is busying doing other media agent tasks.
It would take a bit of work to rebuild the environment from a single global DDB with local disks for a disk library to a partitioned DDB with grid store or UNC paths for the disk library. We are looking for some guidance on whether having multiple virtual media agents might be a better solution or if it will just introduce more complexity that is unnecessary. Keep in mind we will not be adding any cpu, ram or disk resources, but its also possible that two media agents might utilize resources more efficiently than one larger media agent.
Thank You again
Kevin
If there’s no extra actual ‘oomph’ behind the extra MA, I’m not sure if it’s worth it.
I’ll defer to the experience of other’s, though the only benefit you’d get is MA redundancy at the cost of splitting the physical to virtual resources (assuming I understand you correctly).
also, having a partitioned DDB only really helps with scale and balance, but mostly scale of record increases. Is that a concern, or were you aiming at balancing?
We are more looking at balancing. The partitioned DDB would be primarily so that Aux Copy jobs could still run if one of the media agents was tied up doing something else. We have noticed that, as an example, we had a large data aging job that was freeing up about 20 TB of data. The data aging job took a good week maybe two weeks to run and during that time, Aux Copy jobs were very slow almost felt like they were paused. Once the data aging seemed to finish, Aux Copy jobs sped up. I don’t know if having two media agents during a period like this would make a difference, but it is a theory.
Another example is that the Aux Copy jobs seem to be starving the media agent of being able to do DDB backup jobs. Sometimes the DDB backup job times out after a few days before finishing. If we kill the Aux copy jobs, the DDB backup finishes in like an hour or two. Right now we have the Aux Copy jobs set to run for 4 hours, then they auto kill and they try to restart every 30 minutes when they are not running. We have five storage policy copies that we have setup for offline copies. I think we have them collectively limited to 200 streams so maybe there are no streams left for the DDB backups while Aux Copy jobs are running.
This is more of a gut feel, but it seems like Commvault in general does better when job activities are split up more whether that is more sub-clients or more media agents. This theory is just an educated guess from experience using Commvault for years so that is why we are looking for some expert opinions if this kind of architectural design may give us better overall results.
Thank You again,
Kevin
I’ll get some of our folks who know dedupe better than I do to chime in here. Want to be sure I’m not neglecting any angle
@KevinH Have you considered throttling the Aux Copies when running to allow the DDB backups to run as well?
It sounds like you may have either a resource issue. Are you running Space Optimized Aux Copies or do you have this option disabled?
@Orazan I just verified we do have space optimized Aux Copies enabled. We have monitored network bandwidth and this is not an issue, Our media agents only use a small portion of available bandwidth. The two sites are connected with 4 x 10 Gbps full mesh dark fiber. They are a few miles apart and have sub 2ms ping times. Of note, we have a synchronous copy going tot he same site on a different set of media agents and that copy works well. This offline copy are selective copies using “All Fulls” from the same source copies.
It really seems like a software deadlock of sorts that something in the scheduling or operations of this particular media agent is causing an application deadlock so that other media agent tasks are basically paused. I am not sure if having two media agents would resolve this deadlock of sorts or if that would just span both media agents. I don’t have anything conclusive to show some type of deadlock and so far the various support cases we have opened have not identified any specific problem. it’s just a gut feel right now.
The usual cause of deadlocks seen are either contention/resource issues or antivirus or other scanning software. Running Space Optimized Aux Copies will cause the job to use less streams do the the difference in the way the job runs. Have you tried running some jobs without this setting to compare the performance?
Thank you again, we noticed issues with anti-virus in the past and thought they were addressed. I double checked after your last message and we found the anti-virus client was an older client that was not getting the updated policies applied.
We are correcting that now, the AV client was scanning all the drive library drives and DDB volumes. That will obviously add some overhead and potentially cause some corruption.
We will correct this and monitor to see if this makes improvements on its own or if we still need to address other issues as well.
Thank You,
Kevin
After a few days with the new Anti-Virus policies and exclusions in place, it feels like Aux Copy jobs are speeding along at a pretty good pace again.
I still am wondering if two media agents in a virtual machine infrastructure would work better than one. I feel like that two media agents would more efficiently use the VMware hardware resources, RAM, CPU, Network, etc. And I feel like Partitioned DDB, etc and splitting load across multiple media agents would work better from a Commvault software and scheduling perspective.
But it seems like anti-virus might have been the ultimate root problem we were experiencing. I think a lot of the other changes we have made have also helped so overall the structure is better now, but anti-virus seems like the biggest limiting factor we were experiencing.
We will try to follow up in a week or two after we flip our offline media agents and let the other media agent run for a bit to see how it performs with the new structure.