Hi @drPhil ,
Is the VSA Proxy or vCenter being Backed-up or Snapshotted during the backup? - This can sometimes cause issues.
I’d always recommend to protect VSA Proxy and vCenter at times differing to the usual VM backup to avoid issues.
Also ensure that Automount (Windows) or lvm2-lvemad (Linux) is disabled on the VSA proxies.
The vsbkp.log file (Virtual Server Backup) in the Log Files directory (on VSA Proxies) will cover the VM protection Job.
Let us know how you get on.
Best Regards,
Michael
Hi @MichaelCapon
thanks for your contribution!
Now, we have disabled backup of the proxies itself. Automount feature hasnt been checked yet.
In the log files (vsbkp.log file) I have found following errors:
CheckVMInfoError() - VM vm_name] Error removing virtual machine snapshot. Please check Virtual Machine snapshot tree for any stale or missing snapshot chain.
VSBkpWorker::UnmountVM() - Leaving Cleanup string iVSIDA -j 543656 -a 2:110 -vmHost "vcenter_name" -vmName "vm_name" -vmGuid "guid_name" -vmCleanup "snapshot-xxxx" ] with JobManager as the unmount did not clear the cleanup string
RemoveDiskFromVM() - Failed to delete virtual disk isdatastore_name] xxx/vm_name-00000x.vmdk] from VM proxy_name] with error - rInvalid configuration for device '0'.]
Are these error messages somehow useful for further investigation?
Hello drPhil,
Make VSA VM proxy out of regular subclient content where its protect itself, disable auto mount, remove old backup VMs disks from VSA proxy VM then do disk consolidation on backup VMs and disable DRS on VSA proxy VMs.
Have multi VM proxies depending on workload to distribute VMs accordingly. have multiple scsi controllers on each VM proxy as required to protect multiple disks of VMs. Use number of streams as required on VM group(subclients) and Stagger jobs to avoid hotadd lock situations. vsbkp and individual VM status on job will details of its failures.
Below are the BOL docs related to this ask, please check these to do cleanup and best practice implementation.
Hotadd mode:
https://documentation.commvault.com/11.25/expert/32048_hotadd_transport_for_vmware.html
Hotadd data flow:
https://documentation.commvault.com/11.25/expert/135014_data_flow_for_hotadd_transport_mode.html
Best practices: Disable DRS for VSA Proxies on VMs
https://documentation.commvault.com/11.25/expert/32572_best_practices_for_virtual_server_agent_with_vmware.html
snapshot cleanup process:
https://documentation.commvault.com/11.24/expert/113810_snapshot_cleanup.html
disk consolidation KB:
https://kb.commvault.com/article/62743
Regards
Gopinath
Hi @drPhil ,
Thanks for the update.
The combination of “failed to remove snapshot” and “failed to delete vdisk from proxy” sounds like the VSA had a Snapshot present on it when we tried to remove the disk.
I’d suggest the following steps:
1. Remove Snapshots of VSA Proxies
2. Remove any Guest Disks attached to the VSA Proxy from other VM’s.
3. Ensure VM’s are consolidated properly
4. Move the VSA Proxies into a separate Subclient (and schedule for a different time to VM backups)
5. Ensure Automount is disabled.
Note: If using VMware 6.5 or higher, Paravirtual SCSI Adapters should be used on VSA Proxies.
Once the above has been validated, let us know the success of your VM Backups.
Best Regards,
Michael
Hi @Gopinath and @MichaelCapon ,
thank you for your guidance!
Checked: - no VSA Proxy Snapshot, no Guest Disks attached to the VSA Proxy now, VM consolidaton done, VSA Proxy not being backuped itself, Automount disabled, Paravirtual SCSI Adapters in use, no DRS on VSA Proxy set up.
I do not know what stagger job does. There is one subclient, which has cca 50 VMs included. There are 4 VSA Proxies and once one VM is backed up another one is backed up after it. This looks good.
However, if we do not find the cause of the failing backup, it can reoccurs. That is something I am afraid of.
@drPhil , at this point, it might be best to open a support case and get an answer.
Once you do, please share the case number here so I can track it.
Thanks!!
Thanks again for the case number!
Hi !
Not sure it’s the case for you, but I have this warning message from VSA backups of VMs that have too many VMDKs.
I noticed this message in the logs of all the backups of VMs that had more than 20 disks configured (VMDKs).
I guess the incr with CBT ON of some disks would not be able to calculate the delta, throwing this message. But it’s a guess and only engaging myself
Hi @MichaelCapon, @Gopinath, @Mike Struening and @Laurent. Thank you so much for your inputs.
To sum up the thread, I write down some points, that could help with this kind of the issue. However, the root cause has not been found out.
What we have changed
- Commvault settings ( nRemoveSnapshotRetryAttempts and nRemoveSnapshotRetryIntervalSecs - see related thread "Virtual disks needs consolidation - failing VM backup")
- the rest of the old backup software removed
- VSA proxies removed from the backup set in order not to be backup at the same time as the other VM
What we have checked
-
Remove Snapshots of VSA Proxies
Remove any Guest Disks attached to the VSA Proxy from other VM’s.
Ensure VM’s are consolidated properly
Move the VSA Proxies into a separate Subclient (and schedule for a different time to VM backups)
Ensure Automount is disabled.
-
Checked logs (vsbkp.log, VIXDISKLIB.log file)
Maybe, for the future it would be interesting to open the ticket with VMware support as well.
Have a great one!
Sharing case resolution:
From the CV logs we do not detect any errors with the snapshot removal. Please ensure the AV exclusions are in-place as this can cause issues with snapshot removals and hotadd backups.
https://documentation.commvault.com/11.24/expert/8665_recommended_antivirus_exclusions_for_windows.html
5960 1f8c 12/05 22:52:11 969364 CVMWareInfo::_RemoveVMSnapshot() - Removing Snapshot [snapshot-72377] of VM [<vm name>] Guid [5039922a-c165-5158-9a26-a1086632f4ac]
5960 1f8c 12/05 22:52:22 969364 CVMWareInfo::_RemoveVMSnapshot() - Successfully removed Snapshot of VM [<vm2 name>] Guid [5039922a-c165-5158-9a26-a1086632f4ac] duration [00:00:11] size [0]
The antivirus on the VSA proxy will cause issues with hotadd, not the Guest VMs. The AV logs rarely show a block event or that it scanned the CV process. Typically when it scans the CV process is will allow it but since it opened a handle to our process is can cause the process to hang resulting in disks that did not clean up properly. This can result in orphan disks since the process was hung during the mount or unmount operation. Based off your latest logs I do not see the old snapshots still attached to the VSA proxy and backup\consolidations are all working. At this point, I recommend monitoring the backups till Friday to ensure everything continues to work.
The keys will increase the amount of snapshot removal commands and the frequency we send them. If there is already an issue with snapshot cleanup due to hotadd or orphan snapshots and it is not cleaned up manually before the backups these keys will not help.