Question

VMware SRM, DC role-swap and CBT

  • 13 February 2024
  • 3 replies
  • 84 views

Badge +1

We are planning on swapping production VMs between our two data centers on a regular basis. We’re using Power Max storage with SRDF Metro (synchronous) replication, VMware with SRM protected VMs, Commvault Primary backup at the DC where the VMs happen to be at the time and an AUX copy at the other DC. We backup VM snapshot images (so Agent-less).

We can configure SRM to maintain UUID after a failover to the Secondary site (secondary vCentre), so the licensing issue seems fine. SRDF Metro replication keeps the sites in sync, so from a VMware perspective, both sites ‘see’ the same disk IDs. My question is about finding a way to configure Commvault to recognize the AUX copies as Primary, AND, more importantly, continue to leverage the same CBT info from the Primary site. This should be possible given SRDF Metro maintains bit level duplication of all disks/blocks.

Without CBT continuation across sites, after failing 500-ish VMs over to the Secondary site, we’re looking at at full re-seeding of all VMs - so tens of TB of data that needs to be read from PMAX. Not only this will take days to complete (breaking our RPO), it will also severely impact runtime performance for the Production environment (due to high IOps inflicted on storage by Commvault reads).

There is some reference to CBT on this page (vCenter Migration (commvault.com)) but is not clear if it matches our use case or how to implement.

Any ideas we could try would be much appreciated. Thank you


3 replies

Userlevel 1
Badge +2

Without fully understanding the Commvault, VMware and storage configurations in detail, it is difficult to determine what will happen when you failover those VMs to the secondary site.

 

As long as the VM properties appear identical to Commvault and the Change IDs are maintained on the VM, then the backups should continue to run as an incremental.

 

What you can do is perform test to determine the outcome. Setup a test VM in your environment as per your planned configuration. Configure a separate subclient/VM group to protect that VM. Run the backup, and once completed, fail it over to the secondary site.

 

At this point you will need to update the subclient/VM group configuration to target the VM on the secondary site. Commvault will not know that you have failed over from the VMware side. That is only possible if you are replicating those VMs through Commvault.

 

Once you have done that, run an incremental backup of the subclient/VM group again. See if CBT is engaged and an incremental backup is performed.

 

Based on that outcome, you will be able to determine if any further tweaks are required to get this to work.

Userlevel 7
Badge +23

Echoing much about what Peter said,

 

Let me break this down and you confirm if what I am saying is correct.

  1. Using SRM, you will periodically failover VMs between datacenters
  2. You want Commvault to seamlessly follow and backup the VM where it resides, and follow it when it failsover
  3. Need CBT to be preserved so that backup window is not impacted upon failover

Its been a long time since I’ve worked with SRM, but I have not seen this successfully done in the past, because SRM could not preserve the UUID from vCenter which is what we use to identify VMs globally and uniquely. That being said, you mentioned that there is a UUID preservation mechanism that SRM has  now - I dont know when that came in, but if that is the case and the UUID is truly preserved then I would imagine this would work OK, but you’d have to test it out on your own. Here is what I would do to test this:

  1. Build a test VM, record its UUID, configure it in SRM and do a failover - see if it is preserved, and the original VM has its UUID changed.
  2. Add the VM to a test subclient by using the browse option and selecting the VM specifically - dont select the datastore, host or any other criteria. When you pick the VM then it is targeted by its UUID
  3. Backup the VM and confirm that a .ctk file is created in the datastore\folder of the VM. Failover the VM and ensure the .ctk file is still preserved on the replica VM - this is what will be leveraged for change block tracking along with the UUID.
  4. With VM failed over, run an incremental backup and see if the VM is select correctly, and if the application size of the backup remains small (indicating CBT is working)

​In terms of reversing the auxcopy, this is where things become difficult. With VSA V2 you may be able to preserve incremental CBT if the VM is in another subclient, dictated by discovery rules (datastore or cluster discovery perhaps). But you’d need to test it out. You would have to send the backups to a different storage policy with the auxcopy reversed. There is no way to leverage the same storage policy but override the primary and secondary copy per VM.

Badge +1

Thank you Peter and Damian for your detailed responses.

Damian, you described our scenario and desired outcome perfectly. 

We have done the test you describe (we used two VMs in our test). We confirmed that UUID was maintained post failover (there is a new attribute in SRM that allows for that). We then started an incremental backup and the job immediately converted it to a full and the size of the backup matched a full backup. We believe that is because it did not ‘recognized’ the CBT data.

I will ask my team to look into the .ctk file behavior.

As for converting the auxcopy to primary at the secondary site, and vice versa at the primary site, I guess I don’t particularly need that - I thought it would be a prerequisite for the incremental/CBT to work after failover. But I wouldn’t necessarily care as long as I get quick (incremental, CBT enabled) backups post failover and also be able to restore a VM post failover if we needed to. Not sure what the behavior would be if, say, I would need to restore a VM 2 days after failover, when the last Full was taken while the VM was at the other site (but there is an auxcopy at the now-current site) and two more incremental were taken at the current site. Will it be, for efficiency reasons, able to leverage the Full from what was the auxcopy (local at the new site) and the incrementals from the new site as well?

Reply