Question

Windows Cluster Backups Creating Non __GX_Backup Checkpoints and not removing them

  • 10 May 2023
  • 5 replies
  • 235 views

Badge +1

I’ve had this problem for years with 2012, 2016, 2019 and now Azure Stack HCI clusters where, if there’s been some sort of blip on the network or with a particular host, that we can get increasing numbers of non __GX_Backup named (normal named) .avhdx files at every backup on a small percentage of VMs (not the same VMs each time and problem is intermittent).

The disks show as differencing but there are no checkpoints listed on Hyper-V manager and this usually requires downtime to consolidate the disks by restore to fix it.

Commvault support advised they’ve never seen this particular issue before and that have always said that it’s a Microsoft issue but I wondered if anyone else had ever seen it.

I don’t have the issue with non clustered VMs where the backup type is set to “File System and Application Consistent” instead of “Crash Consistent” like it is on the clusters.

I think the issue is probably more related to the clusters than the backup type but am not sure.  If I do change the backup type, then I can expect to have issues with some of our VMs  services not starting back up properly (hence why I tend to use Crash Consistent).

Listing of .avhdx files on recent issue from 14-03-23 backup:-


Mode                LastWriteTime         Length Name                                                                  
----                -------------         ------ ----                                                                  
-a----       15/04/2023     02:17     4890558464 LAGAN-LIVE-AS1_015077AB-FC20-47DD-B272-AFB539B8B6B6.avhdx             
-a----       26/04/2023     01:56     5140119552 LAGAN-LIVE-AS1_1FEC9E86-52BA-4107-B35F-CAC4022EABF9.avhdx             
-a----       27/04/2023     00:47    10540285952 LAGAN-LIVE-AS1_50C7A6EF-F7B9-4F82-BDA6-924B56B8A65C.avhdx             
-a----       18/04/2023     02:17     4903141376 LAGAN-LIVE-AS1_7DDF9F73-AE6C-438F-871E-E21169DEC5C9.avhdx             
-a----       21/04/2023     02:17     5226102784 LAGAN-LIVE-AS1_842305DD-3B39-48D6-A1F8-CCB5AACC23D6.avhdx             
-a----       23/04/2023     02:16     4945084416 LAGAN-LIVE-AS1_8B2BC279-BB45-4AB4-AF37-ED2D21838DA1.avhdx             
-a----       20/04/2023     02:16     5318377472 LAGAN-LIVE-AS1_8BA5D77E-3C12-4DEF-B921-204642CE5B9D.avhdx             
-a----       27/04/2023     09:35     3191865344 LAGAN-LIVE-AS1_AC97BBB6-7AE2-497A-9E07-08D606343117.avhdx             
-a----       22/04/2023     02:17     5450498048 LAGAN-LIVE-AS1_C7845CF4-7951-4B73-BFF9-9FD364AFF567.avhdx             
-a----       17/04/2023     02:18     5138022400 LAGAN-LIVE-AS1_CEA5CE72-5E21-42E7-BF93-09D93676F491.avhdx             
-a----       25/04/2023     02:17     5114953728 LAGAN-LIVE-AS1_D529D929-91DE-49C2-9E86-1EF3DD1369AA.avhdx             
-a----       24/04/2023     02:43        4194304 LAGAN-LIVE-AS1_data_1A2E943D-D35C-4DCC-AEFC-1BE11B8C7BD6.avhdx        
-a----       20/04/2023     02:39        4194304 LAGAN-LIVE-AS1_data_2F734270-16AC-4639-B4A4-D61789C07BAD.avhdx        
-a----       21/04/2023     02:40        4194304 LAGAN-LIVE-AS1_data_4319EB66-8D13-4297-B4F2-BDCCA0BE5515.avhdx        
-a----       27/04/2023     02:17       40894464 LAGAN-LIVE-AS1_data_437EE5C3-1F49-4C87-B53D-CFB5F5E0A0B8.avhdx        
-a----       17/04/2023     02:18        4194304 LAGAN-LIVE-AS1_data_51A10FFD-30E9-4079-891A-FCECFBDAB4AA.avhdx        
-a----       25/04/2023     02:44        4194304 LAGAN-LIVE-AS1_data_6B27E566-8709-4649-BC5F-37DCE553B1FA.avhdx        
-a----       16/04/2023     02:17        4194304 LAGAN-LIVE-AS1_data_7A675DA1-351F-4C08-B779-F55A30E47792.avhdx        
-a----       19/04/2023     21:09       38797312 LAGAN-LIVE-AS1_data_8897FFEC-7917-40EC-910A-2C03897A2F4E.avhdx        
-a----       14/04/2023     02:36        4194304 LAGAN-LIVE-AS1_data_9ECB630F-2BAD-40D9-BBD6-A69C57109C14.avhdx        
-a----       27/04/2023     02:17        4194304 LAGAN-LIVE-AS1_data_BA5699F8-D5E2-41DD-A8EF-8889287788B0.avhdx        
-a----       18/04/2023     02:17        4194304 LAGAN-LIVE-AS1_data_C6D5B6B1-6EAA-4846-81E7-D489DBC8BC04.avhdx        
-a----       18/04/2023     11:39        4194304 LAGAN-LIVE-AS1_data_CA7E5E36-80A3-47DE-99A0-873ED2990D3C.avhdx        
-a----       22/04/2023     02:41        4194304 LAGAN-LIVE-AS1_data_E4845E11-992D-4B17-8618-DB0650A2585A.avhdx        
-a----       23/04/2023     02:41        4194304 LAGAN-LIVE-AS1_data_F17283AB-2BD9-4A4D-BC40-241D0AF85356.avhdx        
-a----       16/04/2023     02:17     5171576832 LAGAN-LIVE-AS1_E61BA230-1522-45A0-A700-5E4D5486D00F.avhdx             
-a----       19/04/2023     02:17     5060427776 LAGAN-LIVE-AS1_EE578E42-C31D-4EA3-9316-01DD2B596A31.avhdx             
-a----       24/04/2023     02:18     4928307200 LAGAN-LIVE-AS1_F95AEE15-E708-4AC2-9B9B-7A0B41622D3C.avhdx 

 


5 replies

Badge

We are having the same issue using a HCI Storage Direct Dell cluster (shared disk). Backups are failing and at point vm machines in the cluster have to be replayed. Does anyone have a workaround for this? 

Badge +1

Hi Damian,

Unfortunately I don’t see any more info in the VMMS-Admin log with those errors - -just the “failed to get disk information”.

I’ll probably need to wait for the issue to recur now to get any more info.

Thanks for your help.

 

 

Userlevel 7
Badge +21

Hi Damian,

Yes, in Hyper-V Manager there are no checkpoints showing but we have differencing disks.

We’ve tried merging via powershell commands in the past but these have been hit and miss so we’ve ended up just doing restores to avoid increased downtime.

We don’t appear to have verbose logging turned on our Azure Stack HCI clusters so have no updated VMMS logs showing.

Commvault support advised under Ticket 230427-500 that changing the backup type would make no difference in this case.

In any case, I’m loathe to change the backup type to one that quiesces the VMs as it will break some of our systems and then I’d have to separate those out to use Crash Consistent again.

 

Ah fair enough. I am not too familiar with Azure stack, but saw that the Microsoft-Windows-Hyper-V-VMMS-Admin log was captured through our log collection process. Our collection of this is pretty rudimentary and sometimes skips stuff, but I see a bunch of errors with no detail on our end (likely you will see more in the windows event viewer on your side). This may be nothing but could be something:

[TYPE] Error [TIME] 4/27/2023 3:51:50 PM [SOURCE] Microsoft-Windows-Hyper-V-VMMS [COMPUTER] <host censored> [DESCRIPTION] Failed to get the disk information.
[TYPE] Error [TIME] 4/27/2023 3:51:48 PM [SOURCE] Microsoft-Windows-Hyper-V-VMMS [COMPUTER] <host censored> [DESCRIPTION] Failed to get the disk information.
[TYPE] Error [TIME] 4/27/2023 3:51:47 PM [SOURCE] Microsoft-Windows-Hyper-V-VMMS [COMPUTER] <host censored> [DESCRIPTION] Failed to get the disk information.
[TYPE] Error [TIME] 4/27/2023 3:51:47 PM [SOURCE] Microsoft-Windows-Hyper-V-VMMS [COMPUTER] <host censored> [DESCRIPTION] Failed to get the disk information.
[TYPE] Error [TIME] 4/27/2023 3:51:44 PM [SOURCE] Microsoft-Windows-Hyper-V-VMMS [COMPUTER] <host censored> [DESCRIPTION] Failed to get the disk information.

 

I can see background disk merges are happening and completing for some VMs, but we did not capture the entire VMMS-Admin log. What I do not see in the admin log is any mention of the 3 example VMs even attempting to do a background merge. I don't know if that not happening or if we didn't collect enough history of the log

In example job 3565956, one of the example VMs you provided (L*-L*-AS2) already had 10 child disks from previous backs that remain un-merged. I didn't see any failures in our logs to indicate an issue converting the checkpoint to a reference point (which should trigger the background merge I believe).

I think the next step is open the windows eventvwr and look for the Microsoft-Windows-Hyper-V-VMMS-Admin log and go through it in detail and look for any mention of the VMs

Badge +1

Hi Damian,

Yes, in Hyper-V Manager there are no checkpoints showing but we have differencing disks.

We’ve tried merging via powershell commands in the past but these have been hit and miss so we’ve ended up just doing restores to avoid increased downtime.

We don’t appear to have verbose logging turned on our Azure Stack HCI clusters so have no updated VMMS logs showing.

Commvault support advised under Ticket 230427-500 that changing the backup type would make no difference in this case.

In any case, I’m loathe to change the backup type to one that quiesces the VMs as it will break some of our systems and then I’d have to separate those out to use Crash Consistent again.

Userlevel 7
Badge +21

Hi @Graham 

The behavior between 2012, 2012 R2 and 2016+ is dramatically different. 2012 backup/snapshot framework left a lot to be desired - live merge of checkpoints wasn’t even possible until 2012 R2. With 2012 R2 we still saw a lot of issues due to the use of VSS on the host disk for VM snaps - most of these issues went away with 2016 and above when the new production checkpoints came into view (including native change block tracking yay!).

So 2012/2012 R2 had ‘random’ issues. 2016+ has been a lot more reliable since the snapshot is individual VM level and not at the host disk.

Just to confirm - in Hyper-V manager you see no checkpoints, but on the disk there are still differencing disks? - in that case Commvault successfully deleted the checkpoint, but Hyper-V is not completing the background merge. Is there an option to manually merge the snapshot at the VM level in the Hyper-V UI?

There should be more details in the system / hyper-v / vmms log as to why a merge may not be occurring that can help with troubleshooting (an error like Background disk merge failed to complete)

I would recommend to try getting application consistent backups to work - that is how they are intended to function, and the state of the VM upon recovery can be questionable if not enabled.

Reply