Skip to main content

Hello,

 

We have ran into a very strange issue. After we upgraded to FR26 our Windows team has been complaining that the Hyper-V nodes are going into non-responsive state during the backups and in some cases they have to force a reboot to bring those nodes and VMs back online. They have provided below data to support their side of the argument:
 

  • The issue is only observed during the backup window.
  • While the backups progress, multiple disk/vhdx related error/warning messages are being logged:
    • Disk 208 has the same disk identifiers as one or more disks connected to the system.
    • A timeout (30000 milliseconds) was reached while waiting for a transaction response from the GxCVD(Instance001) service. All Commvault services report this. 

    • The disk signature of disk 208 is equal to the disk signature of disk 5.

    • DISKPART shows multiple entries like:
       Disk 5    Invalid         200 GB      0 B   *

  • This was not observed prior to the upgrade we performed.

We have opened a Commvault incident for this. Does anyone here have any clue what might be going on here?

 

Thank you

 

Regards

Abdul 

Hi @AbdulWajid 

Many things in fact…

disk signatures : is maybe VSA proxy beeing backup at the same time it is trying to backup another VM?

timeout for CV services : is this observed on a Commvault FS (and SQL?) client installed inside the VM beeing backup, and VM Backup Options are Application Aware, or File system consistent? (but mostly not ‘Crash Consistent’ ? Hyperv VSA proxy ?

What about the storage attachement type of this hyperV host regarding the one used to store the VMs ?

What kind of upgrade has been performed ? Can you provide details ? (upgrade from which FRxx to this FR26)

As you also involved the Support, I guess they’ll be quickly assisting you and will provide you with a more accurate troubleshooting guide than we could do here 🙂 But whatever they would tell you, please share over here for our knowledge.

Regards,

Laurent.  


@AbdulWajid ,

I can see you logged a case with Commvault and it has now been escalated to Development.

They will review and we will get back to you.

Best Regards,

Sebastien


@AbdulWajid , I checked your incident for this issue and it looks like you are going to reach out to MS support.

Do you have any updates from those discussions to share?

Thanks!

 

Here’s the Roll Up Summary:

We use below 2 APIs for creating VM snapshot and converting the same to a reference point.

https://docs.microsoft.com/en-us/windows/win32/hyperv_v2/createsnapshot-msvm-virtualsystemsnapshotservice
ConvertToReferencePoint method of the Msvm_CollectionSnapshotService class

https://docs.microsoft.com/en-us/windows/win32/hyperv_v2/msvm-virtualsystemsnapshotservice-converttoreferencepoint
ConvertToReferencePoint method of the Msvm_VirtualSystemSnapshotService class
 

Sincerely,
Pushkar

- Issue began after upgrading from 11.23.
- The issue starts by reporting partmgr event ID 58 and Disk event 158 - These events claim there is a disk signature collision.
- Shortly after the above starts services start becoming unresponsive. The only method to get services running again is to reboot the host.
- Disk queue doesn't appear to be affected at all
- We do see what appears to be abnormally high disk counts in some of the partmgr errors. One such error stated the disk signature issue was happening between disk 248 and disk 5.


@AbdulWajid , I checked your incident for this issue and it looks like you are going to reach out to MS support.

Do you have any updates from those discussions to share?

Thanks!

 

Here’s the Roll Up Summary:

We use below 2 APIs for creating VM snapshot and converting the same to a reference point.

https://docs.microsoft.com/en-us/windows/win32/hyperv_v2/createsnapshot-msvm-virtualsystemsnapshotservice
ConvertToReferencePoint method of the Msvm_CollectionSnapshotService class

https://docs.microsoft.com/en-us/windows/win32/hyperv_v2/msvm-virtualsystemsnapshotservice-converttoreferencepoint
ConvertToReferencePoint method of the Msvm_VirtualSystemSnapshotService class
 

Sincerely,
Pushkar

- Issue began after upgrading from 11.23.
- The issue starts by reporting partmgr event ID 58 and Disk event 158 - These events claim there is a disk signature collision.
- Shortly after the above starts services start becoming unresponsive. The only method to get services running again is to reboot the host.
- Disk queue doesn't appear to be affected at all
- We do see what appears to be abnormally high disk counts in some of the partmgr errors. One such error stated the disk signature issue was happening between disk 248 and disk 5.

Nothing moved as of yet from Microsoft side. They are looking into logs. So far this is what we know:

  • Once the backups complete. The checkpoint merge operation fails and Hyper-V console shows there are no snapshots. If you look at the configuration of the VM, it still points to an AVHDX file. This leads to a delta build up and in some cases more than 30 AVHDX files. When there are these many AVHDX files, as pre backup process Commvault tries to mount all these snapshots as disks(to scan for blocks that will be excluded from backup like page files etc.) which puts stress on the Disk Management service leading the server go into a non-responsive state and eventually crash. 
  • Microsoft is investigating now why the merge operation is failing and a possible fix around this.

Will post here further updates:

 


Thanks for the update, @AbdulWajid !  I’ll keep an eye out.


Reply