Solved

Intellisnap Backup Copy kills Virtual Disk Service on Server 2016 Media Agent


Userlevel 2
Badge +3

Hi

Been plagued with this problem for a while. Support has not been able to crack it yet either. I have 4 Media Agents and 2 are using CBT and crash consistent backup options for the Intellisnap and work fine. The other 2 Media agents are also using CBT but using Application Consistent (quiesced backup) backup options for Intellisnap and these Media Agents will sometimes freeze up the Virtual Disk Service and the backupcopy fails for the remainder of the backup copies. I cannot open the process manager on the MA’s when this happens and end up have to reboot the MA to get things back on track. Not a lot of info but chucking this one out there to see if anyone had a similar experience.

 

icon

Best answer by s3narasi 12 March 2021, 06:48

View original

39 replies

Userlevel 7
Badge +23

Hey @Neil Cooper !  I see you have a case opened for this, so I’ll get some top brains on it for you.  I also want to make sure we circle back and share the solution for posterity.

If anyone in the community has ideas, please do share for Neil!

Userlevel 2
Badge +3

Thanks. CV is top notch WRT support. I’m trying again to see if they can assist but in the CV logs there is not much to go on which is why I’m reaching out here as well.

 

Cheers

 

 

Userlevel 7
Badge +23

Appreciate that!  I already reached out to the Leadership team that owns your case and said we need a top mind here :nerd:

Badge +13

Hi @Neil Cooper I have a few questions. Are these media agents physical or virtual? Would you be able to get one VM using the stable vmtools version (please make sure you are not using version 11269 issue reported here (https://docs.vmware.com/en/VMware-Tools/11.0/rn/VMware-Tools-1105-Release-Notes.html

My recommendation:

#1 - Create a new subclient or exclude all vms except one for testing and make sure you have a VMtools updated.

#2 - Increase the Debug for (vsbkp and VixDiskLib) on the media agent side to maybe 3 / File Size to 10MB and File Versions to 5) - You can revert this back to Default once the test is complete

#3 - Start a backup for the one VM that you have seen failing before. 

#4- Use GxTail to open both logs and filter by “Commvault Failures and Successe” 

 

When you are using Application Consistent I`d make sure that my VMtools is stable and not a problematic one like 11269. Post the results here.

Userlevel 6
Badge +13

Hi Neil,

Have you checked that Automount is disabled and the SAN policy is OfflineShared on the affected Media Agents?
Is there any AV on the Media Agents that could be scanning the attached disks or interfering with the CV Processes?

I’d also check the vsbkp,VixDiskLib and Event logs whilst the Job is running or before the issues to get an idea of what is happening.

Best Regards,

Michael

Userlevel 2
Badge +3

Appreciate the response. All MAs are over specked physical machines. No VM’s in the CV hardware. 

Recommendations replies:

  1. It could be a vmware tools issue on some of the machines as some are out of date. Although this issue does not happen all the time.
  2. Will try this. The CV Engineer was on the phone and set this up. Will review in the am.
  1. The snapshot of the Datastore finishes. When the backup copy runs for a random datastore on the schedule sometimes the VM’s just get stuck in waiting and none of the backup copies will run after that. I have to properly reboot the MA in order to get the backup copes to run to finish off the Intellisnap work flow.
  2. Going through the logs in the am (if we have failures) and I will report back. 

Thank you for your reply.

Cheers

Neil

 

Userlevel 2
Badge +3

@MichaelCapon automount disabled and SAN Policy = Offline Shared

 

No AV

 

Tracking the logs.

 

Cheers

 

Neil

Badge +13

@Neil Cooper What storage array are you using? Can you share the Windows Event Logs from when the error happens? 

Userlevel 2
Badge +3

@dude Dell Compellent SAN.

 

Badge +13

@Neil Cooper Can you please share what the error says. Open of of the error messages and share with us the details on that Error Log. Event ID, Description etc. Thanks

Userlevel 3
Badge +7

Hi

Been plagued with this problem for a while. Support has not been able to crack it yet either. I have 4 Media Agents and 2 are using CBT and crash consistent backup options for the Intellisnap and work fine. The other 2 Media agents are also using CBT but using Application Consistent (quiesced backup) backup options for Intellisnap and these Media Agents will sometimes freeze up the Virtual Disk Service and the backupcopy fails for the remainder of the backup copies. I cannot open the process manager on the MA’s when this happens and end up have to reboot the MA to get things back on track. Not a lot of info but chucking this one out there to see if anyone had a similar experience.

 

have you tried using a separate proxy ( even just as a test ) for the backup copy ? Currently i use separate proxy for snap and backup copy to avoid any issues with application consistent 

 

https://documentation.commvault.com/commvault/v11/article?p=62414.htm

Userlevel 2
Badge +3

@Matthew M. Magbee The proxy has hard set at the subclient level. Removed and will let it balance the load tonight and see how it goes.

Thank you

Neil

Userlevel 2
Badge +3

@dude 

Log Name:      System
Source:        Virtual Disk Service
Date:          3/9/2021 2:37:38 PM
Event ID:      1
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      kil-cvlt-8
Description:
Unexpected failure. Error code: 2@02000018
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Virtual Disk Service" />
    <EventID Qualifiers="49664">1</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2021-03-09T18:37:38.525490600Z" />
    <EventRecordID>730396</EventRecordID>
    <Channel>System</Channel>
    <Computer></Computer>
    <Security />
  </System>
  <EventData>
    <Data>2@02000018</Data>
  </EventData>
</Event>

Userlevel 2
Badge +3

Morning

Setting the proxy at the client level versus the subclient level seems to have load balanced the backup copy much better. A “sea of green” for the Intellisnaps last night. Its only one night but I will monitor and report back. Ben from CV support is really digging into this issue for me. Went through a lot yesterday.

More to follow….

Userlevel 7
Badge +23

That Ben is a good egg!  Keep us posted on your results :-)

Userlevel 2
Badge +3

So setting the proxies at the instance level did not work on the second night. The troubled MA was allocated most of the load and froze up the VDS and the backup copies failed. Back to square one.

Thanks

Badge +13

@Neil Cooper 

As I navigate down the internet researching this issue I sure see some other very similar cases. So what I`m about to share it may (not) be totally connected to your issue, so take all of this with some grain of salt. 

From reading this document it is my understanding that Dell requires its own software integration for application consistent snapshot backups. 

Page 52

“If additional server protection is desired by capturing application-consistent VSS-integrated Replays of Hyper-V guest VMs, please refer to the Dell Compellent Replay Manager 6 Users Guide. Dell Compellent Replay Manager 6 is able to leverage Microsoft VSS to take application-consistent (IO is paused) Replays of Hyper-V guests, Exchange servers, and SQL servers.” 

And then I found this thread that refers to Dell Compellent, that led me to this blog post that says;

Page 3 and 4

“Backup environment set up we need to configure the Compellent SAN LUNs and the Replay Manager software to make sure the hardware VSS provider is used and the transportable snapshots can be presented to the Off host backup proxy server.”

“You need to install the Compellent Replay Manager Service on - - - - - - backup proxy server. Do note that you need a license for this software, which is needed later on when you configure this service on the hosts to interact with the Compellent SAN.”

 

So again, Im not an expert in Dell Compellent, but after reading some of these docs I do wonder if you have the software configured with the VSS on your media agents. Would you be able to review the links and confirm here?

 

Thank you

Badge

@Neil Cooper We are talking Vmware backups I think and when you are seeing the VDS hang when it is loaded. how many data stores are getting presented here when we have heavy load. 

is there any dell compellent tools installed in the Proxy server and can we confirm is there a lot of left over phantom devices or hidden devices to find them use the below link

https://support.microsoft.com/en-us/topic/device-manager-does-not-display-devices-that-are-not-connected-e7148232-40ae-bb07-0077-88f2e859b53f

 

or you can use the below tool to show and cleanup the phantom devices devnodeclean /n

https://www.microsoft.com/en-us/download/details.aspx?id=42286 

Badge +13

A few other things.

Dell Compellent is deprecated from the Commvault software in V11 SP12. Source: https://documentation.commvault.com/commvault/v11_sp16/article?p=33107.htm

It does require Data Instant Replay licensing as stated above: Source: https://documentation.commvault.com/commvault/v11_sp16/article?p=33107.htm

Commvault Docs says that it should not require any additional Dell Compellent software as per the System Requirements 

This is an old document from Dell on their - CommVault Simpana 10 Best Practices for the Dell Compellent Storage Center and though the articles in my previous post  do mention the requirement Compellent Replay Manager Service on Hyper-V  - it raises the question as to whether or not you have Dell Compellent Software Replay or any other software that may be interfering with the way Intellisnap works when it takes the snaps and mounts on your Media Agent during an snapmount.

In any case, if you do have it installed, check out the version as some older versions may not support integrate well (or integrate at all) with Commvault. From what I could find, it seems version 8.0.1 is the latest.

 

If you do have replay manager, make sure to be using the latest version as it fixes issues with the VSS Provider, if you have the latest version and still does not work, try removing it completely and retry the operation.

 

 

Userlevel 2
Badge +3

Hi

I have read all the replies. Thank you. I was in the DC yestarday and did not reply but will do my best once I have had a chance to try some more things with Ben from CV support. Currently spreading out the schedules has helped. I’m seeing error about Mutiple LUNS causing issues during the snapshot mount on the Compellent. The cleanup tool mentioned that needs to be installed on the MA might be important here as we are still on V11 SP 16. I can see the dead LUNS in VMWare and when I do a rescan of storage they go away:

OS mount failed : [VMWare Mount snapshot failed [0xEC02ECC3:{VMwareSnapOSUtil::MountSnap(1841)/MM.60611-Error mounting the snapshot LUNs because there are multiple copies present.}] (MM.60611)]

Badge +13

Hi, to me the fact that you see multiple LUNs and dead paths has a lot to do with the MPIO software and the Dell Compellent Software I mentioned above. Check it out and let us know when you have a chance. Enjoy your weekend. 

Userlevel 2
Badge +3

@s3narasi Output from the DevNodeClean

 

Userlevel 2
Badge +3

@s3narasi 

 

Userlevel 2
Badge +3

Update

 

So I have gone back to manually setting the proxy at the sub client level equally among the datastore IntelliSnap backups. CV has suggested we remove the option on the snapshot for collect file details for snapshot copy. This has changed our snapshot timings from close to an hour (incremental) down to merely minutes. I’m not running any Compellent software on the MA’s but I did run the Microsoft tool to remove old Compellent registry entries. We are running CV V11 SP 16 and still using the Dell Compellent Depreciated option for the SAN snap vendor and the datastore backups. Maybe this is part of the issue?? Also the MA’s and the CommServe are behind in HPK so that might also solve some issues. It seems like a throughput issues on 2 of the 4 MA’s because the only issue I experienced all weekend was when 2 schedules overlapped. I spread them out by an hour and all green again last night for the snap and backup copies. At most we should have 3 datastores (20 VM’s each) running. Please see greyed out options for Compellent

 

 

Badge +13

I`m quite unsure as to what your thought process is here or even at what your questions are. Previously shared some links that points to best practices as well as a software for the Media Agents, did you have a chance to look at that?

Reply