Solved

Virtualize Machine stuck at 90%


Userlevel 2
Badge +7

I’ve been fighting with P2V’ing a Centos machine for over a week. The first phases get entirely done within 30m, but the job hangs at the Post Virtualize phase (waiting) for what the logs don’t say. I have a support case opened on this but dev seems stumped as do I. Plus the way they wrote this, can’t suspend it, so after some amount of time, even though to me my machine is fully virtualized AND functional, it will destroy it all when times out eventually. It has done this several times already and is easily reproduced. I cannot believe nobody else has ever ran into this or have they? I told our folks I could expedite P2Ving 100 workstations but now have egg on the old face, 😡

11.26.31 we are on

thanks

icon

Best answer by downhill 24 August 2022, 16:02

View original

14 replies

Userlevel 7
Badge +19

Thanks for letting us know! Never used this capability myself. In case development is hooked on already and saying than I assume it is going ti be dealt with but I can image you feel somewhat ashamed that it does work as expected. 

Userlevel 7
Badge +23

I believe that at the 90% mark the machine now running as a VM should run a command that connects to the CommServe to let it know that its online (SimCallWrapper). That would mostly rely on networking - having correct DNS, network rules, connected to the right adapter in vCenter, all ports unblocked or setup the correct network rules to communicate on restricted ports etc.

Before the machine is killed and deleted when the job fails (times out) you can clone it in vCenter to investigate further - perhaps even trying to register it as a normal client with an agent to diagnose the steps that have to be taken, and give clues as to why the SimCallWrapper function is not hitting the CommServe. The logs on the client will give the biggest clue - cvd, cvfwd and simcallwrapper most likely.

Of course I don't know what happened in the actual case, but that is my general observation from what you described. 

Userlevel 7
Badge +19

@downhill I think you might be hitting an issue that I have experienced as well but in my case it was happening with the FREL image. 

Assuming an access node is involved you could try the following reg key: bCleanupVmOnRestoreFailure”DWORD = 0 set it is on proxy/access node machine

Now try to run the job again and see if it keeps on running without being removed. If that is the case than access the client and look into the Commvault logs. Also note which Commvault client version you are running now.

Can the client access the Internet?

Userlevel 2
Badge +7

I have cloned the machine to resurrect it to the point of functional. Of course when you clone a VM, the resulting copy is not exactly like the source.

Which Category is the bCleanupVmOnRestoreFailure key for?

thanks guys

Userlevel 7
Badge +23

I have cloned the machine to resurrect it to the point of functional. Of course when you clone a VM, the resulting copy is not exactly like the source.

Which Category is the bCleanupVmOnRestoreFailure key for?

thanks guys

Goes on the proxy, category is VirtualServer type is boolean

Userlevel 7
Badge +19

Are you sure @Damian Andre as I was told it should be the following:

Name: bCleanupVmOnRestoreFailure
Category: VirtualServer
Type: Integer
Value: 0

 

 

Userlevel 2
Badge +7

Doesn’t seem to like Integer as I just re-reran the Virtualize Me, ad 2.5 hours after kicking it off, it nukes my VM even though, once again, the machine to me is fully functional, has no leftover 1-touch mounts, or anything else out of place that I can see, it just says “loss of control process NA. There are zero dropped or denied ports between the machines, and I can see traffic (what looks like keepalives) periodically between them, so I am sure the issue isn’t “they can’t communicate”. I may try the boolean type and see how that works.

Userlevel 2
Badge +7

Neither works - this is so frustrating. Dev seems like their baffled as well. I simply can’t believe it.

Well, thanks guys for the tips but I’ve now blown well over a week on this and P2V’d the same machine 10 times at least only to have the thing get nuked 2.5 hours after kicking the job off. Faith no more.😔

Userlevel 7
Badge +23

@downhill , can you share the case number?  I want to track it.

Userlevel 2
Badge +7

Sure Mike. 220811-606

 

They gave me this key to try to avoid the nuking at the end:

Name: SkipDeleteVm

Category: VirtualizeMe

Type: Integer

Value: 1

 

I am re-running this again and will report back on if this works. I’m thinking it might.

thanks

Userlevel 2
Badge +7

@Onno van den Berg and @Damian Andre - the correct key is what I previously posted. This prevents the VM from being destroyed even if the job fails. Awaiting dev’s analysis of the most recent attempt.

thanks guys

Userlevel 7
Badge +19

Pity this key is hidden and not listed in the additional settings database…… Anyway curious to hear what the actual issue is that is causing the job to stall. 

Userlevel 7
Badge +23

@Onno van den Berg and @Damian Andre - the correct key is what I previously posted. This prevents the VM from being destroyed even if the job fails. Awaiting dev’s analysis of the most recent attempt.

thanks guys

Looks like engineering are waiting for the logs from the restored machine to further troubleshoot now that the VM is not being deleted. They will probably have to be collected manually - as I suggested earlier on I think this machine is not able to communicate with the CommServe to report restore success. The logs from the restored machine should help figure out why - logs should should be in /var/log/commvault/Log_Files

Userlevel 2
Badge +7

I guess I forgot to put results here, basically I ran out of time to “play” with this particular machine as it was “mine” and needed to abandon the experiment. No, they could not ever tell me why the full restore couldn’t complete. Each time I provided logs, it apparently was never enough. The key did keep it from getting destroyed but eventually my management told me to pull the plug on this PoC effort and they went a different route. I suspect it was because the support team had done all sorts of alias business in DNS so reverse lookups were returning names which didn’t match what were expected, even though they pointed to the same machine. Oh well. thanks guys.

Reply