Solved

Restore speed vmware

  • 17 September 2021
  • 32 replies
  • 1570 views

Userlevel 2
Badge +5

Hi All.

 

I am having issues with restore speed when restoring a vmware server.

 

When a restore is done where “vCenter Client”  is set to the vcenter, restore speeds are slow.

When a restore is done where “vCenter Client” is set directly to the ESXI host and the vcenter is bypassed, we see a factor 4 in restore speeds.

 

Anyone who can explain this behavior? I thought that the vcenter was only used for control data and not data movement.

 

Regards

-Anders

icon

Best answer by Mike Struening 7 January 2022, 23:12

View original

32 replies

Userlevel 6
Badge +14

Hi @ApK ,

The vCenter should only be used for control data here, such as create VM, create VM Snap, etc.

What transport method was used for both jobs here? Was the same disk provisioning used also?

 

In the vsrst.log on the VSA Proxy used for restore, you should see counters under Stat-. These should give a good indication of the media read and disk write speeds here.

I’d suggest reviewing the log and comparing, there may have been an operation that took longer or a difference in speeds (for some reason). - Hopefully the log will give more insight into this!

 

Best Regards,

Michael

Userlevel 2
Badge +5

Hi Michael.

Thanks for your reply.

 

That was my owne thought, that it was only using the vcenter for control data, thats why im wondering what is happening here.

 

I’m using nbd for the restores and thin provisioning disks.

 

I have made 10 tests this morning, and all restores via the esxi host directly is 3-4 times faster.

 

Checked the vsrst.log file, and MediaAgent read speeds are fast, so this is not the issue for sure. Issue is, that vcenter is involved in the restore for some how.

 

Regards

-Anders

Userlevel 6
Badge +14

Thanks @ApK ,

Would you be able to share the vsrst.log and a JobId of vCenter and ESX?

 

Best Regards,

Michael

Userlevel 2
Badge +5

Hi Michael.

 

Would it be better to raise a case for this issue, to further investigate?

 

Thanks

-Anders

Userlevel 6
Badge +14

Hi @ApK ,

 

Yes, you can raise a case for this. - We’ll need the Logs and the Job ID’s to check it further.

Once raised, let us know the case number and we can monitor it internally.

 

Best Regards,

Michael

Userlevel 7
Badge +18

Hey folks,

This sounds like a textbook case of “clear lazy zero” if you are doing SAN restores - article here:

https://documentation.commvault.com/11.24/expert/32721_vmw0074_san_mode_restores_slow_down_and_display_clear_lazy_zero_or_allocate_blocks_vmware.html

 

I was writing the description but the KB article sums it up well

Userlevel 2
Badge +5

Hi Michael.

I have made some more restore tests, and looking into vsrst.log for the restores, shows me some cracy differencies in readmedia speed, so I might have a different issue than in the begnining where I thought it was a vcenter issue.

 

from the vsrst.log file:

Same vmware server restore, same HyperScale server, two different dates, very different read speeds.

 
09/17 10:31:52 10456669 stat- ID [writedisk], Bytes [90393542656], Time [702.082991] Sec(s), Average Speed [122.786054] MB/Sec
09/17 10:31:57 10456669 stat- ID [readmedia], Bytes [83501234239], Time [43.402569] Sec(s), Average Speed [1834.752742] MB/Sec
09/17 10:31:58 10456669 stat- ID [Datastore Write [SN771-D2250-L0009]], Bytes [91067777024], Time [708.095798] Sec(s), Average Speed [122.651483] MB/Sec
 
 
09/20 12:05:19 10481089 stat- ID [readmedia], Bytes [152791654482], Time [5328.833140] Sec(s), Average Speed [27.344350] MB/Sec
09/20 12:05:21 10481089 stat- ID [Datastore Write [SN771-D224E-L0008]], Bytes [162756820992], Time [1126.319442] Sec(s), Average Speed [137.809039] MB/Sec
09/20 12:05:21 10481089 stat- ID [writedisk], Bytes [162756820992], Time [1126.369482] Sec(s), Average Speed [137.802917] MB/Sec

 

I will create a case to have this investigated.

 

@Damian Andre, the restores was done via nbd, but thanks for your suggestion :-)

 

Regards

-Anders

Userlevel 7
Badge +18

@Damian Andre, the restores was done via nbd, but thanks for your suggestion :-)

 

I hate it when my hunch is wrong :joy:

Were both restore tests from the same source job or different jobs? Would be interesting to run it again to see if they are consistent with the last run, and if so what the difference between the jobs is

Userlevel 2
Badge +5

Hi Damian.

:grinning:

Same vmware server restored, same HyperScale server as proxy, last couple of restores have been really slow, just did a new one with really slow performance.

 

I have created a case now.

Userlevel 7
Badge +18

Hi Damian.

:grinning:

Same vmware server restored, same HyperScale server as proxy, last couple of restores have been really slow, just did a new one with really slow performance.

 

I have created a case now.

Sounds good. Be sure to let us know the outcome!

Userlevel 7
Badge +23

@ApK , can you share the case number with me so I can track it properly?

Userlevel 2
Badge +5

Hi Mike.

Case number is: 210920-319

 

Regards

-Anders

Userlevel 7
Badge +23

Thanks!  I see you are working with Alexis….you’re in great hands!

Userlevel 7
Badge +23

Updating the thread as per the case notes.  Development and Alexis discovered that the majority of the job duration was opening SFiles, so you have sealed the store and are monitoring.

Userlevel 7
Badge +23

Also sharing some wise words from Alexis.

You had asked her about the need for a DDB in restores, of which there is none.  Her reply:

Correct, the DDB itself is not used for restores and is only used for backups and data aging. 
However, the files are still deduplicated and there is a large number of small files (sfiles) that needs to be opened to be able to perform the restore. As there is a large number of files that need to be opened, thus increasing the restore time as we can see by the below example that this 48 minute restore, 29 minutes of it were spent opening files

J10685855/<name>.dk_logs/cvd_2021_10_14_15_24_37.log:7874 73b4 10/14 14:17:50 10685855 [DM_BASE    ] 30-# SfileRdrCtrs: [Open File] Exp Avg [0.01], Total Avg [0.03], Total Time [1768.24], Total Count [59139]

The keys increase the size of the sfiles thus causing lesser files to be opened and faster open file times.
The seal will prevent the new backups from referencing the smaller older sfiles which in turn should also increase the time.

Once the seal has been performed, please run new fulls, and restore from the full after the seal. Please let us know how performance of this restore job.

If the seal cannot be performed, please let us know. 

Let me know once things are looking better (though I’ll also see how the case progresses).

Thanks!

Userlevel 7
Badge +23

Looks like you sealed the store and things are looking better.  You’ll monitor for a week to confirm.

I’ll keep this open for a week as well :nerd:

Userlevel 7
Badge +23

Sharing case resolution:

Implement Maxsfilecontainer keys which helped but was still inconsistent
Issue was escalated to develpoment
when using the keys, DDB is needed to be sealed
Answered additional concerns and after DDB was sealed, perform is much better and consistent

Userlevel 7
Badge +16

@Mike Struening Thanks for adding the case resolution, but this makes me wonder. This seems to be a generic restore situation and if I read it correctly @ApK had to add a key and seal the DDBs. 

So this will only address the recovery of new recovery points. But as this appears to be a generic situation what can other customers expect here? What was so unique about this situation. Should other customers do the same? 

Userlevel 2
Badge +5

Hi @Onno van den Berg

 

I asked the same question in the support case, and here is the answer i got:

 

 

“Me: Was just thinking, are there any downside to the changes we made, why are the settings not default settings in a HyperScale setup?

 

Support: In the current release the default sfile size is 1GB. The key was added because older sfiles were written in a smaller size and new backups continued to refer to the older sfiles. The sealing forces the creation of the new sfiles in the 1GB size, thus allowing the faster performance.”

 

So my quess is, if you have had a HyperScale setup running for a couple of years like my self, you would have the same issue as I experienced. But a newly installed HyperScale setup would not suffer from the smaller sfile size, as they are using 1GB sfiles.

 

Regards

-Anders

 

 

Userlevel 7
Badge +16

Hi @Onno van den Berg

 

I asked the same question in the support case, and here is the answer i got:

 

 

“Me: Was just thinking, are there any downside to the changes we made, why are the settings not default settings in a HyperScale setup?

 

Support: In the current release the default sfile size is 1GB. The key was added because older sfiles were written in a smaller size and new backups continued to refer to the older sfiles. The sealing forces the creation of the new sfiles in the 1GB size, thus allowing the faster performance.”

 

So my quess is, if you have had a HyperScale setup running for a couple of years like my self, you would have the same issue as I experienced. But a newly installed HyperScale setup would not suffer from the smaller sfile size, as they are using 1GB sfiles.

 

Regards

-Anders

 

 

So in that case I would expect Commvault to release an advisory (and I assume they can easily extract this information from CCID information that is uploaded to cloud.commvault.com) to inform customers that they managed to improve the recovery performance of VM-leve backups as of version X but that it requires customers to seal their DDBs.

Any idea if this was specific to Hyperscale alone? 

Userlevel 2
Badge +5

That would be a good gesture from Commvaults side, to make an advisory like that.

 

I might add, that it did not only help on restore speeds on vmware servers, but on all sorts of restores.

I have daily scheduled restores of MSSQL databases running, and before the changes, restores were running with around 500GB/h. After the changes, they are now restoring with 2.000-3.000GB/H depending on the load, so that’s quite a change in speeds.

 

I asked that question aswell, as I also have “normal” MediaAgents :-)

Its only specific to HyperScale setups, and “normal” MediaAgents is not affected by thees additional settings

 

 

Userlevel 7
Badge +23

@Onno van den Berg , @ApK I’ll reach out to the engineer and see if this is something we can easily do (and if the number of customers is high or small).

This may be documented already, though I also own the Proactive Support systems, which makes this quite convenient :nerd:

Userlevel 7
Badge +16

Any update on this one @Mike Struening ?

Userlevel 7
Badge +23

Thanks for the push, @Onno van den Berg !  I followed up again with the engineer involved.  The solution was given to her by another engineer who I am still tracking down.

I’ll loop back once I have more details.

Userlevel 7
Badge +23

Here's the setting:

Key: DataMoverLookAheadLinkReaderSlots

Category: MediaAgent

Value<integer> : 1024 

 

Add key to all nodes retry restore.

 

============================

 

If performance is still slow, I would suggest a support incident to confirm if the following keys are necessary:

 

MediaAgent/MaxSFileContainerSize set it to 1073741824

MediaAgent/MaxSFileContainerItems set it to 8192

 

However these keys require sealing the deduplication database which requires a significant amount of free space.

Reply