Solved

Restore speed vmware

  • 17 September 2021
  • 32 replies
  • 3200 views

Userlevel 2
Badge +6

Hi All.

 

I am having issues with restore speed when restoring a vmware server.

 

When a restore is done where “vCenter Client”  is set to the vcenter, restore speeds are slow.

When a restore is done where “vCenter Client” is set directly to the ESXI host and the vcenter is bypassed, we see a factor 4 in restore speeds.

 

Anyone who can explain this behavior? I thought that the vcenter was only used for control data and not data movement.

 

Regards

-Anders

icon

Best answer by Mike Struening RETIRED 7 January 2022, 23:12

View original

32 replies

Userlevel 7
Badge +19

Exactly! And make them become pro-active instead of re-active. So to be clear I catch the change of the sfile and the sealing of the DDB. Improvements are being made but I would expect Commvault to be pro-active in informing customers. I t.b.h. doubt if this is Hyperscale specific so if you can get some details why they stated this than that would be great and last but not least the DataMoverLookAheadLinkReaderSlots setting. What's the relation between this key and a restore because from what i know this is truly something that has an effect on reads from the DDB while during a recovery it should not use the DDB. Only thing I can think of is that besides DDB lookups that it also affect the amount of segments it reads at the same time from sfiles but this is not documented.

Userlevel 7
Badge +23

I’m asking similar questions myself.  Ideally, if such a setting can be detected as beneficial, there should be some notice advising you to seal the store after setting the value (automating the latter part, perhaps).

I’ll reach out to the right people and see what we can start working on here.

Userlevel 7
Badge +19

You are not making it very easy @Mike Struening because it's the factors where I'm interested in. The issue I'm having with it is that you will always run into these kind of situations when you need Commvault to be there, as this situation impacts your RTO. As we both know not a lot of customers do automated frequent recovery tests and based on your feedback and while reading back the information from the entire thread I came to the following conclusion and please correct me if I'm wrong:

  • User perform VM-level backup and tried to recover the VM. 
  • Recovery was "slow"
  • Advise from development was to set a key to increase the size of the sfile and seal the DDB. The value of the key has become 1GB since version X and most likely the user was still running an older version that didn't had this value set already. 
  • DDB was sealed
  • Recovery was much faster after applying DataMoverLookAheadLinkReaderSlots which changes the amount of slots it reads from the DDB at the same time to reduce overhead. 
  • According to support this only affects Hyperscale installations.

Now I'm curious why they needed to apply the DataMoverLookAheadLinkReaderSlots setting because it should only affect aux-copies.We had the setting in place ourselves in the past to speedup tape-out copies.

The user reported much faster VM-level restores but also SQL restores were much faster. My take on this thread is that there is a very big possibility that customers who are running Hyperscale for a long time already should consider sealing their older DDBs, who have SPs with long-term retention, (would be nice have guidance on this which DDBs might be affected) to get much better restore performance. 



 

Userlevel 7
Badge +23

I spoke to my colleague who explained that it is definitely helpful for some people, but not necessarily everyone.  It’s designed to increase the size of the sfiles to reduce read times (for what would instead be many smaller files).

Depending on various factors like retention, it may not be beneficial for everyone, though when it is then it works very well.

Userlevel 7
Badge +19

Ok. I'm really keen to learn what specific circumstance was in place in this case that required this setting to be put in place. 

Hope we get an answer soon! 

Userlevel 7
Badge +23

@Onno van den Berg my understanding is that this was specific to this exact issue and not an overall suggestion for all.

I’ll reach out to the engineer who suggested the key and confirm the above.

If not, I’m with you.  Something that can benefit all should be standard (which is something we do request from dev quite regularly).

Userlevel 7
Badge +19

I'm puzzled to hear this setting (DataMoverLookAheadLinkReaderSlots) is still present and that there might be a lot of room for customers to improve their recovery times without knowing.

What were the specific conditions that were key for development to come-up with these specific additional setting as a possible workaround .e.g. which customer type can be targeted for these settings to be applied? 

Sure, each and every customer environment is different but there are deepening on factors like storage type still settings that can improve recovery speed which imho is the most important reason why we have Commvault in place. So when can we expect this fine-tuning to be automated? I would envision the possibility within Commvault that allows you to perform (automated) dummy backup/restore operations. The results can then be used to fine-tune performance related settings in the background to deliver the most optimal backup and most important recovery experience. That would deliver a solution that can be beneficial to all Commvault customers and would also deliver valuable benchmark information to development to further optimize the experience. 

Userlevel 7
Badge +23

Here's the setting:

Key: DataMoverLookAheadLinkReaderSlots

Category: MediaAgent

Value<integer> : 1024 

 

Add key to all nodes retry restore.

 

============================

 

If performance is still slow, I would suggest a support incident to confirm if the following keys are necessary:

 

MediaAgent/MaxSFileContainerSize set it to 1073741824

MediaAgent/MaxSFileContainerItems set it to 8192

 

However these keys require sealing the deduplication database which requires a significant amount of free space.

Userlevel 7
Badge +23

Thanks for the push, @Onno van den Berg !  I followed up again with the engineer involved.  The solution was given to her by another engineer who I am still tracking down.

I’ll loop back once I have more details.

Userlevel 7
Badge +19

Any update on this one @Mike Struening ?

Userlevel 7
Badge +23

@Onno van den Berg , @ApK I’ll reach out to the engineer and see if this is something we can easily do (and if the number of customers is high or small).

This may be documented already, though I also own the Proactive Support systems, which makes this quite convenient :nerd:

Userlevel 2
Badge +6

That would be a good gesture from Commvaults side, to make an advisory like that.

 

I might add, that it did not only help on restore speeds on vmware servers, but on all sorts of restores.

I have daily scheduled restores of MSSQL databases running, and before the changes, restores were running with around 500GB/h. After the changes, they are now restoring with 2.000-3.000GB/H depending on the load, so that’s quite a change in speeds.

 

I asked that question aswell, as I also have “normal” MediaAgents :-)

Its only specific to HyperScale setups, and “normal” MediaAgents is not affected by thees additional settings

 

 

Userlevel 7
Badge +19

Hi @Onno van den Berg

 

I asked the same question in the support case, and here is the answer i got:

 

 

“Me: Was just thinking, are there any downside to the changes we made, why are the settings not default settings in a HyperScale setup?

 

Support: In the current release the default sfile size is 1GB. The key was added because older sfiles were written in a smaller size and new backups continued to refer to the older sfiles. The sealing forces the creation of the new sfiles in the 1GB size, thus allowing the faster performance.”

 

So my quess is, if you have had a HyperScale setup running for a couple of years like my self, you would have the same issue as I experienced. But a newly installed HyperScale setup would not suffer from the smaller sfile size, as they are using 1GB sfiles.

 

Regards

-Anders

 

 

So in that case I would expect Commvault to release an advisory (and I assume they can easily extract this information from CCID information that is uploaded to cloud.commvault.com) to inform customers that they managed to improve the recovery performance of VM-leve backups as of version X but that it requires customers to seal their DDBs.

Any idea if this was specific to Hyperscale alone? 

Userlevel 2
Badge +6

Hi @Onno van den Berg

 

I asked the same question in the support case, and here is the answer i got:

 

 

“Me: Was just thinking, are there any downside to the changes we made, why are the settings not default settings in a HyperScale setup?

 

Support: In the current release the default sfile size is 1GB. The key was added because older sfiles were written in a smaller size and new backups continued to refer to the older sfiles. The sealing forces the creation of the new sfiles in the 1GB size, thus allowing the faster performance.”

 

So my quess is, if you have had a HyperScale setup running for a couple of years like my self, you would have the same issue as I experienced. But a newly installed HyperScale setup would not suffer from the smaller sfile size, as they are using 1GB sfiles.

 

Regards

-Anders

 

 

Userlevel 7
Badge +19

@Mike Struening Thanks for adding the case resolution, but this makes me wonder. This seems to be a generic restore situation and if I read it correctly @ApK had to add a key and seal the DDBs. 

So this will only address the recovery of new recovery points. But as this appears to be a generic situation what can other customers expect here? What was so unique about this situation. Should other customers do the same? 

Userlevel 7
Badge +23

Sharing case resolution:

Implement Maxsfilecontainer keys which helped but was still inconsistent
Issue was escalated to develpoment
when using the keys, DDB is needed to be sealed
Answered additional concerns and after DDB was sealed, perform is much better and consistent

Userlevel 7
Badge +23

Looks like you sealed the store and things are looking better.  You’ll monitor for a week to confirm.

I’ll keep this open for a week as well :nerd:

Userlevel 7
Badge +23

Also sharing some wise words from Alexis.

You had asked her about the need for a DDB in restores, of which there is none.  Her reply:

Correct, the DDB itself is not used for restores and is only used for backups and data aging. 
However, the files are still deduplicated and there is a large number of small files (sfiles) that needs to be opened to be able to perform the restore. As there is a large number of files that need to be opened, thus increasing the restore time as we can see by the below example that this 48 minute restore, 29 minutes of it were spent opening files

J10685855/<name>.dk_logs/cvd_2021_10_14_15_24_37.log:7874 73b4 10/14 14:17:50 10685855 [DM_BASE    ] 30-# SfileRdrCtrs: [Open File] Exp Avg [0.01], Total Avg [0.03], Total Time [1768.24], Total Count [59139]

The keys increase the size of the sfiles thus causing lesser files to be opened and faster open file times.
The seal will prevent the new backups from referencing the smaller older sfiles which in turn should also increase the time.

Once the seal has been performed, please run new fulls, and restore from the full after the seal. Please let us know how performance of this restore job.

If the seal cannot be performed, please let us know. 

Let me know once things are looking better (though I’ll also see how the case progresses).

Thanks!

Userlevel 7
Badge +23

Updating the thread as per the case notes.  Development and Alexis discovered that the majority of the job duration was opening SFiles, so you have sealed the store and are monitoring.

Userlevel 7
Badge +23

Thanks!  I see you are working with Alexis….you’re in great hands!

Userlevel 2
Badge +6

Hi Mike.

Case number is: 210920-319

 

Regards

-Anders

Userlevel 7
Badge +23

@ApK , can you share the case number with me so I can track it properly?

Userlevel 7
Badge +23

Hi Damian.

:grinning:

Same vmware server restored, same HyperScale server as proxy, last couple of restores have been really slow, just did a new one with really slow performance.

 

I have created a case now.

Sounds good. Be sure to let us know the outcome!

Userlevel 2
Badge +6

Hi Damian.

:grinning:

Same vmware server restored, same HyperScale server as proxy, last couple of restores have been really slow, just did a new one with really slow performance.

 

I have created a case now.

Userlevel 7
Badge +23

@Damian Andre, the restores was done via nbd, but thanks for your suggestion :-)

 

I hate it when my hunch is wrong :joy:

Were both restore tests from the same source job or different jobs? Would be interesting to run it again to see if they are consistent with the last run, and if so what the difference between the jobs is

Reply