Solved

Show more accurate Restore progress indication/timeline

  • 8 February 2021
  • 19 replies
  • 1607 views

Userlevel 6
Badge +13

Hi,

 

We are performing several Sybase & SAP Hana restores.
One thing I noticed: the progress is in commvault is “completely wrong” slash misleading.

It remains at 5%, while in the hana log & sybase log in commvault you can see that it’s for example 30% loaded.

Because the progress remains at 5%, people think that the restore is stuck (can stay at 5% for hours if you have bigger databases), and sometimes people kill the restore because of that.

Why can’t commvault show the real status of the restore? For sybase for example the clsybagent.log from commvault shows the exact status, so commvault knows that status...so why not showing it in the gui/command center? 

Even if it’s not exact figure, having that as job status would remove some frustrations for users. 
 

Only a question, not a request to change something :) 

 

 

 

icon

Best answer by Damian Andre 9 February 2021, 23:58

View original

19 replies

Userlevel 4
Badge +7

Hey Bart,

Thanks for the question here and it really is a good one…

It’s not always the case that the application / database provides a its restore / recovery status during a restore of the data however you have indeed given 2 examples of where it does - Sybase provides the status of LOAD progression and HANA provides the RECOVERY status in its backup.log (in the HANA trace directory).

Generally, all jobs within Commvault are split into phases - each agent or data type has different phases such as reading the index, performing data transfer, informing the application to take recovery steps and then finishing. Using Sybase as an example, if you’re restoring just one database then you could say yep sure - we should be able to track how much data has been restored plus review the actual database load output which is returned from Sybase during the database load (restore) however what happens when this LOAD is complete? In Sybases case specifically, it moves on to perform a redo pass on the database that has just been restored - this can take some time but how would we then reflect this Sybase phase back into the Commvault job phase/percentage?

Then we move on to restoring multiple databases within the same restore job…. its not more complicated as such, but percentages of the entire job then have to be broken down even more - plus the redo of each database being restored.

If we took Microsoft SQL as another example - out of the box and also via the API’s available (VDI) this offers no such way to track the restore progress. There are some SQL queries which can be run to give a rough idea however - but not official.

So this isn’t really a solution for you but an answer to your question :grin: And I certainly agree that there are some agents which could be more detailed when the job is being monitored within Command Center or the Java console.

Userlevel 6
Badge +13

For example, see below, a restore of a sybase db.

Gives an exclamation mark, saying that the job takes too long. 
And the console shows ‘last update time”, 4 hours ago, 0 files restore, 0 bytes restored.
Most users thinks: it’s hanging, I will need to restart it. It happened over here already, and I bet other customers had the same already.

But when checking the logfile of commvault, you see that 40% has been done already.
That’s of course something completely different than ‘0 bytes/0 files, no update last 4 hours’.
As you said, you can’t show exact figures, we know that and that’s normal. But there is a difference between showing no update at all and even showing an exclamation mark, and showing ‘something’.
Especially the last update time is very misleading.

 

note: of course, I’m not the developer of commvault, so I know it’s easy to say for me :)

 

 

 

Userlevel 7
Badge +23

 

But when checking the logfile of commvault, you see that 40% has been done already.
That’s of course something completely different than ‘0 bytes/0 files, no update last 4 hours’.
As you said, you can’t show exact figures, we know that and that’s normal. But there is a difference between showing no update at all and even showing an exclamation mark, and showing ‘something’.
Especially the last update time is very misleading.

 

note: of course, I’m not the developer of commvault, so I know it’s easy to say for me :)

 

 

 

Totally get your point. That triangle of doom (I just made that up :joy: ) is a generic check that displays a warning across any data movement job type that has not moved data in x minutes. Of course in some cases there are absolutely legitimate reasons for that - I see this happen often when customers do eager zero thick restores of VMware machines, as VMware has to reallocate your 1TB disk before we can start moving any data, and the Media Agent / JM don’t know that this is ‘normal’ and hence the warning is displayed. This would also happen often on VM file restores, as in the past, we used to dump out all the blocks associated with the files you picked, and then the VSA would traverse the data and reconstruct the files - that could take a lot of time, in where we’re not actually moving data from the Media Agent.

 

We finally got an accurate progress bar for restores for VM machines in SP5/SP6 of V11 (that is an eternity ago now), before then we’d get a lot of support cases of customers assuming something was broken as the progress bar would go from 5% to 85% and nothing inbetween. Lots of those little things have been sorted out, but still room for improvement, obviously.

Userlevel 4
Badge +7

Hey Bart,

Thanks for showing this example - I reached out to a member of our engineering team directly on this and after discussing we’ve created a customer modification request (CMR) to see about improving this in the future - there is no ETA for this and also no guarantee this will be implemented however.

There’s certainly room for improvement here though so it has been taken on board :grinning:

CMR Reference is 309028

Userlevel 2
Badge +6

We finally got an accurate progress bar for restores for VM machines in SP5/SP6 of V11 (that is an eternity ago now), before then we’d get a lot of support cases of customers assuming something was broken as the progress bar would go from 5% to 85% and nothing inbetween. Lots of those little things have been sorted out, but still room for improvement, obviously.

Unfortunately that Progress bar is broken again with VSA Index v2, especially on File Restores. There it is even worse, the Restore may do nothing for hours until the Persistent Recovery is done counting files to restore. So far this takes the crown for confusing customers in my book

Userlevel 7
Badge +23

We finally got an accurate progress bar for restores for VM machines in SP5/SP6 of V11 (that is an eternity ago now), before then we’d get a lot of support cases of customers assuming something was broken as the progress bar would go from 5% to 85% and nothing inbetween. Lots of those little things have been sorted out, but still room for improvement, obviously.

Unfortunately that Progress bar is broken again with VSA Index v2, especially on File Restores. There it is even worse, the Restore may do nothing for hours until the Persistent Recovery is done counting files to restore. So far this takes the crown for confusing customers in my book

Hey Stefan,

Appreciate the feedback - Have flagged this internally with the right folks to double click on!

Userlevel 7
Badge +19

I always explain Commvault is still leveraging Microsoft alike technology to visualize the job progress. Being at 98% can just mean that it can still take numerous hours. Would be nice if this gets enhanced not showing the job phase anymore but really starts to show the progress based on data from the scan phase/throughput and job history. 

Userlevel 6
Badge +13

I always explain Commvault is still leveraging Microsoft alike technology to visualize the job progress. Being at 98% can just mean that it can still take numerous hours. Would be nice if this gets enhanced not showing the job phase anymore but really starts to show the progress based on data from the scan phase/throughput and job history. 

 

I agree. The point is of course: When you need to restore things, things are getting escalated very quickly. A lot of people having stress. Everyone wants to know how long it will take, as it’s hurting the business. As you then have a restore stuck for hours, despite Commvault can know the progress of some parts (again, we understand not everything can be known) , this causes a lot of stress for users. First reaction will always be: why does it take so long? Why is Commvault so slow?
Of course, this will also hurt the Commvault reputation.  (don’t forget not every person knows how to check the status in databases, or other logs)

Userlevel 7
Badge +19

@Bart I totally agree! I was in my response primarily targeting backup jobs but same applies to restore as well and those are way more important than backups ;-)

Now for restores it is a bit more complex because it will send all data towards the designated client computer factors like network throughput and client system resources will have impact on the restore as data will need to be uncompressed/unencrypted. But from a business value perspective it would definitely be great if Commvault can deliver a better indication. Would be nice if you can kick off a recovery readiness test against a client computer that will perform a regular/planned generic recovery test which can be used to validate the functional recovery readiness of a client. It could than re-use the gathered metrics for better job restore indications. This would not only deliver value for real restore but also to refresh data for testing purposes. Additionally it can be used in the RTO calculation to inform the user beforehand in Command Center that configuration changes or compute offering changes are required. 

Userlevel 7
Badge +19

Hey Bart,

Thanks for the question here and it really is a good one…

It’s not always the case that the application / database provides a its restore / recovery status during a restore of the data however you have indeed given 2 examples of where it does - Sybase provides the status of LOAD progression and HANA provides the RECOVERY status in its backup.log (in the HANA trace directory).

Generally, all jobs within Commvault are split into phases - each agent or data type has different phases such as reading the index, performing data transfer, informing the application to take recovery steps and then finishing. Using Sybase as an example, if you’re restoring just one database then you could say yep sure - we should be able to track how much data has been restored plus review the actual database load output which is returned from Sybase during the database load (restore) however what happens when this LOAD is complete? In Sybases case specifically, it moves on to perform a redo pass on the database that has just been restored - this can take some time but how would we then reflect this Sybase phase back into the Commvault job phase/percentage?

Then we move on to restoring multiple databases within the same restore job…. its not more complicated as such, but percentages of the entire job then have to be broken down even more - plus the redo of each database being restored.

If we took Microsoft SQL as another example - out of the box and also via the API’s available (VDI) this offers no such way to track the restore progress. There are some SQL queries which can be run to give a rough idea however - but not official.

So this isn’t really a solution for you but an answer to your question :grin: And I certainly agree that there are some agents which could be more detailed when the job is being monitored within Command Center or the Java console.

Hi @Edd Rimmer ,

As you might have noticed I openend a post just like Bart did ~2 months ago regarding the same. Anyway just wanted to relay back that although I understand your post and that I do understand the challenges I do however want to bring forward that the behavior is very inconsistent between the various agent types and that Commvault can do a lot to improve the experience. Now instead of focusing on the client side maybe it would be an idea to measure what is send towards the client or measure the amount of data before it is send into the API of the targeting application. This way you are less dependent on the target application and it would open the possibility to make the experience more consistent across the various agent types. 

@Bart Are you already running on FR23? Updates changes to be expected if we upgrade to it? 

 

 

Userlevel 7
Badge +23

@Onno van den Berg , @Bart  I am looking to loop in some developers to either give more detail on why we can’t be more accurate, and/or to see if we can improve our side of things.

Userlevel 7
Badge +19

Everything is possible I would say ;-) I'm also using some alternative paths myself because I find it unacceptable that this behavior is addressed via a CMR. 

Userlevel 7
Badge +23

Sounds good, my friend!  Whomever gets word first, report back as the winner!

Userlevel 6
Badge +13

@Onno van den Berg , @Bart  I am looking to loop in some developers to either give more detail on why we can’t be more accurate, and/or to see if we can improve our side of things.

thanks

Badge

Hello,

has there been any progress since handing over to the developers I wonder? I am restoring 0.5TB database and watch 5% for the last 6 hours is a bit stressful :)

 

Many thanks,

~Ales 

Userlevel 7
Badge +19

@Ales Vrbas what version are you running and which agent type are you referring to? 

Badge

@Ales Vrbas what version are you running and which agent type are you referring to? 

@Onno van den Berg version 11.28.8 and agent type is ‘Virtual Server’. I am basically extracting a large database backup file from a VM backup. 

Userlevel 7
Badge +19

Check! Well I think you might need to open a ticket for this! This case/thread was around the restore progress that is seen during database restores. I was t.b.h. hoping engineering would look a bit further to it and work on generic improvements covering all agent types but it seems they only addressed this to a certain extend for database only. 

I have not done any restores tests myself recently, but will try to find some time to perform some tests as well.

Badge

Check! Well I think you might need to open a ticket for this! This case/thread was around the restore progress that is seen during database restores. I was t.b.h. hoping engineering would look a bit further to it and work on generic improvements covering all agent types but it seems they only addressed this to a certain extend for database only. 

I have not done any restores tests myself recently, but will try to find some time to perform some tests as well.

Thanks @Onno van den Berg, I certainly will :)

I see - I thought the restore of a native backup is showing the progress just all right! 

More we bug then, more likely they’ll look into it (I hope).

Reply