Question

Determining RTO

  • 17 April 2023
  • 2 replies
  • 141 views

Userlevel 7
Badge +19

In practice we see many of our customers determining RTO’s based on paper exercises before even validating what it takes to live up to the RTO SLA agreement. In case organization refer to RTO-0 than they refer to high availability which can be achieved using for example clustering technologies combined with sync. mirror storage or application with built-in capabilities relying in a distributed backend. However it doesn’t protect in case you run into cyber security threats, malicious intent or human failure.

Now you have to rely on your backup copy to recovery to a specific point in time. Now let’s forget the time that it takes before the actual recovery process starts which is easily forgotten, but what actually adds-up to the entire recovery time. Let’s not take into account the MTD which stand for the maximum tolerated downtime as well but let’s focus on the RTO definition within Commvault. What is the rule of thumb that Commvault uses itself to determine the RTO? I remember to recall the rule of thumb full backup duration + 20%, but is that still correct? I of course do not take into account snapshot possibilities, but really focus on the RTO related to a full recovery from a backup copy. 

I had a look at the readiness report and although the report looks nice, the calculated RTOs just doesn’t seem to be realistic. B.t.w. the latest report ran on a CommCell running the latest FR30 maintenance release.

Curious to hear if others noticed the same, to learn from your experiences and I would also like to get confirmation if the rule of thumb is correct.
 

 


2 replies

Userlevel 4
Badge +12

Hi @Onno van den Berg 

Thanks for reaching out, let me cover how the RTO is calculated in the readiness report, which shed some light.

The Recovery Readiness report's Data Views make use of certain parameters to make predictions. For instance, if the recovery copy is a snapcopy (persistent snap) or an active LiveSync instance, we estimate a 10-minute RTO for the failover or live mount/clone time.

On the other hand, if the copy is hosted as a backup, we determine the restore estimate based on the App Size (FET) that reflects the full size/restored footprint, and the measured restore speed of the environment and number streams. If users frequently run restore jobs or restore tests, we can profile and add the true measurement to the environment profile from the copy to improve the estimate based on the actual configuration.

Initially, we use a base rate of approximately 100 G/H per stream from a disk library copy in the report. However, the actual profile will drive that estimate once you start to run actual jobs.

Userlevel 7
Badge +19

@Emils thanks for coming back to me with so much detail. as for the profiling part I assume this is done on a client basis, meaning you will have to do some restore of a client or is one client attached to the same storage policy/server plan enough for it to improve the recovery accuracy? 

See for example the picture below. It states that the RTO for some of the clients is 00:00:00. Doesn't seem very accurate to me, right? These are FS backups, so the start of the job already takes more time than projected. 
 

Is there some documentation available? 

Other point is what happens in case you have a client that is protected using a snap primary which is retained for just 1 day, but you create backups copies. Can you also estimate the RTO based on the primary copy that resides on disk/cloud?

Reply