@EricF
They can always do a spot check of what they have configured to be protected VS what is backed up VS what they can restore. We recommend all our customers go thru this process when they setup client(s) for protection (and to also familiarise themselves with the recovery process).
Once they have a client configured for protection as per their requirements, they can continue to monitor it’s health using several methods...
At the end of the day, as you mentioned… if the customer isn’t actively checking or acting on reports weekly / alerts generated / overall system health… then there isn’t much more we can do?
I’d definitely start by suggesting they create an alert that notifies them of any job failures, jobs completing with errors or failed items.. they can filter the alert to report on only critical servers if they are finding their environment too noisy.
Hope this helps.
Chris
In my experience one of the biggest mistakes being made is that customers picked the wrong approach. With this I mean they focus only on the data that needs protection and thus only select the systems and files/folders who require protection. Why they pick this approach is because it is less costly as you save on licenses (in case of a capacity based license) and because they are used to do it like this. Rebuilding a server is done via the provisioning solution and the data is brought back to the system using data recovery.
However:
- People, sometimes change jobs or for example go on holiday which results in vital knowledge not being present. The engineer who is responsible the the application and/or backup/recovery is not aware of the configuration.
- Over time things change, so do applications. The folders where the data is stored are updated to a different location or maintenance tasks are performed.
- Application owners make assumptions that someone takes care of the backup/recovery while they should take into account that the person for who they think is responsible have no context with the application and data.
- Cyber Security Threats like ransomware not only manipulate data, but also destroys systems like AD controllers. This means vital infrastructure components fail requiring rebuild resulting in longer recovery times.
So, my advise is to use the backup everything approach and only make exclusions if they are really needed, but do make sure these exclusions are set using something that can be reviewed easily. Think of tags who are set through code. This of course can still lead to issues, so frequent reviews are always required.
Not sure if the mistake you are referring to was an issue related to a system not being protected or that some data was left out, but just sending daily reports via mail doesn't do the trick. Setting up proper alerting that is send, for example using a webhook to Teams/Slack can make sure multiple people have eyes on the alert which is also send to a ticketing system. In addition introducing a proper OAT procedure that is followed before handing over the system to the business can also mitigate risks.
One thing I do personally miss is the ability to exclude VMs from backup using a tag. The current implementation is a filter which filters out VMs from being protecting resulting in loss of view on the protection state of you environment. Sure, you can script something that can do this for you by for example creating an inventory based on the tag who are used to exclude workloads from protection and reviewing them against the inventory in Commvault, but I would expect this to be default built-in functionality.
Thank you both the the detailed replies. I am working with the customer on following the suggestions as well as making sure everything is configured for alerting purpose.
At the end of the day they do have alerting through service now ticketing generation off CV alerts, daily backup reporting and access to the Dashboard and other reporting. They even have scripted reports generating excel dumps of backup status.
The main issue is that failures were not being driven to remediation and probably a bit of complacency
You can only do so much to mitigate hum failures from occurring, but as long as Commvault is unable to perform automatic remediation and/or think for humans it means follow-up is required!