We have very huge environment where we have more number of backup failure which we are trying to fix the issue through automation using REST API.
1.Check the backup status for each client.
2.Check if each client is ready(Check Readiness) for backup.
3.If readiness getting failed then Recycle the services for each client.
4.Resubmit the backup job for each client.
5. Post if Resubmit is also getting failed then refer the log cuts Job manager.log and Evmgrs.log to find the failed/error details.
Let me know if anything we can add it for other agent based backups like SQL, Oracle or SAP HANA database.
Please update some additional steps if any initial troubleshooting using REST API will help us to reduce some failures in our environment.
I think this generates more questions than answers, for one doing this by rest isn’t possible for every step so it calls your orchestration layer into question.
also what I have found is that alerting on events is much more productive than alerting in failures.
usually backup failures are preceded by specific events often by hours.
for example CommVault alerts against client status, chances are if there is a connectivity issue you can address it in close to real time rather than waiting for a failure. This is true for all kinds of failure scenarios. What I would suggest if that you pull the events lists correlated to all your failures and find the pre-ceding event and create a corresponding run book.
I highly doubt in most cases if check readiness is failing that restarting the services should resolve the issue. If that is the case I’d suggest your network config is wrong, and recycling of the services allows the client to re-establish connectivity to the rest of the CommCell until the tunnel times out and connectivity breaks.
In either case, I think workflows are a much better and easier fit than rest API. You can even call REST from workflows if you have to, but I’d start there rather than trying to do it all in REST.
Commvault is resilient, and simply retrying jobs I think will not solve your issues - usually something more permanent is going on. I’d be curious to know what is the most common type of failure error you are getting?