Question

Anomaly alerts for Job Completed?

  • 14 November 2022
  • 20 replies
  • 603 views

Userlevel 1
Badge +6

I have been trying to reduce the number of Alerts coming from Commvault so that the ones which are raised are actionable (something needs to be done about them).

However, I do see some strange ones that I don’t expect to see, like:

The system detected events that are unusual in occurrence or frequency in CSNAME.     

Server    Event time    Event    Occurrence    Anomaly type
CSNAME
2022-11-13 00:56:27    Backup job [1346233] completed. Client [CLIENT], Agent Type [SQL Server], Subclient [default], Backup Level [Transaction Log], Objects [1], Failed [0], Duration [00:01:14], Total Size [9.38 MB], Media or Mount Path Used [[MA001] /ds3/CVLT03, [MA001] /ds1/CVLT01].    93    Occurrence

CSNAME
2022-11-12 23:59:56    Backup job [1345979] completed. Client [CLIENT2], Agent Type [Virtual Server], Subclient [default], Backup Level [Incremental], Objects [3801], Failed [0], Duration [00:59:48], Total Size [3.69 GB], Media or Mount Path Used [[MA001] /ds2/CVLT02, [MA001] /ds1/CVLT01].    177    Occurrence

 

Why does Commvault think that it’s unusual for 177 backup jobs to complete successfully?  :-)

I’m very confused by this.  And also if it happens 177 times is it really that unusual?  This doesn’t seem to be something that the engine should be picking up as unusual since it’s an event saying that the job completed successfully.  Especially since I get this alert every day.  Any idea on how to tell the system that this is normal and it’s working as expected?  Or do I just need to turn off the “Anomaly in number of succeeded jobs” option?


20 replies

Userlevel 7
Badge +23

Hi @RobAbate , thanks for the post.

That’s part of the overall Anomaly report.

https://documentation.commvault.com/11.24/expert/5199_alerts_and_notifications_predefined_alerts.html#list-of-predefined-alerts

Do you see a link to the Unusual Number of Succeeded Jobs Report?  If so, any further details there?  It sounds like it thinks 177 jobs is too LOW based on historical data, not too high.

CommServe Anomaly Alert

Operation / Admin Alert

This alert is enabled by default.

This alert notifies the user when the system finds anomalies compared to the system thresholds calculated based on the historical data:

  • Anomaly in DDB Pruning: The number of deduplication database prunable blocks or CommServe job records that should be deleted does not drop for a considerable period of time, which indicates an unusual performance drop in for the deduplication database that reported the issue. The email alert sent for this anomaly contains a link to the DDB Pruning Performance Anomaly Report.

  • Anomaly in events: The frequency or occurrence of the events does not match the system threshold. The software verifies anomaly in events on an hourly basis, and sends an email alert for any anomaly. The email alert sent for this anomaly contains a link to the Anomaly Dashboard Report.

  • Anomaly in number of failed jobs: The number of failed jobs is more than the system threshold. The email alert sent for this anomaly contains a link to the Unusual Number of Failed Jobs Report.

  • Anomaly in number of pending jobs: The number of pending and waiting jobs is more than the system threshold. The email alert sent for this anomaly contains a link to the Unusual Number of Pending Jobs Report.

  • Anomaly in number of succeeded jobs: The number of succeeded jobs is less than the system threshold. The email alert sent for this anomaly contains a link to the Unusual Number of Succeeded Jobs Report.

  • Job activity anomaly: The number of created, deleted, or modified files in a backup job changes abruptly from the normal behavior or the total backed up root size increases or decreases short. The email alert sent for this anomaly contains a link to the Unusual Backup Job Activity Report.

  • Runtime anomaly in jobs: The runtime of the jobs is more than the system threshold. The email alert sent for this anomaly contains a link to the Anomaly Dashboard Report.

    The system sends alerts about anomaly behavior when there are at least 10 backup jobs.

    Note: A backup job runs for longer duration a few times for various reasons such as the backup of the deduplication database associated with the storage policy happens simultaneously, a user suspends a job for a duration, the operation window interrupts a job, or the subclient content that is being backed up increased unusually.

    When you associate a client computer in the alert configuration, the alert notifies any anomalous backup jobs for the client computer. When you associate a storage policy in the alert configuration, the alert notifies any anomalous auxiliary copy jobs for the storage policy.

    You can configure the following notification criteria in the alert configuration:

    • Notify only if condition persists for (hrs and mins): Specify the duration for the anomaly to persist to send the alert.

    • Repeat notification every (hrs and mins): Specify the duration to repeat the alert.

    • Notify when the anomalous job count is more than: This option is applicable only when you select Repeat notification every criteria. Specify the anomalous job count exceeding which the alert can be sent. This condition is applicable both for the first and the repeat notifications.

Userlevel 1
Badge +6

Mike,

I saw that description in BOL also.  But it doesn’t line up with what I’m seeing.

Here’s an example as I get this alert every day (and they’re all around 120~ which doesn’t seem very out of the ordinary to me since it’s a pretty close number every day):

2022-11-11 00:59:43    Backup job [1341137] completed. Client [CLIENT], Agent Type [Virtual Server], Subclient [default], Backup Level [Incremental], Objects [613], Failed [0], Duration [01:59:31], Total Size [589.97 MB], Media or Mount Path Used [[MA] /ds2/CVLT02, [MA] /ds1/CVLT01].    126    Occurrence

2022-11-10 00:54:33    Backup job [1338682] completed. Client [CLIENT], Agent Type [SQL Server], Subclient [default], Backup Level [Transaction Log], Objects [3], Failed [0], Duration [00:02:01], Total Size [42.96 GB], Media or Mount Path Used [[MA] /ds1/CVLT01].    121    Occurrence

2022-11-09 04:59:46    Backup job [1336175] completed. Client [CLIENT], Agent Type [Virtual Server], Subclient [default], Backup Level [Incremental], Objects [4428], Failed [0], Duration [05:59:39], Total Size [4.30 GB], Media or Mount Path Used [[MA] /ds3/CVLT03, [MA1] /ds2/CVLT02].    109    Occurrence

2022-11-09 02:56:37    Backup job [1336524] completed. Client [CLIENT], Agent Type [SQL Server], Subclient [default], Backup Level [Transaction Log], Objects [3], Failed [0], Duration [00:00:56], Total Size [13.06 MB], Media or Mount Path Used [[MA] /ds2/CVLT02, [MA] /ds3/CVLT03].    136    Occurrence

2022-11-08 00:59:56    Backup job [1334105] completed. Client [CLIENT], Agent Type [Virtual Server], Subclient [default], Backup Level [Incremental], Objects [9934], Failed [0], Duration [01:59:49], Total Size [9.68 GB], Media or Mount Path Used [[MA] /ds2/CVLT02, [MA] /ds3/CVLT03].    183    Occurrence
 

If it actually were telling me that there were less jobs than typically succeeded, I would expect to see it tell me what the “norm” has been historically as well as what the low is for that day.  But that’s not what this is showing me.  Plus clicking on that report doesn’t really give you any more info that is useful, as it just shows the info in the alert email in a table view...

Userlevel 7
Badge +23

Agreed, something isn’t adding up right.  Either the alert is misfiring, or the description doesn’t match the actual effect (or some mixture between the two).

I would go ahead and open a support case (share the case number here).

Once it’s resolved I want to create a Doc MR to get the page updated with clarifying detail.

Userlevel 1
Badge +6

Sounds like the old “what it says it does” vs “what it should do” vs “what it actually does”

Assuming documentation is correct, then I don’t think this is a useful alert to have in our environment and I would prefer to just turn it off.  If there were an anomaly in the number of successful jobs, there would likely be an anomaly in the number of failed jobs also, or they are all “no run” or are all still pending in the job controller - so I don’t see the value of this default alert anyway. 

I posted the question here specifically because I did not want to open a ticket for this.

And since the built in alert doesn’t appear to be doing what it says it does, I’m going to just go ahead and not use it.  I’m no longer in the business of fixing Commvault software - I’m an end user now.  My job is to use the software and not to fix both the alert and the documentation - my manager is starting to ask questions like “why does every ticket you open go to dev” and “does this software ever work” (he prefers Netbackup)

So if you guys can go back to your labs and figure out what it’s supposed to do and make sure that the documentation is accurate, then I’d be glad to re-check once this is all fixed.  But I will not be opening a ticket for this, since I don’t see the value in what it says it’s supposed to do and I’m tired of deleting this e-mail every day from multiple CommCells.  Unchecking the box for “anomaly in number of succeeded jobs” is an acceptable solution here (although it should probably say “successful jobs” rather than “succeeded jobs,” but ok...).

Userlevel 7
Badge +23

That’s an absolutely valid point and I understand.

I’ll reach out to the doc team to find the responsible devs and testers and get this cleared up.

Userlevel 1
Badge +6

Hey Mike --

I just noticed that even after turning off the option for Job Anomalies, we’re still seeing the alert that I referenced in the beginning:

The system detected events that are unusual in occurrence or frequency in CS001.     

Server    Event time    Event    Occurrence    Anomaly type
CS001
2022-12-06 00:59:09    Backup job [1398329] completed. Client [CLIENT], Agent Type [Oracle RAC], Subclient [(command line)], Backup Level [Full], Objects [14], Failed [0], Duration [01:23:26], Total Size [39.47 GB], Media or Mount Path Used [[MA001] /ds2/CVLT02, [MA002] /ds3/CVLT03].    156    Occurrence
Please click here for more details. 

 

The system detected events that are unusual in occurrence or frequency in CS001.

Server    Event time    Event    Occurrence    Anomaly type
ens21cvc001
2022-12-05 00:59:39    Backup job [1394241] completed. Client [CLIENT2], Agent Type [MySQL], Subclient [default], Backup Level [Full], Objects [374], Failed [0], Duration [12:59:28], Total Size [9.86 TB], Media or Mount Path Used [[MA003] /ds2/CVLT02].    149    Occurrence

 

Notice that it says “system detected EVENTS that are...”

 

So I’m pretty convinced this is coming from the Event Anomaly selection -- and it’s considering the event that job completed is anomalous.

Again, I’m not going through the pain of opening a ticket for this when I can turn off that alert criterion or create a rule in Outlook.  But it is annoying to see this alert multiple times per day from multiple CommCells, when there’s absolutely nothing to do - no action for me to take to fix it.  And it’s built-in and turned on by default.

 

Also I noticed that I get a lot of runtime anomaly alerts.  Is there a way to configure the thresholds?  I think that the machine learning/AI is too noisy and needs to back off a bit.  I mean it’s alerting me for a job that took 1:47 when the previous job took 1:15 -- not that big of a deal.  Or am I just stuck with what the system thinks the threshold should be?

Userlevel 7
Badge +23

@RobAbate , I’m not sure if we can change the thresholds; hate to be a downer, though I suspect it’s not something you can configure.

I need to follow up with the docs team on this, so I’ll add in the extra question.

Userlevel 1
Badge +3

Hi Rob,

Model tends to give preference to consistency. if past job’s runtimes are all very close then the expected fluctuations are less, and thus breakout are tagged early. You can make system less sensitive to this by adding a gxglobal param “AdminAlertSensitivity” with value as “Low”.

regards

Mrityunjay

 

Userlevel 1
Badge +6

Thanks Mrityunjay!

I’ll give this one a try.

We’re slowly reducing the amount of “noise” coming from these anomaly alerts.

The ones which we see daily tend to be these “jobs running longer than usual” alerts, and the Event alerts telling me that there were 156 occurrences of the event “backup job completed” which is really annoying.

I’m guessing there’s no way to turn off the alert for Aux Copy running longer and only keep the one for backup jobs?  Aux copies by nature tend to be very inconsistent in their run time and the alert has no real value since there’s nothing to do/fix other than wait for the job to finish.

Userlevel 1
Badge +3

I am too busy to dig in to the details for all’yall, @Mrityunjay Upadhyay  but I also have concerns about the usability of the Anomaly Alerts, as well as the RansomeWare alerts/reports.  I think they are poorly engineered features in name only in order to impress & assuage fear in the C-suite level people with empty security features, or maybe to check off a box for ransomeware insurance policies.  If Commvault doesn’t like my assessment, they can provide better documentation and training on how to configure and respond to their Anomaly and Ransomeware spam.  A technical webinar would go a long way here.

Userlevel 4
Badge +13

Hi @Mrityunjay Upadhyay 

I have a customer(s) which also suffers from too much noise from Commvault anomaly alerts and to be honest not being able to tweak them is a bummer.

I was trying to test “AdminAlertSensitivity” Additional Settings as per your suggestion. Unfortunately, I am unable to add this from Command Center as this setting is not being found.

In Java GUI the lookup is also not giving me anything, but at least it can be manually typed there. 

This is empty too

Additional Settings (commvault.com)

and this

https://documentation.commvault.com/search?q=AdminAlertSensitivity&oem=commvault&majorVersion=11&minorVersion=28&site=essential#q=AdminAlertSensitivity&t=All&sort=relevancy

So I guess this is another hidden setting. Are there more hidden (or not) settings allowing to tweak the behavior of these anomaly alerts? Is there a place when I can learn more about them?

 

Regards

Userlevel 1
Badge +3

Hi Robert,

Which FR you are using. Are you also getting more event-based anomalies than desired? We are working on reducing the noise and will port it to your FR. Above, setting is for long running job and is hidden. We are unhiding it and adding some settings. Will add a doc link here once update is out.

 

regards

Mrityunjay

Userlevel 4
Badge +13

Hi @Mrityunjay Upadhyay 

Customer is on FR28. There are some event based anomalies, but also File Activity anomalies and AUX copy running longer than usual. Especially the latter is annoying as it’s coming almost everyday. I would be willing to disable this like RobAbate already mentioned he would, because since this is coming everyday and there is nothing you can do it does not have any value.

Let me know once you will be able to share the doc link.

Regards,

Robert

Userlevel 1
Badge +3

Hi Robert,

You and Rob are right, aux copy adds more dimensions and can get tagged incorrectly. We will exclude these by default and make auxcopy tracking opt in. Will expedite FR28.

regards

Mrityunjay

Userlevel 1
Badge +6

Hey Mrityunjay,

Unfortunately our environment is only on SP24.

I continue to get these Aux Copy anomaly alerts, in addition to the original ones that I had mentioned where it says that X jobs completed and comes from Event anomaly.  Even after setting anomaly sensitivity to low, this is continuing to happen.

We’re considering disabling the anomaly alerts completely, as the CommServe is starting to “cry wolf” where there are so many alerts that everyone ignores all of them and they’re no longer useful/actionable.

I’m also facing an issue with Aux Copies, where it was suggested to implement a max run time for the aux copy jobs. This is working well, but now it is raising a failure alert when the aux copy is killed by the system. I was hoping to find CWE rules for aux copy so that I could say do not alert me if the job is killed by system due to max run time, OR to have an option to treat those jobs as Completed with Errors or just Completed instead of Failed (since a new job will kick off within 20 min and continue). But I can’t seem to find any way to accomplish this, where I’m not alerted for a failed aux copy if it was killed by system due to max run time.

In our environment, there is automation to create tickets for every failure - so we do not want tickets to be generated for these “expected failures” -- we may need to look into handling this specific error code in our scripts for aux copy only: 19:1111 The job has exceeded the total running time.

Userlevel 1
Badge +3

Hi Rob,

 

Happy new year.


We made fixes in SP28 HFP42. will backport it to 24. This will not alert on auxcopy and info events.

Regarding failure alert on system killed Aux Copies jobs after max run time; i am checking if these can be marked CWE. 

i will get back by tomorrow.

 

thanks, and regards

Mrityunjay

Userlevel 1
Badge +3

Hi Rob,

 

FR24 HFP48 onwards these Auxcopy jobs will be marked partial successful.

 

regards

Mrityunjay

Userlevel 1
Badge +6

Thanks Mrityunjay!

I was able to tweak the Alert to add a Token Criteria Selection so that ERR CODE does not equals 19:1111 so that the alert didn’t come from this type of failure.

But I am waiting on another patch which will be Hotfix’d, so we will get these applied on top of SP24 once all of these patches are ready.

Userlevel 1
Badge +6

Mrityunjay,

Apparently my “filter” didn’t work, as I still got the alert today even though I added the token criteria as shown above.  I’m going to try with “does not contains” instead of “does not equals” and see how that works.  My goal is to have no alert fire when the aux copy is killed by the system due to max run time.

But I just looked at my job history and I noticed that the aux copy that was killed by system is already marked as CWE.  But the alert for “aux copy failed” seems to fire even when it is CWE (our alert is set for job failed and job skipped).

I can also try to modify the token criteria selection to only apply to this storage policy, or some other setting.  But wondering why it wasn’t working as expected with the configuration I mentioned in the previous post?  Shouldn’t putting the ERR CODE does not equals 19:1111 stop the alert from being sent?  Or do I have an “exclude” here with no “include” based on this config (screenshot below) ??

 

Userlevel 1
Badge +6

It seems that my change was effective, as I’m not getting the e-mailed alerts when the aux copy job is getting killed anymore. I changed my token criteria to “does not contains” instead of “does not equals” for the error code.  So it looks like this is now stopping the alert when the system kills the job, as desired.

Reply