Solved

"Alert Data Verification Failure Detected" Alert

  • 19 October 2021
  • 19 replies
  • 2021 views

Badge +15

I noticed a new alert was added to my CommServe and I believe this was after an upgrade we did recently to a new FR.

The alert in question is : Alert Data Verification Failure Detected and it is briefly discussed on the link.

https://documentation.commvault.com/11.20/expert/12395_data_verification_faq.html

Few things that come to mind and not completely clear to me.

  1. The alert says that it detected corrupted data on backup disk. It does not tell me the job id nor the path. Looking at the alert in the GUI, It also does not have any additional option to be selected such as job id, storage policy etc. How come?
  2. Job history to the subclient or storage policy brings everything green and no issues with the job themselves. My understanding is that I would potentially have issues trying to restore the particular subclient up to the specific transaction log. Why would job history still shows the job as succesfully completed? 
  3. The alert gives you two options, one to convert the subclient to full during the next backup and one to automatically convert all failure verifications to full. The link points to a workflow to be executed, but upon searching the workflow in the Commvault Store under workflows I can`t find anything related. What are both workflow names? how can I find them?
  4. On my storage policy Data Verification is disabled. Trying to understand the alert considering that the option is disable?
  5. Was this alert introduced in recent FRs? which one specifically?

Appreciate the time.

icon

Best answer by Mike Struening RETIRED 2 November 2021, 21:32

View original

19 replies

Userlevel 7
Badge +23

Appreciate the post, @dude!  I agree, without the Job ID it’s not exactly useful (without some digging work).

Let me discuss internally and get some of our team to respond here.

Thanks!

Badge

We experience the same here after upgrading from FR 11.20 to 11.24. A few backup jobs that are using synthetic full backups are affected. We decided to make a full backup for these jobs.

Userlevel 1
Badge +3

Hello @dude,

As far as I know this alert has been around since at least FR20. Do you have the system created schedule for ddb verification still running? I believe that the alert triggers based on that. The two workflows do not need to be downloaded from the store. They are “Toggle Automatic Conversion To Full Backup” and “Mark Selected Subclients To Run Full Backup” but are hidden in the GUI.

Badge +15

I do see two system created jobs, though the one verification one is disable and has been for a while as far as I know.

Still do not see the logic behind the alert and my questions remain open. 

Userlevel 7
Badge +23

Hi @dude, based on what @Tim H shared, and the context, I’ll start with you 5 bullets:

  1. The alert says that it detected corrupted data on backup disk. It does not tell me the job id nor the path. Looking at the alert in the GUI, It also does not have any additional option to be selected such as job id, storage policy etc. How come? - This is likely due to dedupe, and how it would be several jobs affected.  The time to pull the list and the list itself would be lengthy.
  2. Job history to the subclient or storage policy brings everything green and no issues with the job themselves. My understanding is that I would potentially have issues trying to restore the particular subclient up to the specific transaction log. Why would job history still shows the job as successfully completed? This sounds like a good CMR….to go back and acknowledge these visually somehow.
  3. The alert gives you two options, one to convert the subclient to full during the next backup and one to automatically convert all failure verifications to full. The link points to a workflow to be executed, but upon searching the workflow in the Commvault Store under workflows I can`t find anything related. What are both workflow names? how can I find them? Similarly, this should be a CMR to update the alert
  4. On my storage policy Data Verification is disabled. Trying to understand the alert considering that the option is disable? The alert is clearly not checking, that should be a CMR as well.  That sounds like an easy check to make, or at add context to the alert.
  5. Was this alert introduced in recent FRs? which one specifically? iirc 11.22

That leaves 3 improvements:

  1. Can we visually mark completed jobs impacted by the corrupted files?
  2. Update the alert to contain the workflow names
  3. IF DV is disabled, then either don’t show the alert or add context

Does that cover your thoughts?

Badge +15

@Mike Struening honestly I do not see how this is a "solution" . It seems to me that the alert was not thought out at all. Like I said, it reports a few things that the alert it self cant explain, nor can I find the info through reports or find the workflows. This is pretty innefficiente from my perspective not only that, it leaves the customer with a lot of questions. 

Anyways looks like this is it for now. Thanks for reviewing this. Hope to see improvements in the future.

Userlevel 7
Badge +23

I agree with you 100%.  My hope is that I capture your concerns, bring them to our Alerts development team, and get your changes considered and hopefully implemented.

I just want to be sure I capture your concerns and ideas accurately so I capture everything for my message to them!

Let me know if I missed anything:

  1. Can we visually mark completed jobs impacted by the corrupted files?
  2. Update the alert to contain the workflow names
  3. IF DV is disabled, then either don’t show the alert or add context
Badge +15

Looks good. The only thing I`d add to item 2 - make the workflow visible, which today aparently is hidden. 

Thank you - 

Userlevel 7
Badge +23

ok, great!  I’ll do that.

Userlevel 7
Badge +23

Hi @dude , I have 2 answered for you (asking for better details from dev on the visual identifier request):

  1. Update the alert to contain the workflow names and make them visible

[dev] Why are you looking for workflow name and their availability on store? The link should itself take you to workflow and run the workflow.

  1. IF DV is disabled, then either don’t show the alert or add context

[dev] The jobs are marked as verification failed by any read operations i.e Synth Full, aux copy or restores. It is not just DV jobs. Even when DV is disabled, if any read operation detects data corruption then we mark those chunks\jobs as data verification failed and those affected subclients will be listed in the alert if the affected job(s) are part of latest cycle. The alert helps to protect the affected subclients by running a new Full on them.

For the Workflow link, can you confirm that the alert links you right to the WF itself?  You should have it already installed by default.

Thanks,

Userlevel 7
Badge +23

And now I have an answer on the visual request.

If you go to View>Jobs on copy, check the Data Verification status column.

Let me know if this works for your needs!

 

Badge +15

Hi @dude , I have 2 answered for you (asking for better details from dev on the visual identifier request):

  1. Update the alert to contain the workflow names and make them visible

[dev] Why are you looking for workflow name and their availability on store? The link should itself take you to workflow and run the workflow.

[dude] So, say I deleted the email and the alert it was automatic sent out to me and I go back and want to run the workflow manually, how do I do it? As a CV Admin, to me if there is a link on the alert pointing to a form/workflow that gives me the options to click on and allow the converstion to full, why isnt the same form/workflow visible for the admin to be able to run it whenever I want? 

It does not seem very effective to me, to only give the admin the opportunity to fun a full conversion by only allowing giving you one location which is the email as a resource. 

  1. IF DV is disabled, then either don’t show the alert or add context

[dev] The jobs are marked as verification failed by any read operations i.e Synth Full, aux copy or restores. It is not just DV jobs. Even when DV is disabled, if any read operation detects data corruption then we mark those chunks\jobs as data verification failed and those affected subclients will be listed in the alert if the affected job(s) are part of latest cycle. The alert helps to protect the affected subclients by running a new Full on them.

[dude] If this is not only for DV jobs, then the link I shared above needs to be updated and the alerts needs to be better explained to reflect what you are saying. “The data verification job flags the backup jobs with a 'verification failed'  

 

As for the screen shot you sent, I`m aware of that. Like I said, my DV is disable for the storage policy in question, but with the statement that this alert isnt only for DV jobs, it “sort of” make sense. Again, alert is named Data Verification which is one of the main reasons for my previous questions, but now dev mentioned that this isnt only for DV. Very confusing. It is DV but not. 

Userlevel 7
Badge +23

Appreciate your thought out reply!  I’ll continue talking to our dev team.

Userlevel 7
Badge +23

I have some more:

  1. Regarding the alert name: The alert does not mention about DV JOB status. It just says “Data Verification Failure”.  I think you hinted at this earlier in that it kind of makes sense.
  2. Regarding the Workflow details, I’m going to work with the docs team to make this easy to find.  However, dev provided this idea: If user wants to enable the automatic conversion to Full on data verification failure then they can enable the below highlighted option any time. If email is deleted then user will received alert email again within next 24 hours if new Full has not run yet.

Let me know if you have questions/thoughts/concerns.

Userlevel 1
Badge +5

I have a similar issue after automatic upgrade to 11.24.29 from 11.20.xx.

Did running a full backup solve it?

Does anyone know the cause?

Userlevel 7
Badge +23

@Paul Hutchings , I saw your separate post.  Considering how yours happened right after an upgrade, check directly with support.

Badge

Mike,

Your explanation back from Dev on DV inherent in any read operation was VERY helpful! I’ve heard or read each of those examples here or there working with support or perusing documentation, but I’ve had difficulty finding *that* answer from the perspective of, ‘a comprehensive understanding of when, where, and how data is verified within Commvault’. Thanks again!

 

Also a word of caution regarding auto-convert to full and a CMR/feature request:

-I enabled auto-convert to full but the result was a flurry of unnecessary Traditional Full jobs due to DV failure false positives & back-end storage bloat. What happened? At our scale and architecture we have Data Verification jobs that inevitably run concurrently with DASH/AUX copies. So we’ll have chunks of data locked by DASH that then cause DV failures. With auto-convert to full disabled, those chunks would succeed on subsequent DV jobs, and all would be well. I did end up reverting back to not auto-converting next jobs to full.

 

There are many potential approaches to remediating this issue - a few ideas:

1)Queue any chunks that fail DV to the end of the same DV job with a standard minimum delay of 15 minutes (not a delay per found chunk, but rather don’t attempt to re-verify a failed chunk in less than 15 minutes from initial attempt as it might still be locked by DASH. Or perhaps more simply, after an initial DV pass on all chunks set job pending for 15 minutes then try again. I believe there’s already a standard 15 minute wait for conditions that send jobs into a pending state). This would also enable DV jobs to Complete Successfully vs Complete with errors. When providing proof of DV jobs for auditing or Cyber Insurance quoting, proof with jobs in the green vs jobs in the orange would be ideal ;).

2)There could also be logic to identify chunks that are locked by DASH. If support can determine (they can) that a chunk was locked by a DASH job, these overlaps can be accounted for. i.e. if a chunk is locked check for active DASH job & check DASH job for concurrent use of the same unique chunk. Or cache chunks as they ‘check-in’ or ‘check-out’ of a given job as available or not.

3)You could also outright disable DASH & DV jobs from running concurrently. Similar to how you can’t run a DV jobs and a Space Rec job at the same time.

 

Best regards,

Kyle

 

 

Badge

A follow up to my initial response.

~”Any read operation provides DV.”

If ingesting to primary copy then DASH’ing to second copy either immediately or within 24 hours, immediate scheduled DV jobs are operationally & functionally redundant. It seems that a better approach might be to schedule DV jobs for Second copy DDB/Data with some periodicity after the data is DASH’d.

Primary copy will at least have effective DV per Synth Full schedules following the initial commit & DASH. I wonder if it’s possible to schedule DV jobs to include only data older than ‘x’ days weeks or months? I’m aware of the ability to select or deselect jobs for DV, but at scale that’s less automatic than I’d prefer. We’d also need to keep in mind DDB verification vs data on disk.

 

Thoughts?

Userlevel 7
Badge +23

Thanks, @Kyle Hebert !

Reply