Question

Cache-Database and Deduplication-Database are out of sync

  • 17 August 2021
  • 5 replies
  • 122 views

Userlevel 2
Badge +3

Do you see these error for your jobs?

When we updated from 11.20.32 to 11.20.60 we started getting Cache-Database errors on various backup types, FileSystem idata agent, NDMP backups and others.

The exact error we get is:
Error Code: [40:110]
Description: Client-side deduplication enabled job found Cache-Database and Deduplication-Database are out of sync.

I have a ticket open with support, but I am wondering if the issue is unique to us or if it is happening to other customers as well.

 

Thank you,


5 replies

Userlevel 7
Badge +15

Hey @Farmer92,

Couple of possibilities based on this KB article: https://kb.commvault.com/article/DD0010

  • This error message might appear due to one of the following reasons:

    • Source-Side Deduplication Database (SSDB) is out of sync with the Deduplication Database (DDB) on the MediaAgent. When this occurs the SSDB has reference pointers (signature hash) that the central DDB doesn't have. Therefore the job(s) of the client associated to the SSDB go to a pending state.
    • Deduplication data verification job marked invalid data blocks as bad.

 

Not sure there is much you can do on point #1 - the job will self heal - likely that a job aged that removed a signature that was still in the cache DB and required for the backup to proceed.

For #2 - definitely check out your verification jobs and see if it picked up an anomaly.

 

Cache DB is really only useful for backups over WAN, high latency or low bandwidth. For regular datacenter backups you may get better performance by disabling it and also freeing up space on the client (not much, but every bit counts at scale).

 

Userlevel 2
Badge +3

Damian,

  Thank you for the feed back.  I know a long while back I ran speed tests and did see improvement for some backups using Source Side Cache for backups within our DC.  Maybe it is time for me to run the tests again.
  The jobs do ‘self heal’ if the content is not only System State.  For a our sub clients that only contain System State, the jobs fail, rather than go pending a resuming 20min later. (this seems like another bug)

  My concern is that there are so many jobs that get this error in the middle of their backup.  And that it happens to so many client types.  Something has surely changed with the client side cache between 11.20.32 and 11.20.60.

   Reading the links you provided scares me even more.  How could the Source Side cache have entries that the DDB does not have references for anymore.  Do we need to have a Source side cache pruning process to keep them in sync ahead of time?

   Hoping to find out if any other customers are getting this issue as well.

 

Thank you again,

 

Userlevel 7
Badge +15

Thanks @Farmer92 for the additional detail

How could the Source Side cache have entries that the DDB does not have references for anymore

 

The most common scenario is that data verification is finding bad blocks (can’t read a blog or missing from disk) - that will remove the corresponding signatures from the DDB to ensure we do not dedupe against blocks no longer available. Do you run verification and have you checked out those jobs to ensure that no bad blocks are being detected?

I spoke with our engineering team, and the other possibility is if you are manually deleting backup jobs (job-based pruning). I had a look at your commcell health, and I do see secondary records (duplicate blocks) have dropped on 3 out of 4 of your DDBs which indicates we are removing a lot of jobs.

 

For a our sub clients that only contain System State, the jobs fail

 

Ah, yes, system state portion of WFS jobs are non-resumable, so that is expected. We don't typically see separate subclients for system state only much anymore. The new-ish feature of only backing up system state with full backups or having the capability to do incremental system state when indexing V2 is enabled has mitigated much of the need - but not all of it.

My concern is that there are so many jobs that get this error in the middle of their backup

 

Oh, totally agree with this. This should be a very rare occurrence - I think it certainly needs more investigation. A quick search over recent cases does not seem to show and increase in the number of occurrences here - but I did find your case. For sure if you are not deleting jobs manually, this should be investigated via escalation from our engineering team. Looks like Jesse is helping with that!

Userlevel 2
Badge +3

Damain,

  Thank you again for the thorough response.   

   We have not done any job based pruning.  I will have to look in to the verification.  My concern is going to be how long it will run and the potential impact it will have on running jobs.   The MAG library is 750 TB with about 70 TB of free space currently.  The DDB is partitioned across 4 Media agents.

   We do have a very high change rate of data, I am sure that is not helping matters much, but I would no think that should impact the SSDB cache and the DDB getting out of sync

   I am hoping our ticket with get to engineering team soon so we can find the root cause of this.

   By the way who is Jesse?  Our ticket is with Nick. Does Jesse have a similar ticket?

Thank you again for your help!

Userlevel 7
Badge +23

@Farmer92 , I reached out to the team that owns your case and they’ll give it some extra attention.  Keep us posted (and I’ll track as well)!

Reply