Solved

Cache-Database and Deduplication-Database are out of sync

  • 17 August 2021
  • 7 replies
  • 1094 views

Userlevel 3
Badge +5

Do you see these error for your jobs?

When we updated from 11.20.32 to 11.20.60 we started getting Cache-Database errors on various backup types, FileSystem idata agent, NDMP backups and others.

The exact error we get is:
Error Code: [40:110]
Description: Client-side deduplication enabled job found Cache-Database and Deduplication-Database are out of sync.

I have a ticket open with support, but I am wondering if the issue is unique to us or if it is happening to other customers as well.

 

Thank you,

icon

Best answer by Mike Struening RETIRED 14 April 2022, 19:56

View original

7 replies

Userlevel 7
Badge +23

Hey @Farmer92,

Couple of possibilities based on this KB article: https://kb.commvault.com/article/DD0010

  • This error message might appear due to one of the following reasons:

    • Source-Side Deduplication Database (SSDB) is out of sync with the Deduplication Database (DDB) on the MediaAgent. When this occurs the SSDB has reference pointers (signature hash) that the central DDB doesn't have. Therefore the job(s) of the client associated to the SSDB go to a pending state.
    • Deduplication data verification job marked invalid data blocks as bad.

 

Not sure there is much you can do on point #1 - the job will self heal - likely that a job aged that removed a signature that was still in the cache DB and required for the backup to proceed.

For #2 - definitely check out your verification jobs and see if it picked up an anomaly.

 

Cache DB is really only useful for backups over WAN, high latency or low bandwidth. For regular datacenter backups you may get better performance by disabling it and also freeing up space on the client (not much, but every bit counts at scale).

 

Userlevel 3
Badge +5

Damian,

  Thank you for the feed back.  I know a long while back I ran speed tests and did see improvement for some backups using Source Side Cache for backups within our DC.  Maybe it is time for me to run the tests again.
  The jobs do ‘self heal’ if the content is not only System State.  For a our sub clients that only contain System State, the jobs fail, rather than go pending a resuming 20min later. (this seems like another bug)

  My concern is that there are so many jobs that get this error in the middle of their backup.  And that it happens to so many client types.  Something has surely changed with the client side cache between 11.20.32 and 11.20.60.

   Reading the links you provided scares me even more.  How could the Source Side cache have entries that the DDB does not have references for anymore.  Do we need to have a Source side cache pruning process to keep them in sync ahead of time?

   Hoping to find out if any other customers are getting this issue as well.

 

Thank you again,

 

Userlevel 7
Badge +23

Thanks @Farmer92 for the additional detail

How could the Source Side cache have entries that the DDB does not have references for anymore

 

The most common scenario is that data verification is finding bad blocks (can’t read a blog or missing from disk) - that will remove the corresponding signatures from the DDB to ensure we do not dedupe against blocks no longer available. Do you run verification and have you checked out those jobs to ensure that no bad blocks are being detected?

I spoke with our engineering team, and the other possibility is if you are manually deleting backup jobs (job-based pruning). I had a look at your commcell health, and I do see secondary records (duplicate blocks) have dropped on 3 out of 4 of your DDBs which indicates we are removing a lot of jobs.

 

For a our sub clients that only contain System State, the jobs fail

 

Ah, yes, system state portion of WFS jobs are non-resumable, so that is expected. We don't typically see separate subclients for system state only much anymore. The new-ish feature of only backing up system state with full backups or having the capability to do incremental system state when indexing V2 is enabled has mitigated much of the need - but not all of it.

My concern is that there are so many jobs that get this error in the middle of their backup

 

Oh, totally agree with this. This should be a very rare occurrence - I think it certainly needs more investigation. A quick search over recent cases does not seem to show and increase in the number of occurrences here - but I did find your case. For sure if you are not deleting jobs manually, this should be investigated via escalation from our engineering team. Looks like Jesse is helping with that!

Userlevel 3
Badge +5

Damain,

  Thank you again for the thorough response.   

   We have not done any job based pruning.  I will have to look in to the verification.  My concern is going to be how long it will run and the potential impact it will have on running jobs.   The MAG library is 750 TB with about 70 TB of free space currently.  The DDB is partitioned across 4 Media agents.

   We do have a very high change rate of data, I am sure that is not helping matters much, but I would no think that should impact the SSDB cache and the DDB getting out of sync

   I am hoping our ticket with get to engineering team soon so we can find the root cause of this.

   By the way who is Jesse?  Our ticket is with Nick. Does Jesse have a similar ticket?

Thank you again for your help!

Userlevel 7
Badge +23

@Farmer92 , I reached out to the team that owns your case and they’ll give it some extra attention.  Keep us posted (and I’ll track as well)!

Userlevel 7
Badge +23

Sharing what appears to be the case resolution.

@Farmer92 , can you confirm if this worked?

* The two factors together are causing a unique scenario here where we keep hitting a check that causes client cache to go out of sync:

a. References to primary blocks written since 2018

b. Fast rate of recycling of primary blocks.

 

For now, in order to avoid the issue, we are considering the following steps:

1] We need to prevent references from Aug 2019 to get a window before we hit the problematic check again. To do this we can run the DDBParam script

 

By avoiding these references we may end up requiring up to 17TB space for the blocks to be rewritten based on the dumps we have collected.

 

On a conservative approach, we can do this in multiple steps by starting with older dates.

 

2] We also recommend moving out the subclients that are consuming primaries at a fast rate to a different storage pool

Userlevel 7
Badge +23

sharing the latest which seems to have explained everything:

The oldest block retained is from 2017. But we are no longer referencing to those blocks as they are past 32 billion ids, so the primary block distribution will not be uniform. The older blocks from 2017 are retained mostly because of valid jobs still retained.

We don’t have latest DDB Dump. But using a rough estimate we will have blocks written up to Jan 31st 2019 will be referenced up to April end. 

So we can start with Jan 31st 2019 as the first step for the below setting and then proceed to April 2019 and then August 2019 after observing space consumption.
 

I believe the below are the steps to be carried out:

 

* The two factors together are causing a unique scenario here where we keep hitting a check that causes client cache to go out of sync:
a. References to primary blocks written since 2018
b. Fast rate of recycling of primary blocks.

For now, in order to avoid the issue, we are considering the following steps:
1] We need to prevent references from Aug 2019 to get a window before we hit the problematic check again. To do this we can run the following command:
qoperation execscript -sn DDBParam -si set -si 66 -si DDBDoNotRefBeforeTime -si "2019-08-14 00:51:24.000"

By avoiding these references we may end up requiring up to 17TB space for the blocks to be rewritten based on the dumps we have collected.

On a conservative approach, we can do this in multiple steps by starting with older dates.

2] We also recommend moving out the subclients that are consuming primaries at a fast rate to a different storage pool:

Reply