Skip to main content

Hello,

 

I’ve logged a case at Commvault support but.. they’re reviewing logs.. (🙄)

https://support.commvault.com/Case/Details/241010-316 (for those that would have access to the case to spot the details of the issue).

On a linux physical MA, with of course dedicated NVME disk properly setup with the lvm to have enough spare space to perform lvm ‘snapshots’ of the DDB volume, one of my MAs got a stuck DDB backup job during the night.

The ddbbackup job has been killed (😏) and even restarted later (🤐) and as it was not working again server was rebooted, before another ddbbackup attempt was initiated. 

Of course, it failed.

Then, I was called and looked at the problem, and (my gosh, sorry for that) CHECKED THE LOGS of the job… 

Right away I spotted that the ddbbackup was stuck upon querying ‘vgdisplay -v’ to get the list of volumegroups where the ddb volume is most likely configured. No need to get into details for this.

So, something’s wrong with the VG/lvol. 

Then I issued the ‘lsblk’ command to see if the disk and lvm of the ddb was there, but discovered that there were multiple lvm snapshots of the ddb volume already present, from past (failed) dbb backup jobs.

Of course, in this situation, no new lvm snap of the ddb volume can be done without filling the disk. So I’m stuck and have no DDB backup since 2 days.

(And it’s part of a Gridstor of 4 MAs, for my main datacenter, yes, Murphy’s law😎)

 

So, I have been looking into the depths of the documentation but found no details on how to ‘recover’ from a failed/partial DDB backup job, when it failed after the lvm creation+mount of lvm ddb volume snap.

How can this be ‘cleaned out” ? 

Using windows VSS I would know how to commit/revert, but this linux ddb process is not documented to me, and of course it renamed the ‘source’ volume/filesystem hosting the ddb :

(lsblk extract)

nvme0n1 259:0 0 1.8T 0 disk
└─nvme0n1p1 259:1 0 1.8T 0 part
├─ddb_vg-lvol1-real 253:107 0 1.7T 0 lvm
│ ├─ddb_vg-lvol1 253:108 0 1.7T 0 lvm /logi/cv/ddb
│ ├─ddb_vg-DDBSnap_461766457 253:110 0 1.7T 0 lvm
│ └─ddb_vg-DDBSnap_1877354414 253:112 0 1.7T 0 lvm
├─ddb_vg-DDBSnap_461766457-cow 253:109 0 50G 0 lvm
│ └─ddb_vg-DDBSnap_461766457 253:110 0 1.7T 0 lvm
└─ddb_vg-DDBSnap_1877354414-cow 253:111 0 50G 0 lvm
└─ddb_vg-DDBSnap_1877354414 253:112 0 1.7T 0 lvm

We can see the ‘ddb_vg-lvol1-real’ with the ddb_vg-lvol1 hosting the real FileSytem of my DDB.

But this vg configuration has been altered by the ddbbackup process, and it should be like this upon normal ddbbackup execution :

nvme0n1 259:0 0 1.8T 0 disk
└─nvme0n1p1 259:1 0 1.8T 0 part
└─ddb_vg-lvol1 253:103 0 1.7T 0 lvm /logi/cv/ddb

 

So I am asking the CV support assistance to clean all the ddb snaps and get back to ‘normal lvm configuration’

Does anyone know how to achieve this ? Any documentation link ?

 

Thanks !

 

Hi @Laurent,

 

If there old DDB snaps still present, it can be removed using lvremove /dev/nvmevg/DDBSnap_<number> command. Case owner with get back to you with next plan of action.

 

Regards,

Karthik


Hi Karthik ! Thanks a lot, this looks to be what I’m looking for.

After log exchanges and finally a Zoom, Support understood that I was not waiting for an explanation of the root cause of this DDBBackup failure (which I already know what it is), but how to ‘clean’ those mounted Filesystems and ddb-backup-created VGs.. 

They’re just quite very slow to react. 

My company’s business is not really impacted by this issue, so it cannot be qualified as P1 or P2, but only P3. Meaning support is not working on this case outside of business hours and with low priority.

While, in the meanwhile, it’s been a week since my main Datacenter has backup failures because of this MA stuck, and success KPIs are falling down.. 😥    

 

Regards,

Laurent.


Update : pointing to this command helped to solve my case. 

I could safely perform multiple ‘lvremove’ commands of all ‘dead’ snapshots and my configuration went back to normal status, and new DDB backups could run well. 

Thanks a lot again Karthik !


Why are you running ddb backups at night? default schedule for DDB is at 16:00h in the latest versions; do you still have configured the ddb backup schedule every 8 hours?


Hi, 

 

I’m running DDB backups every 8 hours because we have many new blocks created daily, and many old purged daily as well, on this Gristor of 4 MAs using GDSPs.

 

We do not have the usual “night backups” running, but also one third of backups being executed during the whole day, in addition.

So, as the MA infrastructure can deal with this, so we do. 

I already had to perfom a restore of a ¼ of this 4/4 Gridstor, and DDB reconstruction took much more than 8 hours to perform. So, the least the interval of the ddbbackup is, the less amount of blocks we have to recover, and can then perform backups with deduplication. 

 

In a usual single media agent/DDB configuration, then single daily backup is sufficient.

 

Regards,

 

Laurent.


Reply