Solved

OK to kill long running DDBBackup job?

  • 1 September 2022
  • 11 replies
  • 464 views

Userlevel 4
Badge +15

I’m having problems with DDBBackup jobs at my DR site.  I changed the configuration from every 6 hours to once per day but the backup from yesterday is still running and I’m at the 22 hour point.  I’d like to kill this job and let a fresh one start but I understand that the DDBBackup uses snapshots and I’m afraid that if I kill it there won’t be a proper snapshot cleanup.  

Is it OK to kill a DDBBackup job that’s been running almost a long time?

Ken

icon

Best answer by Ken_H 14 September 2022, 17:27

View original

11 replies

Userlevel 7
Badge +19

I personally have not seen any issue doing this in the past, so I expect Commvault to handle this properly. In case you do run into an issue than it's quite easy to get rid of the snapshot yourself. Of course it is pretty interesting to learn why it is "hanging” all of a sudden after making a change to the backup frequency only.
 

Userlevel 6
Badge +15

I used to have such issues before, mostly because on my DR site, at the DDB backup time, multiple dash auxcopies were running, resulting in high DDB usage, while in parallel the volume hosting the DDB is beeing backup/snapped. 

You can look at the DDBbackup job to try to troubleshoot and check what’s happening. 

Depending on the OS of your MA(s), for windows OS you can use vssadmin commands, and Resource monitor to see what’s going on in terms of I/Os on the volume hosting your DDB.

 

What you should avoid to do, especially if you have an MA hosting multiple DDBs of different size, is to reboot/reset your MA during the ddbbackup job. 

Userlevel 4
Badge +15

I ended up killing the DDB Backup job and needing to use a Force Kill.  I stopped all the aux copies and restarted the CommVault services then resumed the copies.  I never checked the status of the snapshot so I assume it had completed.  A subsequent DDB Backup completed successfully although it took 115 hours (4.8 days).  I have a ticket open with CV support to look into why these backups are so problematic.

Userlevel 7
Badge +23

Glad that part worked out.  Worth sharing the information you get regarding the backups.  that way we have a nice, holistic thread.

Userlevel 4
Badge +15

It appears as if one of the controllers for the HPE MSA disks has failed.  Today I was getting a disk response time of 680,692ms and a disk queue length of over 600.  My sysadmin has forced communication through the other controller and my response time is now less than 40ms and the queue length is less than 1.  

Unfortunately the Dedup DB Reconstruction job is failing so my backups still aren’t running.  I have ticket 220912-556 open to get some help with this.  

Userlevel 7
Badge +23

Thanks for the update, @Ken_H .  I’ll keep an eye on it!

Userlevel 4
Badge +15

I’m doing the DDB reconstruction using the “Reconstruct entire DDB without using a previous recovery backup” option.  After 3.1 hours it’s processed 56.4TB out of 841.1TB total.  Based on these numbers, I’m estimating 42.5 hours remaining.

Userlevel 7
Badge +23

I had to check what day of the week today is, first 🤣

Hopefully on Thursday you have good news!

Userlevel 4
Badge +15

When I checked the Dedupe DB rebuild job at 7:30 AM this morning, it was showing 68% complete after running for 20.5 hours.  When I checked at 9:00 AM, the job had completed and both backups and Aux copy jobs had resumed.  I’m guessing it will be 10 to 12 hours for the queued jobs to fully recover.

In the end, the long running DDB Backup job was really a symptom of a failure with the media agent hardware.  This was tricky as not all the drive letters were impacted equally by the problem controller so it appeared other jobs were running OK… or at least well enough to not raise an error.

Thanks to everyone for their feedback on this topic.

The problem is resolved.

Ken

Userlevel 7
Badge +23

Glad to hear it, and happy you shaved a day off the ETA 🤣

Badge +2

It appears as if one of the controllers for the HPE MSA disks has failed.  Today I was getting a disk response time of 680,692ms and a disk queue length of over 600.  My sysadmin has forced communication through the other controller and my response time is now less than 40ms and the queue length is less than 1.  

Unfortunately the Dedup DB Reconstruction job is failing so my backups still aren’t running.  I have ticket 220912-556 open to get some help with this.  

@Ken_H How did you find the disk response time on your HPE MSA disk?

Reply