The file system backups are showing as successful(no warning) however failing to protect some of the files; I am wondering why the job is not completing with partial sucess and showing it as a VSS issue. Is that becuase there are no application specific vss writers for the application and hence quiesing is not working its magic
lC:\Program Files (x86)\BigFix Enterprise\BES Client\__BESData\SiteData.db] The process cannot access the file because it is being used by another process.
Page 1 / 2
Hi @Theseeker ,
Backup jobs can be configured with thresholds, this is so a job with 10s of 1000s of files will still complete, despite 1 or 2 failed files. The thresholds can be modified here: Home tab > Control Panel > Data > Job Management
I dislike this approach because you are making a global change here which impact everything. If I recall correctly you cannot override it on a client computer, right?
Thinking about it I would like to be able to turn it around. Have the global default set to one meaning it will already report the job as incomplete if it missed one file. The ability to influence the threshold should be done on a server group level or client computer level. This allows a more granular approach offering more control on it.
Imagine this one file to be very important which is being filtered out now as the job just completes successfully…. I wouldn't like to be the guy to receive the request to recover the file to discover that it was not protected for months already.
Other example: customer runs specially tailed application that holds all application files in use. Agreement made to stop application at time X for backup. This works until application grows and time to backup or stop takes to long and backup already kicks in while files are still being in use…...
Hi @Onno van den Berg ,
The global level is still IDA Specific so may clear some very specific conditions, however understand there may be FS ida’s used to protect critical systems too.
I think this should cover those scenarios you mentioned, but let me know if not and i’ll do some more digging.
Cheers,
Jase
@Theseeker Just curious on why you are seeing this error. If a file is locked and vss for locked files is enabled (which is default) we should take a vss snap and protect the file unless vss snap is failing. Can you provide us the backup log to see why the snap is not taken here?
4600 18f0 02/09 00:42:54 4041670 CFileBackupFSW2K::CreateOFMSnap(4065) - Failed to create OFM snap, error=0x800704D5:{COpenFileManager::CreateShadow(94)} + {COpenFileManager::CreateShadow1(420)} + {SHADOWSET::CShadowSet::CreateShadow(67)} + {SHADOWSET::CShadowSet::CreateShadowSet(222)} + {SHADOWSET::CShadowSet::CheckIfRetryIsPossible(284)/W32.1237.(The operation could not be completed. A retry should be performed. (1237))-Unable to create a shadow at this time, error=0x8004230F} 4600 18f0 02/09 00:42:54 4041670 FsBackupCtlr::DoFailedFileRetryBackup(10869) - Failed to create OFM snap, error=0x800704D5:{CFileBackupFSW2K::CreateOFMSnap(4066)/Failed to create OFM snap} + {COpenFileManager::CreateShadow(94)} + {COpenFileManager::CreateShadow1(420)} + {SHADOWSET::CShadowSet::CreateShadow(67)} + {SHADOWSET::CShadowSet::CreateShadowSet(222)} + {SHADOWSET::CShadowSet::CheckIfRetryIsPossible(284)/W32.1237.(The operation could not be completed. A retry should be performed. (1237))-Unable to create a shadow at this time, error=0x8004230F} 4600 18f0 02/09 00:42:54 4041670 FsBackupCtlr::DoFailedFileRetryBackup(10896) - Cannot proceed with backup of locked files in FailedFileRetry.cvf due to a previous error. Copying the locked file entries to Failures.cvf for retry in the next job. 4600 18f0 02/09 00:42:54 4041670 FsBackupCtlr::PutInFileTheFailedFilesList(5733) - FailedFileCnt 3194] 4600 18f0 02/09 00:42:54 4041670 FsBackupCtlr::DoFailedFileRetryBackup(10912) - Copied locked file entries from FailedFileRetry.cvf to Failures.cvf
@Theseeker Looks like vss is not is a good shape to take a vss snap and hence it is failing. Can you if there is enough space on the volume to take the snapshot? You can use the below command to see where is the shadow storage located for the volume
vssadmin list shadowstorage
If this is good, can you check the event viewer to see if there are any errors from VSS?
Restart vss service to see if the issue can be auto corrected.
Thanks,
Karthik
Hi @Theseeker ,
Backup jobs can be configured with thresholds, this is so a job with 10s of 1000s of files will still complete, despite 1 or 2 failed files. The thresholds can be modified here: Home tab > Control Panel > Data > Job Management
How can a job be successfull if it miss to backup files? I’ve never understood the reasoning behind it. If that one file missed is the most critical file and it’s lost even though backups been successfull, how would you explain it?
BR Henke
Same question here. Best guess would be someone didn’t like to see so many Failed or CWE Jobs so they requested this “solution” to hide the minor issues.
Even considers Job ok if CommServe Services failed to be backed up, or most other parts of SystemWriters.Though that bug should finally be fixed in next HPK.
Personal Solution/Workaround: Alert that goes of on specific SystemWriter Events and triggers Workflow to fix Issues , create Ticket with Server-Owner, or just send an Info-Mail to us for further checking.
Hi @Onno van den Berg ,
The global level is still IDA Specific so may clear some very specific conditions, however understand there may be FS ida’s used to protect critical systems too.
I think this should cover those scenarios you mentioned, but let me know if not and i’ll do some more digging.
Cheers,
Jase
Now I would really like to configure this via Command Center. Customers use this as their console.
Good conversation here! the main (historical) reason for job completion status was that it can be normal for a few files to fail for various reasons. If you marked every job that skipped one file as Completed with Errors, they’d all look that way.
Completed with Errors is reserved for either system State Components getting missed, or expected Databases getting missed (there might be others escaping me now). The status really means “we ran and finished, but something major was missed; we advise looking” as well as “we won’t prune previous Completed Without Errors jobs until this is fixed” by default. There’s some extra logic that goes into play for CwE.
For your request @Onno van den Berg for Command Center inclusion, I’ll get the right people to respond.
For the OP, @Theseeker , let us know if you’re able to address the VSS issues via a reboot and if subsequent backups work as expected.
Hi Mike
That's where “Completed with Warnings” would come into play for me.
Everything backed up fine: Completed
Some minor or non critical files not backed up: Completed with Warnings
Some major or critical files not backed up: Completed with Error
Major parts of Backup or System Writers failed: Failed
CVDB knows the StatusName but havn’t seen it used in action yet, nor is it available as Option in Error Threshold Roles.
Hi Mike
That's where “Completed with Warnings” would come into play for me.
Everything backed up fine: Completed
Some minor or non critical files not backed up: Completed with Warnings
Some major or critical files not backed up: Completed with Error
Major parts of Backup or System Writers failed: Failed
CVDB knows the StatusName but havn’t seen it used in action yet, nor is it available as Option in Error Threshold Roles.
That’s a clever idea . Let me pass that up the chain and see what I can do!
Thanks, @Stefan Vollrath !
Yes you are right there are many jobs that completes with errors, though we thought that missing one file is bad enough. So we changed the setting, I think it’s this that we talk about, to 0 failed files.
This gives an indication to what clients needs attention, and most of them can be dealt with applying filters, either global or local. Some can’t be filtered out though, such as SystemState and so on.
Most of the problematic ones are systems with alot of “temp” files, that seem to be flagged for backup but when the backup occurs they aren’t there, hence are failed due to the system.
In addition we create alerts for systems under compliance audits for failed files/jobs.
My .02 on this:
IF the writer is in a bad state, that needs to get corrected, before anything.
If the writer still fails on that database file, I would suggest looking into how that vendor recommends protecting the file. If there are freeze/thaw scripts or quiesce scripts, those should be applied to the subclient. This can be done in command center.
This has always baffeled me.
How can a job be successfull if it miss to backup files? I’ve never understood the reasoning behind it. If that one file missed is the most critical file and it’s lost even though backups been successfull, how would you explain it?
BR Henke
I agree with this Henke… which is why only under extremely rare cases would I personally suggest manipulating how jobs are classified, based on errors.
To me, errors and failed files, mean “fix me”. If those are files you dont need to protect (like tmp/cache files), we can filter them. In command center, filtering can be done globally (manage > system global filters), on server group level (configuration tab > file exceptions), on a plan level (through backup content settings), and on the subclient level (through custom backup content)
Like SLA, if its not 100% something is wrong and needs corrective action.
Hi Mike
That's where “Completed with Warnings” would come into play for me.
Everything backed up fine: Completed
Some minor or non critical files not backed up: Completed with Warnings
Some major or critical files not backed up: Completed with Error
Major parts of Backup or System Writers failed: Failed
CVDB knows the StatusName but havn’t seen it used in action yet, nor is it available as Option in Error Threshold Roles.
How do you distinguish between “Some minor or non critical files not backed up” and “Some major or critical files not backed up”? You need some rule to put in place there.
Like if user data is on then any file missed goes into Malor category and a Windows O/S none critical files goes into minor.
Yes you are right there are many jobs that completes with errors, though we thought that missing one file is bad enough. So we changed the setting, I think it’s this that we talk about, to 0 failed files.
This gives an indication to what clients needs attention, and most of them can be dealt with applying filters, either global or local. Some can’t be filtered out though, such as SystemState and so on.
Most of the problematic ones are systems with alot of “temp” files, that seem to be flagged for backup but when the backup occurs they aren’t there, hence are failed due to the system.
In addition we create alerts for systems under compliance audits for failed files/jobs.
@Henke what Commvault could introduce is a post process that process the list of failed files and performs another file scan to see if the files still exist. That could filter out the temp files automatically and could reduce the amount of reported files. The issue with Temp files being wiped while the job is progressing will also be more visible with long-running jobs.
maybe this is something to look into!
My .02 on this:
IF the writer is in a bad state, that needs to get corrected, before anything.
If the writer still fails on that database file, I would suggest looking into how that vendor recommends protecting the file. If there are freeze/thaw scripts or quiesce scripts, those should be applied to the subclient. This can be done in command center.
This has always baffeled me.
How can a job be successfull if it miss to backup files? I’ve never understood the reasoning behind it. If that one file missed is the most critical file and it’s lost even though backups been successfull, how would you explain it?
BR Henke
I agree with this Henke… which is why only under extremely rare cases would I personally suggest manipulating how jobs are classified, based on errors.
To me, errors and failed files, mean “fix me”. If those are files you dont need to protect (like tmp/cache files), we can filter them. In command center, filtering can be done globally (manage > system global filters), on server group level (configuration tab > file exceptions), on a plan level (through backup content settings), and on the subclient level (through custom backup content)
Like SLA, if its not 100% something is wrong and needs corrective action.
@MFasulo maybe an Idea to check the writer status before executing a job so Command Center can deliver sensible information to the user informing him/her to investigate the writer status.
My .02 on this:
IF the writer is in a bad state, that needs to get corrected, before anything.
If the writer still fails on that database file, I would suggest looking into how that vendor recommends protecting the file. If there are freeze/thaw scripts or quiesce scripts, those should be applied to the subclient. This can be done in command center.
This has always baffeled me.
How can a job be successfull if it miss to backup files? I’ve never understood the reasoning behind it. If that one file missed is the most critical file and it’s lost even though backups been successfull, how would you explain it?
BR Henke
I agree with this Henke… which is why only under extremely rare cases would I personally suggest manipulating how jobs are classified, based on errors.
To me, errors and failed files, mean “fix me”. If those are files you dont need to protect (like tmp/cache files), we can filter them. In command center, filtering can be done globally (manage > system global filters), on server group level (configuration tab > file exceptions), on a plan level (through backup content settings), and on the subclient level (through custom backup content)
Like SLA, if its not 100% something is wrong and needs corrective action.
@MFasulo maybe an Idea to check the writer status before executing a job so Command Center can deliver sensible information to the user informing him/her to investigate the writer status.
I agree. @Mike Struening do you know if we do any VSS writer remediation as part of some error output? I recall back in my support days we did post the VSS writer status before backup and after backup, not sure if we still do that.
When this error occurred, there should have been corresponding errors in the OS event viewer.
@Mike Struening I would definitely not pause the job because this will definitely have impact. So what I would like to see is an alert being raised/send but more importantly I want better feedback from Command Center that something is happening. Both myself and a lot of colleagues who have been forced to use Command Center really miss a single pane of glass. Its to much clicking around and information is scattered all over the place. To be precise, in this case I would like to see a warning sign/indicator/led/icon besides the client computer in the servers view that gives an indication that something is not ok with that particular client.
Please also look at my idea around initiating a post phase that re-scans all failed files and clears the ones who have been deleted in between the 2 scan phases. This will rule-out false positives like temp files etc.
Next week we'll discuss the bigger picture regarding my statement about Command Center with @MFasulo
Well, @MFasulo is the right guy for Command Center, no question.
Can you link me to your idea about the post phase scan? I want to ensure it’s getting traction/attention.
Well, @MFasulo is the right guy for Command Center, no question.
Can you link me to your idea about the post phase scan? I want to ensure it’s getting traction/attention.
We highlight the last backup status and depending on the failure type we provide some quick recommended actions:
You can see there is a difference between the failed to start recommendations and the failed VSS snapshot. When we first started talking about this, this is where I was thinking we can inject a workflow/action that stops and restarts VSS (or something like that.
Here is from VM group where you can resubmit and backup just the failed VMs:
Here is a shot from databases:
@MFasulo Have to admit that it's getting better and better, but adding in auto remediation would be the next step right ;-)
Shall I open a CMR for the reprocessing of failed files after the job finishes to perform a re-scan that will scan all missed files to identify if they were deleted during the job run? That would remove a lot of false positives from being taking into account.