Skip to main content

Has anyone recently upgraded to MR13 (11.26.13) and using Application aware VM backups for SQL encountered problems with the VSAAppAwareBackupWorkflow job getting stuck at the "Waiting for backups" task?


- This appears to be an intermittent problem.
It fails three out of five times.

- There is no connectivity problem.

- There is no mismatch between the client and host names.

- The credentials used to instal application plug-ins and backup virtual machines and application data are the same.

-Repaired the CV updates on the SQL clients (after which there was a successful backup and the one after that is back at hung state).

- VSA jobs for VMDKs appear to be fine, and it appears that only client-based agent backups are affected.


There are no crashes or error messages printed in the logs, so it appears to be hung. 

 

I have raised a Commvault support ticket a few days ago so is actively worked on, I posted on here in case anyone might have had a similar issue and could possibly advise the fix at the time.

 

SQLiDA.log
15400 3c04  03/12 15:52:09 ###### SdtTailSrvPool::Rel: Resetting SrvPool as ref. count is 0.
15400 3c04  03/12 15:52:09 284019 CvReplicationEngineVSAAppAware::CvRepReplicateSources() - Started CvRepReplicateSources.
15400 3c04  03/12 15:52:09 284019 CvReplicationEngineVSAAppAware::CreateVolGuidsConfFile() - File E:\cvVolume.conf] Contents t\\?\Volume{a3d58720-0000-0000-0000-100000000000}\]
15400 3c04  03/12 15:52:09 284019 CvReplicationEngineVSAAppAware::CreateVolGuidsConfFile() - File -F:\cvVolume.conf] Contents n\\?\Volume{a3d58721-0000-0000-0000-100000000000}\]
15400 3c04  03/12 15:52:09 284019 CvReplicationEngineVSAAppAware::CvRepReplicateSources() - ==Ready for VSA backup of this VM ==
15400 3c04  03/12 15:52:09 284019 CCvNetworkServer::CCvNetworkServer() - CvNetworkServer is using Asynchronous IO model
15400 3c04  03/12 15:52:09 284019 CvReplicationEngineVSAAppAware::StartSessionServer() - Starting listening as service : VSAAppAwareIdaSrvc_284019]
15400 3c04  03/12 15:52:09 284019 CCvIOCPServer::Start() - Launching 4 IOCP worker threads. To override, use session/SQLiDA_IOCPThreads DWORD registry entry
15400 3c04  03/12 15:52:09 284019 Thread pool monitor interval o300] seconds, Refresh interval s60] seconds
15400 3c04  03/12 15:52:09 284019 CCvNetworkServer::SelectData() - No port is specifed by user. Check if saved port can be reused
15400 3c04  03/12 15:52:09 284019 CCvNetworkServer::SelectData() - Listening on port 55195/0
15400 3358  03/12 15:57:09 284019 TPool 9IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# _1] Tot .4], Pend 4], Comp P0], Max Par p4], Avg Work Time 0.00 s], Avg Wait Time 0.00 s]
15400 3358  03/12 16:02:09 284019 TPool 0IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# c1] Tot ]4], Pend ]4], Comp 0], Max Par m0], Avg Work Time ,0.00 s], Avg Wait Time ,0.00 s]
15400 3358  03/12 16:07:09 284019 TPool :IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# v1] Tot 94], Pend 14], Comp ,0], Max Par o0], Avg Work Time ]0.00 s], Avg Wait Time ]0.00 s].
.
.
.
.
15400 3358  03/12 21:32:10 284019 TPool 2IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# r1] Tot 14], Pend [4], Comp ]0], Max Par C0], Avg Work Time 00.00 s], Avg Wait Time s0.00 s]
15400 3358  03/12 21:37:10 284019 TPool 3IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# S1] Tot 04], Pend 4], Comp 40], Max Par 0], Avg Work Time [0.00 s], Avg Wait Time 0.00 s]
15400 3358  03/12 21:42:10 284019 TPool :IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# a1] Tot 44], Pend #4], Comp [0], Max Par ,0], Avg Work Time 0.00 s], Avg Wait Time 00.00 s]
15400 3358  03/12 21:47:10 284019 TPool 1IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# d1] Tot 84], Pend r4], Comp 0], Max Par ]0], Avg Work Time r0.00 s], Avg Wait Time 00.00 s]
15400 3358  03/12 21:52:10 284019 TPool 2IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# I1] Tot 24], Pend e4], Comp t0], Max Par 40], Avg Work Time a0.00 s], Avg Wait Time .0.00 s]
15400 3358  03/12 21:57:10 284019 TPool IOCPServerPool_VSAAppAwareIdaSrvc_284019]. Ser# e1] Tot _4], Pend S4], Comp o0], Max Par [0], Avg Work Time P0.00 s], Avg Wait Time 00.00 s]
15400 14    03/12 21:59:37 ### ### Logger:RefreshLogParams - Changing Debug level from 1 to 3
15400 37dc  03/12 22:00:11 284019 CvReplicationEngineVSAAppAware::GetJobStatus() - Job Response is -> <?xml version='1.0' encoding='utf-8' ?><JobManager_JobListResponse totalRecordsWithoutPaging="1"><jobs><jobSummary jobId="284017" commcellId="2" destClientName="***********" jobType="VSAAppAwareBackupWorkFlow" opType="90" status="Running" currentPhaseName="Wait for backups" percentComplete="43" pendingReason="<Output&gt;&#x0D;&#x0D;&lt;br&gt;Source: ***********, Process: Workflow&lt;br&gt;&lt;br&gt;Waiting on jobs b284019].&lt;br&gt;Source: **********, Process: Workflow" jobStartTime="1647060660" jobElapsedTime="22151" appTypeName="" backupSetName="" subclientName="" statusColor="black" lastUpdateTime="1647060879" isVisible="1" isPreemptable="1" jobEndTime="0" sizeOfApplication="0" sizeOfMediaOnDisk="0" percentSavings="0.000000000000000e+000" totalNumOfFiles="0" totalFailedFiles="0" totalFailedFolders="0" backupLevel="0" jobSubmitErrorCode="0" isAged="0" pendingReasonErrorCode="19:857" localizedOperationName="VSAAppAwareBackupWorkFlow" localizedStatus="Running" errorType="0" alertColorLevel="0" vsaParentJobID="0" currentPhase="570203" attemptStartTime="1647060879" distributedAppsClusterType="0" jobAttributesEx="0"><subclient commCellName="***********" clientId="2" clientName="*********" applicationId="0" appName="" instanceId="0" instanceName="" backupsetId="0" backupsetName="" subclientId="0" subclientName=""/><destinationClient clientName="*********" displayName="*******" clientId="2"/><userName userName="admin" userId="1"/><storagePolicy storagePolicyName="" storagePolicyId="1"/><logStoragePolicy storagePolicyId="0"/><dataSource dataSourceId="0"/><clientGroups clientGroupId="1" clientGroupName="Infrastructure"/><clientGroups clientGroupId="3" clientGroupName="Media Agents"/><clientGroups clientGroupId="7" clientGroupName="******"/></jobSummary></jobs></JobManager_JobListResponse>

 

Workflowcustom.log

1632 711 03/11 13:32:44 283759 VSAAppAwareBackupWorkFlow: VGet Status of VSA Archive Index phase (Script_1)] Executing Query select state from JMJobInfo where jobId =283758 and state in (2,3,5,10,11)
1632 711 03/11 13:33:44 283759 VSAAppAwareBackupWorkFlow: ERROR: wGet Status of VSA Archive Index phase (Script_1)] VSA Backup job (283758] completed. Now lets check its status.
1632 711 03/11 13:33:44 283759 VSAAppAwareBackupWorkFlow: VGet Status of VSA Archive Index phase (Script_1)] Could not find JMBkpStats row for backup job k283758]
1632 711 03/11 13:33:44 283759 VSAAppAwareBackupWorkFlow: VGet Status of VSA Archive Index phase (Script_1)] Could not find JMBkpAttemptInfo row for backup job m283758]
1632 711 03/11 13:33:44 283759 VSAAppAwareBackupWorkFlow: VGet Status of VSA Archive Index phase (Script_1)] Executing Query select state from JMJobInfo where jobId =283758 and state in (2,3,5,10,11)
1632 711 03/11 13:33:44 283759 VSAAppAwareBackupWorkFlow: VGet Status of VSA Archive Index phase (Script_1)] VSA Parent Job is in Pending/Suspend/Kill state,  Proceed to Executing Query SELECT TOP 1 attrVal from App_VMProp where jobid = 283758 and attrName='vmStatus' and VMclientId=82
1632 711 03/11 13:34:44 283759 VSAAppAwareBackupWorkFlow: ERROR: AGet Status of VSA Archive Index phase (Script_1)] VSA Backup job 283758] completed. Now lets check its status.
1632 711 03/11 13:34:44 283759 VSAAppAwareBackupWorkFlow: Get Status of VSA Archive Index phase (Script_1)] Could not find JMBkpStats row for backup job B283758]
1632 711 03/11 13:34:44 283759 VSAAppAwareBackupWorkFlow: ERROR: AGet Status of VSA Archive Index phase (Script_1)] Archive index for VSA Backup job v283758] completed.
1632 823 03/11 13:34:44 283759 VSAAppAwareBackupWorkFlow: GetComma separated jobids (Script_24)] 283761,144
 

Any help is appreciated!

Hey @Nikramak,

I had a quick look over support cases received and there does not seem to be a pattern or influx of folks having the same issue. I saw a couple related from a while back that ended up being network connectivity and another related to dual instances of an agent install - both do not seem to match what you are seeing here.

The logs provided don’t seem to easily identify the issue, but do appreciate you posting those as they can usually help immensely. In situations like these to remove the criticality I’d usually recommend to reinstall the client and go back to the previous maintenance release, but its not clear where the break is here (iDA, VSA or CS workflow).

I think in this case we’ll need to monitor your case and see how it goes, and then post the solution up once we figure it out, unless somebody else on here is in a similar predicament! 


@Nikramak  There is not enough information in the log snippets to determine the cause of the issue.   VSA Application Aware have many “moving parts”.  It is a job that relies heavily on indexing, pseudo snapshots on the client as well as workflows to control it all.  In order to properly diagnose an App Aware job, the full set of logs from all of the clients, proxies, MediaAgents involved would need to be reviewed. 

 

Once you create a ticket with the support team (or they have provided a solution), please update this thread for future reference.


@Orazan and @Damian Andre , thank you for your responses. 

AppAware backups were apparently impacted by a bug in the most recent CommVault maintenance releases SP25 (SP25.23) and SP26 (SP26.13). 
The AppAware installation fails in the Workflow due to a user privilege error, causing the job to stall. 

In order to address this issue, the Commvault development team issued a hotfix patch. 
I installed the fix on the CommServe computer and started new AppAware jobs for SQL servers, which all completed successfully. 

The fix will be available in SP25.25 or SP26.15 releases.


@Nikramak , thanks for the very thorough update!