Solved

The SDT data transfer was terminated on a request from the Job Manager on 11.20.40

  • 6 April 2021
  • 5 replies
  • 38 views

Badge +1

Running v11SP20.40.  We’re having a ton of occurrences in the last 3 weeks of this error across multiple environments:

[The SDT data transfer was terminated on a request from the Job Manager.]

I work in a MSP environment with the CS in one datacenter and the MAs spread out around the country.  We have a minimum of 10GB on our datacenter links from MA-MA and MA-CS.  

I’m curious if anyone has seen this and has any resolution or troubleshooting steps.

Thanks!

icon

Best answer by Damian Andre 7 April 2021, 01:33

Hmm - Try to increase the liveliness check to max - JM maybe be incorrectly detecting a stalled job and terminating it

https://documentation.commvault.com/commvault/v11_sp20/article?p=11022.htm

 

LAN MediaAgent liveliness check interval in Minutes

Definition: Specifies the interval at which the LAN MediaAgent (MediaAgent and Client are not on the same computer) will execute a liveliness check. These intervals tend to be smaller, as frequent liveliness checks are needed for a network environment.

Default Value: 30

Range: 2 to 1440

Usage: Liveliness checks are conducted to ensure necessary services are running and listening. Increasing the interval value may be recommended to minimize network traffic if you have a large number of LAN MediaAgents and you have other mechanisms in place to verify network and services availability.

View original

5 replies

Userlevel 6
Badge +14

Hi @djustis and thanks for the post!

That’s normally a communications related error, though we need more logging to see what the actual cause is in your case.

The full error should mention a process that is reporting this issue (clbackup, etc.). 

Look for something like: Source: <servername>, Process: clBackupChild

We need to see what is in the process log file as well as CVD.log on the server reporting the error at that time frame which will give us the full context.

Can you take a look and copy those excerpts here?

Also, what kind of jobs are failing?  Windows File System?  VSA?  An assortment?

Thanks!

Badge +1

Here’s one, its a MA trying to do a DDB backup.  Cut from the clbackup log on the MA:

10400 2d7c  04/06 17:11:06 19468916 CPipelayer::SendPipelineBuffer() - Tail has reported error [94][The SDT data transfer was terminated on a request from the Job Manager.]. Cannot continue.
10400 2d7c  04/06 17:11:06 19468916 [PIPELAYER  ] Error in flushing the current buffer.
10400 2d7c  04/06 17:11:06 19468916 CVArchive::WriteBuffer() - Cannot send the buffer. Ret [268435460]
10400 2d7c  04/06 17:11:06 19468916 CFileBackup::WriteBuffer(1683) - writeBuffer failed
10400 2d7c  04/06 17:11:06 19468916 CFileBackup::HandleReadAndSendFileDataError(1497) - WriteBuffer failed
10400 2d7c  04/06 17:11:06 19468916 CBackupBase::DoBackup(3689) - ReadAndSendFileData indicates FAIL_BACKUP

Error description from the job is:

Error Code: [10:62] Description: Other end XXXX encountered failure in receiving data [The SDT data transfer was terminated on a request from the Job Manager.] Source: onecvt200mbbdsm, Process: clBackup  

Most of the time they pickup and run and often finish, very intermittent.

 

CVD from the MA:

68  c50   04/06 15:06:24 ######## [JOBCTRL    ] Successfully registered control process for Job [19468916:7:5:8:51234] of type [1].
3968  13e0  04/06 15:26:56 ######## [CVD        ] Remote Command Request from remotehost = <::1>, RemoteClient = <onecvt200caa>, RemoteIP(Sock) = <::1>. Launched Process: <clBackup.exe -j 19468916 -a 2:2378 -t 1 -i 3 -d onecvt200maadsm*onecvt200maadsm*8400*8402 -io 1  -jt 19468916:7:6:8:51234  -idxma onecvt200maadsm*onecvt200maadsm*8400*8402  -OSInfo  -h  -w  -ot 1  -numstreams 1  -ab 0 -r 1617627612 -c 0 -appType 33 -slt -id f6520a39-cea7-4160-a887-8e9b2ab95d27 -cn onecvt200mbbdsm -vm Instance001>. Pid=396
3968  c50   04/06 15:26:57 ######## [JOBCTRL    ] Successfully registered control process for Job [19468916:7:6:8:51234] of type [1].

3968  2934  04/06 16:26:08 ######## [CVD        ] Remote Command Request from remotehost = <::1>, RemoteClient = <onecvt200caa>, RemoteIP(Sock) = <::1>. Launched Process: <clBackup.exe -j 19468916 -a 2:2378 -t 1 -i 3 -d ONECVT200MABDSM*onecvt200mabdsm*8400*8402 -io 1  -jt 19468916:7:7:10:51234  -idxma ONECVT200MABDSM*onecvt200mabdsm*8400*8402  -OSInfo  -h  -w  -ot 1  -numstreams 1  -ab 0 -r 1617627612 -c 0 -appType 33 -slt -id 680b3caa-e30d-4cbe-9d91-cf4991d2d05d -cn onecvt200mbbdsm -vm Instance001>. Pid=10400
3968  c50   04/06 16:26:13 ######## [JOBCTRL    ] Successfully registered control process for Job [19468916:7:7:10:51234] of type [1].
3968  8a0   04/06 17:11:06 ######## [JOBCTRL    ] Got request to terminate job 19468916
3968  8a0   04/06 17:11:06 ######## [JOBCTRL    ] Stopping All pipelines for Job 19468916
 

Userlevel 5
Badge +13

Hmm - Try to increase the liveliness check to max - JM maybe be incorrectly detecting a stalled job and terminating it

https://documentation.commvault.com/commvault/v11_sp20/article?p=11022.htm

 

LAN MediaAgent liveliness check interval in Minutes

Definition: Specifies the interval at which the LAN MediaAgent (MediaAgent and Client are not on the same computer) will execute a liveliness check. These intervals tend to be smaller, as frequent liveliness checks are needed for a network environment.

Default Value: 30

Range: 2 to 1440

Usage: Liveliness checks are conducted to ensure necessary services are running and listening. Increasing the interval value may be recommended to minimize network traffic if you have a large number of LAN MediaAgents and you have other mechanisms in place to verify network and services availability.

Badge +1

So boost it up to 1440?

Userlevel 5
Badge +13

So boost it up to 1440?

Yup.

You can always immediately drop it down but it's rare this gets triggered

Reply