Solved

Auxiliary copy is stuck at 30%


Userlevel 1
Badge +6

Our weekly secondary AuxCopy is stuck at 30% since this weekend (so that it is blocking all the primary disk to disk incremental copies), with the below 2 error messages

 

Thinking it might be some ports communication issue between the Media Server (S01190) where the tape library is attached to, and the CommCell Server (S02116), so I did the below ports check between the 2 Servers:

 

Telnet from the Media Server (S01190) to the CommCell Server (S02116)

Port 8400        OK

Port 8401        OK

Port 8403        OK

 

Telnet from the CommCell Server (S02116) to the Media Server (S01190)

Port 8400        OK

Port 8401        Not OK

Port 8403        Not OK

 

Now, before I speak to our Network/Security administrator who have recently installed SentinelOne AV on both of the above 2 Servers, I’m wondering if I’m heading the right direction, and if I have done all the ports checking ? 

 

Thanks,

Kelvin

icon

Best answer by Kelvin 2 August 2021, 23:37

View original

29 replies

Userlevel 7
Badge +23

Good post @Kelvin !  BEfore you even said it, I was going to ask if anything is on that server that might be blocking ports/services.

See if the Security team can exclude our services and directory (including the mount paths) from their scans.  AV is notorious for stopping and/or slowing things down on our side.

You can also try a Check Readiness (right-click the server/client/MA and click Check Readiness:

https://documentation.commvault.com/11.23/expert/7110_checking_network_connectivity_between_clients_and_commserve_computer.html

There’s a bunch of operations and options (and it’s useful for future issues)!

Userlevel 1
Badge +6

Hi Mike,

 

Just ran Readiness on both of the Servers, which seems to be OK

 

 

But it failed the below 2 commands on both Servers

  1. CvNetworkTestTool -Client -SrvHostName S02116 (ran on the Media Server S01190)
  2. CvNetworkTestTool -Client -SrvHostName S01190 (ran on the CommServe S02116)

with the below error messages

 

Do we know exactly what ports are being blocked between the two ?

 

Thanks,

 

Kelvin

PS. Pls ignore the offlined tape library “Quantum Scalar i40-i80” as it’s been replaced by “Quantum Scalar i40-i80 21”, although the offlined is still set as the default one:

 

 

Userlevel 7
Badge +23

The firewall port would be specified in the client’s firewall config ()and we could use CVPing to test that), however, looking at the error, the connection is flat out refused.

Dollars to donuts, it’s the AV tool preventing our service from acting.

Userlevel 1
Badge +6

OK, I’ll speak to our Network Security administrator about that. In the meanwhile, could I ask you to give me a link which shows how to set Commvault exclusion in AV ?

 

Very much appreciated,

Kelvin 

 

Userlevel 7
Badge +23

Ask, and you shall receive!

Windows:

https://documentation.commvault.com/commvault/v11_sp20/article?p=8665.htm

And Unix:

https://documentation.commvault.com/commvault/v11_sp20/article?p=8670.htm

Userlevel 1
Badge +6

Thanks ! I’ve passed the message to my colleague who will look into this tomorrow.

However, there is another thing though, i.e. whereas the AuxCopy is failing, its Primary Copy is running fine on this same S01190 Media Serve (see below) - how come is this possible ?

 

Userlevel 7
Badge +23

That definitely puts a potential spin on it….

Let me ask a few questions:

The Aux Copy, where is the data going from and to?  Meaning which 2 Media Agents?

The backups are all running to which of the Media Agents?

I see 2 machines referenced and want to be sure I know which is which:

  • S01190 - Media Agent with Primary jobs running to it just fine.  It looks like this MA is the proxy for the VMs and I assume the MA for the library as well, so there’s no actual transfer of data between servers, which COULD be why….
  • S02116 - Commserve (check readiness failed here),  is this also a Media Agent? is it involved in the Aux Copy?

It is possible that different executables are being blocked, but that’s unlikely a cause….it’s all the same directory.

I’m going to loop in a colleague and see if they can see if I’m missing anything.  Still confident it is the AV, but want the whole picture to make sense :nerd:

 

Userlevel 7
Badge +23

Spoke to a friend who confirmed, if you’re doing LAN free backups for those vms, then the proxy just goes right to the storage.  Aux Copies, on the other hand have to do the whole connection and transfer, and the AV is likely preventing any of that (it’s an EXTREMELY common issue we encounter).

Userlevel 1
Badge +6

Hi Mike,

 

Both the Auxiliary Copy and Primary Copy come frome the same Storage Policy “Backup-LeLude”, and it is the weekly AuxCopy (highlighted below) that is failing with the communication error message, whereas its Primary Copy is running OK as we speak.

 

The above Storage Policy is set to backup all the VMs managed by the vCentre Server “S01145.neopost.grp”, which happens on the Media Agent S01190 that is physically located on the same site as the vCentre Server “S01145.neopost.grp”.

 

S02116 is the CommServe at another physical location which doesn’t have a VSA for vCentre backup, whereas the S01190 is a Media Server with VSA installed.

 

 

And you are right, since the vCentre and the Media Agent S01190 are physically located at the same site, so it was configured in this way which have both disk library (for Primary Copy) and tape library (for AuxCopy) attached to this Media Agent S01190, so as to confine the backup data flow within the same site without needing to travel a long way to the CommServe S02116 on another physical site.

 

This Media Server is FC connected to both the tape library and a SAN, which is where it’s got all the LUNs that are locally mounted on it to serve as disk libraries for the Primary Copy, in which case, SAN is the default transport mode for both of them.

 

Therefore, the data from the AuxCopy must be from its disk library E which is locally mounted to the Media Server S01190 (see below)

 

 

Same idea should go for the Primary Copy as well, i.e. data flow is contained locally between the Media Server S01190 and the vCentre Server S01145.

 

In other words, the CommSever S02116 should not be involved in the actual backup data transfer in either the Primary Copy or the AuxCopy, apart from the controlling data, perhaps…

 

 

Regards,

Kelvin

Userlevel 1
Badge +6

Hi Mike,

 

Are you saying that, even if the AuxCopy is a LAN free backup, it still needs full communication ports open between the Media Agent and the CommServe, whereas a LAN free Primary Copy doesn’t even need any controlling and commanding data from the CommServe ?

 

Regards,

Kelvin

Userlevel 7
Badge +23

AuxCopyMgr is on the Commserve so there is involvement, though it sounds like the Aux Copy is reading and writing to the same Media Agent?  This may be something different (though definitely send the AV exclusion guide as that is known to cause all sorts of issues, and I can’t rule this one out just yet).

Can you take a look at AuxCopyMgr.log and CVD.log and share what they show at 17:37:22?  Check CVMA.log and any Aux Copy related logs on the Media agent as well.

I’ll get some of my colleagues to chime.

Userlevel 1
Badge +6

Hi Mike

 

The below is the message from AuxCopyMgr.log from the CommServe S02116

7684  2c80  07/07 17:37:22 1159504 processReceivedMessage Received FAIL message from remote AuxCopy binary for readerId [11]. MA [s01190]. Type [2]
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::updateProgressToJM <Copy/Stream> Source <6/1> Target <11/1>: Application Size, Stream Throughput parameters: [15100746] bytes read in [4301] seconds
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: AuxCopy binary on media agent [S01190.neopost.grp] encountered error [8] MM error [0] when sending chunk [8048096] to media agent [S01190.neopost.grp]: [Failed to write the data to the pipeline. ]
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: Partially Copied Archive File Info:Copy [11] CommCellId [2] AFID [2351326] Physical Size [0]
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: Setting jobstatus to FAIL and release resources - got error code CVA_DESTINATION_MA_ERROR and MM error [0]
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::sendFreeStreamRequest FREE STREAM Request for readerId [11] has been sent to media agent [s01190]
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::tryToReserveStreams No reservations were tried in this invocation.
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::run_innerLoop tryToSendCopyRequests() returned No-More-Chunk
7684  2c80  07/07 17:37:22 1159504 AuxCopyManager::sendStopRequest Ask remote AuxCopy binary to stop.

 

The below is from the CVD.log on the CommServe S02116

3856  1244  07/07 17:37:22 ### checkEventSocket() - setupConnection to EvMgrS...
3856  1244  07/07 17:37:22 ### checkEventSocket() - Socket [2876]: is eventSocket

 

The below is from the AuxCopy.log on the Media Agent S01190

10172 2884  07/07 19:36:26 1159504 Reader [1] <Copy/Stream> Source <6/1> Target <11/1>: Reporting PROGRESS to AuxcpyMgr, Err [0/0]. Chnk [8048137], bytes copied [32252533]
10172 2610  07/07 19:37:26 1159504 Sent alive request to AuxcpyMgr
10172 2610  07/07 19:37:26 1159504 Received AuxCopy alive confirmation response

 

Do the above 2 underlined messages indicate the communication between the 2 Servers was OK ?

 

CVMA.log is 0 sized on the Media Agent S01190

 

Thanks,

Kelvin

 

 

 

 

 

Userlevel 7
Badge +23

@Kelvin , had a chat with one of our Media Management SMEs who suggested opening a case.  He mentioned that pipeline errors could be so many different things, especially since it's all the same server; could be writes, process crashes, etc.

When you do, please share the case number so I can track it accordingly.

ps Yeah, looks like there is a connection made there in your underlined lines….you can see the chunk errors, though these could be from the library as well…..definitely so many possibilities.

Userlevel 1
Badge +6

Hi Mike

Opening a case is exactly part of the problem because our support expired on the 1st of April, and we are still in the process of renewing it LOL

 

Cheers,

Kelvin

Userlevel 7
Badge +23

Oh no!

I sent you a pm about that.

In the meantime, see if the tape library itself has any issues.  Definitely a potential/likely cause.

 

Userlevel 1
Badge +6

Cheers, Mile.

I’ll do a reboot for both the Media Agent and the tape library tomorrow - if setting Commvault exclusion doesn’t fix the problem.

Userlevel 7
Badge +23

Hey @Kelvin , hope all is well!  following up to see if the reboot fixed it or if you opened a case up.

Thanks!

Userlevel 1
Badge +6

Hi Mike,

All is good now.

We did 2 things. First, we switched off the AV on both the CommServe and the Media Agent, then rebooted them. After that, the AuxCopy job has been running since.

I re-ran the connectivity test too, which also came back OK. That being said, after I realised that I didn’t do the test correctly last time, so it could be the case that the communication between the 2 Servers was not the issue.

However, every now and then, we still get the same error message “Error occurred while processing chunk [xxxxxx] in media [xxxxxx], at the time of error in library [xxxxx]”, but it would always be auto-cleared up before long, and then, the job would be auto-resumed.

There could a numerous reasons for this error, just to name but a few

  1. The CommServe is a VM, so each time when it takes a snapshot (during a VM backup at night), its network connection to the Media Agent would miss a beat.
  2. The Media Agent itself would from time and time, have network issues.
  3. Our tape library is also quite old, which needs regular Drive-clean

Overall, it’s an outdated backup solution that needs to be overhauled in the near future. But until then, we need to keep it going and put up with all the issues.


I did log a case via the support line but the case isn’t visible in my portal, nor have I received any call-back or email since.

Giving that the CommCell ID isn’t in my portal yet, I assume the case is still lined up in a queue waiting to be approved by the renewal team,

But I’m not here to complain though.

Because, to be honest, I’m happy enough that I could get most of my support through this forum which, to some extent, could give me quicker and more informative advice in helping me to troubleshoot our issues. I couldn’t ask for more.

 

Thanks,

Kelvin

Userlevel 7
Badge +15

@Kelvin - I wasn't exactly sure if you are doing this from your description, but if you are using VSA to backup your CommServe VM, that is generally not a good idea - for the simple fact of, if lose the CommServe, how do you restore the CommServe :wink:

Better just to have several copies of your DR Backup (including a free service to upload it to us for safe keeping), and then provision a new VM, install the software and restore the DR backup to recover.

Glad the forum has been helpful!

Userlevel 1
Badge +6

Hi Damian,

Just tried to set up “the upload to cloud” but having the below error 

 

Assuming the username and password are OK, what else could go wrong ?

For example, do I need to wait until the renew of the CommCell is finished ?

The below is the error message found in the log file CVCloudService

4824  bb4   07/20 15:50:59 ### LIBCURL::CvInternetDirectConnection::setSecured() - CURL certificate bundle path [D:\Program Files\Commvault\ContentStore\Base\curl-ca-bundle.crt]
4824  bb4   07/20 15:51:01 ### CVCloudService::init() - Failed to create DR backup folder. Error Message [U]. CommcellGUID [ ] 
4824  47c   07/20 15:51:25 ### LIBCURL::CvInternetDirectConnection::setSecured() - CURL certificate bundle path [D:\Program Files\Commvault\ContentStore\Base\curl-ca-bundle.crt]
4824  47c   07/20 15:51:26 ### CVCloudService::init() - Failed to create DR backup folder. Error Message [U]. CommcellGUID [ðÅ] 
4824  434   07/20 15:51:55 ### LIBCURL::CvInternetDirectConnection::setSecured() - CURL certificate bundle path [D:\Program Files\Commvault\ContentStore\Base\curl-ca-bundle.crt]
4824  434   07/20 15:51:56 ### CVCloudService::init() - Failed to create DR backup folder. Error Message [U]. CommcellGUID [ðçÄ] 

Userlevel 1
Badge +3

Hello @Kelvin,

Can you please confirm that you have registered your Commserve as described in the following documentation?

https://documentation.commvault.com/commvault/v11_sp20/article?p=92700.htm

https://documentation.commvault.com/commvault/v11_sp20/article?p=40674.htm

Userlevel 1
Badge +6

Hi Tim,

After some fiddling around, I’ve finally registered the CommCell (FAEE1) in the Cloud Poral, thanks to your links (see below)

 

But, within the CommServe Console, I still get the same error message while setting up the “Upload backup metadata to Commvault Cloud” which has the same error message in the logfile as below

 

4824  13d4  07/20 17:49:07 ### LIBCURL::CvInternetDirectConnection::setSecured() - CURL certificate bundle path [D:\Program Files\Commvault\ContentStore\Base\curl-ca-bundle.crt]
4824  13d4  07/20 17:49:08 ### CVCloudService::init() - Failed to create DR backup folder. Error Message [U]. CommcellGUID [0¹Ÿ] 

 

What else could go wrong ?

 

Cheers,

Kelvin

Userlevel 1
Badge +3

Hello @Kelvin,

Was that the same job that was running prior to registering? If not please kill the job. Before starting a new job try disabling the setting in the control panel → DR backup → Press ok then go back in an enable it. If it continues to fail I would suggest you get a ticket opened to review further.

Userlevel 7
Badge +15

Check if you can access https://cvdrbackup1.blob.core.windows.net from a web browser on the CommServe - and that its not being blocked. It will come up with a bogus XML error but that is normal, we just want to see if it is accessible.

If you have a proxy server set it might be trying to use that and may need to be configured in IE proxy settings.

 

Userlevel 1
Badge +6

Hi Damian,

It looks like the URL is being blocked from the CommServe… I’ll check it with our FW administrator tomorrow. Cheers.

 

Reply