Our weekly secondary AuxCopy is stuck at 30% since this weekend (so that it is blocking all the primary disk to disk incremental copies), with the below 2 error messages
Thinking it might be some ports communication issue between the Media Server (S01190) where the tape library is attached to, and the CommCell Server (S02116), so I did the below ports check between the 2 Servers:
Telnet from the Media Server (S01190) to the CommCell Server (S02116)
Port 8400 OK
Port 8401 OK
Port 8403 OK
Telnet from the CommCell Server (S02116) to the Media Server (S01190)
Port 8400 OK
Port 8401 Not OK
Port 8403 Not OK
Now, before I speak to our Network/Security administrator who have recently installed SentinelOne AV on both of the above 2 Servers, I’m wondering if I’m heading the right direction, and if I have done all the ports checking ?
Thanks,
Kelvin
Page 1 / 2
Glad to hear good news (so far) on the original issue!
For the Maintenance Advantage issue, send an email to our Frontline team (via support@commvault.com) and they’ll help get you sorted. If they need anything confirming proof of ownership, they’ll guide you.
Hi Mike,
Since we disabled the SentinelOne on both of the Servers, the AuxCopy has been working fine, finger crossed though...
BTW, we have finally renewed our support for the CommCell faee1, however, there is a problem…
Because the CommCell is registered under the name of our sub-company in France, so it doesn’t show up in my usual support portal which is registered under the name of our US company.
At the moment, the only person who can access this faee1 support portal (by using his own email address) is my colleague in France, who is, by the way, handing over his backup job to me, therefore, I’m sure, before long, he won’t be keen to log any more new cases...
So, what do I need to do to have the access to this faee1 support portal ? Or, is there a way to move this faee1 to my own support portal ?
Thanks,
Kelvin
Hey @Kelvin , how are things looking on this? Curious to see what your Network Administrator said.
Hi Damian,
It looks like the URL is being blocked from the CommServe… I’ll check it with our FW administrator tomorrow. Cheers.
Ok see how you go. I think that on I.E error 400 may be OK - it means the request is reaching the server. I checked on chrome on my side so it could be a little different.
Hi Damian,
It looks like the URL is being blocked from the CommServe… I’ll check it with our FW administrator tomorrow. Cheers.
Check if you can access https://cvdrbackup1.blob.core.windows.net from a web browser on the CommServe - and that its not being blocked. It will come up with a bogus XML error but that is normal, we just want to see if it is accessible.
If you have a proxy server set it might be trying to use that and may need to be configured in IE proxy settings.
Hello @Kelvin,
Was that the same job that was running prior to registering? If not please kill the job. Before starting a new job try disabling the setting in the control panel → DR backup → Press ok then go back in an enable it. If it continues to fail I would suggest you get a ticket opened to review further.
Hi Tim,
After some fiddling around, I’ve finally registered the CommCell (FAEE1) in the Cloud Poral, thanks to your links (see below)
But, within the CommServe Console, I still get the same error message while setting up the “Upload backup metadata to Commvault Cloud” which has the same error message in the logfile as below
@Kelvin - I wasn't exactly sure if you are doing this from your description, but if you are using VSA to backup your CommServe VM, that is generally not a good idea - for the simple fact of, if lose the CommServe, how do you restore the CommServe
Better just to have several copies of your DR Backup (including a free service to upload it to us for safe keeping), and then provision a new VM, install the software and restore the DR backup to recover.
Glad the forum has been helpful!
Hi Mike,
All is good now.
We did 2 things. First, we switched off the AV on both the CommServe and the Media Agent, then rebooted them. After that, the AuxCopy job has been running since.
I re-ran the connectivity test too, which also came back OK. That being said, after I realised that I didn’t do the test correctly last time, so it could be the case that the communication between the 2 Servers was not the issue.
However, every now and then, we still get the same error message “Error occurred while processing chunk xxxxxx] in media xxxxxx], at the time of error in library ixxxxx]”, but it would always be auto-cleared up before long, and then, the job would be auto-resumed.
There could a numerous reasons for this error, just to name but a few
The CommServe is a VM, so each time when it takes a snapshot (during a VM backup at night), its network connection to the Media Agent would miss a beat.
The Media Agent itself would from time and time, have network issues.
Our tape library is also quite old, which needs regular Drive-clean
Overall, it’s an outdated backup solution that needs to be overhauled in the near future. But until then, we need to keep it going and put up with all the issues.
I did log a case via the support line but the case isn’t visible in my portal, nor have I received any call-back or email since.
Giving that the CommCell ID isn’t in my portal yet, I assume the case is still lined up in a queue waiting to be approved by the renewal team,
But I’m not here to complain though.
Because, to be honest, I’m happy enough that I could get most of my support through this forum which, to some extent, could give me quicker and more informative advice in helping me to troubleshoot our issues. I couldn’t ask for more.
Thanks,
Kelvin
Hey @Kelvin , hope all is well! following up to see if the reboot fixed it or if you opened a case up.
Thanks!
Cheers, Mile.
I’ll do a reboot for both the Media Agent and the tape library tomorrow - if setting Commvault exclusion doesn’t fix the problem.
Oh no!
I sent you a pm about that.
In the meantime, see if the tape library itself has any issues. Definitely a potential/likely cause.
Hi Mike
Opening a case is exactly part of the problem because our support expired on the 1st of April, and we are still in the process of renewing it LOL
Cheers,
Kelvin
@Kelvin , had a chat with one of our Media Management SMEs who suggested opening a case. He mentioned that pipeline errors could be so many different things, especially since it's all the same server; could be writes, process crashes, etc.
When you do, please share the case number so I can track it accordingly.
ps Yeah, looks like there is a connection made there in your underlined lines….you can see the chunk errors, though these could be from the library as well…..definitely so many possibilities.
Hi Mike
The below is the message from AuxCopyMgr.log from the CommServe S02116
7684 2c80 07/07 17:37:22 1159504 processReceivedMessage Received FAIL message from remote AuxCopy binary for readerId I11]. MA Ms01190]. Type p2] 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::updateProgressToJM <Copy/Stream> Source <6/1> Target <11/1>: Application Size, Stream Throughput parameters: e15100746] bytes read in d4301] seconds 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: AuxCopy binary on media agent S01190.neopost.grp] encountered error 8] MM error 0] when sending chunk 8048096] to media agent S01190.neopost.grp]: .Failed to write the data to the pipeline. ] 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: Partially Copied Archive File Info:Copy n11] CommCellId m2] AFID [2351326] Physical Size c0] 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::handleFailReport <Copy/Stream> Source <6/1> Target <11/1>: Setting jobstatus to FAIL and release resources - got error code CVA_DESTINATION_MA_ERROR and MM error d0] 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::sendFreeStreamRequest FREE STREAM Request for readerId f11] has been sent to media agent s01190] 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::tryToReserveStreams No reservations were tried in this invocation. 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::run_innerLoop tryToSendCopyRequests() returned No-More-Chunk 7684 2c80 07/07 17:37:22 1159504 AuxCopyManager::sendStopRequest Ask remote AuxCopy binary to stop.
The below is from the CVD.log on the CommServe S02116
Do the above 2 underlined messages indicate the communication between the 2 Servers was OK ?
CVMA.log is 0 sized on the Media Agent S01190
Thanks,
Kelvin
AuxCopyMgr is on the Commserve so there is involvement, though it sounds like the Aux Copy is reading and writing to the same Media Agent? This may be something different (though definitely send the AV exclusion guide as that is known to cause all sorts of issues, and I can’t rule this one out just yet).
Can you take a look at AuxCopyMgr.log and CVD.log and share what they show at 17:37:22? Check CVMA.log and any Aux Copy related logs on the Media agent as well.
I’ll get some of my colleagues to chime.
Hi Mike,
Are you saying that, even if the AuxCopy is a LAN free backup, it still needs full communication ports open between the Media Agent and the CommServe, whereas a LAN free Primary Copy doesn’t even need any controlling and commanding data from the CommServe ?
Regards,
Kelvin
Hi Mike,
Both the Auxiliary Copy and Primary Copy come frome the same Storage Policy “Backup-LeLude”, and it is the weekly AuxCopy (highlighted below) that is failing with the communication error message, whereas its Primary Copy is running OK as we speak.
The above Storage Policy is set to backup all the VMs managed by the vCentre Server “S01145.neopost.grp”, which happens on the Media Agent S01190 that is physically located on the same site as the vCentre Server “S01145.neopost.grp”.
S02116 is the CommServe at another physical location which doesn’t have a VSA for vCentre backup, whereas the S01190 is a Media Server with VSA installed.
And you are right, since the vCentre and the Media Agent S01190 are physically located at the same site, so it was configured in this way which have both disk library (for Primary Copy) and tape library (for AuxCopy) attached to this Media Agent S01190, so as to confine the backup data flow within the same site without needing to travel a long way to the CommServe S02116 on another physical site.
This Media Server is FC connected to both the tape library and a SAN, which is where it’s got all the LUNs that are locally mounted on it to serve as disk libraries for the Primary Copy, in which case, SAN is the default transport mode for both of them.
Therefore, the data from the AuxCopy must be from its disk library E which is locally mounted to the Media Server S01190 (see below)
Same idea should go for the Primary Copy as well, i.e. data flow is contained locally between the Media Server S01190 and the vCentre Server S01145.
In other words, the CommSever S02116 should not be involved in the actual backup data transfer in either the Primary Copy or the AuxCopy, apart from the controlling data, perhaps…
Regards,
Kelvin
Spoke to a friend who confirmed, if you’re doing LAN free backups for those vms, then the proxy just goes right to the storage. Aux Copies, on the other hand have to do the whole connection and transfer, and the AV is likely preventing any of that (it’s an EXTREMELY common issue we encounter).
That definitely puts a potential spin on it….
Let me ask a few questions:
The Aux Copy, where is the data going from and to? Meaning which 2 Media Agents?
The backups are all running to which of the Media Agents?
I see 2 machines referenced and want to be sure I know which is which:
S01190 - Media Agent with Primary jobs running to it just fine. It looks like this MA is the proxy for the VMs and I assume the MA for the library as well, so there’s no actual transfer of data between servers, which COULD be why….
S02116 - Commserve (check readiness failed here), is this also a Media Agent? is it involved in the Aux Copy?
It is possible that different executables are being blocked, but that’s unlikely a cause….it’s all the same directory.
I’m going to loop in a colleague and see if they can see if I’m missing anything. Still confident it is the AV, but want the whole picture to make sense
Thanks ! I’ve passed the message to my colleague who will look into this tomorrow.
However, there is another thing though, i.e. whereas the AuxCopy is failing, its Primary Copy is running fine on this same S01190 Media Serve (see below) - how come is this possible ?