Solved

Intellinsnap NDMP Backup copy failing with error : Socket error

  • 18 August 2022
  • 3 replies
  • 326 views

Userlevel 2
Badge +8

Customer made some changes to the Tipping point IPS device at their end; and all of a sudden the netapp intellisnap backup copies have started failing.  As per the customer, they do not see anything on the logs.  

 

There are two netapp filers on the same network; backup copy on 1 is working while on another one it is failing.  The Logs just say “3880  924   08/18 11:34:52 5974838 NDMP ERROR[TSCNACLUSTER01] 9: DUMP: Error: Write to socket failed”

and the NetApp vendor says they do not see any error at their end as well

I can see in the logs from the NetApp that the connection is made at 11:09am

<LR d="18Aug2022 11:09:49" n="HSCNANODE01" t="1660777789" id="0/155572305395713" p="1" s="Ok" o="ndmpdc" vf="" type="1" seq="1964418" supp="1" >

<ndmp_ctrl_on_e0m_1

        lif_addr="192.168.0.54"

        lif_port="e0M"/>

But after this, there are no entries in the event log of the Netapp. This is the same on cluster 02 which is succeeding in backups.

 

Commvault Logs :


3880  924   08/18 11:31:24 5974838 UPDATEJM: Stream Restart FileCountDelta:[29001] FileCount:[5007677]

3880  924   08/18 11:31:24 5974838 RESTART STRING is [<?xml version="1.0" encoding="UTF-8" standalone="no" ?><NAS_NasBackup_XMLRestartString MultiStreamPathAttemptCount="0" indexRestartCount="0" lastMAClientId="1243" type="NODE" ver="RS_v2.0"><ValidRestartString chunkCount="49" context="32623038396564342d316438352d313165642d393339632d303061303938666462626333" contextLength="36" estimatedBytes="4585028256768" fhInfo="3780727459840" fileCount="5007677" fileNode="7524753" lastAfileBytes="75464769536" lastAfileChunks="1" lastAfileCopyID="443" lastAfileID="13224512" pathName="/svm-hsc-cifs/SharedDrives" pathNum="0" recNum="0" recSize="65536" restartByteOffset="3780728510464"/></NAS_NasBackup_XMLRestartString>]

3880  924   08/18 11:32:01 5974838 NDMP INFO[TSCNACLUSTER01] 8: DUMP: Thu Aug 18 11:31:24 2022 : We have written 3699027528 KB.

3880  924   08/18 11:34:52 5974838 NDMP ERROR[TSCNACLUSTER01] 9: DUMP: Error: Write to socket failed

3880  924   08/18 11:34:52 5974838 NDMP ERROR[TSCNACLUSTER01] 10: DUMP: Error: DUMP IS ABORTED

3880  924   08/18 11:34:52 5974838 NDMP ERROR[TSCNACLUSTER01] 11: DATA: Operation terminated: EVENT: INTERNAL ERROR (for /svm-hsc-cifs/SharedDrives/.snapshot/SP_2_5956286_1659729787)

3880  924   08/18 11:34:52 5974838 DATA HALT with reason [NDMP_DATA_HALT_INTERNAL_ERROR]

icon

Best answer by Mike Struening RETIRED 24 August 2022, 16:31

View original

3 replies

Userlevel 7
Badge +23

I mean… they made changes … it stopped working … they should keep looking ;)

 

IPS devices and firewalls in general are notorious for closing idle connections. I’m not much of an NDMP expert, but it could be closing an idle control connection which is established on that e0m management port on the NetApp. Silently closed TCP connections suck because the application has no idea it has been closed until it tries to send data down it. This is why support often recommends using network configurations to create 1-way firewalls, as they send consistent keep alive packets for ALL traffic traversing the tunnel and hence usually solve the issue.

See if they have some policies on closing idle connections or any differences in the IPS polices between the clusters. Perhaps there are some keepalive settings on the NetApp cluster that differ?

Userlevel 2
Badge +8

Commvault support had advised error is on the Netapp side

“the error was that filer stopped data transfer, meaning error was on the filer side.”

“DUMP: Error: DUMP IS ABORTED”

 

Netapp support came back saying the backup app is issuing the stop command.  

 

So the one way firewall you are referring to that would be from Mediaagent to the Netapp Cluster right 

 

logs from Netapp device 

“00000012.001bca71 0609a591 Tue Aug 23 2022 03:50:01 +12:00 [kern_ndmpd:info:5431] [ 7877] INFO: DATA: received STOP command (+0:00:40) 00000012.001bcfdc 060b51ef Tue Aug 23 2022 06:52:47 +12:00 [kern_ndmpd:info:5431] [ 7882] INFO: DATA notification state 5 (SESS_STATE: HALTING) event 2 (EVENT: ABORTED) (+0:00:40) 00000012.001bcfdd 060b51ef Tue Aug 23 2022 06:52:47 +12:00 [kern_ndmpd:info:5431] [ 7882] ERROR: DATA: Operation terminated: EVENT: ABORTED (for /svm-hsc-cifs/SharedDrives/.snapshot/SP_2_5980163_1660939214) 00000012.001bcfde 060b5206 Tue Aug 23 2022 06:52:50 +12:00 [kern_ndmpd:info:5431] [ 7882] INFO: DATA: received STOP command (+0:00:40) 00000012.001bd5ec 060cac0e Tue Aug 23 2022 09:20:28 +12:00 [kern_ndmpd:info:5431] [28615] ERROR: NDMP Connect: Unable to Connect to [MediaAgent IP]:8701 due to Connection timed out (+0:00:41)”

 

 

Userlevel 7
Badge +23

@Theseeker , it doesn’t really matter since it’s one client establishing, then both can talk freely and sustained.

this thread has a great explanation, as well as a link to an awesome video:

 

Reply