Skip to main content

Hi there!

Is there any way how to investigate very poor reader time in NDMP backup in Commvault?

Quick look at a part of the log suggests slow reader time to be a culprit of the poor backup performance:

 

|*1292266*|*Perf*|696353| -----------------------------------------------------------------------------------------------------
|*1292266*|*Perf*|696353|     Perf-Counter                                  Time(seconds)              Size
|*1292266*|*Perf*|696353| -----------------------------------------------------------------------------------------------------
|*1292266*|*Perf*|696353|
|*1292266*|*Perf*|696353| Replicator DashCopy
|*1292266*|*Perf*|696353|  |_Buffer allocation............................         -                            .Samples - 477421]  Avg - 0.000000]
|*1292266*|*Perf*|696353|  |_Media Open...................................        20                            ÂSamples - 5] nAvg - 4.000000]
|*1292266*|*Perf*|696353|  |_Chunk Recv...................................         -                            0Samples - 2] 9Avg - 0.000000]
|*1292266*|*Perf*|696353|  |_Reader.......................................     61454               25428386393   23.68 GB] S1.39 GBPH]
|*1292266*|*Perf*|696353|
|*1292266*|*Perf*|696353| Reader Pipeline Modules.Client]
|*1292266*|*Perf*|696353|  |_CVA Wait to received data from reader........     61478
|*1292266*|*Perf*|696353|  |_CVA Buffer allocation........................         -
|*1292266*|*Perf*|696353|  |_SDT: Receive Data............................     61436               25457973968  923.71 GB]  VSamples - 477526] nAvg - 0.128655] .1.39 GBPH]
|*1292266*|*Perf*|696353|  |_SDT-Head: CRC32 update.......................         1               25457916432  Â23.71 GB]  ÂSamples - 477525] 8Avg - 0.000000]
|*1292266*|*Perf*|696353|  |_SDT-Head: Network transfer...................        59               25457916432  .23.71 GB]   Samples - 477525] ÂAvg - 0.000124]  1446.68 GBPH]
|*1292266*|*Perf*|696353|
|*1292266*|*Perf*|696353| Writer Pipeline Modules5MediaAgent]
|*1292266*|*Perf*|696353|  |_SDT-Tail: Wait to receive data from source....     61480               25457973968  l23.71 GB]  ]Samples - 477526] [Avg - 0.128747] 1.39 GBPH]
|*1292266*|*Perf*|696353|  |_SDT-Tail: Writer Tasks.......................       192               25457916432  |23.71 GB]  WSamples - 477525] aAvg - 0.000402] Â444.55 GBPH]
|*1292266*|*Perf*|696353|    |_DSBackup: Update Restart Info..............        19
|*1292266*|*Perf*|696353|    |_DSBackup: Media Write......................       171               25431692857   23.69 GB] Â498.63 GBPH]

Hi @drPhil .  Based on your output, the read speed is the slowest which is the filer itself.  I would reach out to the vendor for troubleshooting assistance (and please share their feedback here).


Hi @Mike Struening. Thanks, this will be our next step to take a deeper look on the filer itself. Will share results of our findings once available.

 

Just quick consideration. Could the filer be overloaded with too many parallel jobs, that are from different storage policies? In other words, there is a bunch of storage policies, that are performing aux copy job at the same time from the filer to the (same) tape library. To be honest, I dont really understand how is multistreaming helpful at all. The more streams, the smaller throughput is for each stream?


@drPhil that is certainly possible, depending on how the filer is spec’d out.

One thing about streams is that there is EASILY a way to make it worse if employing the ‘more is better’ approach.

Streams should correlate to spindles on the drives, or tape drives.  If the specs of the filer can handle (picking a number at random) 50 streams with solid/reliable/desirable performance, then increasing that to 100 will only cut the performance in half (at best) while introducing chances of problematic performance and EVEN WORSE a troubleshooting conundrum: how can you troubleshoot a bottleneck when you’ve defined the size of the incoming volume out of synch with the size of the gate itself?

Keep me posted on what the vendor advises.  Should help to show them performance that corresponds to workload (i.e. things are acceptable in this scenario, but awful at this other time/job volume).


@drPhil Are we taking NDMP backup of volume or folders within that?
That would slow down things as well because each directory and file below the specified root of the backup must be examined to determine whether it should be included in the backup. This is a time-consuming operation especially if the sub-directory contains several million inodes (files).
 


There are limitations to NDMP DUMPS that you do and it’s based on HW so doesn’t look like you can overload it

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-ptbrg%2FGUID-2FEAD300-21D9-4BA7-9789-77AD1B0D1D05.html


There are limitations to NDMP DUMPS that you do and it’s based on HW so doesn’t look like you can overload it

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-ptbrg%2FGUID-2FEAD300-21D9-4BA7-9789-77AD1B0D1D05.html

 

Hi @R Anwar ! Thanks for your contribution. Of course, NDMP file level backup plays a significant role in this case and as you said it is always very time consuming if there are milions of files.

 

I have one question. What does mean "Maximum number of NDMP sessions" for Commvault? Is it maximum number of NDMP backup jobs? One backup job can have many data readers… For example, it is approximately 70 data readers streams in use too much for the filer with the system memory less than 16GB?

 

System memory of a storage system Maximum number of NDMP sessions
Less than 16 GB 8
Greater than or equal to 16 GB but less than 24 GB 20
Greater than or equal to 24 GB 36

 

 

 

 


After investigation of the hard drive array we can make this statement. The perfomance of the hard drive array (throughput on interfaces) is quite good and creditable. As simply as that, it works just fine. The cumulative throughput displayed in Job Controller in Commvault Console also says, that the performance is appropriate high. The core of the issue seems to be that there is too much data to be copied in the given time frime and one day has only 24 hours, which is under some circumstances not enough. On the other hand, better scheduling of backup jobs (not so many simultaneously running jobs/streams) could theoretically increase speed of the backup jobs as well. Thanks @Mike Struening  and @R Anwar  for contribution.


Thanks for sharing back!  Glad it was something as simple as spreading out the load and not an issue with any hardware!!


Reply