Solved

Aux Copy performance


Userlevel 4
Badge +15

Hello, 

we would like to tier out the data, wich is stored on the disk library to an Huawei Object Storage. I created a secoundary copy and configured an aux copy schedule. The problem is that the disk library disc space is running low because the job is not as fast as I was hoping.
The amount of data for the copy job can be up to 10 TB.
Is there a solution to speed up the aux copy job ? The Media Agents provide 2x10 Gbit cards.

Regards

Thomas

icon

Best answer by Mike Struening RETIRED 5 May 2021, 18:06

View original

11 replies

Badge +2

can someone please let me know how AUX copy job copies backup jobs ?

I could see our AUX copy is running since more than 10 days. But still seeing very old backup jobs in partially copied list. 

How AUX copy picks up backup jobs for copying ? 

Ideally older jobs should first copied. 

Userlevel 7
Badge +23

@thomas.S , thought you’d find this interesting:

 

Userlevel 7
Badge +23

Hello @Mike Struening,

Thank you for the analysis. In this case, are there any points that I could check with the media agents before opening a case with our networkers? 
I am thinking of settings that could be checked on the media agents ?

 

I don't think there is anything commvault configuration-wise that would cause such slow network performance. You could try toggle the auxcopy mode between network and disk optimized modes and see if it makes any difference. You only have to suspect the copy, change the setting and resume to test.

The rest will come down to benchmarking the system to help isolate the bottleneck - performance is always tricky, as it could be the local OS, network cards, switches/routers in the way, the destination device… you get the picture. So you have to perform some tests to help narrow down the problem.

You could try fiddle with TCP offload options, chimney, check the teaming mode, ensure drivers are up to date - try disable one network card and see if that helps. It sounds like receiving data is fine, so it could be this particular network segment or something is odd with the network teaming - depending on your load balancing mode, most non-switch assisted modes can only load balance transmits (round-robin between adapters) - you could try disabling teaming or one of those NIC’s to see if that is contributing to the performance slowness.

To try to isolate routing/network issues, you could try configure a network share (SMB) somewhere and copy some data there as a performance test, either through windows or a test copy. We also have the cloud test tool which can upload data to your hitachi object storage, and you could measure performance from these Media Agents vs other systems or network segments to help:

https://documentation.commvault.com/commvault/v11/article?p=9234.htm

 

Userlevel 7
Badge +23

Unless you have any throttling in place, not likely.  My initial concern was if you were somehow sending over the main network though you addressed that earlier.

Let me know what they find!!

Userlevel 4
Badge +15

Hello @Mike Struening,

Thank you for the analysis. In this case, are there any points that I could check with the media agents before opening a case with our networkers? 
I am thinking of settings that could be checked on the media agents ?

 

Userlevel 7
Badge +23

Thanks, @thomas.S !

I checked a few of the stream counters and it looks like the network is the cause.

If you check the column for ‘Time(seconds), that is the time the stream/pipe had to wait for data.  In some cases, we’re waiting a minute or two.

The one below has some high wait times, though there are several pipes per MA.

 

3996  6720  05/05 15:03:02 2996475 

|*5852487*|*Perf*|2996475| =======================================================================================

|*5852487*|*Perf*|2996475| Job-ID: 2996475            [Pipe-ID: 5852487]            [App-Type: 0]            [Data-Type: 1]

|*5852487*|*Perf*|2996475| Stream Source:   cvmapapp01

|*5852487*|*Perf*|2996475| Network medium:   SDT

|*5852487*|*Perf*|2996475| Head duration (Local):  [05,May,21 15:01:01  ~  05,May,21 15:03:02] 00:02:01 (121)

|*5852487*|*Perf*|2996475| Tail duration (Local):  [05,May,21 15:01:01  ~  05,May,21 15:03:02] 00:02:01 (121)

|*5852487*|*Perf*|2996475| -----------------------------------------------------------------------------------------------------

|*5852487*|*Perf*|2996475|     Perf-Counter                                  Time(seconds)              Size

|*5852487*|*Perf*|2996475| -----------------------------------------------------------------------------------------------------

|*5852487*|*Perf*|2996475| 

|*5852487*|*Perf*|2996475| Replicator DashCopy

|*5852487*|*Perf*|2996475|  |_Buffer allocation............................        81                            [Samples - 21079] [Avg - 0.003843]

|*5852487*|*Perf*|2996475|  |_Media Open...................................         6                            [Samples - 15] [Avg - 0.400000]

|*5852487*|*Perf*|2996475|  |_Chunk Recv...................................         5                            [Samples - 3] [Avg - 1.666667]

|*5852487*|*Perf*|2996475|  |_Reader.......................................         7                1110032163  [1.03 GB] [531.67 GBPH]

|*5852487*|*Perf*|2996475| 

|*5852487*|*Perf*|2996475| Reader Pipeline Modules[Client]

|*5852487*|*Perf*|2996475|  |_CVA Wait to received data from reader........       119                          

|*5852487*|*Perf*|2996475|  |_CVA Buffer allocation........................         -                          

|*5852487*|*Perf*|2996475|  |_SDT: Receive Data............................         7                1111164840  [1.03 GB]  [Samples - 21113] [Avg - 0.000332] [532.21 GBPH]

|*5852487*|*Perf*|2996475|  |_SDT-Head: CRC32 update.......................         1                1111107304  [1.03 GB]  [Samples - 21112] [Avg - 0.000000]

|*5852487*|*Perf*|2996475|  |_SDT-Head: Network transfer...................        93                1111107304  [1.03 GB]  [Samples - 21112] [Avg - 0.004405] [40.06 GBPH]

|*5852487*|*Perf*|2996475| 

|*5852487*|*Perf*|2996475| Writer Pipeline Modules[MediaAgent]

|*5852487*|*Perf*|2996475|  |_SDT-Tail: Wait to receive data from source....       120                1111164840  [1.03 GB]  [Samples - 21113] [Avg - 0.005684] [31.05 GBPH]

|*5852487*|*Perf*|2996475|  |_SDT-Tail: Writer Tasks.......................        28                1111107304  [1.03 GB]  [Samples - 21112] [Avg - 0.001326] [133.05 GBPH]

|*5852487*|*Perf*|2996475|    |_DSBackup: Media Write......................         8                1110192223  [1.03 GB] [465.28 GBPH]

|*5852487*|*Perf*|2996475| 

|*5852487*|*Perf*|2996475| ----------------------------------------------------------------------------------------------------

Userlevel 4
Badge +15


I have collected the logs for the Aux Copy job. I only left the information in that related to the job ID. 
Since these jobs are not so big I hope that you can already read out something here. I had to deactivate the big jobs first, because otherwise I get problems with the space on the disk library.

Thomas

Userlevel 7
Badge +23

@thomas.S , check CVperfmgr.log on the destination MA for performance metrics.  This will advise where to focus.

Userlevel 7
Badge +23

Sounds good.  I’ll add in some people to advise where we can find the performance counters as well.

Userlevel 4
Badge +15

Hello @Mike Struening
 

The problem is currently the throughput from my point of view.
The job currently runs every 3 hours and mainly copies the logs of the databases to the object storage during the day. Overnight, the data from the VSA backup is added. That adds up to a few TB. 
Tomorrow I can provide your log, which shows the performance data. 
I am sure that it uses LAN because the object storage is only accessible via LAN and nothing in this direction is zoned via FC to the media agents.

Regards

Thomas

Userlevel 7
Badge +23

@thomas.S , is the actual throughput the issue or is the amount of initial data the problem?

Starting with the latter, what is the intended retention on the Aux Copy, and how far back do the To Be Copied jobs go?  the reason I ask is that it’s entirely possible that the aux Copy is grabbing data it will want to age off once the whole thing completes.

If it’s a performance issue, then we’d need to see some log files and stats to see if the issue is the read speed, the network/transfer or the write speed.  Noting the 2x 10 Gbit cards, are you certain the job is using this interface?

Reply