Solved

If "Optimized For concurrent LAN backups" enabled on MA speed backup Oracle DB slowdown at 3 times

  • 19 August 2021
  • 27 replies
  • 2524 views

Badge +2

On CV 11.20.9 we backuped Oracle DB from SPARC Solaris to MA RedHar8.4.

If we set “Optimized For concurrent LAN backups” on MA then speed = 1800 GB per hour, if we unset “Optimized For concurrent LAN backups”, then speed=5500 GB per hour.

All other settings are identical.

What changed in MA configuration when set “Optimized For concurrent LAN backups”?

icon

Best answer by AKuryanov 21 October 2021, 05:26

View original

27 replies

Userlevel 7
Badge +23

Thanks for sharing, @Sergey !

Userlevel 7
Badge +19

@Sergey thanks for sharing this information! so to be clear the only difference between getting 1,2Gbps and 9Gbps is the operating systems that is installed. is the test based on VMs running on top of VMware ESXi? 


@Mike Struening can you elaborate on my previous question? reading back you post one more time really makes me curious who development is making sure the performance is kept in line with expectations also during OS certifications and in addition if a newer OS version come with increased performance enhancements which benefits Commvault performance drastically then I would also share this as well more pro-actively. 

Hello all. We also had performance issues here and there. I did investigation involving CommVault and Microsoft support. It turned out that “Optimized For concurrent LAN backups” makes a huge difference for Windows MA performance, specifically due to enabling of transfers through Loopback interface. There are throughput limitations on loopback interface due to the way OS is processing data (for example it will use only one CPU core for processing). Starting from Win2019, 2021 Microsoft introduced several improvements. In my tests loopback interface throughput on win2012, 2016 had limit of 1.2Gbps. With Win 2019 I was able to get 9Gbps on the same hardware/vm. So at least for Windows it is highly recommended to upgrade MAs to 2019-2021.   

Userlevel 7
Badge +19

So make some changes on Commvault side (sPipelineMode) and making some OS changes improved performance drastically?

I'm really interested to hear if this was a specific case and if not how this has been tackled for all Commvault customers. E,g, is the sPipelineMode set automatically or was the logic itself improved in a more recent feature release and was the OS setting added to the best practices section. 

Userlevel 7
Badge +23

@AKuryanov , checking this incident again (thanks to @Onno van den Berg bringing thread back up), it looks like this case was archived.

Here’s the archived last step:

Issue:
======
Oracle database backup is having a Slow backup performance on Media agent with RHEL 8.

Work so far
------------
-- dev team advised to run backup with pipeline mode
-- added sPipelineMode with value B:P under iDataAgent on client
-- above setting provided expected through put for customer
-- Customer has several client so created a group for them and added pipeline mode to group
-- Additional setting is from client group is now inheriting to individual servers correctly.
-- Customer raised another issue , backup performance is good but compression is not working on client on Red Hat 8 version of MA. ( MA with RH7 shows no issue)
-- Dev stated ,we are 100% sure the compression is happening on the client. But we do see more data transferred over network.
-- Dev advised to check ring buffer parameters on RH8 MA? and to Increase both the RX and TX buffers to the maximum
-- After this change Performance increased from 1.5 TB per hour to 3.5 TB using a media agent running the RH 8 operating system
-- on a media agent with the RH 7 operating system, the performance will increase to 9 TB per hour.
-- Now customer raised that more network interface usage observed with sPipelineMode B:P set ?
-- Dev confirmed that it is known issue
-- Customer confirmed more network interface is observed with out pipeline mode set on client.
-- Logs shows that pipeline mode was used ( cvperfmgr log) even pipeline mode is not configured on client group/ client itself.

Current status (3rd Dec)
--------
Waiting on customer to provide new logs from server ( with increased file version for ORASBT) to check from where pipeline mode is getting configured.

Were you able to move on with this second issue?

Userlevel 7
Badge +19

@AKuryanov @Mike Struening so if I understand correctly the issue still persists and was addressed only via a workaround by the use of RHEL7. curious to hear the root cause and to know when this is fixed. 

Also wondering if @AKuryanov upgraded his environment to a more recent maintenance release to see if that would sort out the problem.

Userlevel 7
Badge +23

Appreciate you updating us!  I see the incident is still ongoing.

Thanks again, @AKuryanov !

Badge +2
After a long correspondence with support, it was found that on RHEL8 during backup with enabled compression, the network traffic from the client to MA does not decrease. Those, we see that on RHEL8 the network traffic is 2 times higher, and the backup time is 2 times longer compared to RHEL7. The reason for this has not yet been clarified.
Userlevel 7
Badge +19
@Onno van den Berg Over the years of using CV, we realized that it is impossible to patch the infrastructure of several thousand clients once a week (64 patches in 56 weeks).Many patches improve in one place and break in another. And in Enterprise, stability is the most important thing. Therefore, we do not patch until support says that your problem is resolved in that particular patch.

I have not said you should patch your systems on a weekly basis, this is b.t.w. also not possible anymore as MRs are released on a monthly basis. But you haven't even patched you environment for more than a year! The version you are running is still close to GA, which means a much bigger chance of underlying software issues that impact reliability, security and most important you fix issues before you run into them while trying to recover data. So I would consider patching at least quarterly! B.t.w. you also patch your Windows systems every month, right ;-)

Badge +2
@Onno van den Berg Over the years of using CV, we realized that it is impossible to patch the infrastructure of several thousand clients once a week (64 patches in 56 weeks).Many patches improve in one place and break in another. And in Enterprise, stability is the most important thing. Therefore, we do not patch until support says that your problem is resolved in that particular patch.
Userlevel 7
Badge +19

@AKuryanov I'm amazed that you are still running on a very old maintenance release. It is 13 months old! The list with documented fixes and enhancements is huge! Please consider updating your environment!

 

Note: FR20 is a LTS release and many customers are still running FR20 right now, so there is a big possibility that if it is software related that it has been addressed already https://documentation.commvault.com/commvault/v11_sp20/article?p=11_20_64.htm.

Userlevel 7
Badge +23

Thanks, @AKuryanov !  Looks like case 210813-89 has been escalated to development.  I’ll keep an eye on it for updates!

Badge +2

Hi

We was opened support incident two week ago

https://ma.commvault.com/Case/Details/210813-89

Can you see his detail?

Userlevel 4
Badge +13

@AKuryanov  Could you paste the log cuts for the slow job from the CvPerfMgr.log on the media agent

 

Alternatively open a support incident by click here , Provide the logs from CS/MA and the agent for the slow job

 

Badge +2
Hi allwe hope to receive an answer.The problem is very serious because backup times have quadrupled.
Userlevel 7
Badge +23

@AKuryanov , @Vladimir I’ll add in some of our Unix team folks for additional thoughts.

Badge +2

Also we have not any firewall between client and MA.

Badge +2

I will supplement the message of the topic starter

 

 

@Vladimir is a my colleague

Badge

I will supplement the message of the topic starter

The problem is only with backup from Solaris to Redhat8

Backup from Solaris to Redhat7 completely utilizes the source network, a backup from redhat7 or redhat8 completely utilizes the network

The performance of the loopback interface was measured by iperf3

[5] local 127.0.0.1 port 48708 connected to 127.0.0.1 port 5201

[ID] Interval Transfer Bitrate Retr Cwnd

[5] 0.00-1.00 sec 6.24 GBytes 53.6 Gbits / sec 0 1.19 Mbytes

[5] 1.00-2.00 sec 6.49 GBytes 55.8 Gbits / sec 0 1.19 Mbytes

[5] 2.00-3.00 sec 6.48 GBytes 55.7 Gbits / sec 0 1.19 Mbytes

[5] 3.00-4.00 sec 6.54 GBytes 56.2 Gbits / sec 0 1.19 Mbytes

[5] 4.00-5.00 sec 6.53 GBytes 56.1 Gbits / sec 0 1.25 Mbytes

[5] 5.00-6.00 sec 6.51 GBytes 55.9 Gbits / sec 0 1.25 Mbytes

[5] 6.00-7.00 sec 6.57 GBytes 56.4 Gbits / sec 0 1.25 Mbytes

[5] 7.00-8.00 sec 6.92 GBytes 59.4 Gbits / sec 0 1.31 Mbytes

Perhaps in this case the problem is not in the loopback interface.

 

Userlevel 3
Badge +6

I will try that tomorrow, I suspect that job will still be AUX’ing
In the meantime, I grabbed a Resource monitor screen shot of 1 of the 4 core media agents while the backups are running tonight.
 

 

Userlevel 7
Badge +23

.   
  I killed the AUX and unchecked the Optimize for concurrent LAN option on the 4 source media agents.  Starting a fresh AUX copy it is now running around 600 GB/hr.  So for the AUX it does not look to different with or without the Optimize on the source MA’s.   The Target MA’s do not have the optimize option selected.

 

I believe that its on the target MA you want to try toggle it, since the destination controls the transfer type. So perhaps try the target MA(s).

Userlevel 3
Badge +6

@Damian Andre I am trying your idea,  I have a AUX copy that has 1 - 20TB backup to work on.  The source backup is on a - 4 node GridStor with Global Dedupe.  The Target is a 2 node Grid with Global Dedupe.  The WAN in between is 10gb with 32 ms latency.  The AUX was running around 700 GB/hr.   
  I killed the AUX and unchecked the Optimize for concurrent LAN option on the 4 source media agents.  Starting a fresh AUX copy it is now running around 600 GB/hr.  So for the AUX it does not look to different with or without the Optimize on the source MA’s.   The Target MA’s do not have the optimize option selected.

The 20TB backup is highly de-duplicated, so I would expect it to run much faster, as it is all DDB lookups and updates.  This same backup is also being written to a LTO7 tape drive and that is bouncing between 800 and 2,700 GB/hr. 

   I am a bit confused how the Tape AUX can run faster, (physically writing all that data to media) vs dedupe to dedupe AUX over the WAN with lots of DDB lookups and updates.
 

The performance hunt continues.

 

Thank you again for the idea.  I apologize for taking this thread on a bit of a detour.

Userlevel 7
Badge +23

This post has me fascinated.  It mentions the loopback adaptor.  We do have Optimize for Concurrent LAN backups enabled and we see 40ms latency and higher on the Loopback adaptor in Windows resource monitor.  In one of our support tickets we were told that the loopback is not used that much so it should cause an issue.  Reading #1 it makes me think that maybe there is more to this.  I am nervous to make any change and unchecking the optimize box as the Media agents move a lot of data.

  I will need to think about this more...

 

Loopback is absolutely used a lot, it is the way commvault processes communicate with each other by default as mentioned above. I’m not really sure latency is such a big deal though - the only latency sensitive part of the product is deduplication - signature checks in/out affect overall performance, and latency there is often mitigated by parallel transactions etc. Q&I time monitors all that so as long as its below the threshold it should be good. For other areas of the product its really bandwidth that makes the difference.

Regarding the “optimize for concurrent lan backups” option, you can toggle it freely, it only takes affect on a NEW backup, so its perfectly safe to disable it, run a test backup, and immediately re-enable it as a test - No issues with that. In fact, I had a REALLY strange case years ago where an auxcopy would only be performant with it disabled, so we created a workflow that disabled it before starting the auxcopy job, and re-enabling it after the job started. It was just a temporary workaround while the customer moved to new infrastructure.

Userlevel 3
Badge +6

This post has me fascinated.  It mentions the loopback adaptor.  We do have Optimize for Concurrent LAN backups enabled and we see 40ms latency and higher on the Loopback adaptor in Windows resource monitor.  In one of our support tickets we were told that the loopback is not used that much so it should cause an issue.  Reading #1 it makes me think that maybe there is more to this.  I am nervous to make any change and unchecking the optimize box as the Media agents move a lot of data.

  I will need to think about this more...

Userlevel 7
Badge +23

@Damian Andre , I wish I could like this post TWICE!!!

Reply