Solved

If "Optimized For concurrent LAN backups" enabled on MA speed backup Oracle DB slowdown at 3 times

  • 19 August 2021
  • 24 replies
  • 320 views

Badge +2

On CV 11.20.9 we backuped Oracle DB from SPARC Solaris to MA RedHar8.4.

If we set “Optimized For concurrent LAN backups” on MA then speed = 1800 GB per hour, if we unset “Optimized For concurrent LAN backups”, then speed=5500 GB per hour.

All other settings are identical.

What changed in MA configuration when set “Optimized For concurrent LAN backups”?

icon

Best answer by AKuryanov 21 October 2021, 05:26

View original

24 replies

Userlevel 7
Badge +23

Hey @AKuryanov , appreciate the question!  You are behind on Maintenance Packs, though I’m not seeing anything specific to this feature in any of the release notes.

I’ll tag in some Oracle and Media Management folks for advice.

Userlevel 7
Badge +15

There are two major areas that get impacted when toggling this setting. This is a confusing topic, so bear with me. :grinning:

#1 - That option changes the way data is transferred between Commvault processes local to the media agent. This option is also sometimes referred to as SDT (simple data transfer). With “Optimized For concurrent LAN backups” ON (i.e SDT ON), data is transferred from one process to another using the loopback adapter (i.e 127.0.0.1 in most cases). Each process binds to a port on the loopback and uses TCP to send data. If this option is turned OFF (SDT OFF), it uses shared memory to transfer data between processes. I.e data from the client is receive to the Media Agent (lets say on the CVD process), and then the writer process may be CVMA, so CVD puts the data into memory, and CVMA reads it out. I’m not 100% sure if that is the correct scenario but the takeaway is that it does not use the loopback and TCP connections to communicate between processes on the MA if you disable optimize for concurrent lan backups. As you can imagine, this setting could increase memory usage and lower scale, which is why SDT is on by default. It may be worth investigating performance of the loopback adapter to ensure it can transfer data fast (like, gigabytes/sec fast) using a tool like iperf3.

 

#2 - now, if you disable optimize for concurrent lan backups (SDT off), it also changes the way the client communicates with the Media Agent - especially so if you have a network configuration in place restricting ports. I am a bit rusty on this, but with optimize off (SDT off), I believe data will bypass any firewall tunnels (i.e sending control traffic and data over an established tunneled connection - typically 8403) - it will send control traffic over the firewall tunnel but send DATA traffic directly to the CVD port on the Media agent (i.e 8400). This sometimes offers huge performance especially if you are choosing to encrypt traffic over the network tunnel in the networking options. Of course, disabling SDT bypasses network encryption so be aware of that.

One way you could test this out is to keep optimize for concurrent lan backups ON, but add 8400 as an additional port in the network/firewall configuration. This will bypass the network tunnel for data traffic (with the encryption caveat I mentioned before). Another option if you are using network configurations is to increase the ‘tunnels per route’ option discussed in the second post in this thread. Disabling SDT allows the ‘additional data ports’ networking option to be used, that option is ignored when optimized for concurrent LAN backups is ON.

 

So the TLDR; You could have a performance issue with the loopback adapter on the media agent, OR you are bypassing some of the firewall protocol and encryption overhead which in-turn is improving the performance - especially on unix clients, disabling SDT allows additional port usage which can benefit some environments.

 

 

 

 

Userlevel 7
Badge +23

@Damian Andre , I wish I could like this post TWICE!!!

Userlevel 2
Badge +3

This post has me fascinated.  It mentions the loopback adaptor.  We do have Optimize for Concurrent LAN backups enabled and we see 40ms latency and higher on the Loopback adaptor in Windows resource monitor.  In one of our support tickets we were told that the loopback is not used that much so it should cause an issue.  Reading #1 it makes me think that maybe there is more to this.  I am nervous to make any change and unchecking the optimize box as the Media agents move a lot of data.

  I will need to think about this more...

Userlevel 7
Badge +15

This post has me fascinated.  It mentions the loopback adaptor.  We do have Optimize for Concurrent LAN backups enabled and we see 40ms latency and higher on the Loopback adaptor in Windows resource monitor.  In one of our support tickets we were told that the loopback is not used that much so it should cause an issue.  Reading #1 it makes me think that maybe there is more to this.  I am nervous to make any change and unchecking the optimize box as the Media agents move a lot of data.

  I will need to think about this more...

 

Loopback is absolutely used a lot, it is the way commvault processes communicate with each other by default as mentioned above. I’m not really sure latency is such a big deal though - the only latency sensitive part of the product is deduplication - signature checks in/out affect overall performance, and latency there is often mitigated by parallel transactions etc. Q&I time monitors all that so as long as its below the threshold it should be good. For other areas of the product its really bandwidth that makes the difference.

Regarding the “optimize for concurrent lan backups” option, you can toggle it freely, it only takes affect on a NEW backup, so its perfectly safe to disable it, run a test backup, and immediately re-enable it as a test - No issues with that. In fact, I had a REALLY strange case years ago where an auxcopy would only be performant with it disabled, so we created a workflow that disabled it before starting the auxcopy job, and re-enabling it after the job started. It was just a temporary workaround while the customer moved to new infrastructure.

Userlevel 2
Badge +3

@Damian Andre I am trying your idea,  I have a AUX copy that has 1 - 20TB backup to work on.  The source backup is on a - 4 node GridStor with Global Dedupe.  The Target is a 2 node Grid with Global Dedupe.  The WAN in between is 10gb with 32 ms latency.  The AUX was running around 700 GB/hr.   
  I killed the AUX and unchecked the Optimize for concurrent LAN option on the 4 source media agents.  Starting a fresh AUX copy it is now running around 600 GB/hr.  So for the AUX it does not look to different with or without the Optimize on the source MA’s.   The Target MA’s do not have the optimize option selected.

The 20TB backup is highly de-duplicated, so I would expect it to run much faster, as it is all DDB lookups and updates.  This same backup is also being written to a LTO7 tape drive and that is bouncing between 800 and 2,700 GB/hr. 

   I am a bit confused how the Tape AUX can run faster, (physically writing all that data to media) vs dedupe to dedupe AUX over the WAN with lots of DDB lookups and updates.
 

The performance hunt continues.

 

Thank you again for the idea.  I apologize for taking this thread on a bit of a detour.

Userlevel 7
Badge +15

.   
  I killed the AUX and unchecked the Optimize for concurrent LAN option on the 4 source media agents.  Starting a fresh AUX copy it is now running around 600 GB/hr.  So for the AUX it does not look to different with or without the Optimize on the source MA’s.   The Target MA’s do not have the optimize option selected.

 

I believe that its on the target MA you want to try toggle it, since the destination controls the transfer type. So perhaps try the target MA(s).

Userlevel 2
Badge +3

I will try that tomorrow, I suspect that job will still be AUX’ing
In the meantime, I grabbed a Resource monitor screen shot of 1 of the 4 core media agents while the backups are running tonight.
 

 

Badge

I will supplement the message of the topic starter

The problem is only with backup from Solaris to Redhat8

Backup from Solaris to Redhat7 completely utilizes the source network, a backup from redhat7 or redhat8 completely utilizes the network

The performance of the loopback interface was measured by iperf3

[5] local 127.0.0.1 port 48708 connected to 127.0.0.1 port 5201

[ID] Interval Transfer Bitrate Retr Cwnd

[5] 0.00-1.00 sec 6.24 GBytes 53.6 Gbits / sec 0 1.19 Mbytes

[5] 1.00-2.00 sec 6.49 GBytes 55.8 Gbits / sec 0 1.19 Mbytes

[5] 2.00-3.00 sec 6.48 GBytes 55.7 Gbits / sec 0 1.19 Mbytes

[5] 3.00-4.00 sec 6.54 GBytes 56.2 Gbits / sec 0 1.19 Mbytes

[5] 4.00-5.00 sec 6.53 GBytes 56.1 Gbits / sec 0 1.25 Mbytes

[5] 5.00-6.00 sec 6.51 GBytes 55.9 Gbits / sec 0 1.25 Mbytes

[5] 6.00-7.00 sec 6.57 GBytes 56.4 Gbits / sec 0 1.25 Mbytes

[5] 7.00-8.00 sec 6.92 GBytes 59.4 Gbits / sec 0 1.31 Mbytes

Perhaps in this case the problem is not in the loopback interface.

 

Badge +2

I will supplement the message of the topic starter

 

 

@Vladimir is a my colleague

Badge +2

Also we have not any firewall between client and MA.

Userlevel 7
Badge +23

@AKuryanov , @Vladimir I’ll add in some of our Unix team folks for additional thoughts.

Badge +2
Hi allwe hope to receive an answer.The problem is very serious because backup times have quadrupled.
Userlevel 2
Badge +6

@AKuryanov  Could you paste the log cuts for the slow job from the CvPerfMgr.log on the media agent

 

Alternatively open a support incident by click here , Provide the logs from CS/MA and the agent for the slow job

 

Badge +2

Hi

We was opened support incident two week ago

https://ma.commvault.com/Case/Details/210813-89

Can you see his detail?

Userlevel 7
Badge +23

Thanks, @AKuryanov !  Looks like case 210813-89 has been escalated to development.  I’ll keep an eye on it for updates!

Userlevel 6
Badge +12

@AKuryanov I'm amazed that you are still running on a very old maintenance release. It is 13 months old! The list with documented fixes and enhancements is huge! Please consider updating your environment!

 

Note: FR20 is a LTS release and many customers are still running FR20 right now, so there is a big possibility that if it is software related that it has been addressed already https://documentation.commvault.com/commvault/v11_sp20/article?p=11_20_64.htm.

Badge +2
@Onno van den Berg Over the years of using CV, we realized that it is impossible to patch the infrastructure of several thousand clients once a week (64 patches in 56 weeks).Many patches improve in one place and break in another. And in Enterprise, stability is the most important thing. Therefore, we do not patch until support says that your problem is resolved in that particular patch.
Userlevel 6
Badge +12
@Onno van den Berg Over the years of using CV, we realized that it is impossible to patch the infrastructure of several thousand clients once a week (64 patches in 56 weeks).Many patches improve in one place and break in another. And in Enterprise, stability is the most important thing. Therefore, we do not patch until support says that your problem is resolved in that particular patch.

I have not said you should patch your systems on a weekly basis, this is b.t.w. also not possible anymore as MRs are released on a monthly basis. But you haven't even patched you environment for more than a year! The version you are running is still close to GA, which means a much bigger chance of underlying software issues that impact reliability, security and most important you fix issues before you run into them while trying to recover data. So I would consider patching at least quarterly! B.t.w. you also patch your Windows systems every month, right ;-)

Badge +2
After a long correspondence with support, it was found that on RHEL8 during backup with enabled compression, the network traffic from the client to MA does not decrease. Those, we see that on RHEL8 the network traffic is 2 times higher, and the backup time is 2 times longer compared to RHEL7. The reason for this has not yet been clarified.
Userlevel 7
Badge +23

Appreciate you updating us!  I see the incident is still ongoing.

Thanks again, @AKuryanov !

Userlevel 6
Badge +12

@AKuryanov @Mike Struening so if I understand correctly the issue still persists and was addressed only via a workaround by the use of RHEL7. curious to hear the root cause and to know when this is fixed. 

Also wondering if @AKuryanov upgraded his environment to a more recent maintenance release to see if that would sort out the problem.

Userlevel 7
Badge +23

@AKuryanov , checking this incident again (thanks to @Onno van den Berg bringing thread back up), it looks like this case was archived.

Here’s the archived last step:

Issue:
======
Oracle database backup is having a Slow backup performance on Media agent with RHEL 8.

Work so far
------------
-- dev team advised to run backup with pipeline mode
-- added sPipelineMode with value B:P under iDataAgent on client
-- above setting provided expected through put for customer
-- Customer has several client so created a group for them and added pipeline mode to group
-- Additional setting is from client group is now inheriting to individual servers correctly.
-- Customer raised another issue , backup performance is good but compression is not working on client on Red Hat 8 version of MA. ( MA with RH7 shows no issue)
-- Dev stated ,we are 100% sure the compression is happening on the client. But we do see more data transferred over network.
-- Dev advised to check ring buffer parameters on RH8 MA? and to Increase both the RX and TX buffers to the maximum
-- After this change Performance increased from 1.5 TB per hour to 3.5 TB using a media agent running the RH 8 operating system
-- on a media agent with the RH 7 operating system, the performance will increase to 9 TB per hour.
-- Now customer raised that more network interface usage observed with sPipelineMode B:P set ?
-- Dev confirmed that it is known issue
-- Customer confirmed more network interface is observed with out pipeline mode set on client.
-- Logs shows that pipeline mode was used ( cvperfmgr log) even pipeline mode is not configured on client group/ client itself.

Current status (3rd Dec)
--------
Waiting on customer to provide new logs from server ( with increased file version for ORASBT) to check from where pipeline mode is getting configured.

Were you able to move on with this second issue?

Userlevel 6
Badge +12

So make some changes on Commvault side (sPipelineMode) and making some OS changes improved performance drastically?

I'm really interested to hear if this was a specific case and if not how this has been tackled for all Commvault customers. E,g, is the sPipelineMode set automatically or was the logic itself improved in a more recent feature release and was the OS setting added to the best practices section. 

Reply