Skip to main content

Hello,

we are running DB2 backup AIX LPAR. We have noted that CPU load is high

“In virtualized environments (e.g.,LPAR, WPAR, Solaris Zones etc.,) where dedicated CPUs are not allocated, backup jobs may result in high CPU usage on production servers. The following measures can be taken to optimize the CPU usage”

Best Practices - AIX File System (commvault.com)

 

The steps were covered and additional registry keys to limit CPU load and core usage were set in place. Switching deduplication to MediaAgent didn’t provided sufficient change, almost no CPU consumption change was in place. Only jobs without deduplication consumes less CPU, but we cannot use them.

 

At the end we can limit CPU load to 50-60% from up to 90%, but it impacts the backup duration few times. Playing with streams is not an option too since additional CPU consumption occurs. Since we are talking about Log backup mostly, Intellisnap won’t help much:

Can I perform an IntelliSnap backup for log files?
No. During an IntelliSnap backup, log files are not moved to the snapshot copy even
if you select the Backup Log Files (...)

Frequently Asked Questions for DB2 IntelliSnap (commvault.com)

 

Is there a way to dig further and change that CPU consumption, especially if it comes to DB2 backups other than IntelliSnap? 

Adding to @Lukas_S . For the same set of backups below are the additional settings used, and commvault registry settings are not providing the expected results on the Client with reduction in CPU. Please share feedback and if this is to do with any issues within product or any other alternate suggestion to perform.

 

Server Compute: 

Entitled Capacity                          : 3.75
Online Virtual CPUs                        : 5
Online Memory                              : 131072 MB
Entitled Capacity of Pool                  : 1505

 

Commvault Parameters , CPU utilization and backup run time:

sSDTHeadMaxCPUUsage - 25%

Commvault Parameters: Streams-8, buffers-8, buffer size-4096,parallelism-8 : CPU utilization 90% avg, backup runtime: 1.5 hours

Commvault Parameters: Streams-5, buffers-5, buffer size-2048,parallelism-5 : CPU utilization 75% avg

backup runtime: 6 hours

Commvault Parameters , CPU utilization and backup run time:

sSDTHeadMaxCPUUsage - 25%

dnicevalue - 15

process_priority - 1

 

Commvault Parameters: Streams-8, buffers-8, buffer size-4096,parallelism-8 : CPU utilization 85% avg; Backup runtime: 2.20 minutes


If I recall correctly, logs rarely deduplicate well but compress well with standard compression.

That being said, compression still comes at a CPU cost. If these clients were installed prior to FR20 then they could be using GZIP compression, which sometimes can use more CPU.

 

You can try change the compression using this setting (nCOMPRESSIONSCHEME) and set it to 1 to force LZO compression. Note that this will change the signature in the DDB for this client resulting in a re-baseline and higher storage consumption on the first backup after making the change.

I recommend trying it out on a test machine if you can first, or perhaps just a single client at first to see if it makes a difference.

In either case, limiting CPU performance will limit backup performance. We aren’t using those CPU cycles just for fun :)

 

In V11 SP4 (yes, half a decade ago). We switched to LZO compression for VSA for new VSA clients. Here was the CPU performance difference at the time (which was tested in a virtual environment using HotAdd):

CPU % - Left: GZIP (old), Right: LZO (new)

 

Algorithm Streams vCPU Application Size (GB) GB/hr Avg CPU
GZIP (Old) 4 4 127.73 639 80.2%
LZO (New) 4 4 127.73 1630 (+155%) 62.8%

Hello,

thank you both for your answers.

 

The Client is running on FR20, so it’s supposed to be running with LZO algorithm, but still it’ll be worth to check. Maybe compression, not deduplication is the clue. Because MediaAgent side dedup, still didn’t gave any expected results.

 

At the end, this seems to be the real issue, because performance peaks shown when backup software was changed:

“In virtualized environments (e.g.,LPAR, WPAR, Solaris Zones etc.,) where dedicated CPUs are not allocated, backup jobs may result in high CPU usage on production servers. The following measures can be taken to optimize the CPU usage”

 

Is there a way we can test other sets of commands being used during backup? Are devs working on new approach / engine regarding to LPAR? Whole virtualization is based on the using not dedicated cores, but we don’t see such CPU peaks during VMware or Hyper-V backups. 

 

 

regards,

Łukasz. 


Hello,

thank you both for your answers.

 

The Client is running on FR20, so it’s supposed to be running with LZO algorithm, but still it’ll be worth to check. Maybe compression, not deduplication is the clue. Because MediaAgent side dedup, still didn’t gave any expected results.

 

Not so much if its been upgraded to FR20, it depends if it was installed prior to FR20. The reason we don't automatically switch algorithms for old clients is the impact on deduplication with a new baseline. Changing the compression algorithm with create new blocks.

Setting dedupe on media agent won’t engage if you have client-side compression on, as data is compressed before deduplication. So in that case you’d have to set both on Media Agent side to take effect.

I can’t help much on the LPAR side since I’m not a Linux guru, but I’ll see if I can flag somebody who knows that.


Hello,

thank you both for your answers.

 

The Client is running on FR20, so it’s supposed to be running with LZO algorithm, but still it’ll be worth to check. Maybe compression, not deduplication is the clue. Because MediaAgent side dedup, still didn’t gave any expected results.

 

Not so much if its been upgraded to FR20, it depends if it was installed prior to FR20. The reason we don't automatically switch algorithms for old clients is the impact on deduplication with a new baseline. Changing the compression algorithm with create new blocks.

Setting dedupe on media agent won’t engage if you have client-side compression on, as data is compressed before deduplication. So in that case you’d have to set both on Media Agent side to take effect.

I can’t help much on the LPAR side since I’m not a Linux guru, but I’ll see if I can flag somebody who knows that.

 

Hello Damian,

actually agent was installed on FR20, so the LZO algorithm is in place. 

 

Correct, so we’d try going without compression at all with MA side dedup in place. 

 

Thank you. 


Tests with compression disabled did not give the expected results.

 

I’m wondering is there a more advanced way of tuning CPU consumption on LPAR?

“In virtualized environments (e.g.,LPAR, WPAR, Solaris Zones etc.,) where dedicated CPUs are not allocated, backup jobs may result in high CPU usage on production servers. The following measures can be taken to optimize the CPU usage”

 

 

regards,

Łukasz


@Lukas_S , I’m going to see if we can get some of our support folks to chime in, though a support case might end up being best here.


@Lukas_S , I’m going to see if we can get some of our support folks to chime in, though a support case might end up being best here.

Yes, you are right. At the end CPU consumption issue is more complex / low level. 

 

Thank you all for your time. 


I wasn’t able to get anything helpful quickly.  Create a support case and share the incident number here so I can track it accordingly :nerd:


@Lukas_S , were you able to create a support case to track this one down?

If so, please share the case number with me.


@Lukas_S , were you able to create a support case to track this one down?

If so, please share the case number with me.

Hello Mike, yes, but it will be long lasting case.

 

regards,

Łukasz.


I’m not going anywhere 😂

I’ll keep an eye on the case you pm’d me.


Sharing the case resolution:

Experiencing high CPU in comparison to TSM backups for DB2.

Provided detailed analysis on what's using CPU methods to reduce this.

 

sSDTHeadMaxCPUUsage although set to 25% not having the desired effect:

DB2SBT log:
8454448 1 04/05 06:46:01 ####### SDT max. CPU thread count is 10] based on reg. value 25%], Procsr count 40]
8454448 1 04/05 06:46:01 ####### SdtBase::InitWrkPool: Initializing SDT head thread pool
8454448 1 04/05 06:46:01 ####### Max head thread count set to 40. CPU # = 40
8454448 1 04/05 06:46:01 ####### Threads per connection set to 20
8454448 1 04/05 06:46:01 ####### Initial max. threads set to 40

 

Logs indicate that we see 40 CPU's however the machine actually only has 5 virtual CPU's (lpar). We are performing calculation based on 40 CPU's meaning 24% of 40 = 10 and so 10 CPU threads used rather than expected 25% of 5 (1 or 2 rounded up).

 

This should be amended to specify threads rather than % of CPU - during session for testing we changed this to 2.

Noticed that deduplication is happening on client but the understanding was that this was happening on media agent:

 

DB2SBT log:
8126636 1 04/04 21:00:56 3472782 CPipelayer::InitiatePipeline signatureType yCV_SIGNATURE_SHA_512], signatureWhere eCV_CLIENTSIDE_DEDUP]

 

This is because (by default) storage policy has setting enabled to perform deduplication on clients.

That setting overrides the subclient setting to perform on Media agent as per the note in the subclient properties.

For testing purposes, disabled deduplication for the subclient to test - this will mimic deduplication not happening on the client.

The solution here could be to create a new storage policy and for clients where deduplication must happen on the media agent have the 'Enable Deduplication on Clients' setting disabled.

Encryption enabled for agent side (client). This will consume CPU cycles as well. For testing, changed this setting to 'Media Only'.

 

There are a couple of other factors:

Disabling Checksum (CRC) checking at the client side - Network CRC helps us detect corruption caused during network transfer but on some systems / processors this can consume alot of CPU cycles.

This can only be disabled at media agent level, see https://documentation.commvault.com/v11/expert/qscript/setMediaAgentProperty.html

 

It is possible to disable at client level however this would require an escalation to our development team to confirm CRC checking is actually the cause of high CPU usage.

Resource Control Groups - See https://documentation.commvault.com/commvault/v11_sp20/article?p=4954.htm another method to control / throttle CPU usage..
Target CPU usage of 40 - 50 % during their backups has now been achieved.

 


Sharing the case resolution:

Experiencing high CPU in comparison to TSM backups for DB2.

Provided detailed analysis on what's using CPU methods to reduce this.

 

sSDTHeadMaxCPUUsage although set to 25% not having the desired effect:

DB2SBT log:
8454448 1 04/05 06:46:01 ####### SDT max. CPU thread count is 10] based on reg. value 25%], Procsr count 40]
8454448 1 04/05 06:46:01 ####### SdtBase::InitWrkPool: Initializing SDT head thread pool
8454448 1 04/05 06:46:01 ####### Max head thread count set to 40. CPU # = 40
8454448 1 04/05 06:46:01 ####### Threads per connection set to 20
8454448 1 04/05 06:46:01 ####### Initial max. threads set to 40

 

Logs indicate that we see 40 CPU's however the machine actually only has 5 virtual CPU's (lpar). We are performing calculation based on 40 CPU's meaning 24% of 40 = 10 and so 10 CPU threads used rather than expected 25% of 5 (1 or 2 rounded up).

 

This should be amended to specify threads rather than % of CPU - during session for testing we changed this to 2.

Noticed that deduplication is happening on client but the understanding was that this was happening on media agent:

 

DB2SBT log:
8126636 1 04/04 21:00:56 3472782 CPipelayer::InitiatePipeline signatureType yCV_SIGNATURE_SHA_512], signatureWhere eCV_CLIENTSIDE_DEDUP]

 

This is because (by default) storage policy has setting enabled to perform deduplication on clients.

That setting overrides the subclient setting to perform on Media agent as per the note in the subclient properties.

For testing purposes, disabled deduplication for the subclient to test - this will mimic deduplication not happening on the client.

The solution here could be to create a new storage policy and for clients where deduplication must happen on the media agent have the 'Enable Deduplication on Clients' setting disabled.

Encryption enabled for agent side (client). This will consume CPU cycles as well. For testing, changed this setting to 'Media Only'.

 

There are a couple of other factors:

Disabling Checksum (CRC) checking at the client side - Network CRC helps us detect corruption caused during network transfer but on some systems / processors this can consume alot of CPU cycles.

This can only be disabled at media agent level, see https://documentation.commvault.com/v11/expert/qscript/setMediaAgentProperty.html

 

It is possible to disable at client level however this would require an escalation to our development team to confirm CRC checking is actually the cause of high CPU usage.

Resource Control Groups - See https://documentation.commvault.com/commvault/v11_sp20/article?p=4954.htm another method to control / throttle CPU usage..
Target CPU usage of 40 - 50 % during their backups has now been achieved.

 

Hi Mike,

at the end we agreed to go with MediaAgent deduplication only and limit CPU usage to few cores. The tricky part was to not to use % value, because agent discovered all cores from hypervisor, not from machine only.  

 

Encryption didn’t change much and CRC checks were left unchanged and not tested.

 

At the end performance issue has been addressed, but still backup isn’t that efficient as it was with dedicated solution from vendor.

 

Thank you for your engagement Mark!


Hello,

do you know how we can check how nCOMPRESSIONSCHEME settings is set in a client?

 

Thanks

Lucio


@Lucio , do you mean in the gui, or on the client’s registry?


I’ve never set the parameter “nCOMPRESSIONSCHEME” so I would like to understand what clients are using gzip (nCOMPRESSIONSCHEME=0) or lzo (nCOMPRESSIONSCHEME=1).

We are starting with a new HyperscaleX cluster, and that would be the right moment to have some tests and use the best setting.

I think that it would be useful to understand also the compression method used for an already completed backup job.

Lucio


@Lucio , I can see that these are listed in Audit Trail reports:

https://documentation.commvault.com/11.24/expert/8677_additional_settings_overview.html

For more direct querying, is using REST API:

https://api.commvault.com/#ba066f42-ed7b-9b1b-8eee-79343119df8a


Thanks Mike,

I’m going to check all the clients settings.

I’ve just took a DB2 backup test with nCOMPRESSIONSCHEME=1 on an AIX LPAR, and the CPU utilization was about 50% less, with improved throughput, even considering that it was a first backup, with all new data written.

Now I’ve to evaluate if the dedup ratio is comparable and acceptable, but from the CPU point of view there is a big difference.

 

I perfectly understand the decision not to implement this change automatically on all clients, but in my opinion this “feature” should be advertised better, considering the benefits.

 

On IBM Power9 with AIX 7.2 there would also be an hardware accellerated version of gzip, that performs 10x better than the original one (https://community.ibm.com/community/user/power/blogs/brian-veale1/2020/11/09/power9-gzip-data-acceleration-with-ibm-aix).

Best Regards

Lucio


@Lucio , I can see that these are listed in Audit Trail reports:

https://documentation.commvault.com/11.24/expert/8677_additional_settings_overview.html

For more direct querying, is using REST API:

https://api.commvault.com/#ba066f42-ed7b-9b1b-8eee-79343119df8a

Hi Mike, 

With these methods I’m able to check only the parameters that was explicitly set, and not the parameters defaults.

We are adding new clients since 2017, so I would like to understand who is still using “gzip” compression and who is already using “lzo”.

 

Lucio

 


I perfectly understand the decision not to implement this change automatically on all clients, but in my opinion this “feature” should be advertised better, considering the benefits.

 

 

I agree, I ran into the same issue in 2018, maybe just as a footnote to the documentation. At the time we also went through the process of contacting support, so the documentation update will likely save you guys some internal cycles.