Solved

DDB QI time threshold.


Userlevel 1
Badge +7

Hello 

 

I have an issue related to DDB 

as shown below  the q&i time is very high, known that the media agent is serving Oracle and SAP dbs only with daily full backup 
around 23 oracle rac and 18 sap client.

library is from flash storage.

DDB Disks is ssd and moved to pure storage [ NVMe disks]  due to insufficient space on local disks 

any idea how to maintain this ?

 

icon

Best answer by J Dodson 6 July 2022, 00:58

View original

18 replies

Userlevel 3
Badge +9

Hi @Muhammad Abdullah 

Q and I time is a metric for the performance of the average lookup in the DDB. This is directly tied to IOPS capability of the disk hosting the DDB. Per the MA hardware requirements, the DDB should be hosted on a local dedicated SSD. If you are using network attached storage for the DDB there will be latency as opposed to locally attached disk. 

Please see the documentation here for HW requirements for MA’s hosting the DDB - https://documentation.commvault.com/11.24/expert/111985_hardware_specifications_for_deduplication_mode_01.html#cpuram

 

Userlevel 1
Badge +7

Hello @Matt Medvedeff 

 

DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. 
the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well. 
 

Userlevel 7
Badge +18

Hello @Matt Medvedeff 

 

DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. 
the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well. 
 

Its not just about IOPS, but IOPS and latency. If you can’t sustain enough IOPS the latency will increase - but given this is NVMe the issue is likely the latency in the connection between the MA and storage. Once you move the storage outside of the box, you inherit some level of latency, but 6ms is a lot. How is the pure storage SAN connected? FC or iSCSI? if iSCSI is it sharing the same interface with other network traffic given that this seems like a workaround?

 

Badge +2

Hey Muhammad,

 

Can you please generate the  DDB stats for 1 day. Here is a screenshot for your reference.

 

 

Userlevel 1
Badge +7

Hello 

Hello @Matt Medvedeff 

 

DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. 
the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well. 
 

Its not just about IOPS, but IOPS and latency. If you can’t sustain enough IOPS the latency will increase - but given this is NVMe the issue is likely the latency in the connection between the MA and storage. Once you move the storage outside of the box, you inherit some level of latency, but 6ms is a lot. How is the pure storage SAN connected? FC or iSCSI? if iSCSI is it sharing the same interface with other network traffic given that this seems like a workaround?

 

Hello @Damian Andre 

regarding the latency, i’ve checked and its less than 6ms it hit 3ms max as shown below. 
and the connection between the MA and the array is FC Connection 

 

 

Userlevel 1
Badge +7

Hey Muhammad,

 

Can you please generate the  DDB stats for 1 day. Here is a screenshot for your reference.

 

 

Dear @Sajan  

kindly find what you asked for. 

 

 

Badge +2

Hey Muhammad,

 

Can you please include Q&I times stats in the chart

 

Userlevel 1
Badge +7

Hey Muhammad,

 

Can you please include Q&I times stats in the chart

 

Hello @Sajan 
 

the preivious screentshot was for the whole ddb disk (2 Partitions) and the Q&I Time checkbox is not present.

thus, please find the below screen for one of the partitions with the Q&I Time checkbox selected.

 

 

Badge +2

This is a good chart. The Q&I times are up and down which is a good sign (better than seeing Q&I times stuck at high level). 

You might want to review how the backup jobs are scheduled. Review what Commvault jobs run during the peak times. 

 

Userlevel 1
Badge +7

Hello @Sajan

this MA is dedicated to backup oracle and SAP clients, 

our oracle dbs runing daily full starting 2 AM, so since the issue showed up, some oracle jobs takes more than 20hrs duration, so i belive that this spikes are from the oracle clients which run slow since the issue happened.

such as this one which started 12 hrs ago and still 37% ! 

 

 

Badge +2

Thanks for sharing this information. This must be such a pain to manage. 

How many such jobs run ? How is the performance when you run only one job ? Are there any synthetic full backups running at that time ? 

Userlevel 1
Badge +7

Dear @Sajan

no synthetic jobs run on this MA 

Only Full backups for Oracle and SAP.
28 individual Full backup job running at the same time 1 AM 

Badge +2

oh! 28 full backups running at the same time is not a good idea. How many MAs do you have ? 

Why dont you break it into 4 separate backups schedules ?

Alternatively, can you suspend all the full backups except one and test the performance of that single job ? 

Userlevel 1
Badge +7

Hello @Sajan 

i have 3 MA 

MA1 > sql, mysql, mongo, sybase, and dump files 
MA2 > VMs, Filesystem
MA3> Oracle, SAP

unfortunately, i cant suspend any of this jobs as this is a critical dbs
but i will reschedule one of the large dbs to run alone in a different time and check the performance

 

Userlevel 7
Badge +23

@Muhammad Abdullah , following up on this one.

Did the reschedule increase the throughput?

Thanks!

Userlevel 1
Badge +3

One additional point to this discussion, how many DDB partitions for any and all DDBS reside on this particular mount path/disk? With DDB v5, Commvault will spawn new DDBs if parameters are met for high QI times or a large number of unique blocks so that could spread more work across the disks and if this disk is a DDB target for other DDBs then it could be a contention issue on the disk at the disk level.

You can check this by looking at Resource Monitor, if this is a Windows MA, and look at the disk queue depth for the disk during these spike times of QI times. It may be necessary to distribute those DDBs to other disks/mount paths.

Also, I have had extensive first hand experience that moving from local NVMe disk to Pure luns, both ISCSI and SAN attached will NOT elicit in any way similar and definitely not better performance. My own experience is about a 30-40% minimum drop in performance and latency, Pure itself is solid performing storage solution, but the DDB lookups are just too intensive for using a shared controller and no one could afford to dedicate a Pure array specifically for the DDB operations.

Another contributing factor could be Commvault operations that would compete for resources such as Data Aging(check for activity on SIDBPrune and SIDBPhysicalDeletes logs as to when they are running at the time of backup for these servers). Data verification operation, or Space Reclamation operations, both of which would have job histories. These operations would negatively impact backup operations if they overlapped.

Userlevel 1
Badge +3

Add DDB backups to the list of processes that could negatively affect backup/lookup performance as well. 

Userlevel 7
Badge +16

@J Dodson you observation regarding seeing an increase of IO latency for DDBs when moving the DDB from a local NVMe drive to a NVMe based storage solution like Pure Storage when using FC/iSCSI is as expected of course. The path towards is long and is traversing via a protocol which is not so efficient as a local drive. You might in this case consider using NVMe-oF to reduce the latency to values closer than what you see when using a local NVMe drive.

But as @Damian Andre already pointed out seeing 6ms of latency is not normal, even for a FC connection towards a Pure Storage array. Some areas to look into:

  • FC misconfiguration
  • QoS on the Pure Storage side e.g. volume IOps limit

Also why are you not considering the use of partitioned DDBs so you spread the load across all MAs and also introduce some form of HA even though the use of block storage is not so optimal assuming you are using the Pure Storage array also to store the backup data. 

Reply