as shown below the q&i time is very high, known that the media agent is serving Oracle and SAP dbs only with daily full backup around 23 oracle rac and 18 sap client.
library is from flash storage.
DDB Disks is ssd and moved to pure storage o NVMe disks] due to insufficient space on local disks
any idea how to maintain this ?
Page 1 / 1
Hi @Muhammad Abdullah
Q and I time is a metric for the performance of the average lookup in the DDB. This is directly tied to IOPS capability of the disk hosting the DDB. Per the MA hardware requirements, the DDB should be hosted on a local dedicated SSD. If you are using network attached storage for the DDB there will be latency as opposed to locally attached disk.
DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well.
Hello @Matt Medvedeff
DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well.
Its not just about IOPS, but IOPS and latency. If you can’t sustain enough IOPS the latency will increase - but given this is NVMe the issue is likely the latency in the connection between the MA and storage. Once you move the storage outside of the box, you inherit some level of latency, but 6ms is a lot. How is the pure storage SAN connected? FC or iSCSI? if iSCSI is it sharing the same interface with other network traffic given that this seems like a workaround?
Hey Muhammad,
Can you please generate the DDB stats for 1 day. Here is a screenshot for your reference.
Hello
Hello @Matt Medvedeff
DDB were hosted on local SSD disks before and the issue was exist, we had to move the DDB partition to SSD SAN Storage due to insufficient space so the issue it not related to the disks. the existence SAN storage is NVMe all the way, so it perform pretty good, and we also use it for active active senarios as well.
Its not just about IOPS, but IOPS and latency. If you can’t sustain enough IOPS the latency will increase - but given this is NVMe the issue is likely the latency in the connection between the MA and storage. Once you move the storage outside of the box, you inherit some level of latency, but 6ms is a lot. How is the pure storage SAN connected? FC or iSCSI? if iSCSI is it sharing the same interface with other network traffic given that this seems like a workaround?
Hello @Damian Andre
regarding the latency, i’ve checked and its less than 6ms it hit 3ms max as shown below. and the connection between the MA and the array is FC Connection
Hey Muhammad,
Can you please generate the DDB stats for 1 day. Here is a screenshot for your reference.
Dear @Sajan
kindly find what you asked for.
Hey Muhammad,
Can you please include Q&I times stats in the chart
Hey Muhammad,
Can you please include Q&I times stats in the chart
Hello @Sajan
the preivious screentshot was for the whole ddb disk (2 Partitions) and the Q&I Time checkbox is not present.
thus, please find the below screen for one of the partitions with the Q&I Time checkbox selected.
This is a good chart. The Q&I times are up and down which is a good sign (better than seeing Q&I times stuck at high level).
You might want to review how the backup jobs are scheduled. Review what Commvault jobs run during the peak times.
Hello @Sajan,
this MA is dedicated to backup oracle and SAP clients,
our oracle dbs runing daily full starting 2 AM, so since the issue showed up, some oracle jobs takes more than 20hrs duration, so i belive that this spikes are from the oracle clients which run slow since the issue happened.
such as this one which started 12 hrs ago and still 37% !
Thanks for sharing this information. This must be such a pain to manage.
How many such jobs run ? How is the performance when you run only one job ? Are there any synthetic full backups running at that time ?
Dear @Sajan,
no synthetic jobs run on this MA
Only Full backups for Oracle and SAP. 28 individual Full backup job running at the same time 1 AM
oh! 28 full backups running at the same time is not a good idea. How many MAs do you have ?
Why dont you break it into 4 separate backups schedules ?
Alternatively, can you suspend all the full backups except one and test the performance of that single job ?
Hello @Sajan
i have 3 MA
MA1 > sql, mysql, mongo, sybase, and dump files MA2 > VMs, Filesystem MA3> Oracle, SAP
unfortunately, i cant suspend any of this jobs as this is a critical dbs but i will reschedule one of the large dbs to run alone in a different time and check the performance
@Muhammad Abdullah , following up on this one.
Did the reschedule increase the throughput?
Thanks!
One additional point to this discussion, how many DDB partitions for any and all DDBS reside on this particular mount path/disk? With DDB v5, Commvault will spawn new DDBs if parameters are met for high QI times or a large number of unique blocks so that could spread more work across the disks and if this disk is a DDB target for other DDBs then it could be a contention issue on the disk at the disk level.
You can check this by looking at Resource Monitor, if this is a Windows MA, and look at the disk queue depth for the disk during these spike times of QI times. It may be necessary to distribute those DDBs to other disks/mount paths.
Also, I have had extensive first hand experience that moving from local NVMe disk to Pure luns, both ISCSI and SAN attached will NOT elicit in any way similar and definitely not better performance. My own experience is about a 30-40% minimum drop in performance and latency, Pure itself is solid performing storage solution, but the DDB lookups are just too intensive for using a shared controller and no one could afford to dedicate a Pure array specifically for the DDB operations.
Another contributing factor could be Commvault operations that would compete for resources such as Data Aging(check for activity on SIDBPrune and SIDBPhysicalDeletes logs as to when they are running at the time of backup for these servers). Data verification operation, or Space Reclamation operations, both of which would have job histories. These operations would negatively impact backup operations if they overlapped.
Add DDB backups to the list of processes that could negatively affect backup/lookup performance as well.
@J Dodson you observation regarding seeing an increase of IO latency for DDBs when moving the DDB from a local NVMe drive to a NVMe based storage solution like Pure Storage when using FC/iSCSI is as expected of course. The path towards is long and is traversing via a protocol which is not so efficient as a local drive. You might in this case consider using NVMe-oF to reduce the latency to values closer than what you see when using a local NVMe drive.
But as @Damian Andre already pointed out seeing 6ms of latency is not normal, even for a FC connection towards a Pure Storage array. Some areas to look into:
FC misconfiguration
QoS on the Pure Storage side e.g. volume IOps limit
Also why are you not considering the use of partitioned DDBs so you spread the load across all MAs and also introduce some form of HA even though the use of block storage is not so optimal assuming you are using the Pure Storage array also to store the backup data.