I have been working with ExaGrid support on slow restore speeds for about 10 weeks now and we cannot seem to figure out why the restore speeds are so slow. We are talking 10 hours to restore a single VMDK just to the Media Agent (as we are eliminating the overhead from VMware at the moment) to the high-speed disks on a Windows Server 2022 MA. These are brand-new and before with Windows Server 2012, we didn't have any issues. It seems that we are getting extremely slow READ performance when going back to the ExaGrid. All backup jobs seem to work fine - we even moved the ExaGrid nodes to the same VLAN as the MAs to eliminate any network issues - but please keep in mind we are using Cisco ACI so everything is software-defined. Moving them to the same VLAN actually DECREASED backup performance by 30% - we know this from using iPerf and TCP dumps.
Setup:
** 4 x MAs - all running Windows Server 2022 with separate physical volumes for the DDBs and Indexes.
** The DDBs are partitioned across all 4 MAs..
** Our CS is also running Windows Server 2022 and is a VM. The ExaGrid is 8 nodes running on version 7.0.1 (P06 build 448).
- Compression and Encryption are disabled across the board
- Deduplication is enabled in Commvault
- Our Storage Policies (we do not use Plans) are separated as Full and Incremental SPs.
- Number of streams to the library is not defined
- Backups run properly although after the VLAN migration, they are much slower
- Restores of VMDK file to MA (Job ID 39114625) avg throughput is 5.35 GB/hr and is 47% complete after running 4.5 hours
- We have configured the SMB client configuration in 2022 Windows to not request or require signatures, disabled bandwidth throttling, and pretty much set it such that it is back at 2012 levels with no performance gains
We were told CV dedupe can remain in tact as ExaGrid will dedupe Commvault's dedupe and then place it into its "retention area" after being processed in the ExaGrid 'Landing Zone" which is pre-ExaGrid deduplication. Once moved to the retention area, then it is deduplicated on the ExaGrid end.
I have hesitated to open a ticket with CV as we knew the issue has been on the ExaGrid end from the beginning but now I am at wit's end and grasping for any type of answers. We have removed AV/IPS and as mentioned earlier, even placed the MAs on the same VLAN as the ExaGrid storage.
** Things that do not work properly at the moment:
----- Synthetic Fulls (there is an open ticket for this which I opened today - error message:
Error Code: [19:1926] Description: Some jobs were skipped due to read errors. Will retry after some time. Source: bccmvcs01, Process: JobManager
----- DDB Verification / Space Reclamation (any time I start a DDB maintenance process, it seems to bring the whole environment to a crawl - and running jobs go into a Pending state so I can't even run these maintenance processes.
----- A direct copy/paste using Windows Explorer from the ExaGrid "utility share" which is a non-deduped share, of a VMDK file to the SSD on the MA is also extremely slow and vice-versa. The read speeds are around 400 MB/s and our expectation is 800-1000 MB/s given we have 10Gbps NIC cards - each with dual interfaces configured as an Active/Active VPC to our switches.
I need help here analyzing the log files as ExaGrid continues to want to update their nodes to newer version, but I don't see that as being the answer at this point since a direct copy/paste and iPerf tests are showing less than 10Gbps and read speeds from the ExaGrid from the vsrst.log files show very slow read speeds off of the libraries.
Please let me know if there is any further information I can share to assist you in troubleshooting our issue.