Dear Community,
Thank you to everyone in advance for their replies - we have a bit of a headscratcher on our hands here, one that we have failed to resolve despite our best efforts so far. A support case is ongoing, but we were pointed back towards our own infrastructure - but let's start at the beginning.
September 8th, my former colleague and Commvault admin (who unfortunately left the company) enabled storage policy level encryption. This led to some issues where synthetic full backup jobs were no longer running due to decryption issues, which in turn led to our primary media agent running full.
September 25th, I once again disabled storage policy encryption because I thought that may fix the issue, but it did not. I created a spillover path on our offsite media agent to keep backup jobs running, and a few days later, the issue that caused the synthetic full jobs to fail was resolved. Some data was written to the mount path (associated with the same library, but mounted on the offsite media agent).
Around September 20th, I noticed that auxiliary copy throughput for the copy between primary and offsite media agent was extremely low (usual throughput was 10TB/hr, it was/is now down to ~200GB/hr). At first I thought it might be related to the decryption issue we were battling at the time, but once that was resolved, aux copy throughput did not recover.
I opened a Commvault support case, and after several tests, read speeds on the media agents were found to be low (using Commvault's performance analysis tool), and disk queue length for the drive in question would not go below 50 during job activity. Even now, most auxiliary copy jobs are stuck with the "Total number of data transfer operations on the MediaAgent exceeds the maximum allowed value." error. In addition, aux copy throughput from the offsite media agent to tapes is also severely affected, corroborating the suspicion of a read speed related issue.
Based on this information, we upgraded the storage array firmware, drivers, etc, as well as opened a case with our hardware vendor, who told us that the disks and the storage arrays were completely healthy (upon reviewing various logs).
At this point I am at a bit of a loss. Commvault support says the issue must be on the infrastructure side, but we have checked the infrastructure side - network, hardware etc are all more than fine. Could this perhaps still be related to the encryption that was once enabled? Could it be related to the fact that there is a spillover mount path on the offsite media agent, mounted to the library of the primary one? I cannot move the mount path back to the primary media agent yet - since there is a substantial auxiliary copy job backlog, data is not ageing the way it is supposed to, and we are still short on space on our primary media agent.
Any advice would be greatly appreciated.
Thanks so much in advance!