Solved

How to Reduce Random Read Load on NL-SAS Storage Arrays?


Badge +1

Hello,

We are seeing a very large random read load on our Hitachi G350 backup storages with NL-SAS disks. These random reads are completely consuming our backup storage performance. We have two G350s on campus and a third at a remote site. Commvault runs copy jobs between these three G350s.

DDB is on NVMe locally in the Media Agent, also the Index Cache Disk.

We ran several analyses and Live Optics showed us that the daily change rate is 334.9%, which is mainly due to the Windows File System policy, for which we see 2485.1% daily change rate.

Does anyone know how the random read load could be reduced since our disk backup is otherwise unusable. What steps could we take to optimize the Commvault configuration?

Screenshot:

 

Thanks for your help! 

icon

Best answer by Torsten 5 October 2022, 17:12

View original

8 replies

Badge +1

Conclusion: 

The random read load is caused mainly by having three different aux copy jobs running. As a fourth copy is required by management, we’ll upgrade to Flash next year and expect the problem to disappear, just as @Onno van den Berg wrote. NL-SAS is not suitable for more than 1-2 aux copies, if at all. 

 

Thanks a lot  @Collin Harper and the Commvault community for the guidance. 😁

Userlevel 5
Badge +13

Hello @Torsten 

From the sound of it, we may be maxing out the storage in terms of how many streams it can handle at any one time. Since these are SAS disks I would recommend confirming Device Streams are not configured “higher than the number of writers  on the physical storage”

Device Streams - https://documentation.commvault.com/11.26/expert/10969_streams_overview.html

  • Disk Library Level

    For disk library, the number of device streams is based on the total number of mount path writers for all mount paths in the library. For example, if you have a disk library with two mount paths that have five writers each, a total of ten device streams can be written to the library. When you increase the number of mount path writers, more job streams can be written to device streams.

If the streams are configured appropriately, pruning may be causing a performance issue by adding additional overhead on the storage in addition to the running jobs. To test, you can configure a pruning blackout window so the system prevents pruning during peak backup hours.

Avoid Pruning of the Data During Peak Backup - https://documentation.commvault.com/11.26/expert/11955_avoid_pruning_of_data_during_peak_backups.html

Userlevel 7
Badge +23

Awesome!  @Onno van den Berg and @Collin Harper are community mainstays for sure!!

Userlevel 7
Badge +19

I guess there are still many customers who are using disk/spindle based storage solutions as their landing zone for their backup data. The more customers are moving to flash-based solutions a lot of these issues will disappear, but the problem will just be moved to some other place like for example the CPUs of the array or the network.

One thing we forget, especially with disk-based arrays, is that all these supporting or administrative jobs like data aging and aux-copies can have a big impact on RTO because of the load they generate on the storage. For large customers and MSPs it would be nice if there would be an option that automatically pauses all administrative jobs that relate to the disk libraries who are servicing active recovery jobs. 

Userlevel 5
Badge +13

@Torsten You’re welcome! I look forward to hearing back.

Userlevel 5
Badge +13

@Torsten 

You’re welcome! I’m happy to hear I was able to point you in the right direction.

 

Thank you,

Collin

Badge +1

@Collin Harper 

The data streams were definitely set too high. We had 50-100 data streams, where we should have had a maximum of 21-45. Another cause of the high read load are several auxilary copy operations.

These seem to run in parallel, due to the continuous schedule of the same primary copy. We are currently checking this with a support engineer. 

Due to the vacation season it will take another 2-3 weeks before we can test all measures and see a result. 

Anyway, you have put us on the right track, thanks for that! 😀👍

Badge +1

Hello @Collin Harper 

Thanks a lot. We’re going to check this asap. I’ll let you know, what we find. 

Reply