Solved

Slow Isilon Read Speeds


Userlevel 1
Badge +3
  • Certified Master
  • 14 replies

Hi all

We have a customer that has an Isilon for disk storage.

Backup speeds are ok, but DASH to DR or Copy to Tape speeds are terrible. “Last night’s” backups copy to DR at no more than 500GB/hr, if we’re lucky and copy to tape speeds do not exceed 200GB/hr.

Index and DDB are on NVME and tested fine.

Fallen Behind Copies are literally years behind.

Both Commvault and Isilon have checked it out and cannot do anything about it.

We did spec Hyperscale before implementation but we were overruled and now I sit with this issue. Very frustrating.

 

Has anyone experienced dog-slow Isilon restores, DASH copies or Tape copies?

How were you able to overcome this?

 

It’s gotten so bad that we are going to ask Commvault to either try help us fix it or reconsider certifying it as a DL destination.

icon

Best answer by CSumner 22 April 2021, 12:38

Thanks Shane,

 

There defiantly looks to be a read issue with the source library.

 

Each thread translates to 1 stream so per stream you are seeing sub optimal performance from the library.

 

Additional settings or a change configuration at this point will not give you the performance boost you need.

 

I would look at engaging your storage support to see why the read speeds are so slow.

 

You should also see very slow speeds if you attempt to copy a large file from the storage to the MA via the configured share. This with the two documents above should be more than enough evidence for your storage support to investigate.

 

Regards,

 

Chris Sumner

View original

18 replies

Userlevel 1
Badge +1

Hi Shane,

 

Thank you for your post.

 

Without the logs, we would not be able to give you a definitive answer however if backups to the primary library show better throughput then the aux copy to the tape library, then we could make an educated guess that the issue is with the write speeds to the tape library.

 

The CVJobReplicatorODS log on the source Media Agents will show you the read speeds.

 

The CVD log on the destination Media Agent will show you the write speeds.

 

The write speeds of the library can be tested outside of Commvault using the tapetoolgui application provided in the Commvault binaries: https://documentation.commvault.com/commvault/v11/article?p=10587.htm

 

Please load an empty tape into one of the drives using Commvault, then go to the properties of drive and click the Details button to make note of the tape number for example :\\Tape0.

 

Now open the tapetoolgui on the Media Agent and go to the the write option, from the drop down box select the correct tape number, set the same settings configured for the aux copy (these can be found in the data path tab for the copy) and attempt to write 10GB of data.

 

The result will show on screen, if the speeds are good then this needs to be a case with Commvault support. If the speeds are slow, then this needs to be looked into by your Tape vendor.

 

Userlevel 1
Badge +3

Hi Shane,

 

Thank you for your post.

 

Without the logs, we would not be able to give you a definitive answer however if backups to the primary library show better throughput then the aux copy to the tape library, then we could make an educated guess that the issue is with the write speeds to the tape library.

 

The CVJobReplicatorODS log on the source Media Agents will show you the read speeds.

 

The CVD log on the destination Media Agent will show you the write speeds.

 

The write speeds of the library can be tested outside of Commvault using the tapetoolgui application provided in the Commvault binaries: https://documentation.commvault.com/commvault/v11/article?p=10587.htm

 

Please load an empty tape into one of the drives using Commvault, then go to the properties of drive and click the Details button to make note of the tape number for example :\\Tape0.

 

Now open the tapetoolgui on the Media Agent and go to the the write option, from the drop down box select the correct tape number, set the same settings configured for the aux copy (these can be found in the data path tab for the copy) and attempt to write 10GB of data.

 

The result will show on screen, if the speeds are good then this needs to be a case with Commvault support. If the speeds are slow, then this needs to be looked into by your Tape vendor.

 

This is very helpful, thank you.

Will report back with the results

 

Userlevel 1
Badge +3

This doesn’t paint a pretty picture:

The CVJobReplicatorODS log, filtered:

 Job Id [2808483], Bytes [3554012807], Time [150.823810] Sec(s), Average Speed [22.472385] MB/Sec
1320  4020  04/20 20:29:45 2808483 stat- ID [Media Read Speed], Job Id [2808483], Bytes [3569888], Time [6.425903] Sec(s), Average Speed [0.529810] MB/Sec
1320  4020  04/20 20:30:18 2808483 stat- ID [Media Read Speed], Job Id [2808483], Bytes [35959], Time [6.379979] Sec(s), Average Speed [0.005375] MB/Sec
1320  2aa8  04/20 20:30:18 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [2524446018], Time [57.223872] Sec(s), Average Speed [42.071591] MB/Sec
1320  2aa8  04/20 20:30:47 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [407533284], Time [13.753373] Sec(s), Average Speed [28.258815] MB/Sec
1320  2aa8  04/20 20:31:43 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [538582783], Time [20.165669] Sec(s), Average Speed [25.470643] MB/Sec
1320  2aa8  04/20 20:33:33 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [1692375892], Time [73.643517] Sec(s), Average Speed [21.916056] MB/Sec
1320  2aa8  04/20 20:36:54 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [3446449989], Time [136.524469] Sec(s), Average Speed [24.074738] MB/Sec
1320  2aa8  04/20 20:38:49 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [3446384584], Time [80.714226] Sec(s), Average Speed [40.720560] MB/Sec
1320  2aa8  04/20 20:43:20 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [3279470225], Time [199.083116] Sec(s), Average Speed [15.709753] MB/Sec
1320  2aa8  04/20 20:48:44 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [5357413664], Time [163.631480] Sec(s), Average Speed [31.223991] MB/Sec
1320  2aa8  04/20 20:57:02 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [9535605031], Time [290.916488] Sec(s), Average Speed [31.259354] MB/Sec
1320  2aa8  04/20 21:03:12 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [2839144051], Time [268.750481] Sec(s), Average Speed [10.074842] MB/Sec
1320  15b0  04/20 21:05:45 2808562 stat- ID [Media Read Speed], Job Id [2808562], Bytes [909152407], Time [31.214392] Sec(s), Average Speed [27.776780] MB/Sec
1320  2aa8  04/20 21:07:27 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [126195744], Time [10.064503] Sec(s), Average Speed [11.957832] MB/Sec
1320  15b0  04/20 21:08:14 2808562 stat- ID [Media Read Speed], Job Id [2808562], Bytes [190005285], Time [52.600836] Sec(s), Average Speed [3.444872] MB/Sec
1320  2aa8  04/20 21:16:10 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [4526056297], Time [96.700089] Sec(s), Average Speed [44.636812] MB/Sec
1320  2aa8  04/20 21:33:53 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [22591919052], Time [885.865786] Sec(s), Average Speed [24.321216] MB/Sec
1320  43e8  04/20 21:48:05 2808580 stat- ID [Media Read Speed], Job Id [2808580], Bytes [1557346555], Time [163.534822] Sec(s), Average Speed [9.081866] MB/Sec
1320  3e6c  04/20 22:35:37 2808624 stat- ID [Media Read Speed], Job Id [2808624], Bytes [3392684930], Time [889.073823] Sec(s), Average Speed [3.639199] MB/Sec
1320  3e6c  04/20 22:37:04 2808624 stat- ID [Media Read Speed], Job Id [2808624], Bytes [1364], Time [0.000146] Sec(s), Average Speed [8.879496] MB/Sec
1320  2aa8  04/20 22:37:16 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [55096329787], Time [3450.144753] Sec(s), Average Speed [15.229493] MB/Sec
1320  2aa8  04/20 23:06:19 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [21736043560], Time [1619.399755] Sec(s), Average Speed [12.800488] MB/Sec
1320  4250  04/20 23:37:59 2808683 stat- ID [Media Read Speed], Job Id [2808683], Bytes [4887304961], Time [1220.825255] Sec(s), Average Speed [3.817825] MB/Sec
1320  4250  04/20 23:48:20 2808683 stat- ID [Media Read Speed], Job Id [2808683], Bytes [2541425061], Time [430.789823] Sec(s), Average Speed [5.626158] MB/Sec
1320  4250  04/20 23:54:01 2808683 stat- ID [Media Read Speed], Job Id [2808683], Bytes [967632658], Time [275.128657] Sec(s), Average Speed [3.354091] MB/Sec
1320  dbc   04/20 23:59:30 2808683 stat- ID [Media Read Speed], Job Id [2808683], Bytes [300848073], Time [64.534557] Sec(s), Average Speed [4.445852] MB/Sec
1320  2f18  04/21 01:09:56 2808808 stat- ID [Media Read Speed], Job Id [2808808], Bytes [1112128472], Time [258.251365] Sec(s), Average Speed [4.106884] MB/Sec
1320  2c58  04/21 01:50:28 2808827 stat- ID [Media Read Speed], Job Id [2808827], Bytes [1162944605], Time [280.876271] Sec(s), Average Speed [3.948608] MB/Sec
1320  2c58  04/21 02:12:45 2808827 stat- ID [Media Read Speed], Job Id [2808827], Bytes [1393061865], Time [1087.711048] Sec(s), Average Speed [1.221397] MB/Sec
1320  2c58  04/21 02:23:42 2808827 stat- ID [Media Read Speed], Job Id [2808827], Bytes [14573], Time [9.090160] Sec(s), Average Speed [0.001529] MB/Sec
1320  f0c   04/21 03:42:43 2808874 stat- ID [Media Read Speed], Job Id [2808874], Bytes [7378130545], Time [2445.191251] Sec(s), Average Speed [2.877621] MB/Sec
1320  2aa8  04/21 04:17:50 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [167574913710], Time [17754.087481] Sec(s), Average Speed [9.001414] MB/Sec
1320  b90   04/21 04:22:43 2809007 stat- ID [Media Read Speed], Job Id [2809007], Bytes [3643832300], Time [122.084865] Sec(s), Average Speed [28.464047] MB/Sec
1320  2aa8  04/21 04:55:39 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [6002650203], Time [1448.817959] Sec(s), Average Speed [3.951203] MB/Sec
1320  35c0  04/21 04:55:42 2809027 stat- ID [Media Read Speed], Job Id [2809027], Bytes [114125], Time [6.132809] Sec(s), Average Speed [0.017747] MB/Sec
1320  2aa8  04/21 04:59:20 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [1837679], Time [5.702311] Sec(s), Average Speed [0.307340] MB/Sec
1320  35c0  04/21 05:08:16 2809027 stat- ID [Media Read Speed], Job Id [2809027], Bytes [3732997825], Time [482.383491] Sec(s), Average Speed [7.380153] MB/Sec
1320  2aa8  04/21 05:11:00 2807768 stat- ID [Media Read Speed], Job Id [2807768], Bytes [7460494613], Time [577.187766] Sec(s), Average Speed [12.326807] MB/Sec
1320  257c  04/21 05:25:25 2808219 stat- ID [Media Read Speed], Job Id [2808219], Bytes [65818948373], Time [34395.754886] Sec(s), Average Speed [1.824930] MB/Sec
1320  3bb0  04/21 05:49:35 2809083 stat- ID [Media Read Speed], Job Id [2809083], Bytes [4520152292], Time [207.714233] Sec(s), Average Speed [20.753287] MB/Sec
1320  21f0  04/21 05:49:36 2809081 stat- ID [Media Read Speed], Job Id [2809081], Bytes [6019348222], Time [442.818842] Sec(s), Average Speed [12.963536] MB/Sec
1320  257c  04/21 05:49:42 2808219 stat- ID [Media Read Speed], Job Id [2808219], Bytes [3515625379], Time [1195.427071] Sec(s), Average Speed [2.804656] MB/Sec
1320  434c  04/21 07:20:05 2809172 stat- ID [Media Read Speed], Job Id [2809172], Bytes [87469], Time [0.522006] Sec(s), Average Speed [0.159801] MB/Sec
1320  434c  04/21 07:23:51 2809172 stat- ID [Media Read Speed], Job Id [2809172], Bytes [2552744137], Time [230.823423] Sec(s), Average Speed [10.546965] MB/Sec
1320  3b44  04/21 09:52:55 2809175 stat- ID [Media Read Speed], Job Id [2809175], Bytes [3422512], Time [9.055604] Sec(s), Average Speed [0.360436] MB/Sec
1320  3b44  04/21 09:53:56 2809175 stat- ID [Media Read Speed], Job Id [2809175], Bytes [757761354], Time [60.735576] Sec(s), Average Speed [11.898422] MB/Sec
1320  4584  04/21 10:14:26 2808219 stat- ID [Media Read Speed], Job Id [2808219], Bytes [16089697298], Time [3175.778068] Sec(s), Average Speed [4.831676] MB/Sec
1320  4370  04/21 11:29:12 2809343 stat- ID [Media Read Speed], Job Id [2809343], Bytes [10441440139], Time [4279.493356] Sec(s), Average Speed [2.326849] MB/Sec
1320  4370  04/21 11:31:33 2809343 stat- ID [Media Read Speed], Job Id [2809343], Bytes [7805357884], Time [4563.974693] Sec(s), Average Speed [1.630984] MB/Sec
1320  3b44  04/21 11:31:43 2809175 stat- ID [Media Read Speed], Job Id [2809175], Bytes [45820529788], Time [14247.348427] Sec(s), Average Speed [3.067087] MB/Sec
1320  4370  04/21 11:31:43 2809343 stat- ID [Media Read Speed], Job Id [2809343], Bytes [13033779089], Time [4350.706284] Sec(s), Average Speed [2.857003] MB/Sec
 

The TapeToolGUI wrote at 378GB/hr:

 

 

Userlevel 1
Badge +1

Hi Shane,

 

Thank you for sending this over.

 

Ok, so rather than a write speed issue, it looks like the issue is with read speeds on the source library. 

 

Lets get this tested outside of Commvault also.

 

You can use the CVDiskPerf tool also included in the Commvault Binaries: https://documentation.commvault.com/commvault/v11/article?p=8855.htm

 

Please use the source media agent and an example of the command is below:

 

CVDiskPerf.exe -PATH D:\DISKLIBRARY1\Folder_2021.04.21-11.23\CVMAGNETIC -RANDOM -OUTFILE c:\temp\perf.txt

 

If possible I would run this while the aux copies are also running and with them suspended.

 

Please make sure you are writing the output file to an existing location and that the name of the file changes on each test, to prevent overwriting the previous test data.

 

Further more the Isilon performance will drop if it is close to full. Can you confirm how much free space you have on the source library?

 

Regards,

 

Chris Sumner

 

 

Userlevel 1
Badge +3

Head Office is 69% full and 

DR is 66% full.

The DiskPerf is running, been running for about 30 mins, will update.

Thanks for all your help.

Userlevel 1
Badge +1

Thanks Shane,

 

There should not be any performance impact if the filer has that much free space.

 

The test is writing and readingless than 10GB of data so if it is running for a while then this may mean there are underlying issues on the source library.

 

However, lets wait for the test to complete before any assumptions are made.

 

Regards,

 

Chris Sumner

Userlevel 6
Badge +13

Windows or Linux MA’s?

I have found with Isilon not to rely on smartconnect DNS round-robin, as windows SMB connections tend to reuse existing open streams rather than open new streams, so you end up accessing fewer nodes to read or write data. It works well if you are servicing hundreds of clients, and each client is opening a single stream, but not so much for a handful of media agents.

I found best performance is achieved by having 1 mount path per Isilon node, and then use IP addresses for each path to the mount path location rather than hostnames and ensure the library configured for round-robin. This will allow Commvault to load balance reads and writes across all nodes and bypass some of the SMB connection re-use logic which could limit connections.

I have seen with some of the ‘archive tier’ Isilon units, performance can be lackluster if you try backup and copy at the same time. If you suspect backups, copies run fast and vice versa.

 

 

Badge +3

Hi,

I have a lot of customers and they are suffering the same pain ! Not enough disk in the Isilon and they are too big in size ending up with not enough spindle to achieve a good read after some time when data is spread across all disks because of the Deduplication process !

Userlevel 1
Badge +3

Windows or Linux MA’s?

I have found with Isilon not to rely on smartconnect DNS round-robin, as windows SMB connections tend to reuse existing open streams rather than open new streams, so you end up accessing fewer nodes to read or write data. It works well if you are servicing hundreds of clients, and each client is opening a single stream, but not so much for a handful of media agents.

I found best performance is achieved by having 1 mount path per Isilon node, and then use IP addresses for each path to the mount path location rather than hostnames and ensure the library configured for round-robin. This will allow Commvault to load balance reads and writes across all nodes and bypass some of the SMB connection re-use logic which could limit connections.

I have seen with some of the ‘archive tier’ Isilon units, performance can be lackluster if you try backup and copy at the same time. If you suspect backups, copies run fast and vice versa.

 

 

Windows MA’s, very strong ones at that.

Not exactly sure how the Isilon is configured in terms of load-balancing, but the tech that did the installation assures us it’s in line with best practice.

Userlevel 1
Badge +3

DiskPerf with DASH copies running completed, DiskPerf without is currently running.

 

---------------------------------------------------------
Report as of   : 21,April,21 09:09:36
Processor      : X64
OS             : Windows Server 2012
---------------------------------------------------------
DiskPerf Version        : 2.2
Path Used               : \\commvault1\backups$\gabstentest\DL
Performance type        : Create new
Read-Write type         : RANDOM
Block Size              : 65536
File Count              : 6
Thread Count            : 6
Block Count             : 16384
Total Bytes Written     : 6442450944
Total Bytes Read        : 6442450944
Total Bytes Deleted     : 6442450944
----
Time Taken to Create(S)     : 13153.09
Time Taken to Write&flush(S): 92.70
Time Taken to Read(S)       : 17035.82
Time Taken to Delete(S)     : 1.71
----
Per thread Throughput Create(GB/H)     : 0.27
Per thread Throughput Write(GB/H)      : 38.84
Per thread Throughput Read(GB/H)       : 0.21
Per thread Throughput Delete(GB/H)     : 2109.77
----
Throughput Create(GB/H)     : 1.64
Throughput Write(GB/H)      : 233.02
Throughput Read(GB/H)       : 1.27
Throughput Delete(GB/H)     : 12658.65

Userlevel 1
Badge +3

DiskPerf without DASH copies running completed

 

---------------------------------------------------------
Report as of   : 22,April,21 08:43:38
Processor      : X64
OS             : Windows Server 2012
---------------------------------------------------------
DiskPerf Version        : 2.2
Path Used               : \\commvault1\backups$\gabstentest\DL
Performance type        : Create new
Read-Write type         : RANDOM
Block Size              : 65536
File Count              : 6
Thread Count            : 6
Block Count             : 16384
Total Bytes Written     : 6442450944
Total Bytes Read        : 6442450944
Total Bytes Deleted     : 6442450944
----
Time Taken to Create(S)     : 7341.29
Time Taken to Write&flush(S): 48.47
Time Taken to Read(S)       : 4842.48
Time Taken to Delete(S)     : 1.10
----
Per thread Throughput Create(GB/H)     : 0.49
Per thread Throughput Write(GB/H)      : 74.27
Per thread Throughput Read(GB/H)       : 0.74
Per thread Throughput Delete(GB/H)     : 3264.65
----
Throughput Create(GB/H)     : 2.94
Throughput Write(GB/H)      : 445.61
Throughput Read(GB/H)       : 4.46
Throughput Delete(GB/H)     : 19587.89

Userlevel 1
Badge +1

Thanks Shane,

 

There defiantly looks to be a read issue with the source library.

 

Each thread translates to 1 stream so per stream you are seeing sub optimal performance from the library.

 

Additional settings or a change configuration at this point will not give you the performance boost you need.

 

I would look at engaging your storage support to see why the read speeds are so slow.

 

You should also see very slow speeds if you attempt to copy a large file from the storage to the MA via the configured share. This with the two documents above should be more than enough evidence for your storage support to investigate.

 

Regards,

 

Chris Sumner

Userlevel 1
Badge +3

Thanks, Chris, I have asked the customer to write and read a 20GB file in his own way, will update.

Userlevel 1
Badge +3

Hi,

I have a lot of customers and they are suffering the same pain ! Not enough disk in the Isilon and they are too big in size ending up with not enough spindle to achieve a good read after some time when data is spread across all disks because of the Deduplication process !

Hi Marco,

Do these customers have a plan on how to resolve or work around it?

Badge

I came across a similar behavior when using data access nodes to protect CIFS data located on an Isilon filer (read operations) some time ago. The Isilon administrator at that time told me this could be related to the Isilon SmartConnect DNS setup (& throttling). It could be good to explore this path further.

 

Isilon read performance while using one access node (60.79 GB/hour):

 

Isilon read performance while using two access nodes (197.45 GB/hour or 98GB/hour each):

 

Can you share some details on how the setup has been done exactly? Are you accessing the Isilon as a single mount path? How many media agents are connected? How many nodes are in the Isilon cluster?

Userlevel 1
Badge +3

Can you share some details on how the setup has been done exactly? 

Is there a simple way for me to get this from the Isilon admin?

We tried a single mount path and when the performance proved to be sub-par we created an additional 4 mount paths, which didn’t make any difference.

But I believe they are using Round Robin DNS for load-balancing.

I am waiting on the customer for the count of nodes and spindles.

Thank you.

Userlevel 1
Badge +3

How many media agents are connected? How many nodes are in the Isilon cluster?

There are 2 Media Agents connected.

10 nodes in the cluster, each with 20 drives.

They are on AOS 8.2.1.0., if it matters.

Userlevel 1
Badge +2

Hello Shane,

Your storage admin would be the best person to answer how exactly it is configured on the backend as we would not see it on the MA OS side or within the Commvault Software. I am sending you a document from Dell’s website about setting up and best practice with Isilon and our software, please use the link below. Based on the performance speeds of the tools ran, I would strongly suggest reaching out to the vendor to ensure that you have the optimal configuration that is listed in the documentation and to see if there are any issues that are causing the slowness.

 

Dell EMC Isilon: Backup Using Commvault (delltechnologies.com)

 

Thank You

Reply