How does HyperScale deal with node failures?
An issue with the M2 NVME boot drive caused the failure of one of our three nodes Cluster (HS 2300 HSX Appliance). Version 4.7 of Hedvig; with over usable capacity of 88 TB; hosted on SuperMicro servers.
We saw that all of our backup jobs had entered a waiting state (there's no indication that the backup entered a waiting state right away. Rather, after a few hours, it stopped).
When we checked how much the rf3 discs were being used on the nodes, we found that nodes 2 and 3 secondary space were being used a lot.
Commvault TAC informed us that after a node failure, the HedVig File system could only handle a certain amount of data, which in our case was 7TB. After that point, the File system became inaccessible to further writes, which doesn't look completely resilient because it gets hamstrung with capacity if it goes beyond 7 TB even though there is more capacity left in the remaining nodes.
I looked at the online documentation and the internet to see how CV deals with node failures, but I didn't find anything useful.
Wanted to know if anyone had encountered a similar circumstance and could explain how the HyperScale architecture operates and handles node failures?
HV150832> getrf3secondarysize
_____________________________________________________________________________________
| Sid | Host Name | RF3SecondarySize|
|
====================================================================================|
| 03949f8f783ce6e| cvhs03.xxx.local| 3.7 TB |
| 4054b0edd97f7dc1| cvhs02.xxx.local| 3.7 TB |
Total RF3 Secondary space used in the cluster:7.5 TB
Thanks