Skip to main content

How does HyperScale deal with node failures?

 

An issue with the M2 NVME boot drive caused the failure of one of our three nodes Cluster  (HS 2300 HSX Appliance). Version 4.7 of Hedvig; with over usable capacity of 88 TB; hosted on SuperMicro servers.

 

We saw that all of our backup jobs had entered a waiting state (there's no indication that the backup entered a waiting state right away. Rather, after a few hours, it stopped).

 

When we checked how much the rf3 discs were being used on the nodes, we found that nodes 2 and 3 secondary space were being used a lot.

Commvault TAC informed us that after a node failure, the HedVig File system could only handle a certain amount of data, which in our case was 7TB. After that point, the File system became inaccessible to further writes, which doesn't look completely resilient because it gets hamstrung with capacity if it goes beyond 7 TB even though there is more capacity left in the remaining nodes.

 

I looked at the online documentation and the internet to see how CV deals with node failures, but I didn't find anything useful.

 

Wanted to know if anyone had encountered a similar circumstance and could explain how the HyperScale architecture operates and handles node failures?


HV150832> getrf3secondarysize
_____________________________________________________________________________________
| Sid                             | Host Name                      | RF3SecondarySize|
|
====================================================================================|
| 03949f8f783ce6e| cvhs03.xxx.local| 3.7 TB          |
| 4054b0edd97f7dc1| cvhs02.xxx.local| 3.7 TB          |
Total RF3 Secondary space used in the cluster:7.5 TB

 

Thanks

Hello Nik

As you may be aware, HyperScale X clusters use Erasure Coding (4+2) for achieving resiliency to node and drive failures. So in a three node cluster when one of the nodes fail, other two nodes cannot continue using Erasure Coding for additional writes that happen while the node is still down. This is because of quorum requirements that Erasure Coding algorithms demand. So to overcome this HyperScale X smartly switches to a higher resiliency mode (Replication Factor 3) so we continue to meet resiliency SLA for new writes. All the data already written to the cluster prior to the node failure is not affected by this switch (they continue to stay in Erasure Coding format). No impact to the reads on the older data as well. This automatic upgrade ensures that newer writes (after node failure) are written in a resilient format despite the loss of a node in the cluster. However when the offline node is recovered, the file system automatically merges this RF3 data back to Erasure Coding so that all the data is stored in a more efficient resilient format. This conversion needs extra buffer space in the cluster, which is why the file system reserves appropriate capacity (depending on the cluster size and how much data has been written to the cluster when the node was offline) ensuring that conversion happens smoothly. Hope this clarifies your question on why the cluster is not allowed to write to the full when it was operating in a degraded (one node down) mode.

~Pavan Bedadala


@Pavan Bedadala @All

Thank you very much for responding; I understand what you mean now, and it appears that the algorithm switches to replication mode for all data written after a node failure, keeping three copies of any RF3 data written, rather than performing 4+2 Erasure coding.

 

If that's okay, I'd like to ask you a few more questions.

I apologise for the long list of questions, but we ran into an unexpected problem when backup jobs froze after a node failure, and it will help us to understand the aspects of the HS solution.

 

  1. Is there a tool I can use to calculate our cluster's Secondary space CAP?
  2. According to my understanding, any data written with RF3 will have three copies spread across different nodes; however, if only two nodes are available (due to a node failure), where does the third copy reside? Would it be consuming the Secondary space to fulfill the RF requirements (if so, wont this cause the space to run out much faster to maintain the RF)?

  3. Is the solution capable of handling only two disc failures because it employs 4+2 EC?
    What happens if a disc on another node fails alongside another failed node?
  4. How much of the cluster's capacity must be consumed before the file system enters read-only mode?

    Based on my previous experience with other HCI products, the limit is between 92% and 95%, after which it will stop writing data to avoid loss. We will set our monitoring system's threshold limits based on your feedback, preventing us from accidentally switching the cluster to read-only.

  5. CV support discovered that snapshot copy jobs were unable to proceed because they were waiting in the archive index phase for the metadata to be written and the index database to become available (unfortunately, the db was on the failed node). Jobs were resumed after the node was brought back online. Have you come across this issue before?  Do you have any recommendations or workarounds that would allow the jobs to continue running without any noticeable interruption in the event of a node failure?

Appreciate your help here

 

-Nik

 


Following up on this one. Could someone please assist me with my queries? 

Appreciate your help!


@Nik I’m following up with @Pavan Bedadala .


Hello Nik

My responses to your questions, as below.

  1. I am assuming your request is for a tool that can determine how much space is reserved by file system when a node failed. Unfortunately no such tool exists. If you think this information is important to run your HyperScale X cluster efficiently, I can follow up with engineering. Even if we cannot offer a tool, we can provide some guidance that you can calculate based on getrf3secondarysize output, perhaps that is a good starting point. 
  2. HyperScale X distributes the writes (data and parity chunks) across drives and nodes. In clusters with fewer than 6 nodes, few of the chunks can be written to multiple drives within the same node.
  3. Yes, cluster can tolerate two drive failures at the same time. If one node and then an additional drive fails in a cluster, then cluster does not allow any more additional writes.
  4. HyperScale X cluster usable capacity already accounts for buffer capacity needs of the file system. However at 97% file system does not allow any more writes. However please do not mistake this guidance for storage and capacity planning. Our best practice guidance is to begin review and planning for additional storage when cluster is expected to be full in 6 months in advance. With supply chain constraints, it is better to leave sufficient buffer to place an order for additional node and expand the cluster. HyperScale X dashboard should alert you when cluster is expected to be free in one month as a default value. You can change the alert and customize to a value appropriate to your environment.
  5. You are right about the observed behavior. It is possible that job could fail in Achive Index phase when a node failed. Do you mind creating a modification request to fix this behavior. You can work with your account team for making this request.

I appreciate the detailed response @Pavan Bedadala

  1.   With the ability to see how much of the cluster's rf3secondarysize has been allocated and how much is being used, we can better plan for the eventuality of a node failure and determine how much data can be written to the cluster at any given time during a hw failure.  If you can provide some kind of calculation guidance based on the output of getrf3secondarysize, that would be very helpful?
  1. Good to know, as other HCI vendors tend to put a cap on utilisation of around 90%.So, you're saying that metadata and garbage data don't eat up any more of the usable storage space right?

  2. After the node was replaced, the backups picked up where they left off, but I'm still concerned enough to file a service request to have the configurations reviewed to make sure this doesn't happen again in the future; I have no idea if this is even possible,however I believe it is crucial that we have a resilient index db to guarantee that the job continues running regardless of hardware failures, otherwise the resiliant clustering would be counterproductive.


Hello Nik

  1. Let me follow up with engineering on the possibilities. If the math is simple to understand or estimate, then we can perhaps document this guidance. 
  1. I have discussed the issue you faced with jobs restarting from beginning after Archive index phase failure. We have identified an issue that can cause this behavior and are working on a solution. 

Thank you, @Pavan Bedadala .I appreciate your assistance here.Let me know how you go.


I would like to jump on this thread for a specific use case of 6 nodes in a cluster.
Our cluster was initially created as a 3-node cluster with 3 nodes added immediately after, which means that some Hedvig services are only running on 3 of the 6 nodes.

Some time ago we ran into the issue that it took unexpectedly long to fix a node failure and RF3 critically filled up the hard disks. During a recent issue with another node, support mentioned that this RF3-behaviour is only applicable to clusters created with 3 nodes and that it continues to do erasure-coding without increased disk usage on 6-node clusters.

Can anyone confirm whether a cluster setup as 6-node will continue to do erasure coding even when one node fails as there are 5 nodes remaining which should be sufficient to cover erasure coding? Is there a case where a 6-node setup will still fallback to RF3 due to loss of nodes or components?


Reply