Question

HyperScale failed disk

  • 3 November 2023
  • 3 replies
  • 178 views

Userlevel 2
Badge +6

Hello all.

 

Received a failed disk alert from our HyperScale installation:

 

 Alert: HyperScale-HardwareAlerts 

 Type: Custom Rules - HyperScale-HardwareAlerts 

                               Detected Time: Thu Nov  2 22:45:33 2023 

                              CommCell: XXX

 

 

                              User: Not Applicable 

 

                              Alert Rule Name : HyperScale-HardwareAlerts 

                              Storage Pool : XXXX

                              Host Name : XXXX 

                              Hardware Entity name : /ws/disk14 

                              Hardware Entity Status : Offline 

                              Commcell Id : XXX

                              Node Serial Number : XXXX

                              Emails to contact : Not Applicable 

                              Condition Cleared: : Not Applicable

 

 

Checked the disk in the Command center and the disk was reported as failed.

Needed to reboot the server for another reason, and when the server was up and running again, the disk failure was cleared and the disk was no longer reported as failed.

 

This was a little mystery to me, so what should I believe, is the disk bad, do I risk my data, do Commvault/HyperScale software have a bug which is reporting wrongly or can anyone explain to me what to believe?

 

Regards

-Anders 


3 replies

Userlevel 5
Badge +14

Hello @ApK 

You wouldn’t risk data loss from a single failed disk. The distributed storage is tolerant to multiple failed disks. You can use the below command to test the physical disk: #smartctl -a /dev/<disk>

For example: #smartctl -a /dev/sdj

 

Thank you,
Collin

Userlevel 2
Badge +6

Hi @Collin Harper .

 

Sure, I know I won’t risk data if one or two disks fails.

 

My main worries is, why the disk was reported as failed, and why the failed disk status was cleared after a reboot.

I had 6 disks reported as failed by Commvault, so I rebooted all servers with reported failed disks, to check if the disk failed status was cleared after the reboot. And the failed disk status was for sure cleared after the reboot.

 

So I’m wondering why the disks were marked as failed in the first place, and why the failed disk status are cleared by a reboot. And now I am wondering if I’m running with 6 disks that potential have errors and with the potential of data loss.

 

Regards

-Anders

Userlevel 5
Badge +14

Hello @ApK 

I cannot say why the alert was cleared after reboot, but I would recommend running the “smartctl” commands to verify the integrity of the disks. If they are truly imminent failure I would recommend opening a ticket with Support to investigate why the Alerts are cleared even though the disk is in-fact imminent failure.

 

Thank you,

Collin

Reply