Question

HSX: failed disk was not correctly replaced

  • 9 November 2022
  • 6 replies
  • 272 views

Userlevel 1
Badge +4

We seem to run into multiple problems in our new HSX environment. Metadata disk d2 silently filled 100% on one of three nodes, data disks are all 90-95%, but in GUI only 550 of 720TB are shown as used. Not a single alert for this, all green in GUI.

 

An then there is disk d22 / sdv on one node that failed a few weeks ago and was replaced together with support. In GUI it’s shown as mounted but in reality its not.

 

sdu                            65:64   0  16.4T  0 disk /hedvig/d21
sdv                            65:80   0  16.4T  0 disk
sdw                            65:96   0  16.4T  0 disk /hedvig/d23
 

I followed Replacing Disks in an HyperScale X Reference Architecture Node (commvault.com) but the disk is not mounted.

 

Nov 9 11:22:38 sdes1701-dp systemd: Dependency failed for /hedvig/d22. Nov 9 11:22:38 sdes1701-dp systemd: Job hedvig-d22.mount/start failed with result 'dependency'. Nov 9 11:22:38 sdes1701-dp systemd: Job dev-disk-by\x2duuid-dfcc3e6c\x2d8152\x2d42b2\x2db0a1\x2d6742d4748d3c.device/start failed with result 'timeout'.

 

Any idea what to do until support calls me?


6 replies

Userlevel 1
Badge +4

Okay, we had to create a fresh xfs fs on the new disk and preserve the old UUID. Something went wrong during the disk replacement.

Userlevel 7
Badge +23

Thanks for the quick answer!

Userlevel 1
Badge +4

HSX is really a nice solution, _but_ we need better monitoring capabilities and a better/more documentation/KB-articles what to do if something goes wrong. Currently we have open a case for everything that does not work as expected. Maybe I miss something, but I did not find anything about the problem with disk for example.

Userlevel 7
Badge +23

@pirx , can you share the case number?  I can follow up and see the status.

Userlevel 1
Badge +4

@pirx , can you share the case number?  I can follow up and see the status.

Yes, but I was too optimistic yesterday. Disk was mounted but not used by HSX, no files were created. I restarted cluster and rebooted node (due to other issues) and after that a directory “handle” was created, but still no files or data. 

So I had to create another case which is currenty worked on. It’s a pitty that nobody really checks if everything is correct at the end and I just don’t have the knowledge to see what is missing. HSX is really a black box and the GUI simply does not reflect reality always. I’ve good Linux knowlege but the simple task to see if a replaced disk is really part of hedwig, I don’t know how to check this.

 

221110-283  case to reopen 221109-316

221109-316 yesterdays case, now active again

220906-341 original case where something went wrong, but in GUI everything looked ok, no indication that anything has gone wrong

 

I’ve never used a storage solution where a simple task of replacing a failed disk went that wrong. I know that the steps documented seem to be easy, but as we see there seem to be no checks during the process and if something goes wrong, it’s complicated to get everything running again.

Userlevel 7
Badge +23

Everything is easy until it doesn’t work 🤓

I’ll keep an eye out, though feel free to share any updates here.

Reply