Question

HSX: failed disk was not correctly replaced

1 year ago
9 November 2022
6 replies
278 views

Userlevel 1

pirx
Bit
9 replies

We seem to run into multiple problems in our new HSX environment. Metadata disk d2 silently filled 100% on one of three nodes, data disks are all 90-95%, but in GUI only 550 of 720TB are shown as used. Not a single alert for this, all green in GUI.

An then there is disk d22 / sdv on one node that failed a few weeks ago and was replaced together with support. In GUI it’s shown as mounted but in reality its not.

sdu 65:64 0 16.4T 0 disk /hedvig/d21
sdv 65:80 0 16.4T 0 disk
sdw 65:96 0 16.4T 0 disk /hedvig/d23

I followed Replacing Disks in an HyperScale X Reference Architecture Node (commvault.com) but the disk is not mounted.

Nov 9 11:22:38 sdes1701-dp systemd: Dependency failed for /hedvig/d22. Nov 9 11:22:38 sdes1701-dp systemd: Job hedvig-d22.mount/start failed with result 'dependency'. Nov 9 11:22:38 sdes1701-dp systemd: Job dev-disk-by\x2duuid-dfcc3e6c\x2d8152\x2d42b2\x2db0a1\x2d6742d4748d3c.device/start failed with result 'timeout'.

Any idea what to do until support calls me?

6 replies

Userlevel 1

pirx
Author
Bit
9 replies
1 year ago
9 November 2022

Okay, we had to create a fresh xfs fs on the new disk and preserve the old UUID. Something went wrong during the disk replacement.

Userlevel 7

+23

Thanks for the quick answer!

Userlevel 1

pirx
Author
Bit
9 replies
1 year ago
10 November 2022

HSX is really a nice solution, _but_ we need better monitoring capabilities and a better/more documentation/KB-articles what to do if something goes wrong. Currently we have open a case for everything that does not work as expected. Maybe I miss something, but I did not find anything about the problem with disk for example.

Userlevel 7

+23

@pirx , can you share the case number? I can follow up and see the status.

Userlevel 1

pirx
Author
Bit
9 replies
1 year ago
10 November 2022

@pirx , can you share the case number? I can follow up and see the status.

Yes, but I was too optimistic yesterday. Disk was mounted but not used by HSX, no files were created. I restarted cluster and rebooted node (due to other issues) and after that a directory “handle” was created, but still no files or data.

So I had to create another case which is currenty worked on. It’s a pitty that nobody really checks if everything is correct at the end and I just don’t have the knowledge to see what is missing. HSX is really a black box and the GUI simply does not reflect reality always. I’ve good Linux knowlege but the simple task to see if a replaced disk is really part of hedwig, I don’t know how to check this.

221110-283 case to reopen 221109-316

221109-316 yesterdays case, now active again

220906-341 original case where something went wrong, but in GUI everything looked ok, no indication that anything has gone wrong

I’ve never used a storage solution where a simple task of replacing a failed disk went that wrong. I know that the steps documented seem to be easy, but as we see there seem to be no checks during the process and if something goes wrong, it’s complicated to get everything running again.

Userlevel 7

+23

Everything is easy until it doesn’t work 🤓

I’ll keep an eye out, though feel free to share any updates here.

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded