Solved

How to troubleshoot crashing/freezing HyperScale node(s)?

  • 12 January 2021
  • 1 reply
  • 348 views

Userlevel 3
Badge +5

Hi,


I’d like like some guidance regarding how to troubleshoot HyperScale nodes when they’re freezing/crashing. Is there anything I should be looking out for?

 

Further information or tips would be much appreciated.

 

Kind regards,

Jon

icon

Best answer by Winston W 12 January 2021, 06:11

View original

1 reply

Userlevel 3
Badge +4

Hi Jon 

 

Usually for issue related to HyperScale Nodes freezing or crashing, we would recommend to setup Kdump, which will allow users to capture a OS DUMP on a HyperScale Node by initiating a NMI (Non-maskable interruption) from the Server Stack

 

Once a OS DUMP is captured, support can assist with the review and determine a RCA for the Freeze/Crash

 

Here are the steps to configure Kdump

 

1) Please add the following lines to /etc/sysctl.conf file, this can be done using the following command:

  • # nano /etc/sysctl.conf
kernel.sysrq = 1
kernel.printk = 8
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.unknown_nmi_panic = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1

Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name

2) Make the sysctl entries persistent using the following command

  • # sysctl -p /etc/sysctl.conf

3) In the file /etc/default/grub replace the existing GRUB_CMDLINE_LINUX with the following entry:

  • # nano /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=256M selinux=0 rhgb quiet loglevel=0 rd.systemd.show_status=false udev.log_priority=3"

Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name

4) Check whether the directory /sys/firmware/efi/ exists on the server. If so it's a server with EFI otherwise it's a server with BIOS.

5) If EFI exists, regenerate GRUB configuration using the following command

  • # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

If it's a server with BIOS then generate GRUB using the following command

  • # grub2-mkconfig -o /boot/grub2/grub.cfg

6) Reboot the node so that the new configuration takes effect

  • # shutdown -r now

7) Confirm the kdump service is running

  • # systemctl status kdump

We should seen an Active status of "active (exited)" if we do not, start the service using

  • #s ystemctl start kdump

8) Check /proc/cmdline after the server reboots to see if crashkernel is now set to 256M

  • # cat /proc/cmdline

9) If the server is stuck at any point later generate NMI to crash it to generate a core dump

Examples:

  • In HPE ILO
    • Information section -> Diagnostics -> Generate NMI to system button
  • In IRMC, this can be found by pressing the power button in the upper-right corner by selecting 'Pulse NMI'

Kind Regards

WW

Reply