Skip to main content
Solved

How to troubleshoot crashing/freezing HyperScale node(s)?

  • 12 January 2021
  • 1 reply
  • 394 views

Hi,


I’d like like some guidance regarding how to troubleshoot HyperScale nodes when they’re freezing/crashing. Is there anything I should be looking out for?

 

Further information or tips would be much appreciated.

 

Kind regards,

Jon

Hi Jon 

 

Usually for issue related to HyperScale Nodes freezing or crashing, we would recommend to setup Kdump, which will allow users to capture a OS DUMP on a HyperScale Node by initiating a NMI (Non-maskable interruption) from the Server Stack

 

Once a OS DUMP is captured, support can assist with the review and determine a RCA for the Freeze/Crash

 

Here are the steps to configure Kdump

 

1) Please add the following lines to /etc/sysctl.conf file, this can be done using the following command:

  • # nano /etc/sysctl.conf
kernel.sysrq = 1
kernel.printk = 8
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.unknown_nmi_panic = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1

Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name

2) Make the sysctl entries persistent using the following command

  • # sysctl -p /etc/sysctl.conf

3) In the file /etc/default/grub replace the existing GRUB_CMDLINE_LINUX with the following entry:

  • # nano /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=256M selinux=0 rhgb quiet loglevel=0 rd.systemd.show_status=false udev.log_priority=3"

Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name

4) Check whether the directory /sys/firmware/efi/ exists on the server. If so it's a server with EFI otherwise it's a server with BIOS.

5) If EFI exists, regenerate GRUB configuration using the following command

  • # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

If it's a server with BIOS then generate GRUB using the following command

  • # grub2-mkconfig -o /boot/grub2/grub.cfg

6) Reboot the node so that the new configuration takes effect

  • # shutdown -r now

7) Confirm the kdump service is running

  • # systemctl status kdump

We should seen an Active status of "active (exited)" if we do not, start the service using

  • #s ystemctl start kdump

8) Check /proc/cmdline after the server reboots to see if crashkernel is now set to 256M

  • # cat /proc/cmdline

9) If the server is stuck at any point later generate NMI to crash it to generate a core dump

Examples:

  • In HPE ILO
    • Information section -> Diagnostics -> Generate NMI to system button
  • In IRMC, this can be found by pressing the power button in the upper-right corner by selecting 'Pulse NMI'

Kind Regards

WW


Reply