Hi,
I’d like like some guidance regarding how to troubleshoot HyperScale nodes when they’re freezing/crashing. Is there anything I should be looking out for?
Further information or tips would be much appreciated.
Kind regards,
Jon
Hi,
I’d like like some guidance regarding how to troubleshoot HyperScale nodes when they’re freezing/crashing. Is there anything I should be looking out for?
Further information or tips would be much appreciated.
Kind regards,
Jon
Hi Jon
Usually for issue related to HyperScale Nodes freezing or crashing, we would recommend to setup Kdump, which will allow users to capture a OS DUMP on a HyperScale Node by initiating a NMI (Non-maskable interruption) from the Server Stack
Once a OS DUMP is captured, support can assist with the review and determine a RCA for the Freeze/Crash
Here are the steps to configure Kdump
1) Please add the following lines to /etc/sysctl.conf file, this can be done using the following command:
kernel.sysrq = 1
kernel.printk = 8
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.unknown_nmi_panic = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1
Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name
2) Make the sysctl entries persistent using the following command
3) In the file /etc/default/grub replace the existing GRUB_CMDLINE_LINUX with the following entry:
GRUB_CMDLINE_LINUX="crashkernel=256M selinux=0 rhgb quiet loglevel=0 rd.systemd.show_status=false udev.log_priority=3"
Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name
4) Check whether the directory /sys/firmware/efi/ exists on the server. If so it's a server with EFI otherwise it's a server with BIOS.
5) If EFI exists, regenerate GRUB configuration using the following command
If it's a server with BIOS then generate GRUB using the following command
6) Reboot the node so that the new configuration takes effect
7) Confirm the kdump service is running
We should seen an Active status of "active (exited)" if we do not, start the service using
8) Check /proc/cmdline after the server reboots to see if crashkernel is now set to 256M
9) If the server is stuck at any point later generate NMI to crash it to generate a core dump
Examples:
Kind Regards
WW
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.