How to troubleshoot crashing/freezing HyperScale node(s)?

Question

Hi,I’d like like some guidance regarding how to troubleshoot HyperScale nodes when they’re freezing/crashing. Is there anything I should be looking out for? Further information or tips would be much appreciated. Kind regards,Jon

Winston W · Accepted Answer

Hi JonUsually for issue related toHyperScale Nodes freezing or crashing, we would recommend to setup Kdump, which will allow users to capture a OS DUMP on a HyperScale Node by initiating a NMI (Non-maskable interruption) from the Server StackOnce a OS DUMP is captured, support can assist with the review and determine a RCA for the Freeze/CrashHere are the steps to configure Kdump1) Please add the following lines to /etc/sysctl.conf file, this can be done using the following command:# nano /etc/sysctl.confkernel.sysrq = 1kernel.printk = 8kernel.panic = 1kernel.panic_on_oops = 1kernel.unknown_nmi_panic = 1kernel.panic_on_unrecovered_nmi = 1kernel.panic_on_io_nmi = 1kernel.panic_on_oops = 1Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name2) Make the sysctl entries persistent using the following command# sysctl -p /etc/sysctl.conf3) In the file /etc/default/grub replace the existing GRUB_CMDLINE_LINUX with the following entry:# nano /etc/default/grubGRUB_CMDLINE_LINUX="crashkernel=256M selinux=0 rhgb quiet loglevel=0 rd.systemd.show_status=false udev.log_priority=3"Once the changes are made hit ctrl+x then Y to save your changes. Do not change the file name4) Check whether the directory /sys/firmware/efi/ exists on the server. If so it's a server with EFI otherwise it's a server with BIOS.5) If EFI exists, regenerate GRUB configuration using the following command# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfgIf it's a server with BIOS then generate GRUB using the following command# grub2-mkconfig -o /boot/grub2/grub.cfg6) Reboot the node so that the new configuration takes effect# shutdown -r now7) Confirm the kdump service is running# systemctl status kdumpWe should seen an Active status of "active (exited)" if we do not, start the service using#s ystemctl start kdump8) Check /proc/cmdline after the server reboots to see if crashkernel is now set to 256M# cat /proc/cmdline9) If the server is stuck at any point later generate NMI to crash it to generate a core dumpExamples:In HPEILOInformation section -> Diagnostics -> Generate NMI to system buttonIn IRMC, this can be found by pressing the power button in the upper-right corner by selecting 'Pulse NMI'Kind RegardsWW

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded