From e9ddb3484092f20a4f55e1a5f564cb9b5bbcbb4e Mon Sep 17 00:00:00 2001 From: Lotus Fenn Date: Wed, 29 Oct 2025 01:35:49 +0000 Subject: [PATCH] Add a patch for printing the AMD Zen CPU reset reason If I intentionally trigger a CPU soft reset I see this: ``` admin@gold208-dut:~$ sudo dmesg | grep -i reason [ 0.635233] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9 ``` If I intentionally trigger the CPU FCH Watchdog, I see this: ``` admin@gold208-dut:~$ sudo dmesg | grep reason [ 0.632563] x86/amd: Previous system reset reason [0x02000800]: hardware watchdog timer expired ``` Upstream from here: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=ab8131028710d009ab93d6bffd2a2749ade909b0 The patch had to be adapted to v6.1 we're using, that was basically adding the entire contents (5 constants) of `fch.h` as the file didn't exist in v6.1, and updating the patch for `amd.c` for context. Signed-off-by: Nate White --- ...-Print-the-reason-for-the-last-reset.patch | 519 ++++++++++++++++++ ...MD-Ignore-invalid-reset-reason-value.patch | 64 +++ patches-sonic/series | 4 + 3 files changed, 587 insertions(+) create mode 100644 patches-sonic/0001-x86-CPU-AMD-Print-the-reason-for-the-last-reset.patch create mode 100644 patches-sonic/0002-x86-CPU-AMD-Ignore-invalid-reset-reason-value.patch diff --git a/patches-sonic/0001-x86-CPU-AMD-Print-the-reason-for-the-last-reset.patch b/patches-sonic/0001-x86-CPU-AMD-Print-the-reason-for-the-last-reset.patch new file mode 100644 index 000000000..7f247882d --- /dev/null +++ b/patches-sonic/0001-x86-CPU-AMD-Print-the-reason-for-the-last-reset.patch @@ -0,0 +1,519 @@ +From d0dc75304bc26dfb821f710ddf0d428484620af1 Mon Sep 17 00:00:00 2001 +From: Yazen Ghannam +Date: Tue, 22 Apr 2025 18:48:30 -0500 +Subject: [PATCH 1/2] x86/CPU/AMD: Print the reason for the last reset + +[ Upstream commit ab8131028710d009ab93d6bffd2a2749ade909b0 ] + +The following register contains bits that indicate the cause for the +previous reset. + + PMx000000C0 (FCH::PM::S5_RESET_STATUS) + +This is useful for debug. The reasons for reset are broken into 6 high level +categories. Decode it by category and print during boot. + +Specifics within a category are split off into debugging documentation. + +The register is accessed indirectly through a "PM" port in the FCH. Use +MMIO access in order to avoid restrictions with legacy port access. + +Use a late_initcall() to ensure that MMIO has been set up before trying to +access the register. + +This register was introduced with AMD Family 17h, so avoid access on older +families. There is no CPUID feature bit for this register. + + [ bp: Simplify the reason dumping loop. + - merge a fix to not access an array element after the last one: + https://lore.kernel.org/r/20250505133609.83933-1-superm1@kernel.org + Reported-by: James Dutton + ] + + [ mingo: + - Use consistent .rst formatting + - Fix 'Sleep' class field to 'ACPI-State' + - Standardize pin messages around the 'tripped' verbiage + - Remove reference to ring-buffer printing & simplify the wording + - Use curly braces for multi-line conditional statements ] + +Signed-off-by: Yazen Ghannam +Co-developed-by: Mario Limonciello +Signed-off-by: Mario Limonciello +Signed-off-by: Borislav Petkov (AMD) +Signed-off-by: Ingo Molnar +Signed-off-by: Borislav Petkov (AMD) +Link: https://lore.kernel.org/20250422234830.2840784-6-superm1@kernel.org +--- + Documentation/arch/x86/amd-debugging.rst | 368 +++++++++++++++++++++++ + arch/x86/include/asm/amd/fch.h | 13 + + arch/x86/kernel/cpu/amd.c | 54 ++++ + 3 files changed, 435 insertions(+) + create mode 100644 Documentation/arch/x86/amd-debugging.rst + create mode 100644 arch/x86/include/asm/amd/fch.h + +diff --git a/Documentation/arch/x86/amd-debugging.rst b/Documentation/arch/x86/amd-debugging.rst +new file mode 100644 +index 000000000000..d92bf59d62c7 +--- /dev/null ++++ b/Documentation/arch/x86/amd-debugging.rst +@@ -0,0 +1,368 @@ ++.. SPDX-License-Identifier: GPL-2.0 ++ ++Debugging AMD Zen systems +++++++++++++++++++++++++++ ++ ++Introduction ++============ ++ ++This document describes techniques that are useful for debugging issues with ++AMD Zen systems. It is intended for use by developers and technical users ++to help identify and resolve issues. ++ ++S3 vs s2idle ++============ ++ ++On AMD systems, it's not possible to simultaneously support suspend-to-RAM (S3) ++and suspend-to-idle (s2idle). To confirm which mode your system supports you ++can look at ``cat /sys/power/mem_sleep``. If it shows ``s2idle [deep]`` then ++*S3* is supported. If it shows ``[s2idle]`` then *s2idle* is ++supported. ++ ++On systems that support *S3*, the firmware will be utilized to put all hardware into ++the appropriate low power state. ++ ++On systems that support *s2idle*, the kernel will be responsible for transitioning devices ++into the appropriate low power state. When all devices are in the appropriate low ++power state, the hardware will transition into a hardware sleep state. ++ ++After a suspend cycle you can tell how much time was spent in a hardware sleep ++state by looking at ``cat /sys/power/suspend_stats/last_hw_sleep``. ++ ++This flowchart explains how the AMD s2idle suspend flow works. ++ ++.. kernel-figure:: suspend.svg ++ ++This flowchart explains how the amd s2idle resume flow works. ++ ++.. kernel-figure:: resume.svg ++ ++s2idle debugging tool ++===================== ++ ++As there are a lot of places that problems can occur, a debugging tool has been ++created at ++`amd-debug-tools `_ ++that can help test for common problems and offer suggestions. ++ ++If you have an s2idle issue, it's best to start with this and follow instructions ++from its findings. If you continue to have an issue, raise a bug with the ++report generated from this script to ++`drm/amd gitlab `_. ++ ++Spurious s2idle wakeups from an IRQ ++=================================== ++ ++Spurious wakeups will generally have an IRQ set to ``/sys/power/pm_wakeup_irq``. ++This can be matched to ``/proc/interrupts`` to determine what device woke the system. ++ ++If this isn't enough to debug the problem, then the following sysfs files ++can be set to add more verbosity to the wakeup process: :: ++ ++ # echo 1 | sudo tee /sys/power/pm_debug_messages ++ # echo 1 | sudo tee /sys/power/pm_print_times ++ ++After making those changes, the kernel will display messages that can ++be traced back to kernel s2idle loop code as well as display any active ++GPIO sources while waking up. ++ ++If the wakeup is caused by the ACPI SCI, additional ACPI debugging may be ++needed. These commands can enable additional trace data: :: ++ ++ # echo enable | sudo tee /sys/module/acpi/parameters/trace_state ++ # echo 1 | sudo tee /sys/module/acpi/parameters/aml_debug_output ++ # echo 0x0800000f | sudo tee /sys/module/acpi/parameters/debug_level ++ # echo 0xffff0000 | sudo tee /sys/module/acpi/parameters/debug_layer ++ ++Spurious s2idle wakeups from a GPIO ++=================================== ++ ++If a GPIO is active when waking up the system ideally you would look at the ++schematic to determine what device it is associated with. If the schematic ++is not available, another tactic is to look at the ACPI _EVT() entry ++to determine what device is notified when that GPIO is active. ++ ++For a hypothetical example, say that GPIO 59 woke up the system. You can ++look at the SSDT to determine what device is notified when GPIO 59 is active. ++ ++First convert the GPIO number into hex. :: ++ ++ $ python3 -c "print(hex(59))" ++ 0x3b ++ ++Next determine which ACPI table has the ``_EVT`` entry. For example: :: ++ ++ $ sudo grep EVT /sys/firmware/acpi/tables/SSDT* ++ grep: /sys/firmware/acpi/tables/SSDT27: binary file matches ++ ++Decode this table:: ++ ++ $ sudo cp /sys/firmware/acpi/tables/SSDT27 . ++ $ sudo iasl -d SSDT27 ++ ++Then look at the table and find the matching entry for GPIO 0x3b. :: ++ ++ Case (0x3B) ++ { ++ M000 (0x393B) ++ M460 (" Notify (\\_SB.PCI0.GP17.XHC1, 0x02)\n", Zero, Zero, Zero, Zero, Zero, Zero) ++ Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake ++ } ++ ++You can see in this case that the device ``\_SB.PCI0.GP17.XHC1`` is notified ++when GPIO 59 is active. It's obvious this is an XHCI controller, but to go a ++step further you can figure out which XHCI controller it is by matching it to ++ACPI.:: ++ ++ $ grep "PCI0.GP17.XHC1" /sys/bus/acpi/devices/*/path ++ /sys/bus/acpi/devices/device:2d/path:\_SB_.PCI0.GP17.XHC1 ++ /sys/bus/acpi/devices/device:2e/path:\_SB_.PCI0.GP17.XHC1.RHUB ++ /sys/bus/acpi/devices/device:2f/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1 ++ /sys/bus/acpi/devices/device:30/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM0 ++ /sys/bus/acpi/devices/device:31/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT1.CAM1 ++ /sys/bus/acpi/devices/device:32/path:\_SB_.PCI0.GP17.XHC1.RHUB.PRT2 ++ /sys/bus/acpi/devices/LNXPOWER:0d/path:\_SB_.PCI0.GP17.XHC1.PWRS ++ ++Here you can see it matches to ``device:2d``. Look at the ``physical_node`` ++to determine what PCI device that actually is. :: ++ ++ $ ls -l /sys/bus/acpi/devices/device:2d/physical_node ++ lrwxrwxrwx 1 root root 0 Feb 12 13:22 /sys/bus/acpi/devices/device:2d/physical_node -> ../../../../../pci0000:00/0000:00:08.1/0000:c2:00.4 ++ ++So there you have it: the PCI device associated with this GPIO wakeup was ``0000:c2:00.4``. ++ ++The ``amd_s2idle.py`` script will capture most of these artifacts for you. ++ ++s2idle PM debug messages ++======================== ++ ++During the s2idle flow on AMD systems, the ACPI LPS0 driver is responsible ++to check all uPEP constraints. Failing uPEP constraints does not prevent ++s0i3 entry. This means that if some constraints are not met, it is possible ++the kernel may attempt to enter s2idle even if there are some known issues. ++ ++To activate PM debugging, either specify ``pm_debug_messagess`` kernel ++command-line option at boot or write to ``/sys/power/pm_debug_messages``. ++Unmet constraints will be displayed in the kernel log and can be ++viewed by logging tools that process kernel ring buffer like ``dmesg`` or ++``journalctl``." ++ ++If the system freezes on entry/exit before these messages are flushed, a ++useful debugging tactic is to unbind the ``amd_pmc`` driver to prevent ++notification to the platform to start s0i3 entry. This will stop the ++system from freezing on entry or exit and let you view all the failed ++constraints. :: ++ ++ cd /sys/bus/platform/drivers/amd_pmc ++ ls | grep AMD | sudo tee unbind ++ ++After doing this, run the suspend cycle and look specifically for errors around: :: ++ ++ ACPI: LPI: Constraint not met; min power state:%s current power state:%s ++ ++Historical examples of s2idle issues ++==================================== ++ ++To help understand the types of issues that can occur and how to debug them, ++here are some historical examples of s2idle issues that have been resolved. ++ ++Core offlining ++-------------- ++An end user had reported that taking a core offline would prevent the system ++from properly entering s0i3. This was debugged using internal AMD tools ++to capture and display a stream of metrics from the hardware showing what changed ++when a core was offlined. It was determined that the hardware didn't get ++notification the offline cores were in the deepest state, and so it prevented ++CPU from going into the deepest state. The issue was debugged to a missing ++command to put cores into C3 upon offline. ++ ++`commit d6b88ce2eb9d2 ("ACPI: processor idle: Allow playing dead in C3 state") `_ ++ ++Corruption after resume ++----------------------- ++A big problem that occurred with Rembrandt was that there was graphical ++corruption after resume. This happened because of a misalignment of PSP ++and driver responsibility. The PSP will save and restore DMCUB, but the ++driver assumed it needed to reset DMCUB on resume. ++This actually was a misalignment for earlier silicon as well, but was not ++observed. ++ ++`commit 79d6b9351f086 ("drm/amd/display: Don't reinitialize DMCUB on s0ix resume") `_ ++ ++Back to Back suspends fail ++-------------------------- ++When using a wakeup source that triggers the IRQ to wakeup, a bug in the ++pinctrl-amd driver may capture the wrong state of the IRQ and prevent the ++system going back to sleep properly. ++ ++`commit b8c824a869f22 ("pinctrl: amd: Don't save/restore interrupt status and wake status bits") `_ ++ ++Spurious timer based wakeup after 5 minutes ++------------------------------------------- ++The HPET was being used to program the wakeup source for the system, however ++this was causing a spurious wakeup after 5 minutes. The correct alarm to use ++was the ACPI alarm. ++ ++`commit 3d762e21d5637 ("rtc: cmos: Use ACPI alarm for non-Intel x86 systems too") `_ ++ ++Disk disappears after resume ++---------------------------- ++After resuming from s2idle, the NVME disk would disappear. This was due to the ++BIOS not specifying the _DSD StorageD3Enable property. This caused the NVME ++driver not to put the disk into the expected state at suspend and to fail ++on resume. ++ ++`commit e79a10652bbd3 ("ACPI: x86: Force StorageD3Enable on more products") `_ ++ ++Spurious IRQ1 ++------------- ++A number of Renoir, Lucienne, Cezanne, & Barcelo platforms have a ++platform firmware bug where IRQ1 is triggered during s0i3 resume. ++ ++This was fixed in the platform firmware, but a number of systems didn't ++receive any more platform firmware updates. ++ ++`commit 8e60615e89321 ("platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN") `_ ++ ++Hardware timeout ++---------------- ++The hardware performs many actions besides accepting the values from ++amd-pmc driver. As the communication path with the hardware is a mailbox, ++it's possible that it might not respond quickly enough. ++This issue manifested as a failure to suspend: :: ++ ++ PM: dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110 ++ amd_pmc AMDI0005:00: PM: failed to suspend noirq: error -110 ++ ++The timing problem was identified by comparing the values of the idle mask. ++ ++`commit 3c3c8e88c8712 ("platform/x86: amd-pmc: Increase the response register timeout") `_ ++ ++Failed to reach hardware sleep state with panel on ++-------------------------------------------------- ++On some Strix systems certain panels were observed to block the system from ++entering a hardware sleep state if the internal panel was on during the sequence. ++ ++Even though the panel got turned off during suspend it exposed a timing problem ++where an interrupt caused the display hardware to wake up and block low power ++state entry. ++ ++`commit 40b8c14936bd2 ("drm/amd/display: Disable unneeded hpd interrupts during dm_init") `_ ++ ++Runtime power consumption issues ++================================ ++ ++Runtime power consumption is influenced by many factors, including but not ++limited to the configuration of the PCIe Active State Power Management (ASPM), ++the display brightness, the EPP policy of the CPU, and the power management ++of the devices. ++ ++ASPM ++---- ++For the best runtime power consumption, ASPM should be programmed as intended ++by the BIOS from the hardware vendor. To accomplish this the Linux kernel ++should be compiled with ``CONFIG_PCIEASPM_DEFAULT`` set to ``y`` and the ++sysfs file ``/sys/module/pcie_aspm/parameters/policy`` should not be modified. ++ ++Most notably, if L1.2 is not configured properly for any devices, the SoC ++will not be able to enter the deepest idle state. ++ ++EPP Policy ++---------- ++The ``energy_performance_preference`` sysfs file can be used to set a bias ++of efficiency or performance for a CPU. This has a direct relationship on ++the battery life when more heavily biased towards performance. ++ ++ ++BIOS debug messages ++=================== ++ ++Most OEM machines don't have a serial UART for outputting kernel or BIOS ++debug messages. However BIOS debug messages are useful for understanding ++both BIOS bugs and bugs with the Linux kernel drivers that call BIOS AML. ++ ++As the BIOS on most OEM AMD systems are based off an AMD reference BIOS, ++the infrastructure used for exporting debugging messages is often the same ++as AMD reference BIOS. ++ ++Manually Parsing ++---------------- ++There is generally an ACPI method ``\M460`` that different paths of the AML ++will call to emit a message to the BIOS serial log. This method takes ++7 arguments, with the first being a string and the rest being optional ++integers:: ++ ++ Method (M460, 7, Serialized) ++ ++Here is an example of a string that BIOS AML may call out using ``\M460``:: ++ ++ M460 (" OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", DADR, Arg0, Arg1, PCSA, Zero, Zero) ++ ++Normally when executed, the ``\M460`` method would populate the additional ++arguments into the string. In order to get these messages from the Linux ++kernel a hook has been added into ACPICA that can capture the *arguments* ++sent to ``\M460`` and print them to the kernel ring buffer. ++For example the following message could be emitted into kernel ring buffer:: ++ ++ extrace-0174 ex_trace_args : " OEM-ASL-PCIe Address (0x%X)._REG (%d %d) PCSA = %d\n", ec106000, 2, 1, 1, 0, 0 ++ ++In order to get these messages, you need to compile with ``CONFIG_ACPI_DEBUG`` ++and then turn on the following ACPICA tracing parameters. ++This can be done either on the kernel command line or at runtime: ++ ++* ``acpi.trace_method_name=\M460`` ++* ``acpi.trace_state=method`` ++ ++NOTE: These can be very noisy at bootup. If you turn these parameters on ++the kernel command, please also consider turning up ``CONFIG_LOG_BUF_SHIFT`` ++to a larger size such as 17 to avoid losing early boot messages. ++ ++Tool assisted Parsing ++--------------------- ++As mentioned above, parsing by hand can be tedious, especially with a lot of ++messages. To help with this, a tool has been created at ++`amd-debug-tools `_ ++to help parse the messages. ++ ++Random reboot issues ++==================== ++ ++When a random reboot occurs, the high-level reason for the reboot is stored ++in a register that will persist onto the next boot. ++ ++There are 6 classes of reasons for the reboot: ++ * Software induced ++ * Power state transition ++ * Pin induced ++ * Hardware induced ++ * Remote reset ++ * Internal CPU event ++ ++.. csv-table:: ++ :header: "Bit", "Type", "Reason" ++ :align: left ++ ++ "0", "Pin", "thermal pin BP_THERMTRIP_L was tripped" ++ "1", "Pin", "power button was pressed for 4 seconds" ++ "2", "Pin", "shutdown pin was tripped" ++ "4", "Remote", "remote ASF power off command was received" ++ "9", "Internal", "internal CPU thermal limit was tripped" ++ "16", "Pin", "system reset pin BP_SYS_RST_L was tripped" ++ "17", "Software", "software issued PCI reset" ++ "18", "Software", "software wrote 0x4 to reset control register 0xCF9" ++ "19", "Software", "software wrote 0x6 to reset control register 0xCF9" ++ "20", "Software", "software wrote 0xE to reset control register 0xCF9" ++ "21", "ACPI-state", "ACPI power state transition occurred" ++ "22", "Pin", "keyboard reset pin KB_RST_L was tripped" ++ "23", "Internal", "internal CPU shutdown event occurred" ++ "24", "Hardware", "system failed to boot before failed boot timer expired" ++ "25", "Hardware", "hardware watchdog timer expired" ++ "26", "Remote", "remote ASF reset command was received" ++ "27", "Internal", "an uncorrected error caused a data fabric sync flood event" ++ "29", "Internal", "FCH and MP1 failed warm reset handshake" ++ "30", "Internal", "a parity error occurred" ++ "31", "Internal", "a software sync flood event occurred" ++ ++This information is read by the kernel at bootup and printed into ++the syslog. When a random reboot occurs this message can be helpful ++to determine the next component to debug. +diff --git a/arch/x86/include/asm/amd/fch.h b/arch/x86/include/asm/amd/fch.h +new file mode 100644 +index 000000000000..2cf5153edbc2 +--- /dev/null ++++ b/arch/x86/include/asm/amd/fch.h +@@ -0,0 +1,13 @@ ++/* SPDX-License-Identifier: GPL-2.0 */ ++#ifndef _ASM_X86_AMD_FCH_H_ ++#define _ASM_X86_AMD_FCH_H_ ++ ++#define FCH_PM_BASE 0xFED80300 ++ ++/* Register offsets from PM base: */ ++#define FCH_PM_DECODEEN 0x00 ++#define FCH_PM_DECODEEN_SMBUS0SEL GENMASK(20, 19) ++#define FCH_PM_SCRATCH 0x80 ++#define FCH_PM_S5_RESET_STATUS 0xC0 ++ ++#endif /* _ASM_X86_AMD_FCH_H_ */ +diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c +index 823f44f7bc94..c3194101a92a 100644 +--- a/arch/x86/kernel/cpu/amd.c ++++ b/arch/x86/kernel/cpu/amd.c +@@ -9,6 +9,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -1216,3 +1217,56 @@ void amd_check_microcode(void) + if (cpu_feature_enabled(X86_FEATURE_ZEN2)) + on_each_cpu(zenbleed_check_cpu, NULL, 1); + } ++ ++static const char * const s5_reset_reason_txt[] = { ++ [0] = "thermal pin BP_THERMTRIP_L was tripped", ++ [1] = "power button was pressed for 4 seconds", ++ [2] = "shutdown pin was tripped", ++ [4] = "remote ASF power off command was received", ++ [9] = "internal CPU thermal limit was tripped", ++ [16] = "system reset pin BP_SYS_RST_L was tripped", ++ [17] = "software issued PCI reset", ++ [18] = "software wrote 0x4 to reset control register 0xCF9", ++ [19] = "software wrote 0x6 to reset control register 0xCF9", ++ [20] = "software wrote 0xE to reset control register 0xCF9", ++ [21] = "ACPI power state transition occurred", ++ [22] = "keyboard reset pin KB_RST_L was tripped", ++ [23] = "internal CPU shutdown event occurred", ++ [24] = "system failed to boot before failed boot timer expired", ++ [25] = "hardware watchdog timer expired", ++ [26] = "remote ASF reset command was received", ++ [27] = "an uncorrected error caused a data fabric sync flood event", ++ [29] = "FCH and MP1 failed warm reset handshake", ++ [30] = "a parity error occurred", ++ [31] = "a software sync flood event occurred", ++}; ++ ++static __init int print_s5_reset_status_mmio(void) ++{ ++ unsigned long value; ++ void __iomem *addr; ++ int i; ++ ++ if (!cpu_feature_enabled(X86_FEATURE_ZEN)) ++ return 0; ++ ++ addr = ioremap(FCH_PM_BASE + FCH_PM_S5_RESET_STATUS, sizeof(value)); ++ if (!addr) ++ return 0; ++ ++ value = ioread32(addr); ++ iounmap(addr); ++ ++ for (i = 0; i < ARRAY_SIZE(s5_reset_reason_txt); i++) { ++ if (!(value & BIT(i))) ++ continue; ++ ++ if (s5_reset_reason_txt[i]) { ++ pr_info("x86/amd: Previous system reset reason [0x%08lx]: %s\n", ++ value, s5_reset_reason_txt[i]); ++ } ++ } ++ ++ return 0; ++} ++late_initcall(print_s5_reset_status_mmio); +-- +2.43.0 + diff --git a/patches-sonic/0002-x86-CPU-AMD-Ignore-invalid-reset-reason-value.patch b/patches-sonic/0002-x86-CPU-AMD-Ignore-invalid-reset-reason-value.patch new file mode 100644 index 000000000..911861ce2 --- /dev/null +++ b/patches-sonic/0002-x86-CPU-AMD-Ignore-invalid-reset-reason-value.patch @@ -0,0 +1,64 @@ +From aa429e1fbeaa168555aea99038f30a0e05b369e5 Mon Sep 17 00:00:00 2001 +From: Yazen Ghannam +Date: Mon, 21 Jul 2025 18:11:54 +0000 +Subject: [PATCH 2/2] x86/CPU/AMD: Ignore invalid reset reason value + +[ Upstream commit e9576e078220c50ace9e9087355423de23e25fa5 ] + +The reset reason value may be "all bits set", e.g. 0xFFFFFFFF. This is a +commonly used error response from hardware. This may occur due to a real +hardware issue or when running in a VM. + +The user will see all reset reasons reported in this case. + +Check for an error response value and return early to avoid decoding +invalid data. + +Also, adjust the data variable type to match the hardware register size. + +Fixes: ab8131028710 ("x86/CPU/AMD: Print the reason for the last reset") +Reported-by: Libing He +Signed-off-by: Yazen Ghannam +Signed-off-by: Borislav Petkov (AMD) +Reviewed-by: Mario Limonciello +Cc: stable@vger.kernel.org +Link: https://lore.kernel.org/20250721181155.3536023-1-yazen.ghannam@amd.com +--- + arch/x86/kernel/cpu/amd.c | 8 ++++++-- + 1 file changed, 6 insertions(+), 2 deletions(-) + +diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c +index c3194101a92a..b40f841479f7 100644 +--- a/arch/x86/kernel/cpu/amd.c ++++ b/arch/x86/kernel/cpu/amd.c +@@ -1243,8 +1243,8 @@ static const char * const s5_reset_reason_txt[] = { + + static __init int print_s5_reset_status_mmio(void) + { +- unsigned long value; + void __iomem *addr; ++ u32 value; + int i; + + if (!cpu_feature_enabled(X86_FEATURE_ZEN)) +@@ -1257,12 +1257,16 @@ static __init int print_s5_reset_status_mmio(void) + value = ioread32(addr); + iounmap(addr); + ++ /* Value with "all bits set" is an error response and should be ignored. */ ++ if (value == U32_MAX) ++ return 0; ++ + for (i = 0; i < ARRAY_SIZE(s5_reset_reason_txt); i++) { + if (!(value & BIT(i))) + continue; + + if (s5_reset_reason_txt[i]) { +- pr_info("x86/amd: Previous system reset reason [0x%08lx]: %s\n", ++ pr_info("x86/amd: Previous system reset reason [0x%08x]: %s\n", + value, s5_reset_reason_txt[i]); + } + } +-- +2.43.0 + diff --git a/patches-sonic/series b/patches-sonic/series index 8d862b559..9f293cfba 100644 --- a/patches-sonic/series +++ b/patches-sonic/series @@ -201,6 +201,10 @@ cisco-npu-disable-other-bars.patch 0001-fix-os-crash-caused-by-optoe-when-class-switch.patch 0001-tty-8250-HSUART-DMA-be-deactivated-for-DNV-CPU.patch +# Nexthop patches +0001-x86-CPU-AMD-Print-the-reason-for-the-last-reset.patch +0002-x86-CPU-AMD-Ignore-invalid-reset-reason-value.patch + # Fix to avoid kernel panic on Kernel 6.1.94 # https://github.com/sonic-net/sonic-buildimage/issues/20901 #PCI-ASPM-Fix-link-state-exit-during-switch-upstream.patch # Upstreamed