PCIe Disable Fatal Error Reporting
When loading a new configuration on a PCIe-connected FPGA, the device can fall off the bus. This is usually no problem at all so long as drivers are properly unloaded, etc. before reconfiguring the FPGA. However, some systems complain bitterly when the device falls off the bus, such as a Dell R540:
[ 2750.469657] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 2750.469658] {1}[Hardware Error]: event severity: fatal [ 2750.469659] {1}[Hardware Error]: Error 0, type: fatal [ 2750.469660] {1}[Hardware Error]: section_type: PCIe error [ 2750.469660] {1}[Hardware Error]: port_type: 4, root port [ 2750.469661] {1}[Hardware Error]: version: 3.0 [ 2750.469662] {1}[Hardware Error]: command: 0x0547, status: 0x4010 [ 2750.469662] {1}[Hardware Error]: device_id: 0000:3a:00.0 [ 2750.469663] {1}[Hardware Error]: slot: 2 [ 2750.469663] {1}[Hardware Error]: secondary_bus: 0x3b [ 2750.469664] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2030 [ 2750.469664] {1}[Hardware Error]: class_code: 000406 [ 2750.469665] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003 [ 2750.469666] Kernel panic - not syncing: Fatal hardware error! [ 2750.469728] Kernel Offset: 0x20600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
The device falling off the bus triggers a PCIe fatal error that causes the kernel to panic and the iDRAC to reboot the machine. The iDRAC is totally independent from the operating system; it will still reboot the machine even if the operating system ignores the error.
To work around this crash, the error must not be reported to the OS or to the iDRAC. To that end, PCIe fatal error reporting must be disabled on the switch or root port upstream of the FPGA. Specifically, two bits must be cleared - SERR in the command register, and the fatal error reporting enable bit in the device control register in the PCIe capability. The following script performs these operations on the switch port upstream of the specified PCIe device ID.
- pcie_disable_fatal.sh
#!/bin/bash dev=$1 if [ -z "$dev" ]; then echo "Error: no device specified" exit 1 fi if [ ! -e "/sys/bus/pci/devices/$dev" ]; then dev="0000:$dev" fi if [ ! -e "/sys/bus/pci/devices/$dev" ]; then echo "Error: device $dev not found" exit 1 fi port=$(basename $(dirname $(readlink "/sys/bus/pci/devices/$dev"))) if [ ! -e "/sys/bus/pci/devices/$port" ]; then echo "Error: device $port not found" exit 1 fi echo "Disabling fatal error reporting on port $port..." cmd=$(setpci -s $port COMMAND) echo "Command:" $cmd # clear SERR bit in command register setpci -s $port COMMAND=$(printf "%04x" $(("0x$cmd" & ~0x0100))) ctrl=$(setpci -s $port CAP_EXP+8.w) echo "Device control:" $ctrl # clear fatal error reporting enable bit in device control register setpci -s $port CAP_EXP+8.w=$(printf "%04x" $(("0x$ctrl" & ~0x0004)))