====== PCIe Disable Fatal Error Reporting ======
When loading a new configuration on a PCIe-connected FPGA, the device can fall off the bus. This is usually no problem at all so long as drivers are properly unloaded, etc. before reconfiguring the FPGA. However, some systems complain bitterly when the device falls off the bus, such as a Dell R540:
[ 2750.469657] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 2750.469658] {1}[Hardware Error]: event severity: fatal
[ 2750.469659] {1}[Hardware Error]: Error 0, type: fatal
[ 2750.469660] {1}[Hardware Error]: section_type: PCIe error
[ 2750.469660] {1}[Hardware Error]: port_type: 4, root port
[ 2750.469661] {1}[Hardware Error]: version: 3.0
[ 2750.469662] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ 2750.469662] {1}[Hardware Error]: device_id: 0000:3a:00.0
[ 2750.469663] {1}[Hardware Error]: slot: 2
[ 2750.469663] {1}[Hardware Error]: secondary_bus: 0x3b
[ 2750.469664] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2030
[ 2750.469664] {1}[Hardware Error]: class_code: 000406
[ 2750.469665] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 2750.469666] Kernel panic - not syncing: Fatal hardware error!
[ 2750.469728] Kernel Offset: 0x20600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
The device falling off the bus triggers a PCIe fatal error that causes the kernel to panic and the iDRAC to reboot the machine. The iDRAC is totally independent from the operating system; it will still reboot the machine even if the operating system ignores the error.
To work around this crash, the error must not be reported to the OS or to the iDRAC. To that end, PCIe fatal error reporting must be disabled on the switch or root port upstream of the FPGA. Specifically, two bits must be cleared - SERR in the command register, and the fatal error reporting enable bit in the device control register in the PCIe capability. The following script performs these operations on the switch port upstream of the specified PCIe device ID.
#!/bin/bash
dev=$1
if [ -z "$dev" ]; then
echo "Error: no device specified"
exit 1
fi
if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
dev="0000:$dev"
fi
if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
echo "Error: device $dev not found"
exit 1
fi
port=$(basename $(dirname $(readlink "/sys/bus/pci/devices/$dev")))
if [ ! -e "/sys/bus/pci/devices/$port" ]; then
echo "Error: device $port not found"
exit 1
fi
echo "Disabling fatal error reporting on port $port..."
cmd=$(setpci -s $port COMMAND)
echo "Command:" $cmd
# clear SERR bit in command register
setpci -s $port COMMAND=$(printf "%04x" $(("0x$cmd" & ~0x0100)))
ctrl=$(setpci -s $port CAP_EXP+8.w)
echo "Device control:" $ctrl
# clear fatal error reporting enable bit in device control register
setpci -s $port CAP_EXP+8.w=$(printf "%04x" $(("0x$ctrl" & ~0x0004)))