This is an old revision of the document!


PCIe Disable Fatal Error Reporting

When loading a new configuration on a PCIe-connected FPGA, the device can fall off the bus. This is usually no problem at all so long as drivers are properly unloaded, etc. before reconfiguring the FPGA. However, some systems complain bitterly when the device falls off the bus, such as a Dell R540:

[ 2750.469657] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 2750.469658] {1}[Hardware Error]: event severity: fatal
[ 2750.469659] {1}[Hardware Error]:  Error 0, type: fatal
[ 2750.469660] {1}[Hardware Error]:   section_type: PCIe error
[ 2750.469660] {1}[Hardware Error]:   port_type: 4, root port
[ 2750.469661] {1}[Hardware Error]:   version: 3.0
[ 2750.469662] {1}[Hardware Error]:   command: 0x0547, status: 0x4010
[ 2750.469662] {1}[Hardware Error]:   device_id: 0000:3a:00.0
[ 2750.469663] {1}[Hardware Error]:   slot: 2
[ 2750.469663] {1}[Hardware Error]:   secondary_bus: 0x3b
[ 2750.469664] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030
[ 2750.469664] {1}[Hardware Error]:   class_code: 000406
[ 2750.469665] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
[ 2750.469666] Kernel panic - not syncing: Fatal hardware error!
[ 2750.469728] Kernel Offset: 0x20600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

To work around this crash, PCIe fatal error reporting must be disabled on the switch or root port upstream of the FPGA. Specifically, two bits must be cleared - SERR in the command register, and the fatal error reporting enable bit in the device control register in the PCIe capability. The following script performs these operations on the switch port upstream of the specified PCIe device ID.

#!/bin/bash
 
dev=$1
 
if [ -z "$dev" ]; then
    echo "Error: no device specified"
    exit 1
fi
 
if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
    dev="0000:$dev"
fi
 
if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
    echo "Error: device $dev not found"
    exit 1
fi
 
port=$(basename $(dirname $(readlink "/sys/bus/pci/devices/$dev")))
 
if [ ! -e "/sys/bus/pci/devices/$port" ]; then
    echo "Error: device $port not found"
    exit 1
fi
 
echo "Disabling fatal error reporting on port $port..."
 
cmd=$(setpci -s $port COMMAND)
 
echo "Command:" $cmd
 
# clear SERR bit in command register
setpci -s $port COMMAND=$(printf "%04x" $(("0x$cmd" & ~0x0100)))
 
ctrl=$(setpci -s $port CAP_EXP+8.w)
 
echo "Device control:" $ctrl
 
# clear fatal error reporting enable bit in device control register
setpci -s $port CAP_EXP+8.w=$(printf "%04x" $(("0x$ctrl" & ~0x0004)))