add idl4k kernel firmware version 1.13.0.105

This commit is contained in:
Jaroslav Kysela
2015-03-26 17:22:37 +01:00
parent 5194d2792e
commit e9070cdc77
31064 changed files with 12769984 additions and 0 deletions

View File

@@ -0,0 +1,14 @@
00-INDEX
- this file
MSI-HOWTO.txt
- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
PCI-DMA-mapping.txt
- info for PCI drivers using DMA portably across all platforms
PCIEBUS-HOWTO.txt
- a guide describing the PCI Express Port Bus driver
pci-error-recovery.txt
- info on PCI error recovery
pci.txt
- info on the PCI subsystem for device driver authors
pcieaer-howto.txt
- the PCI Express Advanced Error Reporting Driver Guide HOWTO

View File

@@ -0,0 +1,359 @@
The MSI Driver Guide HOWTO
Tom L Nguyen tom.l.nguyen@intel.com
10/03/2003
Revised Feb 12, 2004 by Martine Silbermann
email: Martine.Silbermann@hp.com
Revised Jun 25, 2004 by Tom L Nguyen
Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
Copyright 2003, 2008 Intel Corporation
1. About this guide
This guide describes the basics of Message Signaled Interrupts (MSIs),
the advantages of using MSI over traditional interrupt mechanisms, how
to change your driver to use MSI or MSI-X and some basic diagnostics to
try if a device doesn't support MSIs.
2. What are MSIs?
A Message Signaled Interrupt is a write from the device to a special
address which causes an interrupt to be received by the CPU.
The MSI capability was first specified in PCI 2.2 and was later enhanced
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
capability was also introduced with PCI 3.0. It supports more interrupts
per device than MSI and allows interrupts to be independently configured.
Devices may support both MSI and MSI-X, but only one can be enabled at
a time.
3. Why use MSIs?
There are three reasons why using MSIs can give an advantage over
traditional pin-based interrupts.
Pin-based PCI interrupts are often shared amongst several devices.
To support this, the kernel must call each interrupt handler associated
with an interrupt, which leads to reduced performance for the system as
a whole. MSIs are never shared, so this problem cannot arise.
When a device writes data to memory, then raises a pin-based interrupt,
it is possible that the interrupt may arrive before all the data has
arrived in memory (this becomes more likely with devices behind PCI-PCI
bridges). In order to ensure that all the data has arrived in memory,
the interrupt handler must read a register on the device which raised
the interrupt. PCI transaction ordering rules require that all the data
arrives in memory before the value can be returned from the register.
Using MSIs avoids this problem as the interrupt-generating write cannot
pass the data writes, so by the time the interrupt is raised, the driver
knows that all the data has arrived in memory.
PCI devices can only support a single pin-based interrupt per function.
Often drivers have to query the device to find out what event has
occurred, slowing down interrupt handling for the common case. With
MSIs, a device can support more interrupts, allowing each interrupt
to be specialised to a different purpose. One possible design gives
infrequent conditions (such as errors) their own interrupt which allows
the driver to handle the normal interrupt handling path more efficiently.
Other possible designs include giving one interrupt to each packet queue
in a network card or each port in a storage controller.
4. How to use MSIs
PCI devices are initialised to use pin-based interrupts. The device
driver has to set up the device to use MSI or MSI-X. Not all machines
support MSIs correctly, and for those machines, the APIs described below
will simply fail and the device will continue to use pin-based interrupts.
4.1 Include kernel support for MSIs
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
option enabled. This option is only available on some architectures,
and it may depend on some other options also being set. For example,
on x86, you must also enable X86_UP_APIC or SMP in order to see the
CONFIG_PCI_MSI option.
4.2 Using MSI
Most of the hard work is done for the driver in the PCI layer. It simply
has to request that the PCI layer set up the MSI capability for this
device.
4.2.1 pci_enable_msi
int pci_enable_msi(struct pci_dev *dev)
A successful call will allocate ONE interrupt to the device, regardless
of how many MSIs the device supports. The device will be switched from
pin-based interrupt mode to MSI mode. The dev->irq number is changed
to a new number which represents the message signaled interrupt.
This function should be called before the driver calls request_irq()
since enabling MSIs disables the pin-based IRQ and the driver will not
receive interrupts on the old interrupt.
4.2.2 pci_enable_msi_block
int pci_enable_msi_block(struct pci_dev *dev, int count)
This variation on the above call allows a device driver to request multiple
MSIs. The MSI specification only allows interrupts to be allocated in
powers of two, up to a maximum of 2^5 (32).
If this function returns 0, it has succeeded in allocating at least as many
interrupts as the driver requested (it may have allocated more in order
to satisfy the power-of-two requirement). In this case, the function
enables MSI on this device and updates dev->irq to be the lowest of
the new interrupts assigned to it. The other interrupts assigned to
the device are in the range dev->irq to dev->irq + count - 1.
If this function returns a negative number, it indicates an error and
the driver should not attempt to request any more MSI interrupts for
this device. If this function returns a positive number, it will be
less than 'count' and indicate the number of interrupts that could have
been allocated. In neither case will the irq value have been
updated, nor will the device have been switched into MSI mode.
The device driver must decide what action to take if
pci_enable_msi_block() returns a value less than the number asked for.
Some devices can make use of fewer interrupts than the maximum they
request; in this case the driver should call pci_enable_msi_block()
again. Note that it is not guaranteed to succeed, even when the
'count' has been reduced to the value returned from a previous call to
pci_enable_msi_block(). This is because there are multiple constraints
on the number of vectors that can be allocated; pci_enable_msi_block()
will return as soon as it finds any constraint that doesn't allow the
call to succeed.
4.2.3 pci_disable_msi
void pci_disable_msi(struct pci_dev *dev)
This function should be used to undo the effect of pci_enable_msi() or
pci_enable_msi_block(). Calling it restores dev->irq to the pin-based
interrupt number and frees the previously allocated message signaled
interrupt(s). The interrupt may subsequently be assigned to another
device, so drivers should not cache the value of dev->irq.
A device driver must always call free_irq() on the interrupt(s)
for which it has called request_irq() before calling this function.
Failure to do so will result in a BUG_ON(), the device will be left with
MSI enabled and will leak its vector.
4.3 Using MSI-X
The MSI-X capability is much more flexible than the MSI capability.
It supports up to 2048 interrupts, each of which can be controlled
independently. To support this flexibility, drivers must use an array of
`struct msix_entry':
struct msix_entry {
u16 vector; /* kernel uses to write alloc vector */
u16 entry; /* driver uses to specify entry */
};
This allows for the device to use these interrupts in a sparse fashion;
for example it could use interrupts 3 and 1027 and allocate only a
two-element array. The driver is expected to fill in the 'entry' value
in each element of the array to indicate which entries it wants the kernel
to assign interrupts for. It is invalid to fill in two entries with the
same number.
4.3.1 pci_enable_msix
int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
Calling this function asks the PCI subsystem to allocate 'nvec' MSIs.
The 'entries' argument is a pointer to an array of msix_entry structs
which should be at least 'nvec' entries in size. On success, the
function will return 0 and the device will have been switched into
MSI-X interrupt mode. The 'vector' elements in each entry will have
been filled in with the interrupt number. The driver should then call
request_irq() for each 'vector' that it decides to use.
If this function returns a negative number, it indicates an error and
the driver should not attempt to allocate any more MSI-X interrupts for
this device. If it returns a positive number, it indicates the maximum
number of interrupt vectors that could have been allocated. See example
below.
This function, in contrast with pci_enable_msi(), does not adjust
dev->irq. The device will not generate interrupts for this interrupt
number once MSI-X is enabled. The device driver is responsible for
keeping track of the interrupts assigned to the MSI-X vectors so it can
free them again later.
Device drivers should normally call this function once per device
during the initialization phase.
It is ideal if drivers can cope with a variable number of MSI-X interrupts,
there are many reasons why the platform may not be able to provide the
exact number a driver asks for.
A request loop to achieve that might look like:
static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
{
while (nvec >= FOO_DRIVER_MINIMUM_NVEC) {
rc = pci_enable_msix(adapter->pdev,
adapter->msix_entries, nvec);
if (rc > 0)
nvec = rc;
else
return rc;
}
return -ENOSPC;
}
4.3.2 pci_disable_msix
void pci_disable_msix(struct pci_dev *dev)
This API should be used to undo the effect of pci_enable_msix(). It frees
the previously allocated message signaled interrupts. The interrupts may
subsequently be assigned to another device, so drivers should not cache
the value of the 'vector' elements over a call to pci_disable_msix().
A device driver must always call free_irq() on the interrupt(s)
for which it has called request_irq() before calling this function.
Failure to do so will result in a BUG_ON(), the device will be left with
MSI enabled and will leak its vector.
4.3.3 The MSI-X Table
The MSI-X capability specifies a BAR and offset within that BAR for the
MSI-X Table. This address is mapped by the PCI subsystem, and should not
be accessed directly by the device driver. If the driver wishes to
mask or unmask an interrupt, it should call disable_irq() / enable_irq().
4.4 Handling devices implementing both MSI and MSI-X capabilities
If a device implements both MSI and MSI-X capabilities, it can
run in either MSI mode or MSI-X mode but not both simultaneously.
This is a requirement of the PCI spec, and it is enforced by the
PCI layer. Calling pci_enable_msi() when MSI-X is already enabled or
pci_enable_msix() when MSI is already enabled will result in an error.
If a device driver wishes to switch between MSI and MSI-X at runtime,
it must first quiesce the device, then switch it back to pin-interrupt
mode, before calling pci_enable_msi() or pci_enable_msix() and resuming
operation. This is not expected to be a common operation but may be
useful for debugging or testing during development.
4.5 Considerations when using MSIs
4.5.1 Choosing between MSI-X and MSI
If your device supports both MSI-X and MSI capabilities, you should use
the MSI-X facilities in preference to the MSI facilities. As mentioned
above, MSI-X supports any number of interrupts between 1 and 2048.
In constrast, MSI is restricted to a maximum of 32 interrupts (and
must be a power of two). In addition, the MSI interrupt vectors must
be allocated consecutively, so the system may not be able to allocate
as many vectors for MSI as it could for MSI-X. On some platforms, MSI
interrupts must all be targetted at the same set of CPUs whereas MSI-X
interrupts can all be targetted at different CPUs.
4.5.2 Spinlocks
Most device drivers have a per-device spinlock which is taken in the
interrupt handler. With pin-based interrupts or a single MSI, it is not
necessary to disable interrupts (Linux guarantees the same interrupt will
not be re-entered). If a device uses multiple interrupts, the driver
must disable interrupts while the lock is held. If the device sends
a different interrupt, the driver will deadlock trying to recursively
acquire the spinlock.
There are two solutions. The first is to take the lock with
spin_lock_irqsave() or spin_lock_irq() (see
Documentation/DocBook/kernel-locking). The second is to specify
IRQF_DISABLED to request_irq() so that the kernel runs the entire
interrupt routine with interrupts disabled.
If your MSI interrupt routine does not hold the lock for the whole time
it is running, the first solution may be best. The second solution is
normally preferred as it avoids making two transitions from interrupt
disabled to enabled and back again.
4.6 How to tell whether MSI/MSI-X is enabled on a device
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
has an 'Enable' flag which will be followed with either "+" (enabled)
or "-" (disabled).
5. MSI quirks
Several PCI chipsets or devices are known not to support MSIs.
The PCI stack provides three ways to disable MSIs:
1. globally
2. on all devices behind a specific bridge
3. on a single device
5.1. Disabling MSIs globally
Some host chipsets simply don't support MSIs properly. If we're
lucky, the manufacturer knows this and has indicated it in the ACPI
FADT table. In this case, Linux will automatically disable MSIs.
Some boards don't include this information in the table and so we have
to detect them ourselves. The complete list of these is found near the
quirk_disable_all_msi() function in drivers/pci/quirks.c.
If you have a board which has problems with MSIs, you can pass pci=nomsi
on the kernel command line to disable MSIs on all devices. It would be
in your best interests to report the problem to linux-pci@vger.kernel.org
including a full 'lspci -v' so we can add the quirks to the kernel.
5.2. Disabling MSIs below a bridge
Some PCI bridges are not able to route MSIs between busses properly.
In this case, MSIs must be disabled on all devices behind the bridge.
Some bridges allow you to enable MSIs by changing some bits in their
PCI configuration space (especially the Hypertransport chipsets such
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
Linux mostly knows about them and automatically enables MSIs if it can.
If you have a bridge which Linux doesn't yet know about, you can enable
MSIs in configuration space using whatever method you know works, then
enable MSIs on that bridge by doing:
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
where $bridge is the PCI address of the bridge you've enabled (eg
0000:00:0e.0).
To disable MSIs, echo 0 instead of 1. Changing this value should be
done with caution as it can break interrupt handling for all devices
below this bridge.
Again, please notify linux-pci@vger.kernel.org of any bridges that need
special handling.
5.3. Disabling MSIs on a single device
Some devices are known to have faulty MSI implementations. Usually this
is handled in the individual device driver but occasionally it's necessary
to handle this with a quirk. Some drivers have an option to disable use
of MSI. While this is a convenient workaround for the driver author,
it is not good practise, and should not be emulated.
5.4. Finding why MSIs are disabled on a device
From the above three sections, you can see that there are many reasons
why MSIs may not be enabled for a given device. Your first step should
be to examine your dmesg carefully to determine whether MSIs are enabled
for your machine. You should also check your .config to be sure you
have enabled CONFIG_PCI_MSI.
Then, 'lspci -t' gives the list of bridges above a device. Reading
/sys/bus/pci/devices/*/msi_bus will tell you whether MSI are enabled (1)
or disabled (0). If 0 is found in any of the msi_bus files belonging
to bridges between the PCI root and the device, MSIs are disabled.
It is also worth checking the device driver to see whether it supports MSIs.
For example, it may contain calls to pci_enable_msi(), pci_enable_msix() or
pci_enable_msi_block().

View File

@@ -0,0 +1,217 @@
The PCI Express Port Bus Driver Guide HOWTO
Tom L Nguyen tom.l.nguyen@intel.com
11/03/2004
1. About this guide
This guide describes the basics of the PCI Express Port Bus driver
and provides information on how to enable the service drivers to
register/unregister with the PCI Express Port Bus Driver.
2. Copyright 2004 Intel Corporation
3. What is the PCI Express Port Bus Driver
A PCI Express Port is a logical PCI-PCI Bridge structure. There
are two types of PCI Express Port: the Root Port and the Switch
Port. The Root Port originates a PCI Express link from a PCI Express
Root Complex and the Switch Port connects PCI Express links to
internal logical PCI buses. The Switch Port, which has its secondary
bus representing the switch's internal routing logic, is called the
switch's Upstream Port. The switch's Downstream Port is bridging from
switch's internal routing bus to a bus representing the downstream
PCI Express link from the PCI Express Switch.
A PCI Express Port can provide up to four distinct functions,
referred to in this document as services, depending on its port type.
PCI Express Port's services include native hotplug support (HP),
power management event support (PME), advanced error reporting
support (AER), and virtual channel support (VC). These services may
be handled by a single complex driver or be individually distributed
and handled by corresponding service drivers.
4. Why use the PCI Express Port Bus Driver?
In existing Linux kernels, the Linux Device Driver Model allows a
physical device to be handled by only a single driver. The PCI
Express Port is a PCI-PCI Bridge device with multiple distinct
services. To maintain a clean and simple solution each service
may have its own software service driver. In this case several
service drivers will compete for a single PCI-PCI Bridge device.
For example, if the PCI Express Root Port native hotplug service
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
kernel therefore does not load other service drivers for that Root
Port. In other words, it is impossible to have multiple service
drivers load and run on a PCI-PCI Bridge device simultaneously
using the current driver model.
To enable multiple service drivers running simultaneously requires
having a PCI Express Port Bus driver, which manages all populated
PCI Express Ports and distributes all provided service requests
to the corresponding service drivers as required. Some key
advantages of using the PCI Express Port Bus driver are listed below:
- Allow multiple service drivers to run simultaneously on
a PCI-PCI Bridge Port device.
- Allow service drivers implemented in an independent
staged approach.
- Allow one service driver to run on multiple PCI-PCI Bridge
Port devices.
- Manage and distribute resources of a PCI-PCI Bridge Port
device to requested service drivers.
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
5.1 Including the PCI Express Port Bus Driver Support into the Kernel
Including the PCI Express Port Bus driver depends on whether the PCI
Express support is included in the kernel config. The kernel will
automatically include the PCI Express Port Bus driver as a kernel
driver when the PCI Express support is enabled in the kernel.
5.2 Enabling Service Driver Support
PCI device drivers are implemented based on Linux Device Driver Model.
All service drivers are PCI device drivers. As discussed above, it is
impossible to load any service driver once the kernel has loaded the
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
Model requires some minimal changes on existing service drivers that
imposes no impact on the functionality of existing service drivers.
A service driver is required to use the two APIs shown below to
register its service with the PCI Express Port Bus driver (see
section 5.2.1 & 5.2.2). It is important that a service driver
initializes the pcie_port_service_driver data structure, included in
header file /include/linux/pcieport_if.h, before calling these APIs.
Failure to do so will result an identity mismatch, which prevents
the PCI Express Port Bus driver from loading a service driver.
5.2.1 pcie_port_service_register
int pcie_port_service_register(struct pcie_port_service_driver *new)
This API replaces the Linux Driver Model's pci_register_driver API. A
service driver should always calls pcie_port_service_register at
module init. Note that after service driver being loaded, calls
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
necessary since these calls are executed by the PCI Port Bus driver.
5.2.2 pcie_port_service_unregister
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
pcie_port_service_unregister replaces the Linux Driver Model's
pci_unregister_driver. It's always called by service driver when a
module exits.
5.2.3 Sample Code
Below is sample service driver code to initialize the port service
driver data structure.
static struct pcie_port_service_id service_id[] = { {
.vendor = PCI_ANY_ID,
.device = PCI_ANY_ID,
.port_type = PCIE_RC_PORT,
.service_type = PCIE_PORT_SERVICE_AER,
}, { /* end: all zeroes */ }
};
static struct pcie_port_service_driver root_aerdrv = {
.name = (char *)device_name,
.id_table = &service_id[0],
.probe = aerdrv_load,
.remove = aerdrv_unload,
.suspend = aerdrv_suspend,
.resume = aerdrv_resume,
};
Below is a sample code for registering/unregistering a service
driver.
static int __init aerdrv_service_init(void)
{
int retval = 0;
retval = pcie_port_service_register(&root_aerdrv);
if (!retval) {
/*
* FIX ME
*/
}
return retval;
}
static void __exit aerdrv_service_exit(void)
{
pcie_port_service_unregister(&root_aerdrv);
}
module_init(aerdrv_service_init);
module_exit(aerdrv_service_exit);
6. Possible Resource Conflicts
Since all service drivers of a PCI-PCI Bridge Port device are
allowed to run simultaneously, below lists a few of possible resource
conflicts with proposed solutions.
6.1 MSI Vector Resource
The MSI capability structure enables a device software driver to call
pci_enable_msi to request MSI based interrupts. Once MSI interrupts
are enabled on a device, it stays in this mode until a device driver
calls pci_disable_msi to disable MSI interrupts and revert back to
INTx emulation mode. Since service drivers of the same PCI-PCI Bridge
port share the same physical device, if an individual service driver
calls pci_enable_msi/pci_disable_msi it may result unpredictable
behavior. For example, two service drivers run simultaneously on the
same physical Root Port. Both service drivers call pci_enable_msi to
request MSI based interrupts. A service driver may not know whether
any other service drivers have run on this Root Port. If either one
of them calls pci_disable_msi, it puts the other service driver
in a wrong interrupt mode.
To avoid this situation all service drivers are not permitted to
switch interrupt mode on its device. The PCI Express Port Bus driver
is responsible for determining the interrupt mode and this should be
transparent to service drivers. Service drivers need to know only
the vector IRQ assigned to the field irq of struct pcie_device, which
is passed in when the PCI Express Port Bus driver probes each service
driver. Service drivers should use (struct pcie_device*)dev->irq to
call request_irq/free_irq. In addition, the interrupt mode is stored
in the field interrupt_mode of struct pcie_device.
6.2 MSI-X Vector Resources
Similar to the MSI a device driver for an MSI-X capable device can
call pci_enable_msix to request MSI-X interrupts. All service drivers
are not permitted to switch interrupt mode on its device. The PCI
Express Port Bus driver is responsible for determining the interrupt
mode and this should be transparent to service drivers. Any attempt
by service driver to call pci_enable_msix/pci_disable_msix may
result unpredictable behavior. Service drivers should use
(struct pcie_device*)dev->irq and call request_irq/free_irq.
6.3 PCI Memory/IO Mapped Regions
Service drivers for PCI Express Power Management (PME), Advanced
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
PCI configuration space on the PCI Express port. In all cases the
registers accessed are independent of each other. This patch assumes
that all service drivers will be well behaved and not overwrite
other service driver's configuration settings.
6.4 PCI Config Registers
Each service driver runs its PCI config operations on its own
capability structure except the PCI Express capability structure, in
which Root Control register and Device Control register are shared
between PME and AER. This patch assumes that all service drivers
will be well behaved and not overwrite other service driver's
configuration settings.

View File

@@ -0,0 +1,431 @@
PCI Error Recovery
------------------
February 2, 2006
Current document maintainer:
Linas Vepstas <linasvepstas@gmail.com>
updated by Richard Lary <rlary@us.ibm.com>
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
Many PCI bus controllers are able to detect a variety of hardware
PCI errors on the bus, such as parity errors on the data and address
busses, as well as SERR and PERR errors. Some of the more advanced
chipsets are able to deal with these errors; these include PCI-E chipsets,
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
pSeries boxes. A typical action taken is to disconnect the affected device,
halting all I/O to it. The goal of a disconnection is to avoid system
corruption; for example, to halt system memory corruption due to DMA's
to "wild" addresses. Typically, a reconnection mechanism is also
offered, so that the affected PCI device(s) are reset and put back
into working condition. The reset phase requires coordination
between the affected device drivers and the PCI controller chip.
This document describes a generic API for notifying device drivers
of a bus disconnection, and then performing error recovery.
This API is currently implemented in the 2.6.16 and later kernels.
Reporting and recovery is performed in several steps. First, when
a PCI hardware error has resulted in a bus disconnect, that event
is reported as soon as possible to all affected device drivers,
including multiple instances of a device driver on multi-function
cards. This allows device drivers to avoid deadlocking in spinloops,
waiting for some i/o-space register to change, when it never will.
It also gives the drivers a chance to defer incoming I/O as
needed.
Next, recovery is performed in several stages. Most of the complexity
is forced by the need to handle multi-function devices, that is,
devices that have multiple device drivers associated with them.
In the first stage, each driver is allowed to indicate what type
of reset it desires, the choices being a simple re-enabling of I/O
or requesting a slot reset.
If any driver requests a slot reset, that is what will be done.
After a reset and/or a re-enabling of I/O, all drivers are
again notified, so that they may then perform any device setup/config
that may be required. After these have all completed, a final
"resume normal operations" event is sent out.
The biggest reason for choosing a kernel-based implementation rather
than a user-space implementation was the need to deal with bus
disconnects of PCI devices attached to storage media, and, in particular,
disconnects from devices holding the root file system. If the root
file system is disconnected, a user-space mechanism would have to go
through a large number of contortions to complete recovery. Almost all
of the current Linux file systems are not tolerant of disconnection
from/reconnection to their underlying block device. By contrast,
bus errors are easy to manage in the device driver. Indeed, most
device drivers already handle very similar recovery procedures;
for example, the SCSI-generic layer already provides significant
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
Detailed Design
---------------
Design and implementation details below, based on a chain of
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
The error recovery API support is exposed to the driver in the form of
a structure of function pointers pointed to by a new field in struct
pci_driver. A driver that fails to provide the structure is "non-aware",
and the actual recovery steps taken are platform dependent. The
arch/powerpc implementation will simulate a PCI hotplug remove/add.
This structure has the form:
struct pci_error_handlers
{
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
int (*mmio_enabled)(struct pci_dev *dev);
int (*link_reset)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev);
};
The possible channel states are:
enum pci_channel_state {
pci_channel_io_normal, /* I/O channel is in normal state */
pci_channel_io_frozen, /* I/O to channel is blocked */
pci_channel_io_perm_failure, /* PCI card is dead */
};
Possible return values are:
enum pci_ers_result {
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
};
A driver does not have to implement all of these callbacks; however,
if it implements any, it must implement error_detected(). If a callback
is not implemented, the corresponding feature is considered unsupported.
For example, if mmio_enabled() and resume() aren't there, then it
is assumed that the driver is not doing any direct recovery and requires
a slot reset. If link_reset() is not implemented, the card is assumed to
not care about link resets. Typically a driver will want to know about
a slot_reset().
The actual steps taken by a platform to recover from a PCI error
event will be platform-dependent, but will follow the general
sequence described below.
STEP 0: Error Event
-------------------
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
all writes are ignored.
STEP 1: Notification
--------------------
Platform calls the error_detected() callback on every instance of
every driver affected by the error.
At this point, the device might not be accessible anymore, depending on
the platform (the slot will be isolated on powerpc). The driver may
already have "noticed" the error because of a failing I/O, but this
is the proper "synchronization point", that is, it gives the driver
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
to complete; it can take semaphores, schedule, etc... everything but
touch the device. Within this function and after it returns, the driver
shouldn't do any new IOs. Called in task context. This is sort of a
"quiesce" point. See note about interrupts at the end of this doc.
All drivers participating in this system must implement this call.
The driver must return one of the following result codes:
- PCI_ERS_RESULT_CAN_RECOVER:
Driver returns this if it thinks it might be able to recover
the HW by just banging IOs or if it wants to be given
a chance to extract some diagnostic information (see
mmio_enable, below).
- PCI_ERS_RESULT_NEED_RESET:
Driver returns this if it can't recover without a
slot reset.
- PCI_ERS_RESULT_DISCONNECT:
Driver returns this if it doesn't want to recover at all.
The next step taken will depend on the result codes returned by the
drivers.
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
then the platform should re-enable IOs on the slot (or do nothing in
particular, if the platform doesn't isolate slots), and recovery
proceeds to STEP 2 (MMIO Enable).
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
then recovery proceeds to STEP 4 (Slot Reset).
If the platform is unable to recover the slot, the next step
is STEP 6 (Permanent Failure).
>>> The current powerpc implementation assumes that a device driver will
>>> *not* schedule or semaphore in this routine; the current powerpc
>>> implementation uses one kernel thread to notify all devices;
>>> thus, if one device sleeps/schedules, all devices are affected.
>>> Doing better requires complex multi-threaded logic in the error
>>> recovery implementation (e.g. waiting for all notification threads
>>> to "join" before proceeding with recovery.) This seems excessively
>>> complex and not worth implementing.
>>> The current powerpc implementation doesn't much care if the device
>>> attempts I/O at this point, or not. I/O's will fail, returning
>>> a value of 0xff on read, and writes will be dropped. If more than
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
>>> assumes that the device driver has gone into an infinite loop
>>> and prints an error to syslog. A reboot is then required to
>>> get the device working again.
STEP 2: MMIO Enabled
-------------------
The platform re-enables MMIO to the device (but typically not the
DMA), and then calls the mmio_enabled() callback on all affected
device drivers.
This is the "early recovery" call. IOs are allowed again, but DMA is
not, with some restrictions. This is NOT a callback for the driver to
start operations again, only to peek/poke at the device, extract diagnostic
information, if any, and eventually do things like trigger a device local
reset or some such, but not restart operations. This callback is made if
all drivers on a segment agree that they can try to recover and if no automatic
link reset was performed by the HW. If the platform can't just re-enable IOs
without a slot reset or a link reset, it will not call this callback, and
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
>>> The following is proposed; no platform implements this yet:
>>> Proposal: All I/O's should be done _synchronously_ from within
>>> this callback, errors triggered by them will be returned via
>>> the normal pci_check_whatever() API, no new error_detected()
>>> callback will be issued due to an error happening here. However,
>>> such an error might cause IOs to be re-blocked for the whole
>>> segment, and thus invalidate the recovery that other devices
>>> on the same segment might have done, forcing the whole segment
>>> into one of the next states, that is, link reset or slot reset.
The driver should return one of the following result codes:
- PCI_ERS_RESULT_RECOVERED
Driver returns this if it thinks the device is fully
functional and thinks it is ready to start
normal driver operations again. There is no
guarantee that the driver will actually be
allowed to proceed, as another driver on the
same segment might have failed and thus triggered a
slot reset on platforms that support it.
- PCI_ERS_RESULT_NEED_RESET
Driver returns this if it thinks the device is not
recoverable in it's current state and it needs a slot
reset to proceed.
- PCI_ERS_RESULT_DISCONNECT
Same as above. Total failure, no recovery even after
reset driver dead. (To be defined more precisely)
The next step taken depends on the results returned by the drivers.
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
proceeds to STEP 4 (Slot Reset)
STEP 3: Link Reset
------------------
The platform resets the link, and then calls the link_reset() callback
on all affected device drivers. This is a PCI-Express specific state
and is done whenever a non-fatal error has been detected that can be
"solved" by resetting the link. This call informs the driver of the
reset and the driver should check to see if the device appears to be
in working condition.
The driver is not supposed to restart normal driver I/O operations
at this point. It should limit itself to "probing" the device to
check it's recoverability status. If all is right, then the platform
will call resume() once all drivers have ack'd link_reset().
Result codes:
(identical to STEP 3 (MMIO Enabled)
The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
(Resume Operations).
>>> The current powerpc implementation does not implement this callback.
STEP 4: Slot Reset
------------------
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
the platform will peform a slot reset on the requesting PCI device(s).
The actual steps taken by a platform to perform a slot reset
will be platform-dependent. Upon completion of slot reset, the
platform will call the device slot_reset() callback.
Powerpc platforms implement two levels of slot reset:
soft reset(default) and fundamental(optional) reset.
Powerpc soft reset consists of asserting the adapter #RST line and then
restoring the PCI BAR's and PCI configuration header to a state
that is equivalent to what it would be after a fresh system
power-on followed by power-on BIOS/system firmware initialization.
Soft reset is also known as hot-reset.
Powerpc fundamental reset is supported by PCI Express cards only
and results in device's state machines, hardware logic, port states and
configuration registers to initialize to their default conditions.
For most PCI devices, a soft reset will be sufficient for recovery.
Optional fundamental reset is provided to support a limited number
of PCI Express PCI devices for which a soft reset is not sufficient
for recovery.
If the platform supports PCI hotplug, then the reset might be
performed by toggling the slot electrical power off/on.
It is important for the platform to restore the PCI config space
to the "fresh poweron" state, rather than the "last state". After
a slot reset, the device driver will almost always use its standard
device initialization routines, and an unusual config space setup
may result in hung devices, kernel panics, or silent data corruption.
This call gives drivers the chance to re-initialize the hardware
(re-download firmware, etc.). At this point, the driver may assume
that the card is in a fresh state and is fully functional. The slot
is unfrozen and the driver has full access to PCI config space,
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
will also be available.
Drivers should not restart normal I/O processing operations
at this point. If all device drivers report success on this
callback, the platform will call resume() to complete the sequence,
and let the driver restart normal I/O processing.
A driver can still return a critical failure for this function if
it can't get the device operational after reset. If the platform
previously tried a soft reset, it might now try a hard reset (power
cycle) and then call slot_reset() again. It the device still can't
be recovered, there is nothing more that can be done; the platform
will typically report a "permanent failure" in such a case. The
device will be considered "dead" in this case.
Drivers for multi-function cards will need to coordinate among
themselves as to which driver instance will perform any "one-shot"
or global device initialization. For example, the Symbios sym53cxx2
driver performs device init only from PCI function 0:
+ if (PCI_FUNC(pdev->devfn) == 0)
+ sym_reset_scsi_bus(np, 0);
Result codes:
- PCI_ERS_RESULT_DISCONNECT
Same as above.
Drivers for PCI Express cards that require a fundamental reset must
set the needs_freset bit in the pci_dev structure in their probe function.
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
PCI card types:
+ /* Set EEH reset type to fundamental if required by hba */
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
+ pdev->needs_freset = 1;
+
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
Failure).
>>> The current powerpc implementation does not try a power-cycle
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
>>> However, it probably should.
STEP 5: Resume Operations
-------------------------
The platform will call the resume() callback on all affected device
drivers if all drivers on the segment have returned
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
The goal of this callback is to tell the driver to restart activity,
that everything is back and running. This callback does not return
a result code.
At this point, if a new error happens, the platform will restart
a new error recovery sequence.
STEP 6: Permanent Failure
-------------------------
A "permanent failure" has occurred, and the platform cannot recover
the device. The platform will call error_detected() with a
pci_channel_state value of pci_channel_io_perm_failure.
The device driver should, at this point, assume the worst. It should
cancel all pending I/O, refuse all new I/O, returning -EIO to
higher layers. The device driver should then clean up all of its
memory and remove itself from kernel operations, much as it would
during system shutdown.
The platform will typically notify the system operator of the
permanent failure in some way. If the device is hotplug-capable,
the operator will probably want to remove and replace the device.
Note, however, not all failures are truly "permanent". Some are
caused by over-heating, some by a poorly seated card. Many
PCI error events are caused by software bugs, e.g. DMA's to
wild addresses or bogus split transactions due to programming
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
for additional detail on real-life experience of the causes of
software errors.
Conclusion; General Remarks
---------------------------
The way the callbacks are called is platform policy. A platform with
no slot reset capability may want to just "ignore" drivers that can't
recover (disconnect them) and try to let other cards on the same segment
recover. Keep in mind that in most real life cases, though, there will
be only one driver per segment.
Now, a note about interrupts. If you get an interrupt and your
device is dead or has been isolated, there is a problem :)
The current policy is to turn this into a platform policy.
That is, the recovery API only requires that:
- There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the
slot_reset callback is called, at which point interrupts are expected
to be fully operational.
- There is no guarantee that interrupt delivery is stopped, that is,
a driver that gets an interrupt after detecting an error, or that detects
an error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with that
condition, typically by masking the IRQ source during the duration of
the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal
with temporarily disabling that IRQ number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices
anyway :)
>>> Implementation details for the powerpc platform are discussed in
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
>>> As of this writing, there is a growing list of device drivers with
>>> patches implementing error recovery. Not all of these patches are in
>>> mainline yet. These may be used as "examples":
>>>
>>> drivers/scsi/ipr
>>> drivers/scsi/sym53c8xx_2
>>> drivers/scsi/qla2xxx
>>> drivers/scsi/lpfc
>>> drivers/next/bnx2.c
>>> drivers/next/e100.c
>>> drivers/net/e1000
>>> drivers/net/e1000e
>>> drivers/net/ixgb
>>> drivers/net/ixgbe
>>> drivers/net/cxgb3
>>> drivers/net/s2io.c
>>> drivers/net/qlge
The End
-------

View File

@@ -0,0 +1,99 @@
PCI Express I/O Virtualization Howto
Copyright (C) 2009 Intel Corporation
Yu Zhao <yu.zhao@intel.com>
1. Overview
1.1 What is SR-IOV
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
capability which makes one physical device appear as multiple virtual
devices. The physical device is referred to as Physical Function (PF)
while the virtual devices are referred to as Virtual Functions (VF).
Allocation of the VF can be dynamically controlled by the PF via
registers encapsulated in the capability. By default, this feature is
not enabled and the PF behaves as traditional PCIe device. Once it's
turned on, each VF's PCI configuration space can be accessed by its own
Bus, Device and Function Number (Routing ID). And each VF also has PCI
Memory Space, which is used to map its register set. VF device driver
operates on the register set so it can be functional and appear as a
real existing PCI device.
2. User Guide
2.1 How can I enable SR-IOV capability
The device driver (PF driver) will control the enabling and disabling
of the capability via API provided by SR-IOV core. If the hardware
has SR-IOV capability, loading its PF driver would enable it and all
VFs associated with the PF.
2.2 How can I use the Virtual Functions
The VF is treated as hot-plugged PCI devices in the kernel, so they
should be able to work in the same way as real PCI devices. The VF
requires device driver that is same as a normal PCI device's.
3. Developer Guide
3.1 SR-IOV API
To enable SR-IOV capability:
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
'nr_virtfn' is number of VFs to be enabled.
To disable SR-IOV capability:
void pci_disable_sriov(struct pci_dev *dev);
To notify SR-IOV core of Virtual Function Migration:
irqreturn_t pci_sriov_migration(struct pci_dev *dev);
3.2 Usage example
Following piece of code illustrates the usage of the SR-IOV API.
static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
pci_enable_sriov(dev, NR_VIRTFN);
...
return 0;
}
static void __devexit dev_remove(struct pci_dev *dev)
{
pci_disable_sriov(dev);
...
}
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
{
...
return 0;
}
static int dev_resume(struct pci_dev *dev)
{
...
return 0;
}
static void dev_shutdown(struct pci_dev *dev)
{
...
}
static struct pci_driver dev_driver = {
.name = "SR-IOV Physical Function driver",
.id_table = dev_id_table,
.probe = dev_probe,
.remove = __devexit_p(dev_remove),
.suspend = dev_suspend,
.resume = dev_resume,
.shutdown = dev_shutdown,
};

View File

@@ -0,0 +1,651 @@
How To Write Linux PCI Drivers
by Martin Mares <mj@ucw.cz> on 07-Feb-2000
updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The world of PCI is vast and full of (mostly unpleasant) surprises.
Since each CPU architecture implements different chip-sets and PCI devices
have different requirements (erm, "features"), the result is the PCI support
in the Linux kernel is not as trivial as one would wish. This short paper
tries to introduce all potential driver authors to Linux APIs for
PCI device drivers.
A more complete resource is the third edition of "Linux Device Drivers"
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
LDD3 is available for free (under Creative Commons License) from:
http://lwn.net/Kernel/LDD3/
However, keep in mind that all documents are subject to "bit rot".
Refer to the source code if things are not working as described here.
Please send questions/comments/patches about Linux PCI API to the
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
0. Structure of PCI drivers
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI drivers "discover" PCI devices in a system via pci_register_driver().
Actually, it's the other way around. When the PCI generic code discovers
a new device, the driver with a matching "description" will be notified.
Details on this below.
pci_register_driver() leaves most of the probing for devices to
the PCI layer and supports online insertion/removal of devices [thus
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
pci_register_driver() call requires passing in a table of function
pointers and thus dictates the high level structure of a driver.
Once the driver knows about a PCI device and takes ownership, the
driver generally needs to perform the following initialization:
Enable the device
Request MMIO/IOP resources
Set the DMA mask size (for both coherent and streaming DMA)
Allocate and initialize shared control data (pci_allocate_coherent())
Access device configuration space (if needed)
Register IRQ handler (request_irq())
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
Enable DMA/processing engines
When done using the device, and perhaps the module needs to be unloaded,
the driver needs to take the follow steps:
Disable the device from generating IRQs
Release the IRQ (free_irq())
Stop all DMA activity
Release DMA buffers (both streaming and coherent)
Unregister from other subsystems (e.g. scsi or netdev)
Release MMIO/IOP resources
Disable the device
Most of these topics are covered in the following sections.
For the rest look at LDD3 or <linux/pci.h> .
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
the PCI functions described below are defined as inline functions either
completely empty or just returning an appropriate error codes to avoid
lots of ifdefs in the drivers.
1. pci_register_driver() call
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI device drivers call pci_register_driver() during their
initialization with a pointer to a structure describing the driver
(struct pci_driver):
field name Description
---------- ------------------------------------------------------
id_table Pointer to table of device ID's the driver is
interested in. Most drivers should export this
table using MODULE_DEVICE_TABLE(pci,...).
probe This probing function gets called (during execution
of pci_register_driver() for already existing
devices or later if a new device gets inserted) for
all PCI devices which match the ID table and are not
"owned" by the other drivers yet. This function gets
passed a "struct pci_dev *" for each device whose
entry in the ID table matches the device. The probe
function returns zero when the driver chooses to
take "ownership" of the device or an error code
(negative number) otherwise.
The probe function always gets called from process
context, so it can sleep.
remove The remove() function gets called whenever a device
being handled by this driver is removed (either during
deregistration of the driver or when it's manually
pulled out of a hot-pluggable slot).
The remove function always gets called from process
context, so it can sleep.
suspend Put device into low power state.
suspend_late Put device into low power state.
resume_early Wake device from low power state.
resume Wake device from low power state.
(Please see Documentation/power/pci.txt for descriptions
of PCI Power Management and the related functions.)
shutdown Hook into reboot_notifier_list (kernel/sys.c).
Intended to stop any idling DMA operations.
Useful for enabling wake-on-lan (NIC) or changing
the power state of a device before reboot.
e.g. drivers/net/e100.c.
err_handler See Documentation/PCI/pci-error-recovery.txt
The ID table is an array of struct pci_device_id entries ending with an
all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred
method of declaring the table. Each entry consists of:
vendor,device Vendor and device ID to match (or PCI_ANY_ID)
subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID)
subdevice,
class Device class, subclass, and "interface" to match.
See Appendix D of the PCI Local Bus Spec or
include/linux/pci_ids.h for a full list of classes.
Most drivers do not need to specify class/class_mask
as vendor/device is normally sufficient.
class_mask limit which sub-fields of the class field are compared.
See drivers/scsi/sym53c8xx_2/ for example of usage.
driver_data Data private to the driver.
Most drivers don't need to use driver_data field.
Best practice is to use driver_data as an index
into a static list of equivalent device types,
instead of using it as a pointer.
Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
a pci_device_id table.
New PCI IDs may be added to a device driver pci_ids table at runtime
as shown below:
echo "vendor device subvendor subdevice class class_mask driver_data" > \
/sys/bus/pci/drivers/{driver}/new_id
All fields are passed in as hexadecimal values (no leading 0x).
The vendor and device fields are mandatory, the others are optional. Users
need pass only as many optional fields as necessary:
o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
o class and classmask fields default to 0
o driver_data defaults to 0UL.
Note that driver_data must match the value used by any of the pci_device_id
entries defined in the driver. This makes the driver_data field mandatory
if all the pci_device_id entries have a non-zero driver_data value.
Once added, the driver probe routine will be invoked for any unclaimed
PCI devices listed in its (newly updated) pci_ids list.
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
automatically calls the remove hook for all devices handled by the driver.
1.1 "Attributes" for driver functions/data
Please mark the initialization and cleanup functions where appropriate
(the corresponding macros are defined in <linux/init.h>):
__init Initialization code. Thrown away after the driver
initializes.
__exit Exit code. Ignored for non-modular drivers.
__devinit Device initialization code.
Identical to __init if the kernel is not compiled
with CONFIG_HOTPLUG, normal function otherwise.
__devexit The same for __exit.
Tips on when/where to use the above attributes:
o The module_init()/module_exit() functions (and all
initialization functions called _only_ from these)
should be marked __init/__exit.
o Do not mark the struct pci_driver.
o The ID table array should be marked __devinitconst; this is done
automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE().
o The probe() and remove() functions should be marked __devinit
and __devexit respectively. All initialization functions
exclusively called by the probe() routine, can be marked __devinit.
Ditto for remove() and __devexit.
o If mydriver_remove() is marked with __devexit(), then all address
references to mydriver_remove must use __devexit_p(mydriver_remove)
(in the struct pci_driver declaration for example).
__devexit_p() will generate the function name _or_ NULL if the
function will be discarded. For an example, see drivers/net/tg3.c.
o Do NOT mark a function if you are not sure which mark to use.
Better to not mark the function than mark the function wrong.
2. How to find PCI devices manually
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PCI drivers should have a really good reason for not using the
pci_register_driver() interface to search for PCI devices.
The main reason PCI devices are controlled by multiple drivers
is because one PCI device implements several different HW services.
E.g. combined serial/parallel port/floppy controller.
A manual search may be performed using the following constructs:
Searching by vendor and device ID:
struct pci_dev *dev = NULL;
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
configure_device(dev);
Searching by class ID (iterate in a similar way):
pci_get_class(CLASS_ID, dev)
Searching by both vendor/device and subsystem vendor/device ID:
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
You can use the constant PCI_ANY_ID as a wildcard replacement for
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
specific vendor, for example.
These functions are hotplug-safe. They increment the reference count on
the pci_dev that they return. You must eventually (possibly at module unload)
decrement the reference count on these devices by calling pci_dev_put().
3. Device Initialization Steps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As noted in the introduction, most PCI drivers need the following steps
for device initialization:
Enable the device
Request MMIO/IOP resources
Set the DMA mask size (for both coherent and streaming DMA)
Allocate and initialize shared control data (pci_allocate_coherent())
Access device configuration space (if needed)
Register IRQ handler (request_irq())
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
Enable DMA/processing engines.
The driver can access PCI config space registers at any time.
(Well, almost. When running BIST, config space can go away...but
that will just result in a PCI Bus Master Abort and config reads
will return garbage).
3.1 Enable the PCI device
~~~~~~~~~~~~~~~~~~~~~~~~~
Before touching any device registers, the driver needs to enable
the PCI device by calling pci_enable_device(). This will:
o wake up the device if it was in suspended state,
o allocate I/O and memory regions of the device (if BIOS did not),
o allocate an IRQ (if BIOS did not).
NOTE: pci_enable_device() can fail! Check the return value.
[ OS BUG: we don't check resource allocations before enabling those
resources. The sequence would make more sense if we called
pci_request_resources() before calling pci_enable_device().
Currently, the device drivers can't detect the bug when when two
devices have been allocated the same range. This is not a common
problem and unlikely to get fixed soon.
This has been discussed before but not changed as of 2.6.19:
http://lkml.org/lkml/2006/3/2/194
]
pci_set_master() will enable DMA by setting the bus master bit
in the PCI_COMMAND register. It also fixes the latency timer value if
it's set to something bogus by the BIOS. pci_clear_master() will
disable DMA by clearing the bus master bit.
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
and also ensures that the cache line size register is set correctly.
Check the return value of pci_set_mwi() as not all architectures
or chip-sets may support Memory-Write-Invalidate. Alternatively,
if Mem-Wr-Inval would be nice to have but is not required, call
pci_try_set_mwi() to have the system do its best effort at enabling
Mem-Wr-Inval.
3.2 Request MMIO/IOP resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memory (MMIO), and I/O port addresses should NOT be read directly
from the PCI device config space. Use the values in the pci_dev structure
as the PCI "bus address" might have been remapped to a "host physical"
address by the arch/chip-set specific kernel support.
See Documentation/IO-mapping.txt for how to access device registers
or device memory.
The device driver needs to call pci_request_region() to verify
no other device is already using the same address resource.
Conversely, drivers should call pci_release_region() AFTER
calling pci_disable_device().
The idea is to prevent two devices colliding on the same address range.
[ See OS BUG comment above. Currently (2.6.19), The driver can only
determine MMIO and IO Port resource availability _after_ calling
pci_enable_device(). ]
Generic flavors of pci_request_region() are request_mem_region()
(for MMIO ranges) and request_region() (for IO Port ranges).
Use these for address resources that are not described by "normal" PCI
BARs.
Also see pci_request_selected_regions() below.
3.3 Set the DMA mask size
~~~~~~~~~~~~~~~~~~~~~~~~~
[ If anything below doesn't make sense, please refer to
Documentation/DMA-API.txt. This section is just a reminder that
drivers need to indicate DMA capabilities of the device and is not
an authoritative source for DMA interfaces. ]
While all drivers should explicitly indicate the DMA capability
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
32-bit bus master capability for streaming data need the driver
to "register" this capability by calling pci_set_dma_mask() with
appropriate parameters. In general this allows more efficient DMA
on systems where System RAM exists above 4G _physical_ address.
Drivers for all PCI-X and PCIe compliant devices must call
pci_set_dma_mask() as they are 64-bit DMA devices.
Similarly, drivers must also "register" this capability if the device
can directly address "consistent memory" in System RAM above 4G physical
address by calling pci_set_consistent_dma_mask().
Again, this includes drivers for all PCI-X and PCIe compliant devices.
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
64-bit DMA capable for payload ("streaming") data but not control
("consistent") data.
3.4 Setup shared control data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
memory. See Documentation/DMA-API.txt for a full description of
the DMA APIs. This section is just a reminder that it needs to be done
before enabling DMA on the device.
3.5 Initialize device registers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some drivers will need specific "capability" fields programmed
or other "vendor specific" register initialized or reset.
E.g. clearing pending interrupts.
3.6 Register IRQ handler
~~~~~~~~~~~~~~~~~~~~~~~~
While calling request_irq() is the last step described here,
this is often just another intermediate step to initialize a device.
This step can often be deferred until the device is opened for use.
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
can be shared).
request_irq() will associate an interrupt handler and device handle
with an interrupt number. Historically interrupt numbers represent
IRQ lines which run from the PCI device to the Interrupt controller.
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
request_irq() also enables the interrupt. Make sure the device is
quiesced and does not have any interrupts pending before registering
the interrupt handler.
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
which deliver interrupts to the CPU via a DMA write to a Local APIC.
The fundamental difference between MSI and MSI-X is how multiple
"vectors" get allocated. MSI requires contiguous blocks of vectors
while MSI-X can allocate several individual ones.
MSI capability can be enabled by calling pci_enable_msi() or
pci_enable_msix() before calling request_irq(). This causes
the PCI support to program CPU vector data into the PCI device
capability registers.
If your PCI device supports both, try to enable MSI-X first.
Only one can be enabled at a time. Many architectures, chip-sets,
or BIOSes do NOT support MSI or MSI-X and the call to pci_enable_msi/msix
will fail. This is important to note since many drivers have
two (or more) interrupt handlers: one for MSI/MSI-X and another for IRQs.
They choose which handler to register with request_irq() based on the
return value from pci_enable_msi/msix().
There are (at least) two really good reasons for using MSI:
1) MSI is an exclusive interrupt vector by definition.
This means the interrupt handler doesn't have to verify
its device caused the interrupt.
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
to be visible to the host CPU(s) when the MSI is delivered. This
is important for both data coherency and avoiding stale control data.
This guarantee allows the driver to omit MMIO reads to flush
the DMA stream.
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
of MSI/MSI-X usage.
4. PCI device shutdown
~~~~~~~~~~~~~~~~~~~~~~~
When a PCI device driver is being unloaded, most of the following
steps need to be performed:
Disable the device from generating IRQs
Release the IRQ (free_irq())
Stop all DMA activity
Release DMA buffers (both streaming and consistent)
Unregister from other subsystems (e.g. scsi or netdev)
Disable device from responding to MMIO/IO Port addresses
Release MMIO/IO Port resource(s)
4.1 Stop IRQs on the device
~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to do this is chip/device specific. If it's not done, it opens
the possibility of a "screaming interrupt" if (and only if)
the IRQ is shared with another device.
When the shared IRQ handler is "unhooked", the remaining devices
using the same IRQ line will still need the IRQ enabled. Thus if the
"unhooked" device asserts IRQ line, the system will respond assuming
it was one of the remaining devices asserted the IRQ line. Since none
of the other devices will handle the IRQ, the system will "hang" until
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
iterations later). Once the shared IRQ is masked, the remaining devices
will stop functioning properly. Not a nice situation.
This is another reason to use MSI or MSI-X if it's available.
MSI and MSI-X are defined to be exclusive interrupts and thus
are not susceptible to the "screaming interrupt" problem.
4.2 Release the IRQ
~~~~~~~~~~~~~~~~~~~
Once the device is quiesced (no more IRQs), one can call free_irq().
This function will return control once any pending IRQs are handled,
"unhook" the drivers IRQ handler from that IRQ, and finally release
the IRQ if no one else is using it.
4.3 Stop all DMA activity
~~~~~~~~~~~~~~~~~~~~~~~~~
It's extremely important to stop all DMA operations BEFORE attempting
to deallocate DMA control data. Failure to do so can result in memory
corruption, hangs, and on some chip-sets a hard crash.
Stopping DMA after stopping the IRQs can avoid races where the
IRQ handler might restart DMA engines.
While this step sounds obvious and trivial, several "mature" drivers
didn't get this step right in the past.
4.4 Release DMA buffers
~~~~~~~~~~~~~~~~~~~~~~~
Once DMA is stopped, clean up streaming DMA first.
I.e. unmap data buffers and return buffers to "upstream"
owners if there is one.
Then clean up "consistent" buffers which contain the control data.
See Documentation/DMA-API.txt for details on unmapping interfaces.
4.5 Unregister from other subsystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Most low level PCI device drivers support some other subsystem
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
driver isn't losing resources from that other subsystem.
If this happens, typically the symptom is an Oops (panic) when
the subsystem attempts to call into a driver that has been unloaded.
4.6 Disable Device from responding to MMIO/IO Port addresses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
This is the symmetric opposite of pci_enable_device().
Do not access device registers after calling pci_disable_device().
4.7 Release MMIO/IO Port Resource(s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Call pci_release_region() to mark the MMIO or IO Port range as available.
Failure to do so usually results in the inability to reload the driver.
5. How to access PCI config space
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can use pci_(read|write)_config_(byte|word|dword) to access the config
space of a device represented by struct pci_dev *. All these functions return 0
when successful or an error code (PCIBIOS_...) which can be translated to a text
string by pcibios_strerror. Most drivers expect that accesses to valid PCI
devices don't fail.
If you don't have a struct pci_dev available, you can call
pci_bus_(read|write)_config_(byte|word|dword) to access a given device
and function on that bus.
If you access fields in the standard portion of the config header, please
use symbolic names of locations and bits declared in <linux/pci.h>.
If you need to access Extended PCI Capability registers, just call
pci_find_capability() for the particular capability and it will find the
corresponding register block for you.
6. Other interesting functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pci_find_slot() Find pci_dev corresponding to given bus and
slot numbers.
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
pci_find_capability() Find specified capability in device's capability
list.
pci_resource_start() Returns bus start address for a given PCI region
pci_resource_end() Returns bus end address for a given PCI region
pci_resource_len() Returns the byte length of a PCI region
pci_set_drvdata() Set private driver data pointer for a pci_dev
pci_get_drvdata() Return private driver data pointer for a pci_dev
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
7. Miscellaneous hints
~~~~~~~~~~~~~~~~~~~~~~
When displaying PCI device names to the user (for example when a driver wants
to tell the user what card has it found), please use pci_name(pci_dev).
Always refer to the PCI devices by a pointer to the pci_dev structure.
All PCI layer functions use this identification and it's the only
reasonable one. Don't use bus/slot/function numbers except for very
special purposes -- on systems with multiple primary buses their semantics
can be pretty complex.
Don't try to turn on Fast Back to Back writes in your driver. All devices
on the bus need to be capable of doing it, so this is something which needs
to be handled by platform and generic code, not individual drivers.
8. Vendor and device identifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One is not not required to add new device ids to include/linux/pci_ids.h.
Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids.
PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary
hex numbers (vendor controlled) and normally used only in a single
location, the pci_device_id table.
Please DO submit new vendor/device ids to pciids.sourceforge.net project.
9. Obsolete functions
~~~~~~~~~~~~~~~~~~~~~
There are several functions which you might come across when trying to
port an old driver to the new PCI interface. They are no longer present
in the kernel as they aren't compatible with hotplug or PCI domains or
having sane locking.
pci_find_device() Superseded by pci_get_device()
pci_find_subsys() Superseded by pci_get_subsys()
pci_find_slot() Superseded by pci_get_slot()
The alternative is the traditional PCI device driver that walks PCI
device lists. This is still possible but discouraged.
10. MMIO Space and "Write Posting"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Converting a driver from using I/O Port space to using MMIO space
often requires some additional changes. Specifically, "write posting"
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
already do this. I/O Port space guarantees write transactions reach the PCI
device before the CPU can continue. Writes to MMIO space allow the CPU
to continue before the transaction reaches the PCI device. HW weenies
call this "Write Posting" because the write completion is "posted" to
the CPU before the transaction has reached its destination.
Thus, timing sensitive code should add readl() where the CPU is
expected to wait before doing other work. The classic "bit banging"
sequence works fine for I/O Port space:
for (i = 8; --i; val >>= 1) {
outb(val & 1, ioport_reg); /* write bit */
udelay(10);
}
The same sequence for MMIO space should be:
for (i = 8; --i; val >>= 1) {
writeb(val & 1, mmio_reg); /* write bit */
readb(safe_mmio_reg); /* flush posted write */
udelay(10);
}
It is important that "safe_mmio_reg" not have any side effects that
interferes with the correct operation of the device.
Another case to watch out for is when resetting a PCI device. Use PCI
Configuration space reads to flush the writel(). This will gracefully
handle the PCI master abort on all platforms if the PCI device is
expected to not respond to a readl(). Most x86 platforms will allow
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").

View File

@@ -0,0 +1,273 @@
The PCI Express Advanced Error Reporting Driver Guide HOWTO
T. Long Nguyen <tom.l.nguyen@intel.com>
Yanmin Zhang <yanmin.zhang@intel.com>
07/29/2006
1. Overview
1.1 About this guide
This guide describes the basics of the PCI Express Advanced Error
Reporting (AER) driver and provides information on how to use it, as
well as how to enable the drivers of endpoint devices to conform with
PCI Express AER driver.
1.2 Copyright <20> Intel Corporation 2006.
1.3 What is the PCI Express AER Driver?
PCI Express error signaling can occur on the PCI Express link itself
or on behalf of transactions initiated on the link. PCI Express
defines two error reporting paradigms: the baseline capability and
the Advanced Error Reporting capability. The baseline capability is
required of all PCI Express components providing a minimum defined
set of error reporting requirements. Advanced Error Reporting
capability is implemented with a PCI Express advanced error reporting
extended capability structure providing more robust error reporting.
The PCI Express AER driver provides the infrastructure to support PCI
Express Advanced Error Reporting capability. The PCI Express AER
driver provides three basic functions:
- Gathers the comprehensive error information if errors occurred.
- Reports error to the users.
- Performs error recovery actions.
AER driver only attaches root ports which support PCI-Express AER
capability.
2. User Guide
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
The PCI Express AER Root driver is a Root Port service driver attached
to the PCI Express Port Bus driver. If a user wants to use it, the driver
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
CONFIG_PCIEAER = y.
2.2 Load PCI Express AER Root Driver
There is a case where a system has AER support in BIOS. Enabling the AER
Root driver and having AER support in BIOS may result unpredictable
behavior. To avoid this conflict, a successful load of the AER Root driver
requires ACPI _OSC support in the BIOS to allow the AER Root driver to
request for native control of AER. See the PCI FW 3.0 Specification for
details regarding OSC usage. Currently, lots of firmwares don't provide
_OSC support while they use PCI Express. To support such firmwares,
forceload, a parameter of type bool, could enable AER to continue to
be initiated although firmwares have no _OSC support. To enable the
walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
when booting kernel. Note that forceload=n by default.
nosourceid, another parameter of type bool, can be used when broken
hardware (mostly chipsets) has root ports that cannot obtain the reporting
source ID. nosourceid=n by default.
2.3 AER error output
When a PCI-E AER error is captured, an error message will be outputed to
console. If it's a correctable error, it is outputed as a warning.
Otherwise, it is printed as an error. So users could choose different
log level to filter out correctable error messages.
Below shows an example.
+------ PCI-Express Device Error -----+
Error Severity : Uncorrected (Fatal)
PCIE Bus Error type : Transaction Layer
Unsupported Request : First
Requester ID : 0500
VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
TLB Header:
04000001 00200a03 05010000 00050100
In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
other fields.
3. Developer Guide
To enable AER aware support requires a software driver to configure
the AER capability structure within its device and to provide callbacks.
To support AER better, developers need understand how AER does work
firstly.
PCI Express errors are classified into two types: correctable errors
and uncorrectable errors. This classification is based on the impacts
of those errors, which may result in degraded performance or function
failure.
Correctable errors pose no impacts on the functionality of the
interface. The PCI Express protocol can recover without any software
intervention or any loss of data. These errors are detected and
corrected by hardware. Unlike correctable errors, uncorrectable
errors impact functionality of the interface. Uncorrectable errors
can cause a particular transaction or a particular PCI Express link
to be unreliable. Depending on those error conditions, uncorrectable
errors are further classified into non-fatal errors and fatal errors.
Non-fatal errors cause the particular transaction to be unreliable,
but the PCI Express link itself is fully functional. Fatal errors, on
the other hand, cause the link to be unreliable.
When AER is enabled, a PCI Express device will automatically send an
error message to the PCIE root port above it when the device captures
an error. The Root Port, upon receiving an error reporting message,
internally processes and logs the error message in its PCI Express
capability structure. Error information being logged includes storing
the error reporting agent's requestor ID into the Error Source
Identification Registers and setting the error bits of the Root Error
Status Register accordingly. If AER error reporting is enabled in Root
Error Command Register, the Root Port generates an interrupt if an
error is detected.
Note that the errors as described above are related to the PCI Express
hierarchy and links. These errors do not include any device specific
errors because device specific errors will still get sent directly to
the device driver.
3.1 Configure the AER capability structure
AER aware drivers of PCI Express component need change the device
control registers to enable AER. They also could change AER registers,
including mask and severity registers. Helper function
pci_enable_pcie_error_reporting could be used to enable AER. See
section 3.3.
3.2. Provide callbacks
3.2.1 callback reset_link to reset pci express link
This callback is used to reset the pci express physical link when a
fatal error happens. The root port aer service driver provides a
default reset_link function, but different upstream ports might
have different specifications to reset pci express link, so all
upstream ports should provide their own reset_link functions.
In struct pcie_port_service_driver, a new pointer, reset_link, is
added.
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
Section 3.2.2.2 provides more detailed info on when to call
reset_link.
3.2.2 PCI error-recovery callbacks
The PCI Express AER Root driver uses error callbacks to coordinate
with downstream device drivers associated with a hierarchy in question
when performing error recovery actions.
Data struct pci_driver has a pointer, err_handler, to point to
pci_error_handlers who consists of a couple of callback function
pointers. AER driver follows the rules defined in
pci-error-recovery.txt except pci express specific parts (e.g.
reset_link). Pls. refer to pci-error-recovery.txt for detailed
definitions of the callbacks.
Below sections specify when to call the error callback functions.
3.2.2.1 Correctable errors
Correctable errors pose no impacts on the functionality of
the interface. The PCI Express protocol can recover without any
software intervention or any loss of data. These errors do not
require any recovery actions. The AER driver clears the device's
correctable error status register accordingly and logs these errors.
3.2.2.2 Non-correctable (non-fatal and fatal) errors
If an error message indicates a non-fatal error, performing link reset
at upstream is not required. The AER driver calls error_detected(dev,
pci_channel_io_normal) to all drivers associated within a hierarchy in
question. for example,
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
If Upstream port A captures an AER error, the hierarchy consists of
Downstream port B and EndPoint.
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
whether it can recover or the AER driver calls mmio_enabled as next.
If an error message indicates a fatal error, kernel will broadcast
error_detected(dev, pci_channel_io_frozen) to all drivers within
a hierarchy in question. Then, performing link reset at upstream is
necessary. As different kinds of devices might use different approaches
to reset link, AER port service driver is required to provide the
function to reset link. Firstly, kernel looks for if the upstream
component has an aer driver. If it has, kernel uses the reset_link
callback of the aer driver. If the upstream component has no aer driver
and the port is downstream port, we will use the aer driver of the
root port who reports the AER error. As for upstream ports,
they should provide their own aer service drivers with reset_link
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
to mmio_enabled.
3.3 helper functions
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
pci_enable_pcie_error_reporting enables the device to send error
messages to root port when an error is detected. Note that devices
don't enable the error reporting by default, so device drivers need
call this function to enable it.
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
pci_disable_pcie_error_reporting disables the device to send error
messages to root port when an error is detected.
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
error status register.
3.4 Frequent Asked Questions
Q: What happens if a PCI Express device driver does not provide an
error recovery handler (pci_driver->err_handler is equal to NULL)?
A: The devices attached with the driver won't be recovered. If the
error is fatal, kernel will print out warning messages. Please refer
to section 3 for more information.
Q: What happens if an upstream port service driver does not provide
callback reset_link?
A: Fatal error recovery will fail if the errors are reported by the
upstream ports who are attached by the service driver.
Q: How does this infrastructure deal with driver that is not PCI
Express aware?
A: This infrastructure calls the error callback functions of the
driver when an error happens. But if the driver is not aware of
PCI Express, the device might not report its own errors to root
port.
Q: What modifications will that driver need to make it compatible
with the PCI Express AER Root driver?
A: It could call the helper functions to enable AER in devices and
cleanup uncorrectable status register. Pls. refer to section 3.3.
4. Software error injection
Debugging PCIE AER error recovery code is quite difficult because it
is hard to trigger real hardware errors. Software based error
injection can be used to fake various kinds of PCIE errors.
First you should enable PCIE AER software error injection in kernel
configuration, that is, following item should be in your .config.
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
After reboot with new kernel or insert the module, a device file named
/dev/aer_inject should be created.
Then, you need a user space tool named aer-inject, which can be gotten
from:
http://www.kernel.org/pub/linux/utils/pci/aer-inject/
More information about aer-inject can be found in the document comes
with its source code.