246 Intel
®
E7520 Memory Controller Hub (MCH) Datasheet
Functional Description
Memory data parity errors are non-fatal. The whole system down does not require to be shut down
when one piece of data is found to be bad (either due to a flipped bit or just a transient). There are
DIMM Counters, etc. to further quality how frequent these types of errors are, and if they should be
further escalated. FSB data errors are a transient event, and can be isolated down to a process.
Since an address parity error becomes unrecognizable, meaning we don’t know what the address is,
we probably can't isolate to a process and this becomes fatal. The PCI Express subsystem uses data
retries for bad packets. The memory subsystem uses data retries for uncorrectable errors. The FSB
and Hub Interface do not have these types of mechanisms.
5.11.2 Data Error Propagation between Interfaces/Units
Due to the nature of having various data protection schemes; ECC, parity, and CRC, it is necessary
to be able to convert between the separate schemes. Beyond this requirement, it is necessary to
indicate whether or not incoming data is corrupted. To accomplish this, the MCH implements a
functionality referred to as “data poisoning.” Each of the MCH’s external interface units, as well as
the internal Posted Memory Write Buffer (PMWB) implements a “Data Poisoning Enable” bit.
When this bit is set, and errors are detected on data incoming to the MCH from the external
interface (or poisoned data is written to the PMWB), the MCH reports the error via the enabled
mechanism for the interface, and also marks the data as poisoned before propagating it on the
internal data path towards its destination. This could result in the MCH generating a series of error
messages when an error is detected on incoming data via one of the external interfaces, and the
associated poisoned data propagates towards its destination through the MCH. Diagnostic software
could examine the various error status bits in order to track the errant data through the system.
When a data error occurs with data poisoning turned off, it causes an error bit to be set which could
then notify the system, but the data propagated would not indicate that it had a parity error. This
becomes a race between the “bad” data disguised as good, and the notification to the OS. Under
this condition, you want to “use the biggest hammer” to bring the system down as fast as possible,
before this “bad” data is consumed. Under these circumstances, causing a SERR for a memory data
parity error makes. Disabling data poisoning is not the recommended usage model, and is assumed
enabled.
If the Data Poisoning Enable bit is clear, the error condition is not reported and the data is
propagated as if no error was detected.
5.11.3 FERR/NERR Global Register Scheme
The Global FERR consists of three fields. The first (Fatal) field has 10 bits that will indicate the
first signaled fatal error from 10 different sources. The second (Non-fatal) field will indicate the
first non-fatal error that occurs from the same 10 sources. A non-fatal error may be either
correctable or uncorrectable, but not fatal to the system. These two fields will usually have at most
one bit asserted in each field. In the event of simultaneous errors occurring in the same core clock,
more than one bit in a field may be set. These registers also contain several bits that are reserved for
future enhancements.
The Global NERR will consist of these same three fields with slightly differently functionality.
Instead of just the first fatal or non-fatal errors recorded, this register will indicate the second, third,
fourth, etc. errors that are reported by the MCH.
Figure 5-12. Global FERR/NERR Register Representation
Fatal Error status bits (10 bits) Non-fatal Error status bits (11 bits) Reserved (4 bits)