Skip to content

Error Handling and Custom Fault Injection Instructions

Functional Description

The CtrlUnit is used to control ECC error injection in the DCache. Each core's L1 DCache is configured with a memory-mapped register-controlled controller, and each hardware unit supporting ECC is assigned a Control Bank. The configuration registers in the CtrlUnit are accessed via MMIO load/store instructions. Once the registers are configured, the L1 DCache will trigger an ECC error on the first read operation (e.g., a load instruction or MainPipe access).

Feature 1: Address Space

  • The address space 0x38022000-0x3802207F, totaling 128 bytes, is the local space for each hart.

Feature 2: DCache Control Bank

  • As shown in Figure \ref{fig:CtrlBank}, each Control Bank contains the following registers: ECCCTL, ECCEID, and ECCMASK, each of which is 8 bytes in size.

CtrlBank Layout

  • ECCCTL (ECC Control): ECC injection control register

ECCCTL

  • ese (error signaling enable): Indicates that the injection is valid, initialized to 0. When the injection is successful, ese will be pulled low.

  • pst: Injection support signal. When pst=1, after the ECCEID counter decrements to 0 and injection is successful, the injection timer is reset to the previously set ECCEID value for re-injection; when pst==0, injection occurs only once.

  • ede (error delay enable): Indicates the counter is active, initialized to 0. If

    • If ese==1 and ede==0, error injection takes effect immediately.

    • When ese==1 and ede==1, the injection becomes effective only after ECCEID decrements to 0.

  • cmp (component): Indicates the injection target, initialized to 0.

    • 1’b0: Injection target is tag

    • 1’b1: Injection target is data

  • bank: Bank valid signal, initialized to 0. When a bit in the bank is set, the corresponding mask becomes active.

  • ECCEID (ECC Error Inject Delay): ECC injection delay controller.

ECCEID

  • When ese==1 and ede==1, the decrement starts until it reaches 0. Currently, the same clock frequency as the core is used, but it can also be divided. Since ECC injection depends on DCache access, the timing of EID and ECC error triggering may not align.

  • ECCMASK (ECC Mask): ECC injection mask register.

ECCMASK

  • 0 indicates no inversion, 1 indicates inversion. Tag injection only uses the bits corresponding to the tag length in ECCMASK0; any excess bits have no effect.

Feature 3: Bus Error Unit Controller

  • ECC errors from the DCache are uniformly sent to the Bus Error Unit controller for processing. The Bus Error Unit controller stores the following information:
Information stored by the Bus Error Unit
Field Descrption Initial value Address
cause Cause of the error event 0 0x38010000
value Physical address of the error event Undefined 0x38010008
enable Event valid mask 1 0x38010010
global_interrupt Global interrupt enable mask 0 0x38010018
accrued Accumulated Event Mask 0 0x38010020
local_interrupt Hart local interrupt enable mask 0 0x38010028
  • Address space

    The physical address space of the Bus Error Unit is: 0x38010000 - 0x38010fff

  • Supported error types

    • ICache ECC Error

    • DCache Ecc Error

    • L2Cache Ecc Error

  • Controlled interrupt

    • Local interrupt: It can only be reported to the Hart where the Bus Error Unit resides, and is reported to the backend, which is responsible for interrupt handling. Currently, the NMI_31 interrupt is used.

    • Global interrupt: If a global interrupt occurs, the Bus Error Unit sends the interrupt information to the PLIC, which is responsible for reporting the interrupt.

Feature 4: L1 DCache ECC Error Handling Process

  • Report error

  • Tag ECC error: An ECC error is determined as long as it occurs in any path.

    Table: Tag ECC Error and Tag Hit Relationship

    Hit Error Tag Error
    N N N
    N Y Y (probably hit)
    Y N N
    Y Y(hit with error) Y
    Y Y (hit with no error) N

    Relationship between Tag Hit and Tag ECC Error and the judgment result in the table

    • Data ECC Error: If a hit line has an ECC error, it is considered an ECC error. If there is no hit, it is not handled.

    • If an instruction access triggers an ECC error, it is considered a Hardware error and an exception is reported.

    • Any triggered error must send error information to the BEU. When hardware detects an error, it reports to the BEU, triggering an NMI external interrupt.

  • Regular memory access instruction

  • For regular memory access instructions such as Load, execution will only trigger tag or data ECC errors, which are reported to the BEU along with a Hardware Error (19).

  • Probe/Snoop

  • For Probe/Snoop

    • If a tag ECC error occurs, there is no need to change the cache state, and a ProbeAck request with corrupt=1 must be returned to L2.

    • If a data ECC error occurs, change the cache state according to the rules. If data needs to be returned, a ProbeAckData request with corrupt=1 must be returned to L2.

  • Replace/Evict

  • For Replace/Evict,

    • If a tag ECC error occurs, a Release request with corrupt=1 must be returned to L2.

    • If a data ECC error occurs, a ReleaseData request with corrupt=1 must be returned to L2.

  • Store to DCache

  • For Sbuffer writing data to DCache

    • If a tag ECC error occurs, the cacheline is released according to the Replace/Evict process, and the data is written into the DCache without reporting the error to L2.

    • If a data ECC error occurs, the data is written directly without reporting the error to L2.

  • Atomics

  • For Atomic operations, exceptions are reported, but errors are not forwarded to L2.

  • Multiple Error Selection

  • If multiple errors occur simultaneously, the priority order is ldu0 > ldu1 > ldu2 > MainPipe

\newpage

Overall Block Diagram

Error Architecture

Interface Timing

Configuration register timing

  • Configuration registers can be read and written via the tilelink interface, as shown in Figure \ref{fig:DCache-Error-Config-Timing}, with the write address and data transmitted on the A channel.

  • Configure the EccMask0 register at address 0x38022010 with the data value 0xff;

  • Configure the EccEid register at address 0x38022008 with a write value of 0x4.

  • Configure the EccCtl register at address 0x38022000 with the data value 0x5

Configuration Register Timing
{#fig:DCache-Error-Config-Timing width=80%}

Tag Injection Timing

  • As shown in Figure \ref{fig:DCache-Error-TagInj-Timing}, after configuring the registers (EccCtl, EccEid, and EccMask0), injection begins when the timer counts down to 0:

  • The tag injection interface io_pseudoError_0_valid is asserted.

  • Upon successful injection (i.e., when io_pseudoError_0_valid && io_pseudoError_0_ready == 1), the ese bit of EccCtl will be cleared, ending the injection.

  • Taking MainPipe as an example, the s1_tag_error, s2_tag_error, and s3_tag_error signals are sequentially raised, and finally, the error information is reported to the BEU through the io_error port.

Tag Injection Timing
{#fig:DCache-Error-TagInj-Timing width=80%}

\newpage

Data injection timing

  • As shown in Figure \ref{fig:DCache-Error-DataInj-Timing}, after configuring the registers (EccCtl, EccEid, and EccMask2), when the timer counts down to 0, injection begins:

  • The tag injection interface io_pseudoError_1_valid is asserted,

  • Upon successful injection (i.e., when io_pseudoError_1_valid && io_pseudoError_1_ready == 1), the ese bit of EccCtl will be cleared, ending the injection;

  • Taking MainPipe as an example, s2_data_error and s3_data_error are sequentially raised, and finally, error information is reported to the BEU via the io_error port.

Data Injection Timing
{#fig:DCache-Error-DataInj-Timing width=80%}