Skip to content

ExuUnit

  • Version: V2R2
  • Status: OK
  • Date: 2025/01/20
  • commit:xxx

Glossary of Terms

Glossary of Terms in FU
fu Descrption
alu Arithmetic Logic Unit
mul Multiplication unit
bku B-extension bit manipulation and cryptography unit
brh Conditional Branch Unit
jmp Direct jump unit
i2f Integer to Floating-Point Conversion Unit
i2v Integer Move to Vector Unit
VSetRiWi VSet unit for reading and writing integers
VSetRiWvf The vset unit for reading integers and writing vectors
csr Control and Status Register Unit
fence Memory Synchronization Instruction Unit
div Division unit
falu Floating-point arithmetic logic unit
fcvt Floating-point conversion unit
f2v Floating-point move to vector unit
fmac Floating-point fused multiply-add
fdiv Floating-Point Division Unit
vfma Vector floating-point fused multiply-add unit
vialu Vector Integer Arithmetic Logic Unit
vimac Vector integer multiply-add unit
vppu Vector permutation processing unit
vfalu Vector Floating-Point Arithmetic Logic Unit
vfcvt Vector floating-point conversion unit
vipu Vector integer processing unit
VSetRvfWvf Read vector write vector vset unit
vfdiv Vector floating-point division unit
vidiv Vector integer division unit

Input and Output

flush is a Redirect input with a valid signal.

in is the ExuInput generated according to the specific ExeUnit parameter configuration

out is the ExuOutput generated based on the specific ExeUnit parameter configuration

csrio, csrin, and csrToDecode only exist when CSR is present in the ExeUnit.

Similarly, fenceio only exists when fence is present in the ExeUnit. frm only exists when frm is required as a src in the ExeUnit. vxrm only exists when vxrm is required as a src in the ExeUnit.

vtype, vlIsZero, and vlIsVlmax only exist when Vconfig needs to be written in this ExeUnit.

Additionally, for cases where JmpFu or BrhFu exists in the ExeUnit, the instruction address translation type instrAddrTransType must also be input

Feature

Each ExuUnit generates a series of corresponding FU modules based on its configuration parameters.

busy is used to indicate whether the current ExeUnit is in a busy state. For ExeUnits with deterministic latency, the functional unit is never marked as busy because the latency is fixed, and all tasks will complete in order. In this case, busy is directly set to false, indicating the functional unit is always idle. For ExeUnits with non-deterministic latency, busy is set high when input fires and set low when output fires. Additionally, if the incoming uop or the uop being calculated needs to be redirect flushed, busy is also set low.

Additionally, the ExeUnit checks for mixed latency types, i.e., whether there are functional units on the same port with different latency types (deterministic and non-deterministic). If such mixed cases exist, for functional units with non-deterministic latency, their priority is ensured to be the maximum. This design logic guarantees that when handling functional units with different latency types, the write-back port priorities are appropriately configured to avoid conflicts or inconsistencies in priority.

In addition to containing various FUs, each ExuUnit also includes a submodule called in1ToN, which acts as a Dispatcher. Its role is to further dispatch an ExuInput entering the ExuUnit to different FUs. It must ensure that the same ExuInput enters exactly one FU and does not enter more than one FU.

Additionally, there is a set of registers called inPipe, which consists of (valid, input) pairs with a size of latencyMax + 1. These record the inputs and the computation cycle in which the input resides. For FUs that require pipeline control, the original data can be obtained through inPipe.

Finally, the output results from different FUs need to be aggregated, and one FU's output result is selected as the ExeUnit's output.

ExuUnit Overview

Design specifications

There are a total of 3 ExuBlocks in the Backend: intExuBlock, fpExuBlock, and vfExuBlock, which are the execution modules for integer, floating-point, and vector operations, respectively. Each ExuBlock contains several ExeUnit units.

The intExuBlock contains 8 ExeUnits, each with the following functions:

FUs included in each ExeUnit within intExuBlock
ExeUnit Feature
exus0 ALU, MUL, BKU
exus1 BRH, JMP
exus2 ALU, MUL, BKU
exus3 BRH, JMP
exus4 alu
exus5 brh, jmp, i2f, i2v, VSetRiWi, VSetRiWvf
exus6 alu
exus7 csr, fence, div

The fpExuBlock contains 5 ExeUnits, with each ExuUnit corresponding to the following functions:

FUs included in each ExeUnit in fpExuBlock
ExeUnit Feature
exus0 falu,fcvt,f2v,fmac
exus1 fdiv
exus2 FALU, FMAC
exus3 fdiv
exus4 FALU, FMAC

The vfExuBlock contains 5 ExeUnits, each corresponding to the following functions:

FUs included in each ExeUnit in vfExuBlock
ExeUnit Feature
exus0 vfma,vialu,vimac,vppu
exus1 vfalu,vfcvt,vipu,VSetRvfWvf
exus2 vfma,vialu
exus3 vfalu
exus4 vfdiv, vidiv

Gating

The ExuUnit also supports clock gating for functional units (FUs). Power consumption is reduced by controlling the clock enable signal clk_en of each FU. The clock is only enabled when the FU is needed, and the clock gating enable signal is dynamically calculated based on the FU's latency settings and whether uncertain latency is enabled, thereby optimizing power consumption.

In simple terms, for FUs with fixed latency and a latency cycle count greater than 0, two vectors of length latReal + 1, fuVldVec and fuRdyVec, are used. When the FU input is valid, fuVldVec(0) is set to 1, and this 1 is shifted backward each cycle. Additionally, the value of fuRdyVec(i) depends on fuRdyVec(i+1) and fuVldVec(i+1). Thus, when there is a 1 in fuVldVec, it indicates there is valid computation in progress.

For FUs with uncertain latency, the uncer_en_reg is recorded when the FU input fires and cleared when the FU output fires.

Thus, for FUs that can use gating, the clk_en signal is asserted under the following conditions: for zero-latency FUs when the FU input fires; for multi-cycle latency FUs when the input fires or there is valid computation in the current FU; for non-deterministic latency FUs when the FU input fires or there is valid computation in the current FU. Clock gating is performed based on these conditions.