Vector Load Split Unit VLSplit

Functional Description

Accept and process Uops from Vector Load instructions. Split Uops, calculate the offset of Uops relative to the base address, and generate control signals for the scalar memory access Pipeline. VLSplit is broadly divided into two implementation modules: VLSplitPipeline and VLSplitBuffer.

Feature 1: VLSplitPipeline performs secondary decoding for uop.

The split pipeline for Vector Load instructions. It accepts Uops of Vector Load instructions dispatched from the Vector Load issue queue. After finer-grained decoding and calculating the Mask and address offset in the pipeline, it sends them to the VLSplitBuffer. Meanwhile, the VLSplitPipeline also requests entries in the VLMergeBuffer based on the decoding results.

VLSplitPipeline consists of two pipeline stages:

S0：

Perform finer-grained decoding based on the incoming Uop information.
Generate alignedType based on the instruction type, using alignedType to indicate the memory access width of the Load Pipeline.
Generate the preIsSplit signal based on the instruction type. A high preIsSplit signal indicates that it is not a Unit-Stride instruction.
Generate the Mask for this Uop based on the instruction type and information such as vm, emul, lmul, eew, and sew.
Calculate the VdIdx of this Uop for subsequent backend data merging and writeback. Due to out-of-order execution, Uops from the same instruction may not execute consecutively, so this stage computes VdIdx based on instruction type, emul, lmul, and uopidx.

S1：

Calculate UopOffset and Stride.
Calculate the FlowNum required for this Uop. Here, the FlowNum sent to VMergeBuffer differs from that sent to VSplitBuffer. The FlowNum in MergeBuffer is used to determine whether this Uop has completed all valid memory accesses, while the FlowNum used in VSplitBuffer is required for splitting.
Request a VLMergeBuffer entry. Each Uop requests one entry.
Send information to VLSplitBuffer.

**Mask calculation: **

First, we calculate and generate the SrcMask representing this Vector Load instruction based on vm, v0, vstart, and evl. Here, evl is the effective vector length, and different types of Vector Load instructions have different evl calculation methods:
- For Load Whole instructions, their evl = NFIELDS*VLEN/EEW.
- For Load Unit-Stride Mask instructions, evl=ceil(vl/8).
- For Vector Load instructions other than the two mentioned above, their evl = vl.
Then, we use the [FlowNum of all Uops before the current Uop in this instruction] and [FlowNum of all Uops including the current Uop] along with [FlowNum of all Vd before the current Uop] to calculate the actual FlowMask. Here, due to the uniqueness of Load Indexed, when $signed(emul) > $signed(lmul) for Indexed instructions, we need to ensure that the FlowNum of Uops with the same VdIdx is offset within the VdIdx, as illustrated below:
- First, we assume the following configuration for the vector vluxei instruction:
  - vsetvli t1,t0,e8,m1,ta,ma lmul = 1
  - vluxei16.v v2,(a0),v8 emul = 2
  - vl = 9, v0 = 0x1FF
- Under this configuration, since $signed(emul) > $signed(lmul), it will actually generate two Uops, indicating that indexes need to be fetched from two vector registers, while the destination register for both Uops is the same Vd. That is, the VdIdx of the two Uops should be identical, as they are to be written into the same target register. Therefore, the following result will be produced here:
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x00FF, toMergeBuffMask = 0x01FF
  - uopIdxInField = 1, vdIdxInField = 0, flowMask = 0x0001, toMergeBuffMask = 0x01FF
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
  - uopIdxInField = 0, vdIdxInField = 0, flowMask = 0x0000, toMergeBuffMask = 0x0000
- The FlowNum calculated for each Uop is 8. For more details, refer to VSplit.scala.

Feature 2: VLSplitBuffer splits based on the secondary decoding information generated by VLSplitPipeline.

The VLSplitBuffer is a single-entry buffer that receives relevant information from the VLSplitPipeline and caches the Vector Load Uop that needs to be split.

The VLSplitBuffer will split a Uop into multiple pieces of information that can be sent to the scalar Load Pipeline based on the Uop's details, and then dispatch them to the scalar Load Pipeline for actual memory access.

** enqueue logic: **

VLSplitBuffer accepts entry requests and related information from VLSplitPipeline. When there are free entries in VLSplitBuffer, it allocates one VLSplitBuffer entry for each request and sets the corresponding entry's Valid flag high.

Dequeue logic：

VLSplitBuffer accepts entry requests and related information from VLSplitPipeline. When there are free entries in VLSplitBuffer, it allocates one VLSplitBuffer entry for each request and sets the corresponding entry's Valid flag high.

Split：

VLSplitBuffer splits instructions based on their type.
For Unit-Stride instructions:
- When the base address is aligned (not crossing CacheLine), a 128-bit access is performed at once.
- When the base address is unaligned (crossing CacheLine boundaries), we perform a split, initiating two 128-bit memory accesses.
For other Vector Load instructions, we split them according to the semantic requirements of the instructions and perform memory accesses element by element.
Each split sends the generated relevant information to the scalar Load Pipeline for actual memory access.
The splitting is determined by the splitIdx counter, where splitIdx indicates the number of splits already performed for the current entry. When splitIdx is less than the required number of splits and can be sent to the scalar Load Pipeline, a split occurs, incrementing the splitIdx counter. When splitIdx is greater than or equal to the required number of splits, the splitting ends, the entry is dequeued, and the splitIdx counter is reset to zero.

**Address calculation: **

During splitting, it is also necessary to calculate the relevant information to be sent to the scalar Load Pipeline, primarily determining the virtual address for each split memory access.
The virtual address calculation varies based on the instruction type's splitting method.
For Unit-Stride instructions:
- When the base address is aligned (not crossing a CacheLine), a single 128-bit aligned access is sufficient.
- When the base address is unaligned (crossing CacheLine), we split it and use two consecutive 128-bit aligned addresses for access.
For other Vector Load instructions, we split them element-wise according to the instruction semantics, with virtual addresses calculated based on the elements and semantics.

**Redirection and exception handling: ** When a redirection signal arrives, relevant entries in VLSplitBuffer are flushed based on the redirection information.

Feature 3: Apply backpressure based on the Threshold signal from VLMergeBuffer

Refer to Threshold Backpressure. When receiving from the VLMergeBuffer, the VLSplitPipeline will backpressure the enqueue request, preventing the backend from sending new uops until the VLMergeBuffer releases the threshold backpressure.

Overall Block Diagram

No block diagram for a single module.

Main ports

Only list the external interfaces of VLSplit, excluding the internal interfaces of VLSplitPipe and VLSplitBuffer.

	Direction	Description
redirect	In	Redirect port
in	In	Receive uop dispatch from the Issue Queue.
toMergeBuffer.req	Out	Request MergeBuffer entry
toMergeBuffer.resp	In	MergeBuffer response
out	Out	Send memory access requests to the Load Unit.
threshold	In	Receive the threshold signal from VLMergeBuffer.

Interface timing

The interface timing is relatively simple, described only in text.

	Description
redirect	Has Valid status. Data is valid when Valid is asserted.
in	Includes Valid and Ready signals. Data is valid when Valid && Ready.
toMergeBuffer.req	Includes Valid and Ready signals. Data is valid when Valid && Ready.
toMergeBuffer.resp	Has Valid status. Data is valid when Valid is asserted.
out	Includes Valid and Ready signals. Data is valid when Valid && Ready.
threshold	No Valid signal; data is always considered valid, and responses are generated as soon as the corresponding signal is present.