CPU Core Overview
CPU核心总览
XiangShan-2 (NANHU) supports single-core and dual-core configurations, where each core has its own private L1/L2 cache. L3 is shared by multiple cores.
第二代香山(南湖)支持单核和双核配置,每个核心都有自己的私有L1/L2高速缓存。L3缓存由多个核心共享。
NANHU communicates with the uncore through 3 AXI interfaces, including the memory port, the DMA port and the peripheral port. It also has clock, reset, and JTAG interfaces. Please refer to the integration guide for more detailed information.
南湖通过3个AXI接口与非核心部分通信,包括内存端口、DMA端口和外设端口。它还具有时钟、复位和JTAG接口。更多详细信息请参阅集成指南。
NANHU targets 2GHz@14nm, and 2.4GHz~2.8GHz@7nm.
南湖的目标是在14nm工艺下频率达到2GHz,在7nm工艺下频率达到2.4GHz到2.8GHz。
Typical Configurations 典型配置
Below is the typical NANHU core configurations:
以下是南湖核心的典型配置:
Feature | NANHU (XiangShan-2) |
---|---|
Pipeline stage 流水级数 |
11 |
Decoder width 译码宽度 |
6 |
Rename width 重命名宽度 |
6 |
ROB 重排序缓冲 |
256 |
Physical register 物理寄存器宽度 |
192(integer), 192(float) |
Load Queue 加载队列 |
80 |
Store Queue 存储队列 |
64 |
L1 Instruction Cache L1指令高速缓存 |
64KB/128KB (4/8-way) |
L1 Data Cache L1数据高速缓存 |
64KB/128KB(4/8 way) |
L2 Cache L2高速缓存 |
512KB/1MB, 8-way, non inclusive |
L3 Cache L3高速缓存 |
2MB~8MB, 8-way, non inclusive |
Physical RF size 物理寄存器堆大小 |
192x64 bits, 14R8W |
ECC Support ECC支持 |
Y |
Virtual Memory Support 虚拟内存支持 |
Y |
Physical memory protection 物理内存保护 |
Y |
Virtualization 虚拟化 |
N |
Vector 向量化 |
N |
ISA Support ISA支持
Instruction Set | Description |
---|---|
I | Integer 整数指令 |
M | Integer Multiplication and Division 整数乘法与除法指令 |
A | Atomics 原子指令 |
F | Single-Precision Floating-Point 单精度浮点数指令 |
D | Double-Precision Floating-Point 双精度浮点数指令 |
C | 16-bit Compressed Instructions 16位压缩指令 |
Zba | Bitmanip Extension - address generation 位操作扩展-地址生成指令 |
Zbb | Bitmanip Extension - basic bit manipulation 位操作扩展-基本位操作指令 |
Zbc | Bitmanip Extension - carryless multiplication 位操作扩展-无进位乘法指令 |
Zbs | Bitmanip Extension - single-bit instructions 位操作扩展-单比特指令 |
zbkb | Cryptography Extensions - Bitmanip instructions 加密扩展-位操作指令 |
Zbkc | Cryptography Extensions - Carry-less multiply instructions 加密扩展-无进位乘法指令 |
zbkx | Cryptography Extensions - Crossbar permutation instructions 加密扩展-交叉排列指令 |
zknd | Cryptography Extensions - AES Decryption 加密扩展-AES解密指令 |
zkne | Cryptography Extensions - AES Encryption 加密扩展-AES加密指令 |
zknh | Cryptography Extensions - Hash Function Instructions AES扩展-哈希函数指令 |
zksed | Cryptography Extensions - SM4 Block Cipher Instructions 加密扩展-SM4块密码指令 |
zksh | Cryptography Extensions - SM3 Hash Function Instructions 加密扩展-SM3哈希函数指令 |
svinval | Fine-Grained Address-Translation Cache Invalidation 细粒度地址转换缓存失效指令 |
Instruction Latency 指令延迟
Most arithmetic instructions are single-cycle (Latency = 1
).
Multi-cycle instructions are listed as follows.
绝大部分算术指令都是单周期指令(延迟为1
)。多周期指令在下表中列出:
Instruction(s) / Operations | Latency | Descriptions |
---|---|---|
LD |
4 (to ALU and LD), 5 (others) | Load operations (to use) 加载操作(用于执行) |
MUL |
3 | Integer multiplier 整数乘法 |
DIV |
4~20 | Integer divider (SRT16) 整数除法(SRT16) |
FMA |
5 | Floating-point multiply-add instruction (cascade FMA) 浮点乘法加法指令(cascade FMA) |
FADD , FMUL |
3 | Floating-point add/multiply operations 浮点加法/乘法运算 |
FDIV/SQRT |
3~18 | Floating-point div/sqrt operations 浮点除法/平方根运算 |
CLZ(W) , CTZ(W) , CPOP(W) , XPERM(8/4) , CLMUL(H/R) |
3 | Complex bit manipulation 复杂位操作 |
AES64* , SHA256* , SHA512* , SM3* , SM4* |
3 | Complex scalar crypto operations 复杂标量加密操作 |
Priviledge Mode 特权等级
NANHU supports three levels of privilege mode: machine (M), supervisor (S), and user (U).
南湖支持三级特权模式:machine(M), supervisor(S), 和user(U)。
Microarchitecture 微架构
Please refer to Section CPU Core for more details.
更多细节请参阅CPU Core章节
Cache Controller 高速缓存控制器
There is a cache controller connected to L3 Cache, which used to perform Cache Maintenance Operation (CMO). Programmers ought to use MMIO based memory access to trigger operation required.
南湖的缓存控制器与L3高速缓存相连,用于执行缓存维护操作(CMO)。编程人员应使用基于 MMIO 的内存访问来触发所需的操作。
The following is a register table of the L3 cache controller.
下面是 L3 缓存控制器的寄存器列表。
Address | Width | Attr. | Description |
---|---|---|---|
0x3900_0100 | 8B | RW | Tag register of the interest cache block 感兴趣缓存块的 标签寄存器 |
0x3900_0108 | 8B | RW | Set register of the interest cache block 感兴趣(or关注)缓存块的 组寄存器 |
0x3900_0110 | 8B | RW | Way register of the interest cache block (deprecated) 感兴趣缓存的 路寄存器 |
0x3900_0118 - 0x3900_0150 | 64B in total | RW | Data register of the interest cache block (deprecated) 感兴趣缓存块的 数据寄存器 |
0x3900_0180 | 8B | RO | Flag register indicates ready for receiving next command 指示可以接收下一命令的 标志寄存器 Value 1 indicates ready, 0 indicates not ready 值为1表示就绪,值为0表示未就绪 |
0x3900_0200 | 8B | WO | Command register for cache operation用于缓存操作的 指令寄存器 Supported commands are: 支持的指令有: Command Number 16 ( CMD_CMO_INV )Command Number 17 ( CMD_CMO_CLEAN )Command Number 18 ( CMD_CMO_FLUSH ) |
A standard Cache operation follows the following process:
标准的缓存操作过程如下:
-
Inquire the
Flag
register, which indicates ready for receiving requests when valid
查询标志寄存器
,该寄存器有效时表示已准备好接收请求 -
Set
Tag
register to the tag of the interest cache block
将标签寄存器
设置为感兴趣缓存块的标签 -
Set
Set
register to the set of the interest cache block
将组寄存器
设置为感兴趣缓存块的组 -
Write command number to
Command
register
将指令编号写入指令寄存器
Afterwards, the command is desired to be done.
此后,指令执行完毕
There are three commands available.
缓存控制器有三个可用的操作:
-
Command Number 16 (
CMD_CMO_INV
): Invalidate the cache block from cache hierarchy (Note that this operation may break cache coherence).
使缓存块从缓存层次结构中失效(注意,此操作可能会破坏缓存一致性)。 -
Command Number 17 (
CMD_CMO_CLEAN
): make cache block data in memory up-to-date. In other words, write back a block to memory if it is dirty in cache hierarchy. In current implementation, this command behaves just likeCMD_CMO_FLUSH
.
换句话说,如果缓存块在缓存层次结构中是脏的,则将其写回内存。在目前的实现中,该命令的行为与CMD_CMO_FLUSH
类似 -
Command Number 18 (
CMD_CMO_FLUSH
): flush the cache block to memory. In other words, write back a block to memory and invalidate the block.
将缓存块刷新到内存中。换句话说,就是将数据块写回内存并使其失效。
Hardware Performance Monitor (HPM) 硬件性能计数器
Architecture 架构
Using distributed HPM (hardware performance monitor).
There is an independent HPM in each block, and the HPM can also contain mirrored values of some other CSRs (the mirrored registers can only be modified by instructions).
Each HPM contains multiple performance counter registers for counting internal events.
For the number of performance counters, refer to the number of performance events to be counted simultaneously for different blocks.
Each performance counter contains the following registers:
南湖使用分布式 HPM(硬件性能监视器)。
每个区块都有一个独立的HPM,HPM还可以包含一些其他CSR的镜像值(镜像寄存器只能通过指令修改)。
每个 HPM包含多个性能计数寄存器,用于统计内部事件。
性能计数器的数量请参考不同区块需要同时统计的性能事件数量。
每个性能计数器包含以下寄存器:
Each hpmevent is 64 bits. In order to count the events combined by multiple events, the event fields are now split according to the function. The split is as follows.
每个 hpmevent 寄存器为 64 位。为了统计由多个事件组合成的事件, event域目前根据功能进行划分。划分方法如下。
Mode represents that the corresponding performance counter is to be counted in a specific mode. Onehot encoding is adopted, and the encoding table is as follows.
Mode表示相应的性能计数器在特定模式下计数,采用Onehot编码,编码表如下。
Table.1 Performance counter register
表1.性能计数寄存器
Privilege Mode | Mode Coding | Event Coding |
---|---|---|
M | Mode[4] | Mode[63] |
H | Mode[3] | Mode[62] |
S | Mode[2] | Mode[61] |
U | Mode[1] | Mode[60] |
D | Mode[0] | Mode[59] |
Event indicates the performance event code to be counted, with a total of four event fields. Where event equals 0 means no event, event equals all 1, means cycle. The Event coding table needs to be supplemented later. When an illegal value is written, the write operation is ignored.
Events are classified by block. Between two consecutive blocks, there can be overlapping parts. The overlap is used for performance counter statistics between blocks.
The optype encoding table is as follows:
Event表示被统计性能事件的编码,一共有四个event域。其中一个event为0表示无事件,为全1表示循环。事件编码表需要日后补充。当写入非法值时,写入操作将被忽略。
事件按块分类.两个连续的块之间可能存在重叠部分,该部分用于块间的性能计数器统计。
操作类型编码表如下:
Table.2 Mode and Event Coding
表2.Mode和Event编码表
Optype | Mode Coding |
---|---|
'b00000 |
Or |
'b00001 |
And |
'b00002 |
Xor |
'b00003 |
Add |
'b00004 |
Sub |
Table.3 Optype encoding table
表3.操作类型编码表
Name | Address | Width | Description |
---|---|---|---|
Mhpmcounter31-3 | 0xB03-0xB1F | 64 | 64-bit accumulator. Counts based on selected events. The maximum data accumulated at one time is not fixed. 64 位累加器。根据所选事件进行计数。每次累加的最大值不固定。 |
Mhpmevent31-3 | 0x323-0x33F | 64 | Event selection, decides under what conditions to count. 事件选择,决定在什么条件下计数。 |
Mcountinhibit | 0x320 | 32 | Each bit controls whether the performance counter can be accumulated. 1: The accumulator does not change; 0: Accumulator counts up according to performance. 每一位控制性能计数器是否可以累加。1:累加器不变;0:累加器根据性能计数。 |
Mcounteren | 0x306 | 32 | Controls whether the S state has permission to access the corresponding performance counter. 0: S-state program access to hpmcounter register will report an illegal instruction; 1: S-state programs can access hpmcounter. 控制S态是否有权限访问相应的性能计数器。0:S态程序访问hpmcounter寄存器将报告非法指令;1:S态程序可以访问hpmcounter。 |
Scounteren | 0x106 | 32 | According to the value of Mcounteren, it is used to control whether the U state has permission to access the corresponding performance counter. Mcounteren[i] & Scounteren[i] == 0: An illegal instruction exception will be reported when a U-state program accesses the hpmcounter[i] register; Mcounteren[i] & Scounteren[i] == 1: U-state programs can access hpmcounter[i]. 根据Mcounteren的值控制U态是否有权限访问相应的性能计数器。Mcounteren[i] & Scounteren[i] == 0:当U态程序访问 hpmcounter[i]寄存器时,将报告非法指令异常;Mcounteren[i] & Scounteren[i] == 1:U 状态程序可以访问 hpmcounter[i]。 |
pmcounter31-3 | 0xC03-0xC1F | 64 | Mirror for Mhpmcounter31-3. Mhpmcounter31-3的镜像。 |
Linux Support Linux支持
We have provided an example implementation with riscv-pk and riscv-linux. If you have any issues regarding the SBI and Linux syscall implementations, please refer to the source code.
我们提供了一个使用riscv-pk和riscv-linux实现的示例。如果您对SBI和Linux系统调用的实现有任何疑问,请参阅源代码。
We have also provided an example of the user program to configure and read the performance counters.
hpmdriver.h
includes macro definition, configuring, reading or clearing methods of performance counters, and wraps syscall; set_hpm.c
and read_hpm.c
are for configuring and reading HPM, respectively.
我们同样提供了一个配置和读取性能计数器的用户程序示例。hpmdriver.h
包括宏定义、配置、读取或清除性能计数器的方法以及封装系统调用;set_hpm.c
和 read_hpm.c
分别用于配置和读取 HPM。
List of the Performance Counters 性能计数器列表
For the update-to-date implemented performance counters, please see the Chisel elaboration logs when generating the verilog.
有关最新实现的性能计数器,请参阅生成verilog时Chisel的详细说明日志。
The table below presents an example of events. Please note that hardware performance monitors are highly configurable, so the information provided may NOT perfectly align with real-world cases. If you require additional counters, we recommend directly modifying the Chisel code.
下表列出了一个事件示例。 请注意,硬件性能监视器是高度可配置的,因此提供的信息可能与实际情况不完全一致。如果需要额外的计数器,我们建议直接修改 Chisel 代码。
Please refer to the source code for the detailed update conditions of these counters. We want to emphasize that we cannot guarantee the accuracy of the existing performance counters. It is important to understand that utilizing these counters is done at your own risk, and we advise taking necessary precautions.
有关这些计数器的详细更新条件,请参阅源代码。我们想强调的是,我们无法保证现有性能计数器的准确性。请务必理解,使用这些计数器的风险由您自行承担,我们建议您采取必要的预防措施。
Table.4 Example of the Performance Event Table
表4.性能事件表示例
BlockName | Event |
---|---|
Frontend | FrontendBubble |
IFU | frontendFlush |
IFU | to_ibuffer_package_num |
IFU | crossline |
IFU | lastInLine |
Ibuffer | ibuffer_flush |
Ibuffer | ibuffer_hungry |
IFU | to_ibuffer_cache_miss_num |
Icache | icache_miss_req |
icache | Icache_miss_penalty |
FTQ | bpu_to_ftq_stall |
FTQ | mispredictRedirect |
FTQ | replayRedirect |
FTQ | predecodeRedirect |
FTQ | to_ifu_bubble |
FTQ | to_ifu_stall |
FTQ | from_bpu_real_bubble |
FTQ | BpInstr |
FTQ | BpRight |
FTQ | BpWrong |
FTQ | BpBInstr |
FTQ | BpBRight |
FTQ | BpBWrong |
FTQ | BpJRight |
FTQ | BpJWrong |
FTQ | BpIRight |
FTQ | BpIWrong |
FTQ | BpCRight |
FTQ | BpCWrong |
FTQ | BpRRight |
FTQ | BpRWrong |
FTQ | ftb_false_hit |
FTQ | ftb_hit" |
TAGE | tage_table_hits |
TAGE | commit_use_altpred_b0 |
TAGE | commit_use_altpred_b1 |
SC | sc_update_on_mispred |
SC | sc_update_on_unconf |
BPU | s2_redirect |
uBTB | ftb_commit_hits |
uBTB | ftb_commit_misses |
uBTB | ubtb_commit_hits |
uBTB | ubtb_commit_misses |
IBUFFER | ibuffer_empty |
IBUFFER | ibuffer_12_valid |
IBUFFER | ibuffer_24_valid |
IBUFFER | ibuffer_36_valid |
IBUFFER | ibuffer_full |
FusionDecoder | fused_instr |
DecodeStage | waitInstr |
DecodeStage | stall_cycle |
DecodeStage | utilization |
DecodeStage | storeset_ssit_hit |
DecodeStage | ssit_update_lxsx |
DecodeStage | ssit_update_lysx |
DecodeStage | ssit_update_lxsy |
DecodeStage | ssit_update_lysy |
Rename | in |
Rename | waitInstr |
Rename | stall_cycle_dispatch |
Rename | stall_cycle_fp |
Rename | stall_cycle_int |
Rename | stall_cycle_walk |
BusyTable | busy_count |
StdFreeList | utilization |
StdFreeList | allocation_blocked |
StdFreeList | can_alloc_wrong |
Dispatch1 | storeset_load_wait |
Dispatch1 | storeset_store_wait |
Dispatch1 | in |
Dispatch1 | empty |
Dispatch1 | utilization |
Dispatch1 | waitInstr |
Dispatch1 | stall_cycle_lsq |
Dispatch1 | stall_cycle_roq |
Dispatch1 | stall_cycle_int_dq |
Dispatch1 | stall_cycle_fp_dq |
Dispatch1 | stall_cycle_ls_dq |
DispatchQueue_int | in |
DispatchQueue_int | out |
DispatchQueue_int | out_try |
DispatchQueue_int | fake_block |
DispatchQueue_int | queue_size_4 |
DispatchQueue_int | queue_size_8 |
DispatchQueue_int | queue_size_12 |
DispatchQueue_int | queue_size_16 |
DispatchQueue_int | queue_size_full |
DispatchQueue_fp | in |
DispatchQueue_fp | out |
DispatchQueue_fp | out_try |
DispatchQueue_fp | fake_block |
DispatchQueue_fp | queue_size_4 |
DispatchQueue_fp | queue_size_8 |
DispatchQueue_fp | queue_size_12 |
DispatchQueue_fp | queue_size_16 |
DispatchQueue_fp | queue_size_full |
DispatchQueue_lsu | in |
DispatchQueue_lsu | out |
DispatchQueue_lsu | out_try |
DispatchQueue_lsu | fake_block |
DispatchQueue_lsu | queue_size_4 |
DispatchQueue_lsu | queue_size_8 |
DispatchQueue_lsu | queue_size_12 |
DispatchQueue_lsu | queue_size_16 |
DispatchQueue_lsu | queue_size_full |
Dispatch2Ls | in |
Dispatch2Ls | out |
Dispatch2Ls | out_load0 |
Dispatch2Ls | out_load1 |
Dispatch2Ls | out_store0 |
Dispatch2Ls | out_store1 |
Dispatch2Ls | blocked |
roq | interrupt_num |
roq | exception_num |
roq | flush_pipe_num |
roq | replay_inst_num |
roq | clock_cycle |
roq | commitUop |
roq | commitInstr |
roq | commitInstrMove |
roq | commitInstrMoveElim |
roq | commitInstrFused |
roq | commitInstrLoad |
roq | commitInstrLoadWait |
roq | commitInstrStore |
roq | writeback |
roq | walkInstr |
roq | walkCycle |
roq | queue_1/4 |
roq | queue_1/2 |
roq | queue_3/4 |
roq | queue_full |
rs | alu0_rs_full |
rs | alu1_rs_full |
rs | alu2_rs_full |
rs | alu3_rs_full |
rs | load0_rs_full |
rs | load1_rs_full |
rs | store0_rs_full |
rs | store1_rs_full |
rs | mdu_rs_full |
rs | misc_rs_full |
rs | fmac0_rs_full |
rs | fmac1_rs_full |
rs | fmac2_rs_full |
rs | fmac3_rs_full |
rs | fmisc0_rs_full |
rs | fmac1_rs_full |
PageTableWalker | fsm_count |
TLB | first_access |
TLB | access |
TLB | first_miss |
TLB | miss |
TLBStorage | access |
TLBStorage | hit |
PageTableCache | access |
PageTableCache | l1_hit |
PageTableCache | l2_hit |
PageTableCache | l3_hit |
PageTableCache | sp_hit |
PageTableCache | pte_hit |
L2TLBMissQueue | mq_in_count |
L2TLBMissQueue | mem_count |
L2TLBMissQueue | mem_cycle |
MemBlock | load_rs_deq_count |
MemBlock | store_rs_deq_count |
StoreQueue | vaddr_match_failed |
StoreQueue | vaddr_match_really_failed |
StoreQueue | queue_1/4 |
StoreQueue | queue_1/2 |
StoreQueue | queue_3/4 |
StoreQueue | queue_full |
LoadQueue | rollback |
LoadQueue | queue_1/4 |
LoadQueue | queue_1/2 |
LoadQueue | queue_3/4 |
LoadQueue | queue_full |
LoadQueue | refill |
LoadQueue | utilization_miss |
LoadUnit | in |
LoadUnit | tlb_miss |
LoadUnit | in |
LoadUnit | dcache_miss |
LoadPipe | load_req |
LoadPipe | load_hit_way |
LoadPipe | load_replay_for_data_nack |
LoadPipe | load_replay_for_no_mshr |
LoadPipe | load_hit |
LoadPipe | load_miss |
NewSbuffer | do_uarch_drain |
NewSbuffer | sbuffer_req_valid |
NewSbuffer | sbuffer_req_fire |
NewSbuffer | sbuffer_merge |
NewSbuffer | dcache_req_valid |
NewSbuffer | dcache_req_fire |
NewSbuffer | StoreQueueSize |
WritebackQueue | wb_req |
WritebackQueue | wb_release |
WritebackQueue | wb_probe_resp |
MainPipe | pipe_req |
MainPipe | pipe_total_penalty |
dcache | MissQueue |
dcache | MissQueue |
dcache | MissQueue |
dcache | MissQueue |
dcache | MissQueue |
dcache | Probe |
dcache | Probe |