跳转至

CPU Core Overview
CPU核心总览

XiangShan-2 (NANHU) supports single-core and dual-core configurations, where each core has its own private L1/L2 cache. L3 is shared by multiple cores.
第二代香山(南湖)支持单核和双核配置,每个核心都有自己的私有L1/L2高速缓存。L3缓存由多个核心共享。

NANHU communicates with the uncore through 3 AXI interfaces, including the memory port, the DMA port and the peripheral port. It also has clock, reset, and JTAG interfaces. Please refer to the integration guide for more detailed information.
南湖通过3个AXI接口与非核心部分通信,包括内存端口、DMA端口和外设端口。它还具有时钟、复位和JTAG接口。更多详细信息请参阅集成指南。

NANHU targets 2GHz@14nm, and 2.4GHz~2.8GHz@7nm.
南湖的目标是在14nm工艺下频率达到2GHz,在7nm工艺下频率达到2.4GHz到2.8GHz。

Typical Configurations   典型配置

Below is the typical NANHU core configurations:
以下是南湖核心的典型配置:

Feature NANHU (XiangShan-2)
Pipeline stage
流水级数
11
Decoder width
译码宽度
6
Rename width
重命名宽度
6
ROB
重排序缓冲
256
Physical register
物理寄存器宽度
192(integer), 192(float)
Load Queue
加载队列
80
Store Queue
存储队列
64
L1 Instruction Cache
L1指令高速缓存
64KB/128KB (4/8-way)
L1 Data Cache
L1数据高速缓存
64KB/128KB(4/8 way)
L2 Cache
L2高速缓存
512KB/1MB, 8-way, non inclusive
L3 Cache
L3高速缓存
2MB~8MB, 8-way, non inclusive
Physical RF size
物理寄存器堆大小
192x64 bits, 14R8W
ECC Support
ECC支持
Y
Virtual Memory Support
虚拟内存支持
Y
Physical memory protection
物理内存保护
Y
Virtualization
虚拟化
N
Vector
向量化
N

ISA Support   ISA支持

Instruction Set Description
I Integer
整数指令
M Integer Multiplication and Division
整数乘法与除法指令
A Atomics
原子指令
F Single-Precision Floating-Point
单精度浮点数指令
D Double-Precision Floating-Point
双精度浮点数指令
C 16-bit Compressed Instructions
16位压缩指令
Zba Bitmanip Extension - address generation
位操作扩展-地址生成指令
Zbb Bitmanip Extension - basic bit manipulation
位操作扩展-基本位操作指令
Zbc Bitmanip Extension - carryless multiplication
位操作扩展-无进位乘法指令
Zbs Bitmanip Extension - single-bit instructions
位操作扩展-单比特指令
zbkb Cryptography Extensions - Bitmanip instructions
加密扩展-位操作指令
Zbkc Cryptography Extensions - Carry-less multiply instructions
加密扩展-无进位乘法指令
zbkx Cryptography Extensions - Crossbar permutation instructions
加密扩展-交叉排列指令
zknd Cryptography Extensions - AES Decryption
加密扩展-AES解密指令
zkne Cryptography Extensions - AES Encryption
加密扩展-AES加密指令
zknh Cryptography Extensions - Hash Function Instructions
AES扩展-哈希函数指令
zksed Cryptography Extensions - SM4 Block Cipher Instructions
加密扩展-SM4块密码指令
zksh Cryptography Extensions - SM3 Hash Function Instructions
加密扩展-SM3哈希函数指令
svinval Fine-Grained Address-Translation Cache Invalidation
细粒度地址转换缓存失效指令

Instruction Latency   指令延迟

Most arithmetic instructions are single-cycle (Latency = 1). Multi-cycle instructions are listed as follows.
绝大部分算术指令都是单周期指令(延迟为1)。多周期指令在下表中列出:

Instruction(s) / Operations Latency Descriptions
LD 4 (to ALU and LD), 5 (others) Load operations (to use)
加载操作(用于执行)
MUL 3 Integer multiplier
整数乘法
DIV 4~20 Integer divider (SRT16)
整数除法(SRT16)
FMA 5 Floating-point multiply-add instruction (cascade FMA)
浮点乘法加法指令(cascade FMA)
FADD, FMUL 3 Floating-point add/multiply operations
浮点加法/乘法运算
FDIV/SQRT 3~18 Floating-point div/sqrt operations
浮点除法/平方根运算
CLZ(W), CTZ(W), CPOP(W), XPERM(8/4), CLMUL(H/R) 3 Complex bit manipulation
复杂位操作
AES64*, SHA256*, SHA512*, SM3*, SM4* 3 Complex scalar crypto operations
复杂标量加密操作

Priviledge Mode   特权等级

NANHU supports three levels of privilege mode: machine (M), supervisor (S), and user (U).
南湖支持三级特权模式:machine(M), supervisor(S), 和user(U)。

Microarchitecture   微架构

Please refer to Section CPU Core for more details.
更多细节请参阅CPU Core章节

Cache Controller   高速缓存控制器

There is a cache controller connected to L3 Cache, which used to perform Cache Maintenance Operation (CMO). Programmers ought to use MMIO based memory access to trigger operation required.
南湖的缓存控制器与L3高速缓存相连,用于执行缓存维护操作(CMO)。编程人员应使用基于 MMIO 的内存访问来触发所需的操作。

The following is a register table of the L3 cache controller.
下面是 L3 缓存控制器的寄存器列表。

Address Width Attr. Description
0x3900_0100 8B RW Tag register of the interest cache block
感兴趣缓存块的标签寄存器
0x3900_0108 8B RW Set register of the interest cache block
感兴趣(or关注)缓存块的组寄存器
0x3900_0110 8B RW Way register of the interest cache block (deprecated)
感兴趣缓存的路寄存器
0x3900_0118 - 0x3900_0150 64B in total RW Data register of the interest cache block (deprecated)
感兴趣缓存块的数据寄存器
0x3900_0180 8B RO Flag register indicates ready for receiving next command
指示可以接收下一命令的标志寄存器
Value 1 indicates ready, 0 indicates not ready
值为1表示就绪,值为0表示未就绪
0x3900_0200 8B WO Command register for cache operation
用于缓存操作的指令寄存器
Supported commands are:
支持的指令有:
Command Number 16 (CMD_CMO_INV)
Command Number 17 (CMD_CMO_CLEAN)
Command Number 18 (CMD_CMO_FLUSH)

A standard Cache operation follows the following process:
标准的缓存操作过程如下:

  1. Inquire the Flag register, which indicates ready for receiving requests when valid
    查询标志寄存器,该寄存器有效时表示已准备好接收请求

  2. Set Tag register to the tag of the interest cache block
    标签寄存器设置为感兴趣缓存块的标签

  3. Set Set register to the set of the interest cache block
    组寄存器设置为感兴趣缓存块的组

  4. Write command number to Command register
    将指令编号写入指令寄存器

Afterwards, the command is desired to be done.
此后,指令执行完毕

There are three commands available.
缓存控制器有三个可用的操作:

  1. Command Number 16 (CMD_CMO_INV): Invalidate the cache block from cache hierarchy (Note that this operation may break cache coherence).
    使缓存块从缓存层次结构中失效(注意,此操作可能会破坏缓存一致性)。

  2. Command Number 17 (CMD_CMO_CLEAN): make cache block data in memory up-to-date. In other words, write back a block to memory if it is dirty in cache hierarchy. In current implementation, this command behaves just like CMD_CMO_FLUSH.
    换句话说,如果缓存块在缓存层次结构中是脏的,则将其写回内存。在目前的实现中,该命令的行为与 CMD_CMO_FLUSH 类似

  3. Command Number 18 (CMD_CMO_FLUSH): flush the cache block to memory. In other words, write back a block to memory and invalidate the block.
    将缓存块刷新到内存中。换句话说,就是将数据块写回内存并使其失效。

Hardware Performance Monitor (HPM)   硬件性能计数器

Architecture   架构

Using distributed HPM (hardware performance monitor). There is an independent HPM in each block, and the HPM can also contain mirrored values of some other CSRs (the mirrored registers can only be modified by instructions). Each HPM contains multiple performance counter registers for counting internal events. For the number of performance counters, refer to the number of performance events to be counted simultaneously for different blocks. Each performance counter contains the following registers:
南湖使用分布式 HPM(硬件性能监视器)。 每个区块都有一个独立的HPM,HPM还可以包含一些其他CSR的镜像值(镜像寄存器只能通过指令修改)。 每个 HPM包含多个性能计数寄存器,用于统计内部事件。 性能计数器的数量请参考不同区块需要同时统计的性能事件数量。 每个性能计数器包含以下寄存器:

Each hpmevent is 64 bits. In order to count the events combined by multiple events, the event fields are now split according to the function. The split is as follows.
每个 hpmevent 寄存器为 64 位。为了统计由多个事件组合成的事件, event域目前根据功能进行划分。划分方法如下。

Mode represents that the corresponding performance counter is to be counted in a specific mode. Onehot encoding is adopted, and the encoding table is as follows.
Mode表示相应的性能计数器在特定模式下计数,采用Onehot编码,编码表如下。

Table.1 Performance counter register
表1.性能计数寄存器

Privilege Mode Mode Coding Event Coding
M Mode[4] Mode[63]
H Mode[3] Mode[62]
S Mode[2] Mode[61]
U Mode[1] Mode[60]
D Mode[0] Mode[59]

Event indicates the performance event code to be counted, with a total of four event fields. Where event equals 0 means no event, event equals all 1, means cycle. The Event coding table needs to be supplemented later. When an illegal value is written, the write operation is ignored. Events are classified by block. Between two consecutive blocks, there can be overlapping parts. The overlap is used for performance counter statistics between blocks. The optype encoding table is as follows:
Event表示被统计性能事件的编码,一共有四个event域。其中一个event为0表示无事件,为全1表示循环。事件编码表需要日后补充。当写入非法值时,写入操作将被忽略。 事件按块分类.两个连续的块之间可能存在重叠部分,该部分用于块间的性能计数器统计。 操作类型编码表如下:

Table.2 Mode and Event Coding
表2.Mode和Event编码表

Optype Mode Coding
'b00000 Or
'b00001 And
'b00002 Xor
'b00003 Add
'b00004 Sub

Table.3 Optype encoding table
表3.操作类型编码表

Name Address Width Description
Mhpmcounter31-3 0xB03-0xB1F 64 64-bit accumulator. Counts based on selected events. The maximum data accumulated at one time is not fixed.
64 位累加器。根据所选事件进行计数。每次累加的最大值不固定。
Mhpmevent31-3 0x323-0x33F 64 Event selection, decides under what conditions to count.
事件选择,决定在什么条件下计数。
Mcountinhibit 0x320 32 Each bit controls whether the performance counter can be accumulated. 1: The accumulator does not change; 0: Accumulator counts up according to performance.
每一位控制性能计数器是否可以累加。1:累加器不变;0:累加器根据性能计数。
Mcounteren 0x306 32 Controls whether the S state has permission to access the corresponding performance counter. 0: S-state program access to hpmcounter register will report an illegal instruction; 1: S-state programs can access hpmcounter.
控制S态是否有权限访问相应的性能计数器。0:S态程序访问hpmcounter寄存器将报告非法指令;1:S态程序可以访问hpmcounter。
Scounteren 0x106 32 According to the value of Mcounteren, it is used to control whether the U state has permission to access the corresponding performance counter. Mcounteren[i] & Scounteren[i] == 0: An illegal instruction exception will be reported when a U-state program accesses the hpmcounter[i] register; Mcounteren[i] & Scounteren[i] == 1: U-state programs can access hpmcounter[i].
根据Mcounteren的值控制U态是否有权限访问相应的性能计数器。Mcounteren[i] & Scounteren[i] == 0:当U态程序访问 hpmcounter[i]寄存器时,将报告非法指令异常;Mcounteren[i] & Scounteren[i] == 1:U 状态程序可以访问 hpmcounter[i]。
pmcounter31-3 0xC03-0xC1F 64 Mirror for Mhpmcounter31-3.
Mhpmcounter31-3的镜像。

Linux Support   Linux支持

We have provided an example implementation with riscv-pk and riscv-linux. If you have any issues regarding the SBI and Linux syscall implementations, please refer to the source code.
我们提供了一个使用riscv-pkriscv-linux实现的示例。如果您对SBI和Linux系统调用的实现有任何疑问,请参阅源代码。

We have also provided an example of the user program to configure and read the performance counters. hpmdriver.h includes macro definition, configuring, reading or clearing methods of performance counters, and wraps syscall; set_hpm.c and read_hpm.c are for configuring and reading HPM, respectively.
我们同样提供了一个配置和读取性能计数器的用户程序示例hpmdriver.h 包括宏定义、配置、读取或清除性能计数器的方法以及封装系统调用;set_hpm.cread_hpm.c 分别用于配置和读取 HPM。

List of the Performance Counters   性能计数器列表

For the update-to-date implemented performance counters, please see the Chisel elaboration logs when generating the verilog.
有关最新实现的性能计数器,请参阅生成verilog时Chisel的详细说明日志。

The table below presents an example of events. Please note that hardware performance monitors are highly configurable, so the information provided may NOT perfectly align with real-world cases. If you require additional counters, we recommend directly modifying the Chisel code.
下表列出了一个事件示例。 请注意,硬件性能监视器是高度可配置的,因此提供的信息可能与实际情况不完全一致。如果需要额外的计数器,我们建议直接修改 Chisel 代码。

Please refer to the source code for the detailed update conditions of these counters. We want to emphasize that we cannot guarantee the accuracy of the existing performance counters. It is important to understand that utilizing these counters is done at your own risk, and we advise taking necessary precautions.
有关这些计数器的详细更新条件,请参阅源代码。我们想强调的是,我们无法保证现有性能计数器的准确性。请务必理解,使用这些计数器的风险由您自行承担,我们建议您采取必要的预防措施。

Table.4 Example of the Performance Event Table
表4.性能事件表示例

BlockName Event
Frontend FrontendBubble
IFU frontendFlush
IFU to_ibuffer_package_num
IFU crossline
IFU lastInLine
Ibuffer ibuffer_flush
Ibuffer ibuffer_hungry
IFU to_ibuffer_cache_miss_num
Icache icache_miss_req
icache Icache_miss_penalty
FTQ bpu_to_ftq_stall
FTQ mispredictRedirect
FTQ replayRedirect
FTQ predecodeRedirect
FTQ to_ifu_bubble
FTQ to_ifu_stall
FTQ from_bpu_real_bubble
FTQ BpInstr
FTQ BpRight
FTQ BpWrong
FTQ BpBInstr
FTQ BpBRight
FTQ BpBWrong
FTQ BpJRight
FTQ BpJWrong
FTQ BpIRight
FTQ BpIWrong
FTQ BpCRight
FTQ BpCWrong
FTQ BpRRight
FTQ BpRWrong
FTQ ftb_false_hit
FTQ ftb_hit"
TAGE tage_table_hits
TAGE commit_use_altpred_b0
TAGE commit_use_altpred_b1
SC sc_update_on_mispred
SC sc_update_on_unconf
BPU s2_redirect
uBTB ftb_commit_hits
uBTB ftb_commit_misses
uBTB ubtb_commit_hits
uBTB ubtb_commit_misses
IBUFFER ibuffer_empty
IBUFFER ibuffer_12_valid
IBUFFER ibuffer_24_valid
IBUFFER ibuffer_36_valid
IBUFFER ibuffer_full
FusionDecoder fused_instr
DecodeStage waitInstr
DecodeStage stall_cycle
DecodeStage utilization
DecodeStage storeset_ssit_hit
DecodeStage ssit_update_lxsx
DecodeStage ssit_update_lysx
DecodeStage ssit_update_lxsy
DecodeStage ssit_update_lysy
Rename in
Rename waitInstr
Rename stall_cycle_dispatch
Rename stall_cycle_fp
Rename stall_cycle_int
Rename stall_cycle_walk
BusyTable busy_count
StdFreeList utilization
StdFreeList allocation_blocked
StdFreeList can_alloc_wrong
Dispatch1 storeset_load_wait
Dispatch1 storeset_store_wait
Dispatch1 in
Dispatch1 empty
Dispatch1 utilization
Dispatch1 waitInstr
Dispatch1 stall_cycle_lsq
Dispatch1 stall_cycle_roq
Dispatch1 stall_cycle_int_dq
Dispatch1 stall_cycle_fp_dq
Dispatch1 stall_cycle_ls_dq
DispatchQueue_int in
DispatchQueue_int out
DispatchQueue_int out_try
DispatchQueue_int fake_block
DispatchQueue_int queue_size_4
DispatchQueue_int queue_size_8
DispatchQueue_int queue_size_12
DispatchQueue_int queue_size_16
DispatchQueue_int queue_size_full
DispatchQueue_fp in
DispatchQueue_fp out
DispatchQueue_fp out_try
DispatchQueue_fp fake_block
DispatchQueue_fp queue_size_4
DispatchQueue_fp queue_size_8
DispatchQueue_fp queue_size_12
DispatchQueue_fp queue_size_16
DispatchQueue_fp queue_size_full
DispatchQueue_lsu in
DispatchQueue_lsu out
DispatchQueue_lsu out_try
DispatchQueue_lsu fake_block
DispatchQueue_lsu queue_size_4
DispatchQueue_lsu queue_size_8
DispatchQueue_lsu queue_size_12
DispatchQueue_lsu queue_size_16
DispatchQueue_lsu queue_size_full
Dispatch2Ls in
Dispatch2Ls out
Dispatch2Ls out_load0
Dispatch2Ls out_load1
Dispatch2Ls out_store0
Dispatch2Ls out_store1
Dispatch2Ls blocked
roq interrupt_num
roq exception_num
roq flush_pipe_num
roq replay_inst_num
roq clock_cycle
roq commitUop
roq commitInstr
roq commitInstrMove
roq commitInstrMoveElim
roq commitInstrFused
roq commitInstrLoad
roq commitInstrLoadWait
roq commitInstrStore
roq writeback
roq walkInstr
roq walkCycle
roq queue_1/4
roq queue_1/2
roq queue_3/4
roq queue_full
rs alu0_rs_full
rs alu1_rs_full
rs alu2_rs_full
rs alu3_rs_full
rs load0_rs_full
rs load1_rs_full
rs store0_rs_full
rs store1_rs_full
rs mdu_rs_full
rs misc_rs_full
rs fmac0_rs_full
rs fmac1_rs_full
rs fmac2_rs_full
rs fmac3_rs_full
rs fmisc0_rs_full
rs fmac1_rs_full
PageTableWalker fsm_count
TLB first_access
TLB access
TLB first_miss
TLB miss
TLBStorage access
TLBStorage hit
PageTableCache access
PageTableCache l1_hit
PageTableCache l2_hit
PageTableCache l3_hit
PageTableCache sp_hit
PageTableCache pte_hit
L2TLBMissQueue mq_in_count
L2TLBMissQueue mem_count
L2TLBMissQueue mem_cycle
MemBlock load_rs_deq_count
MemBlock store_rs_deq_count
StoreQueue vaddr_match_failed
StoreQueue vaddr_match_really_failed
StoreQueue queue_1/4
StoreQueue queue_1/2
StoreQueue queue_3/4
StoreQueue queue_full
LoadQueue rollback
LoadQueue queue_1/4
LoadQueue queue_1/2
LoadQueue queue_3/4
LoadQueue queue_full
LoadQueue refill
LoadQueue utilization_miss
LoadUnit in
LoadUnit tlb_miss
LoadUnit in
LoadUnit dcache_miss
LoadPipe load_req
LoadPipe load_hit_way
LoadPipe load_replay_for_data_nack
LoadPipe load_replay_for_no_mshr
LoadPipe load_hit
LoadPipe load_miss
NewSbuffer do_uarch_drain
NewSbuffer sbuffer_req_valid
NewSbuffer sbuffer_req_fire
NewSbuffer sbuffer_merge
NewSbuffer dcache_req_valid
NewSbuffer dcache_req_fire
NewSbuffer StoreQueueSize
WritebackQueue wb_req
WritebackQueue wb_release
WritebackQueue wb_probe_resp
MainPipe pipe_req
MainPipe pipe_total_penalty
dcache MissQueue
dcache MissQueue
dcache MissQueue
dcache MissQueue
dcache MissQueue
dcache Probe
dcache Probe