HPM

Version: V2R2
Status: OK
Date: 2026/02/02
commit: 0f639b5a

Basic Information

Glossary of Terms

Terminology Explanation
Abbreviation	Full name	Description
HPM	Hardware performance monitor	Hardware Performance Counter Unit

Submodule List

Submodule List
Submodule	Description
HPerfCounter	Single Counter Module
HPerfMonitor	Counter organization module
PFEvent	Copy of Hpmevent register

Design specifications

Implemented basic hardware performance monitoring functionality based on the RISC-V Privileged Specification, with additional support for sstc and sscofpmf extensions.
The clock cycles executed by the hart (cycle)
Number of instructions committed by the hart (minstret)
Hardware Timer (time)
Counter overflow flag (time)
29 hardware performance counters (hpmcounter3 - hpmcounter3)
29 hardware performance event selectors (mhpmcounter3 - mhpmcounter31)
Supports defining up to 2^10 types of performance events

Function

The basic functions of HPM are as follows:

Disable all performance event monitoring via the mcountinhibit register.
Initialize echo performance event counters, including: mcycle, minstret, mhpmcounter3 - mhpmcounter31.
Configure performance event selectors for each monitoring unit, including: mhpmcounter3 - mhpmcounter31. The Xiangshan Kunminghu architecture allows up to four event combinations per selector. After writing the event index value, combination method, and sampling privilege level into the selector, normal counting of configured events can proceed at the specified privilege level, with results accumulated into the event counter based on the combined outcome.
Configure xcounteren for access permission authorization
Enable all performance event monitoring via mcountinhibit register and start counting.

HPM event overflow interrupt

The overflow interrupt LCOFIP initiated by the Kunming Lake Performance Monitoring Unit has a unified interrupt vector number of 12. The enabling and handling process of the interrupt is consistent with that of ordinary private interrupts.

Overall Design

Performance events are defined within each submodule, which assemble them into io_perf by calling generatePerfEvent and output to the four main modules: Frontend, Backend, MemBlock, and CoupledL2.

The above four modules obtain the performance event outputs of submodules by calling the get_perf method. Meanwhile, each main module instantiates the PFEvent module as a replica of mhpmevent in CSR, aggregating the required performance event selector data and the performance event outputs from submodules, which are then fed into the HPerfMonitor module to calculate the incremental results applied to the performance event counters.

Finally, the CSR collects incremental results from performance event counters of four top-level modules and inputs them into CSR registers mhpmcounter3-31 for cumulative counting.

In particular, the performance events of CoupledL2 are directly fed into the CSR module, and according to the event selection information read from the mhpmevent register, after processing by the HPerfMonitor module instantiated in the CSR, they are accumulated into the CSR registers mhpmcounter26-31.

For the detailed HPM overall design block diagram, refer to 此图:

HPerfMonitor Counter Organization Module

Input the event selection information (events) into the corresponding HPerfCounter module, and replicate all performance event counting information to each HperfCounter module.

Collect all HperfCounter outputs.

HperfCounter single counter module

Based on the input event selection information, select the required performance event counting information, and according to the counting mode in the event selection information, combine and output the input performance events.

Copy of PFEvent Hpmevent register

Copy of CSR register mhpmevent: Collects CSR write information and synchronizes changes to mhpmevent

Machine-mode Performance Event Count Inhibit Register (MCOUNTINHIBIT)

The Machine-Mode Performance Event Count Inhibit Register (mcountinhibit) is a 32-bit WARL register primarily used to control whether hardware performance monitoring counters count. In scenarios where performance analysis is not required, counters can be disabled to reduce processor power consumption.

Table: Machine Mode Performance Event Count Prohibit Register Description

+--------+--------+-------+--------------------------------------------+----------+ | Name | Bitfield | R/W | Behavior | Reset Value | +========+========+=======+============================================+==========+ | HPMx | 31:4 | RW | mhpmcounterx register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+ | IR | 3 | RW | minstret register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+ | -- | 2 | RO 0 | Reserved | 0 | +--------+--------+-------+--------------------------------------------+----------+ | CY | 1 | RW | mcycle register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+

Machine-mode Performance Counter Event Access Enable Register (MCOUNTEREN)

The Machine-mode Performance Event Counter Access Enable Register (mcounteren) is a 32-bit WARL register primarily used to control access permissions for user-mode performance monitoring counters at privilege levels below machine mode (HS-mode/VS-mode/HU-mode/VU-mode).

Table: Machine Mode Performance Event Counter Access Authorization Register Description

+--------+--------+-------+------------------------------------------------+----------+
| Name | Bits | R/W | Behavior | Reset |
+========+========+=======+================================================+==========+
| HPMx | 31:4 | RW | hpmcounterenx register M-mode lower privilege access bits:
| 0 | | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction
exception | | | | | | | | | | | | 1: Allows normal access to hpmcounterx | |
+--------+--------+-------+------------------------------------------------+----------+
| IR | 3 | RW | instret register M-mode lower privilege access bit: | 0 | | | |
| | | | | | | 0: Accessing instret raises illegal instruction exception | | | |
| | | | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+
| TM | 2 | RW | time/stimecmp register M-mode lower privilege access bit: | 0 |
| | | | | | | | | | 0: Accessing time raises illegal instruction exception | | |
| | | | | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+
| CY | 1 | RW | cycle register M-mode lower privilege access bit: | 0 | | | | |
| | | | | | 0: Accessing cycle raises illegal instruction exception | | | | | |
| | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+

Supervisor-mode Performance Counter Access Enable Register (SCOUNTEREN)

Supervisor-mode Performance Counter Access Enable Register (scounteren) is a 32-bit WARL register primarily used to control user-mode access permissions for performance monitoring counters in HU-mode/VU-mode.

Table: Description of access permission registers for supervisor-mode performance event counters

+--------+--------+-------+------------------------------------------------+----------+ | Name | Bits | R/W | Behavior | Reset | +========+========+=======+================================================+==========+ | HPMx | 31:4 | RW | hpmcounterenx register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction exception | | | | | | | | | | | | 1: Normal access to hpmcounterx allowed | | +--------+--------+-------+------------------------------------------------+----------+ | IR | 3 | RW | instret register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing instret raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+ | TM | 2 | RW | time register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing time raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+ | CY | 1 | RW | cycle register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing cycle raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+

Virtualization Mode Performance Event Counter Access Authorization Register (HCOUNTEREN)

The Virtualization Mode Performance Event Counter Access Authorization Register (hcounteren) is a 32-bit WARL register primarily used to control user-mode performance monitoring counter access permissions in guest virtual machines (VS-mode/VU-mode).

Table: Description of access permission registers for supervisor-mode performance event counters

+--------+--------+-------+------------------------------------------------+----------+
| Name | Bitfield | R/W | Behavior | Reset Value |
+========+========+=======+================================================+==========+
| HPMx | 31:4 | RW | hpmcounterenx register guest VM access permission bit: | 0
| | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction
exception | | | | | | | | | | | | 1: Normal access to hpmcounterx is permitted |
|
+--------+--------+-------+------------------------------------------------+----------+
| IR | 3 | RW | instret register guest VM access permission bit: | 0 | | | | | |
| | | | | 0: Accessing instret raises illegal instruction exception | | | | | |
| | | | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+
| TM | 2 | RW | time/vstimecmp(via stimecmp) register guest VM | 0 | | | | |
access permission bit: | | | | | | | | | | | | 0: Accessing time raises illegal
instruction exception | | | | | | | | | | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+
| CY | 1 | RW | cycle register guest VM access permission bit: | 0 | | | | | | |
| | | | 0: Accessing cycle raises illegal instruction exception | | | | | | | |
| | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+

Supervisor Mode Time Compare Register (STIMECMP)

The Supervisor Mode Timer Compare Register (stimecmp) is a 64-bit WARL register primarily used to manage timer interrupts (STIP) in supervisor mode.

STIMECMP Register Behavior Description:

The reset value is the 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
When menvcfg.STCE is 0 and the current privilege level is below M-mode (HS-mode/VS-mode/HU-mode/VU-mode), accessing the stimecmp register triggers an illegal instruction exception and does not generate an STIP interrupt.
The stimecmp register is the source of STIP interrupt generation: when performing an unsigned integer comparison time ≥ stimecmp, it asserts the STIP interrupt pending signal.
Supervisor mode software can control the generation of timer interrupts by writing to stimecmp.

Guest Virtual Machine Supervisor Mode Time Compare Register (VSTIMECMP)

The Guest Supervisor Time Compare Register (vstimecmp) is a 64-bit WARL register primarily used to manage timer interrupts (STIP) in guest supervisor mode.

VSTIMECMP Register Behavior Description:

The reset value is the 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
When henvcfg.STCE is 0 or hcounteren.TM is set, accessing the vstimecmp register via the stimecmp register triggers a virtual illegal instruction exception without generating a VSTIP interrupt.
The vstimecmp register is the source of VSTIP interrupt generation: when performing an unsigned integer comparison time + htimedelta ≥ vstimecmp, the VSTIP interrupt pending signal is raised.
Guest supervisor mode software can control the generation of timer interrupts in VS-mode by writing to vstimecmp.

Machine-mode Performance Event Selector (mhpmevent3 - 31) is a 64-bit WARL register used to select the performance event corresponding to each performance event counter. In the Xiangshan Kunminghu architecture, each counter can be configured to count up to four performance events in combination. After users write the event index value, event combination method, and sampling privilege level into the designated event selector, the event counter matched by that selector begins normal counting.

Table: Machine Mode Performance Event Selector Description

+----------------+--------+-------+-----------------------------------------------+----------+ | Name | Bits | R/W | Behavior | Reset | +================+========+=======+===============================================+==========+ | OF | 63 | RW | Performance counter overflow flag: | 0 | | | | | | | | | | | 0: Set to 1 when counter overflows, triggers interrupt | | | | | | | | | | | | 1: Counter value remains unchanged on overflow, no interrupt | | +----------------+--------+-------+-----------------------------------------------+----------+ | MINH | 62 | RW | When set to 1, disables M-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | SINH | 61 | RW | When set to 1, disables S-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | UINH | 60 | RW | When set to 1, disables U-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | VSINH | 59 | RW | When set to 1, disables VS-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | VUINH | 58 | RW | When set to 1, disables VU-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | -- | 57:55 | RW | -- | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | | | | Counter event combination method control bits: | | | | | | | | | | | | 5'b00000: OR operation combination | | | OP_TYPE2 | 54:50 | | | | | OP_TYPE1 | 49:45 | RW | 5'b00001: AND operation combination | 0 | | OP_TYPE0 | 44:40 | | | | | | | | 5'b00010: XOR operation combination | | | | | | | | | | | | 5'b00100: ADD operation combination | | +----------------+--------+-------+-----------------------------------------------+----------+ | | | | Counter performance event index values: | | | EVENT3 | 39:30 | | | | | EVENT2 | 29:20 | RW | 0: Corresponding event counter does not count | -- | | EVENT1 | 19:10 | | | | | EVENT0 | 9:0 | | 1: Corresponding event counter counts the event | | | | | | | | +----------------+--------+-------+-----------------------------------------------+----------+

The combination method for counter events is:

EVENT0 and EVENT1 event counts use OP_TYPE0 operation combination to produce RESULT0.
EVENT2 and EVENT3 event counts are combined using OP_TYPE1 operation to produce RESULT1.
The combined results of RESULT0 and RESULT1 are processed using OP_TYPE2 operation to form RESULT2.
RESULT2 is accumulated into the corresponding event counter.

The reset value for the event index portion of the performance event selector is specified as 0

The Kunming Lake architecture categorizes the provided performance events into four types based on their sources: frontend, backend, memory access, and cache. The counters are divided into four sections, each recording performance events from the aforementioned sources:

Frontend: mhpmevent 3-10
Backend: mhpmevent11-18
Memory Access: mhpmevent19-26
Cache: mhpmevent27-31

Kunming Lake Frontend Performance Event Index Table
Index	Event
0	noEvent
1	frontendFlush
2	ifu_req
3	ifu_miss
4	ifu_req_cacheline_0
5	ifu_req_cacheline_1
6	ifu_req_cacheline_0_hit
7	ifu_req_cacheline_1_hit
8	only_0_hit
9	only_0_miss
10	hit_0_hit_1
11	hit_0_miss_1
12	miss_0_hit_1
13	miss_0_miss_1
14	IBuffer_Flushed
15	IBuffer_hungry
16	IBuffer_1_4_valid
17	IBuffer_2_4_valid
18	IBuffer_3_4_valid
19	IBuffer_4_4_valid
20	IBuffer_full
21	Front_Bubble
22	Fetch_Latency_Bound
23	icache_miss_cnt
24	icache_miss_penalty
25	bpu_s2_redirect
26	bpu_s3_redirect
27	bpu_to_ftq_stall
28	mispredictRedirect
29	replayRedirect
30	predecodeRedirect
31	to_ifu_bubble
32	from_bpu_real_bubble
33	BpInstr
34	BpBInstr
35	BpRight
36	BpWrong
37	BpBRight
38	BpBWrong
39	BpJRight
40	BpJWrong
41	BpIRight
42	BpIWrong
43	BpCRight
44	BpCWrong
45	BpRRight
46	BpRWrong
47	ftb_false_hit
48	ftb_hit
49	fauftb_commit_hit
50	fauftb_commit_miss
51	tage_tht_hit
52	sc_update_on_mispred
53	sc_update_on_unconf
54	ftb_commit_hits
55	ftb_commit_misses
56	itlb_access
57	itlb_miss

Kunming Lake Backend Performance Event Index Table
Index	Event
0	noEvent
1	decoder_fused_instr
2	decoder_waitInstr
3	decoder_stall_cycle
4	decoder_utilization
5	frontend_stall_cycle
6	backend_stall_cycle
7	INST_SPEC
8	RECOVERY_BUBBLE
9	rename_in
10	rename_waitinstr
11	rename_stall
12	rename_stall_cycle_walk
13	rename_stall_cycle_dispatch
14	rename_stall_cycle_int
15	rename_stall_cycle_fp
16	rename_stall_cycle_vec
17	rename_stall_cycle_v0
18	rename_stall_cycle_vl
19	me_freelist_1_4_valid
20	me_freelist_2_4_valid
21	me_freelist_3_4_valid
22	me_freelist_4_4_valid
23	std_freelist_1_4_valid
24	std_freelist_2_4_valid
25	std_freelist_3_4_valid
26	std_freelist_4_4_valid
27	std_freelist_1_4_valid
28	std_freelist_2_4_valid
29	std_freelist_3_4_valid
30	std_freelist_4_4_valid
31	std_freelist_1_4_valid
32	std_freelist_2_4_valid
33	std_freelist_3_4_valid
34	std_freelist_4_4_valid
35	std_freelist_1_4_valid
36	std_freelist_2_4_valid
37	std_freelist_3_4_valid
38	std_freelist_4_4_valid
39	dispatch_in
40	dispatch_empty
41	dispatch_utili
42	dispatch_waitinstr
43	dispatch_stall_cycle_lsq
44	dispatch_stall_cycle_rob
45	dispatch_stall_cycle_int_dq
46	dispatch_stall_cycle_fp_dq
47	dispatch_stall_cycle_ls_dq
48	rob_interrupt_num
49	rob_exception_num
50	rob_flush_pipe_num
51	rob_replay_inst_num
52	rob_commitUop
53	rob_commitInstr
54	rob_commitInstrFused
55	rob_commitInstrLoad
56	rob_commitInstrBranch
57	rob_commitInstrStore
58	rob_walkInstr
59	rob_walkCycle
60	rob_1_4_valid
61	rob_2_4_valid
62	rob_3_4_valid
63	rob_4_4_valid
64	BRANCH_JUMP
65	BR_MIS_PRED
66	TOTAL_FLUSH
67	EXEC_STALL_CYCLE
68	MEMSTALL_STORE
69	MEMSTALL_L1MISS
70	MEMSTALL_L2MISS
71	MEMSTALL_L3MISS
72	issueQueue_enq_fire_cnt
73	IssueQueueAluMulBkuBrhJmp_full
74	IssueQueueAluMulBkuBrhJmp_full
75	IssueQueueAluBrhJmpI2fVsetriwiVsetriwvfI2v_full
76	IssueQueueAluCsrFenceDiv_full
77	issueQueue_enq_fire_cnt
78	IssueQueueFaluFcvtF2vFmacFdiv_full
79	IssueQueueFaluFmacFdiv_full
80	IssueQueueFaluFmac_full
81	issueQueue_enq_fire_cnt
82	IssueQueueVfmaVialuFixVimacVppuVfaluVfcvtVipuVsetrvfwvf_full
83	IssueQueueVfmaVialuFixVfalu_full
84	IssueQueueVfdivVidiv_full
85	issueQueue_enq_fire_cnt
86	IssueQueueStaMou_full
87	IssueQueueStaMou_full
88	IssueQueueLdu_full
89	IssueQueueLdu_full
90	IssueQueueLdu_full
91	IssueQueueVlduVstuVseglduVsegstu_full
92	IssueQueueVlduVstu_full
93	IssueQueueStdMoud_full
94	IssueQueueStdMoud_full

Kunminghu Memory Access Performance Event Index Table
Index	Event
0	noEvent
1	load_s0_in_fire
2	load_to_load_forward
3	stall_dcache
4	load_s1_in_fire
5	load_s1_tlb_miss
6	load_s2_in_fire
7	load_s2_dcache_miss
8	l1D_load_hw_prf_access
9	l1D_load_hw_prf_miss
10	load_s0_in_fire
11	load_to_load_forward
12	stall_dcache
13	load_s1_in_fire
14	load_s1_tlb_miss
15	load_s2_in_fire
16	load_s2_dcache_miss
17	l1D_load_hw_prf_access
18	l1D_load_hw_prf_miss
19	load_s0_in_fire
20	load_to_load_forward
21	stall_dcache
22	load_s1_in_fire
23	load_s1_tlb_miss
24	load_s2_in_fire
25	load_s2_dcache_miss
26	l1D_load_hw_prf_access
27	l1D_load_hw_prf_miss
28	sbuffer_req_valid
29	sbuffer_req_fire
30	sbuffer_merge
31	sbuffer_newline
32	dcache_req_valid
33	dcache_req_fire
34	sbuffer_idle
35	sbuffer_flush
36	sbuffer_replace
37	mpipe_resp_valid
38	replay_resp_valid
39	coh_timeout
40	sbuffer_1_4_valid
41	sbuffer_2_4_valid
42	sbuffer_3_4_valid
43	sbuffer_full_valid
44	MEMSTALL_ANY_LOAD
45	enq
46	ld_ld_violation
47	enq
48	stld_rollback
49	enq
50	deq
51	deq_block
52	replay_full
53	replay_rar_nack
54	replay_raw_nack
55	replay_nuke
56	replay_mem_amb
57	replay_tlb_miss
58	replay_bank_conflict
59	replay_dcache_replay
60	replay_forward_fail
61	replay_dcache_miss
62	full_mask_000
63	full_mask_001
64	full_mask_010
65	full_mask_011
66	full_mask_100
67	full_mask_101
68	full_mask_110
69	full_mask_111
70	nuke_rollback
71	nack_rollback
72	mmioCycle
73	mmioCnt
74	mmio_wb_success
75	mmio_wb_blocked
76	stq_1_4_valid
77	stq_2_4_valid
78	stq_3_4_valid
79	stq_4_4_valid
80	dcache_wbq_req
81	dcache_wbq_1_4_valid
82	dcache_wbq_2_4_valid
83	dcache_wbq_3_4_valid
84	dcache_wbq_4_4_valid
85	l1D_write_dcache_access
86	l1D_write_dcache_miss
87	dcache_mp_req
88	dcache_mp_total_penalty
89	dcache_missq_req
90	dcache_missq_1_4_valid
91	dcache_missq_2_4_valid
92	dcache_missq_3_4_valid
93	dcache_missq_4_4_valid
94	dcache_probq_req
95	dcache_probq_1_4_valid
96	dcache_probq_2_4_valid
97	dcache_probq_3_4_valid
98	dcache_probq_4_4_valid
99	load_req
100	load_replay
101	load_replay_for_data_nack
102	load_replay_for_no_mshr
103	load_replay_for_conflict
104	l1D_read_dcache_access
105	l1D_read_dcache_miss
106	load_req
107	load_replay
108	load_replay_for_data_nack
109	load_replay_for_no_mshr
110	load_replay_for_conflict
111	l1D_read_dcache_access
112	l1D_read_dcache_miss
113	load_req
114	load_replay
115	load_replay_for_data_nack
116	load_replay_for_no_mshr
117	load_replay_for_conflict
118	l1D_read_dcache_access
119	l1D_read_dcache_miss
120	dtlb_ld_access
121	dtlb_ld_miss
122	dtlb_st_access
123	dtlb_st_miss
124	PTW_tlbllptw_incount
125	PTW_tlbllptw_inblock
126	PTW_tlbllptw_memcount
127	PTW_tlbllptw_memcycle
128	PTW_access
129	PTW_l2_hit
130	PTW_l1_hit
131	PTW_l0_hit
132	PTW_sp_hit
133	PTW_pte_hit
134	PTW_rwHazard
135	PTW_out_blocked
136	PTW_fsm_count
137	PTW_fsm_busy
138	PTW_fsm_idle
139	PTW_resp_blocked
140	PTW_mem_count
141	PTW_mem_cycle
142	PTW_mem_blocked
143	ldDeqCount
144	stDeqCount

Kunming Lake Cache Performance Event Index Table
Index	Event
0	noEvent
1	Slice0_l2_cache_refill
2	Slice0_l2_cache_rd_refill
3	Slice0_l2_cache_wr_refill
4	Slice0_l2_cache_long_miss
5	Slice0_l2_cache_hit
6	Slice0_l2_cache_miss
7	Slice0_l2_cache_access
8	Slice0_l2_cache_l2wb
9	Slice0_l2_cache_l1wb
10	Slice0_l2_cache_wb_victim
11	Slice0_l2_cache_wb_cleaning_coh
12	Slice0_l2_cache_prefetch_access
13	Slice0_l2_cache_prefetch_miss
14	Slice0_l2_cache_access_rd
15	Slice0_l2_cache_access_wr
16	Slice0_l2_cache_miss_rd
17	Slice0_l2_cache_inv
18	Slice1_l2_cache_refill
19	Slice1_l2_cache_rd_refill
20	Slice1_l2_cache_wr_refill
21	Slice1_l2_cache_long_miss
22	Slice1_l2_cache_hit
23	Slice1_l2_cache_miss
24	Slice1_l2_cache_access
25	Slice1_l2_cache_l2wb
26	Slice1_l2_cache_l1wb
27	Slice1_l2_cache_wb_victim
28	Slice1_l2_cache_wb_cleaning_coh
29	Slice1_l2_cache_prefetch_access
30	Slice1_l2_cache_prefetch_miss
31	Slice1_l2_cache_access_rd
32	Slice1_l2_cache_access_wr
33	Slice1_l2_cache_miss_rd
34	Slice1_l2_cache_inv
35	Slice2_l2_cache_refill
36	Slice2_l2_cache_rd_refill
37	Slice2_l2_cache_wr_refill
38	Slice2_l2_cache_long_miss
39	Slice2_l2_cache_hit
40	Slice2_l2_cache_miss
41	Slice2_l2_cache_access
42	Slice2_l2_cache_l2wb
43	Slice2_l2_cache_l1wb
44	Slice2_l2_cache_wb_victim
45	Slice2_l2_cache_wb_cleaning_coh
46	Slice2_l2_cache_prefetch_access
47	Slice2_l2_cache_prefetch_miss
48	Slice2_l2_cache_access_rd
49	Slice2_l2_cache_access_wr
50	Slice2_l2_cache_miss_rd
51	Slice2_l2_cache_inv
52	Slice3_l2_cache_refill
53	Slice3_l2_cache_rd_refill
54	Slice3_l2_cache_wr_refill
55	Slice3_l2_cache_long_miss
56	Slice3_l2_cache_hit
57	Slice3_l2_cache_miss
58	Slice3_l2_cache_access
59	Slice3_l2_cache_l2wb
60	Slice3_l2_cache_l1wb
61	Slice3_l2_cache_wb_victim
62	Slice3_l2_cache_wb_cleaning_coh
63	Slice3_l2_cache_prefetch_access
64	Slice3_l2_cache_prefetch_miss
65	Slice3_l2_cache_access_rd
66	Slice3_l2_cache_access_wr
67	Slice3_l2_cache_miss_rd
68	Slice3_l2_cache_inv

Topdown PMU

Topdown performance analysis is a top-down approach designed to quickly identify CPU performance bottlenecks. Its core concept involves decomposing high-level performance categories step by step, gradually refining the issues to accurately pinpoint the root cause. We have implemented three levels of Topdown performance events, as detailed below:

Table: Three-Level Top-Down Performance Events

+-------------+-------------+-------------+--------------+---------------------------------------+
| Level 1 | Level 2 | Level 3 | Description | Formula |
+=============+=============+=============+==============+=======================================+
| Retiring | - | - | Instruction commit impact | INST_RETIRED / | | | | | |
(IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| FrontEnd | - | - | Front-end impact | IF_FETCH_BUBBLE / | | Bound | | | |
(IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Fetch | - | Fetch latency impact | IF_FETCH_BUBBLE_EQ_MAX / | | | Latency
| | | CPU_CYCLES | | | Bound | | | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| | Fetch | | | FrontEnd Bound - | | - | Bandwidth | - | Fetch bandwidth impact
| Fetch Latency Bound | | | Bound | | | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| Bad | | | | (INST_SPEC - INST_RETIRED+ | | Speculation | - | - | Misprediction
impact | RECOVERY_BUBBLE) / | | | | | | (IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Branch | - | Branch misprediction | Bad Speculation * | | | Misspredict |
| impact | BR_MIS_PRED / TOTAL_FLUSH |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Machine | - | Machine clear | Bad Speculation - Branch Misspredict | | |
Clears | | event impact | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| BackEnd | - | - | Back-end impact | 1 - (FrontEnd Bound + | | Bound | | | |
Bad Speculation + Retiring) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Core | - | Core impact | (EXEC_STALL_CYCLE - MEMSTALL_ANYLOAD -| | | Bound
| | | MEMSTALL_STORE) / CPU_CYCLE |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Memory | - | Memory access impact | (MEMSTALL_ANYLOAD + MEMSTALL_STORE) /
| | | Bound | | | CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L1 Bound | L1 impact | (MEMSTALL_ANYLOAD - MEMSTALL_L1MISS) /| | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L2 Bound | L2 impact | (MEMSTALL_L1MISS - MEMSTALL_L2MISS) / | | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L3 Bound | L3 impact | (MEMSTALL_L2MISS - MEMSTALL_L3MISS) / | | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | Mem Bound | External memory impact | MEMSTALL_L3MISS / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | Store Bound | Store instruction impact | MEMSTALL_STORE / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+

Here, IssueBW represents the issue width, which is currently 6-issue in the Xiangshan Kunminghu architecture.

Table: Topdown Performance Events

+----------------------------+----------------------+---------------------------------------------+\ | Name | Corresponding Event | Description |\ +============================+======================+=============================================+\ | CPU_CYCLES | - | Total clock cycles after all instructions commit |\ +----------------------------+----------------------+---------------------------------------------+\ | INST_RETIRED | rob_commitInstr | Number of successfully committed instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | INST_SPEC | - | Number of speculatively executed instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | IF_FETCH_BUBBLE | Front_Bubble | Number of bubbles fetched from the instruction buffer, |\ | | | with no backend stall |\ +----------------------------+----------------------+---------------------------------------------+\ | IF_FETCH_BUBBLE_EQ_MAX | Fetch_Latency_Bound | Cycles fetching zero instructions from the instruction buffer, |\ | | | with no backend stall |\ +----------------------------+----------------------+---------------------------------------------+\ | BR_MIS_PRED | - | Number of mispredicted branch instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | TOTAL_FLUSH | - | Number of pipeline flush events |\ +----------------------------+----------------------+---------------------------------------------+\ | RECOVERY_BUBBLE | - | Number of cycles recovering from early mispredictions |\ +----------------------------+----------------------+---------------------------------------------+\ | EXEC_STALL_CYCLE | - | Number of cycles issuing few uops |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_ANY_LOAD | - | No uops issued, and at least one Load instruction not completed |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_STORE | - | Non-Store uops issued, |\ | | | and Store instructions not completed |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L1MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L1-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L2MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L2-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L3MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L3-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+

To measure the impact of front-end fetch latency over a period, we can set the EVENT0 field of mhpmevent3 to 22, leaving the remaining bits at their default values, then proceed with testing. Upon completion, the CSR read instruction can be used to access the mhpmcounter3 register, obtaining the cycle count of front-end fetch latency during that period. Through calculation, the impact caused by front-end fetch latency can then be determined.

The performance event counters in the Xiangshan Kunminghu architecture are divided into three groups: machine-mode event counters, supervisor-mode event counters, and user-mode event counters.

Machine Mode Event Counter List
Name	Index	Read/Write	Description	Reset Value
MCYCLE	0xB00	RW	Machine Mode Clock Cycle Counter	-
MINSTRET	0xB02	RW	Machine-mode retired instruction counter	-
MHPMCOUNTER3-31	0XB03-0XB1F	RW	Machine-mode Performance Event Counter	0

The corresponding MHPMCOUNTERx counter is controlled by MHPMEVENTx, specifying the counting of relevant performance events.

Supervisor mode event counters include the supervisor mode counter overflow interrupt flag register (SCOUNTOVF)

Table: Supervisor Mode Counter Overflow Interrupt Flag Register (SCOUNTOVF) Description

+------------+--------+-------+-----------------------------------------------+--------+
| Name | Bits | R/W | Behavior | Reset |
+============+========+=======+===============================================+========+
| OFVEC | 31:3 | RO | mhpmcounterx register overflow flag: | 0 | | | | | | | | |
| | 1: Overflow occurred | | | | | | | | | | | | 0: No overflow occurred | |
+------------+--------+-------+-----------------------------------------------+--------+
| -- | 2:0 | RO 0 | -- | 0 |
+------------+--------+-------+-----------------------------------------------+--------+

scountovf serves as a read-only mapping of the OF bit in the mhpmcounter register, controlled by xcounteren:

M-mode can read the correct value when accessing scountovf.
HS-mode access to scountovf: When mcounteren.HPMx is 1, the corresponding OFVECx can read the correct value; otherwise, it only reads 0.
When accessing scountovf in VS-mode: When both mcounteren.HPMx and hcounteren.HPMx are 1, the corresponding OFVECx can be read correctly; otherwise, it only reads 0.

User Mode Event Counter List
Name	Index	Read/Write	Description	Reset Value
CYCLE	0xC00	RO	User-mode read-only copy of mcycle register	-
TIME	0xC01	RO	Memory-mapped register mtime user-mode read-only copy	-
INSTRET	0xC02	RO	User-mode read-only copy of minstret register	-
HPMCOUNTER3-31	0XC03-0XC1F	RO	Read-only user-mode copy of the mhpmcounter3-31 registers	0

HPM