Skip to content

HPM

  • Version: V2R2
  • Status: OK
  • Date: 2026/02/02
  • commit: 0f639b5a

Basic Information

Glossary of Terms

Terminology Explanation
Abbreviation Full name Description
HPM Hardware performance monitor Hardware Performance Counter Unit

Submodule List

Submodule List
Submodule Description
HPerfCounter Single Counter Module
HPerfMonitor Counter organization module
PFEvent Copy of Hpmevent register

Design specifications

  • Implemented basic hardware performance monitoring functionality based on the RISC-V Privileged Specification, with additional support for sstc and sscofpmf extensions.
  • The clock cycles executed by the hart (cycle)
  • Number of instructions committed by the hart (minstret)
  • Hardware Timer (time)
  • Counter overflow flag (time)
  • 29 hardware performance counters (hpmcounter3 - hpmcounter3)
  • 29 hardware performance event selectors (mhpmcounter3 - mhpmcounter31)
  • Supports defining up to 2^10 types of performance events

Function

The basic functions of HPM are as follows:

  • Disable all performance event monitoring via the mcountinhibit register.
  • Initialize echo performance event counters, including: mcycle, minstret, mhpmcounter3 - mhpmcounter31.
  • Configure performance event selectors for each monitoring unit, including: mhpmcounter3 - mhpmcounter31. The Xiangshan Kunminghu architecture allows up to four event combinations per selector. After writing the event index value, combination method, and sampling privilege level into the selector, normal counting of configured events can proceed at the specified privilege level, with results accumulated into the event counter based on the combined outcome.
  • Configure xcounteren for access permission authorization
  • Enable all performance event monitoring via mcountinhibit register and start counting.

HPM event overflow interrupt

The overflow interrupt LCOFIP initiated by the Kunming Lake Performance Monitoring Unit has a unified interrupt vector number of 12. The enabling and handling process of the interrupt is consistent with that of ordinary private interrupts.

Overall Design

Performance events are defined within each submodule, which assemble them into io_perf by calling generatePerfEvent and output to the four main modules: Frontend, Backend, MemBlock, and CoupledL2.

The above four modules obtain the performance event outputs of submodules by calling the get_perf method. Meanwhile, each main module instantiates the PFEvent module as a replica of mhpmevent in CSR, aggregating the required performance event selector data and the performance event outputs from submodules, which are then fed into the HPerfMonitor module to calculate the incremental results applied to the performance event counters.

Finally, the CSR collects incremental results from performance event counters of four top-level modules and inputs them into CSR registers mhpmcounter3-31 for cumulative counting.

In particular, the performance events of CoupledL2 are directly fed into the CSR module, and according to the event selection information read from the mhpmevent register, after processing by the HPerfMonitor module instantiated in the CSR, they are accumulated into the CSR registers mhpmcounter26-31.

For the detailed HPM overall design block diagram, refer to 此图:

HPM Overall Design

HPerfMonitor Counter Organization Module

Input the event selection information (events) into the corresponding HPerfCounter module, and replicate all performance event counting information to each HperfCounter module.

Collect all HperfCounter outputs.

HperfCounter single counter module

Based on the input event selection information, select the required performance event counting information, and according to the counting mode in the event selection information, combine and output the input performance events.

Copy of PFEvent Hpmevent register

Copy of CSR register mhpmevent: Collects CSR write information and synchronizes changes to mhpmevent

Machine-mode Performance Event Count Inhibit Register (MCOUNTINHIBIT)

The Machine-Mode Performance Event Count Inhibit Register (mcountinhibit) is a 32-bit WARL register primarily used to control whether hardware performance monitoring counters count. In scenarios where performance analysis is not required, counters can be disabled to reduce processor power consumption.

Table: Machine Mode Performance Event Count Prohibit Register Description

+--------+--------+-------+--------------------------------------------+----------+ | Name | Bitfield | R/W | Behavior | Reset Value | +========+========+=======+============================================+==========+ | HPMx | 31:4 | RW | mhpmcounterx register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+ | IR | 3 | RW | minstret register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+ | -- | 2 | RO 0 | Reserved | 0 | +--------+--------+-------+--------------------------------------------+----------+ | CY | 1 | RW | mcycle register count disable bit: | 0 | | | | | | | | | | | 0: Normal counting | | | | | | | | | | | | 1: Counting disabled | | +--------+--------+-------+--------------------------------------------+----------+

Machine-mode Performance Counter Event Access Enable Register (MCOUNTEREN)

The Machine-mode Performance Event Counter Access Enable Register (mcounteren) is a 32-bit WARL register primarily used to control access permissions for user-mode performance monitoring counters at privilege levels below machine mode (HS-mode/VS-mode/HU-mode/VU-mode).

Table: Machine Mode Performance Event Counter Access Authorization Register Description

+--------+--------+-------+------------------------------------------------+----------+
| Name | Bits | R/W | Behavior | Reset |
+========+========+=======+================================================+==========+
| HPMx | 31:4 | RW | hpmcounterenx register M-mode lower privilege access bits:
| 0 | | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction
exception | | | | | | | | | | | | 1: Allows normal access to hpmcounterx | |
+--------+--------+-------+------------------------------------------------+----------+
| IR | 3 | RW | instret register M-mode lower privilege access bit: | 0 | | | |
| | | | | | | 0: Accessing instret raises illegal instruction exception | | | |
| | | | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+
| TM | 2 | RW | time/stimecmp register M-mode lower privilege access bit: | 0 |
| | | | | | | | | | 0: Accessing time raises illegal instruction exception | | |
| | | | | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+
| CY | 1 | RW | cycle register M-mode lower privilege access bit: | 0 | | | | |
| | | | | | 0: Accessing cycle raises illegal instruction exception | | | | | |
| | | | | | 1: Allows normal access | |
+--------+--------+-------+------------------------------------------------+----------+

Supervisor-mode Performance Counter Access Enable Register (SCOUNTEREN)

Supervisor-mode Performance Counter Access Enable Register (scounteren) is a 32-bit WARL register primarily used to control user-mode access permissions for performance monitoring counters in HU-mode/VU-mode.

Table: Description of access permission registers for supervisor-mode performance event counters

+--------+--------+-------+------------------------------------------------+----------+ | Name | Bits | R/W | Behavior | Reset | +========+========+=======+================================================+==========+ | HPMx | 31:4 | RW | hpmcounterenx register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction exception | | | | | | | | | | | | 1: Normal access to hpmcounterx allowed | | +--------+--------+-------+------------------------------------------------+----------+ | IR | 3 | RW | instret register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing instret raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+ | TM | 2 | RW | time register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing time raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+ | CY | 1 | RW | cycle register user-mode access bit: | 0 | | | | | | | | | | | 0: Accessing cycle raises illegal instruction exception | | | | | | | | | | | | 1: Normal access allowed | | +--------+--------+-------+------------------------------------------------+----------+

Virtualization Mode Performance Event Counter Access Authorization Register (HCOUNTEREN)

The Virtualization Mode Performance Event Counter Access Authorization Register (hcounteren) is a 32-bit WARL register primarily used to control user-mode performance monitoring counter access permissions in guest virtual machines (VS-mode/VU-mode).

Table: Description of access permission registers for supervisor-mode performance event counters

+--------+--------+-------+------------------------------------------------+----------+
| Name | Bitfield | R/W | Behavior | Reset Value |
+========+========+=======+================================================+==========+
| HPMx | 31:4 | RW | hpmcounterenx register guest VM access permission bit: | 0
| | | | | | | | | | | 0: Accessing hpmcounterx raises illegal instruction
exception | | | | | | | | | | | | 1: Normal access to hpmcounterx is permitted |
|
+--------+--------+-------+------------------------------------------------+----------+
| IR | 3 | RW | instret register guest VM access permission bit: | 0 | | | | | |
| | | | | 0: Accessing instret raises illegal instruction exception | | | | | |
| | | | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+
| TM | 2 | RW | time/vstimecmp(via stimecmp) register guest VM | 0 | | | | |
access permission bit: | | | | | | | | | | | | 0: Accessing time raises illegal
instruction exception | | | | | | | | | | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+
| CY | 1 | RW | cycle register guest VM access permission bit: | 0 | | | | | | |
| | | | 0: Accessing cycle raises illegal instruction exception | | | | | | | |
| | | | 1: Normal access is permitted | |
+--------+--------+-------+------------------------------------------------+----------+

Supervisor Mode Time Compare Register (STIMECMP)

The Supervisor Mode Timer Compare Register (stimecmp) is a 64-bit WARL register primarily used to manage timer interrupts (STIP) in supervisor mode.

STIMECMP Register Behavior Description:

  • The reset value is the 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
  • When menvcfg.STCE is 0 and the current privilege level is below M-mode (HS-mode/VS-mode/HU-mode/VU-mode), accessing the stimecmp register triggers an illegal instruction exception and does not generate an STIP interrupt.
  • The stimecmp register is the source of STIP interrupt generation: when performing an unsigned integer comparison time ≥ stimecmp, it asserts the STIP interrupt pending signal.
  • Supervisor mode software can control the generation of timer interrupts by writing to stimecmp.

Guest Virtual Machine Supervisor Mode Time Compare Register (VSTIMECMP)

The Guest Supervisor Time Compare Register (vstimecmp) is a 64-bit WARL register primarily used to manage timer interrupts (STIP) in guest supervisor mode.

VSTIMECMP Register Behavior Description:

  • The reset value is the 64-bit unsigned number 64'hffff_ffff_ffff_ffff.
  • When henvcfg.STCE is 0 or hcounteren.TM is set, accessing the vstimecmp register via the stimecmp register triggers a virtual illegal instruction exception without generating a VSTIP interrupt.
  • The vstimecmp register is the source of VSTIP interrupt generation: when performing an unsigned integer comparison time + htimedelta ≥ vstimecmp, the VSTIP interrupt pending signal is raised.
  • Guest supervisor mode software can control the generation of timer interrupts in VS-mode by writing to vstimecmp.

Machine-mode Performance Event Selector (mhpmevent3 - 31) is a 64-bit WARL register used to select the performance event corresponding to each performance event counter. In the Xiangshan Kunminghu architecture, each counter can be configured to count up to four performance events in combination. After users write the event index value, event combination method, and sampling privilege level into the designated event selector, the event counter matched by that selector begins normal counting.

Table: Machine Mode Performance Event Selector Description

+----------------+--------+-------+-----------------------------------------------+----------+ | Name | Bits | R/W | Behavior | Reset | +================+========+=======+===============================================+==========+ | OF | 63 | RW | Performance counter overflow flag: | 0 | | | | | | | | | | | 0: Set to 1 when counter overflows, triggers interrupt | | | | | | | | | | | | 1: Counter value remains unchanged on overflow, no interrupt | | +----------------+--------+-------+-----------------------------------------------+----------+ | MINH | 62 | RW | When set to 1, disables M-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | SINH | 61 | RW | When set to 1, disables S-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | UINH | 60 | RW | When set to 1, disables U-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | VSINH | 59 | RW | When set to 1, disables VS-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | VUINH | 58 | RW | When set to 1, disables VU-mode sampling | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | -- | 57:55 | RW | -- | 0 | +----------------+--------+-------+-----------------------------------------------+----------+ | | | | Counter event combination method control bits: | | | | | | | | | | | | 5'b00000: OR operation combination | | | OP_TYPE2 | 54:50 | | | | | OP_TYPE1 | 49:45 | RW | 5'b00001: AND operation combination | 0 | | OP_TYPE0 | 44:40 | | | | | | | | 5'b00010: XOR operation combination | | | | | | | | | | | | 5'b00100: ADD operation combination | | +----------------+--------+-------+-----------------------------------------------+----------+ | | | | Counter performance event index values: | | | EVENT3 | 39:30 | | | | | EVENT2 | 29:20 | RW | 0: Corresponding event counter does not count | -- | | EVENT1 | 19:10 | | | | | EVENT0 | 9:0 | | 1: Corresponding event counter counts the event | | | | | | | | +----------------+--------+-------+-----------------------------------------------+----------+

The combination method for counter events is:

  • EVENT0 and EVENT1 event counts use OP_TYPE0 operation combination to produce RESULT0.
  • EVENT2 and EVENT3 event counts are combined using OP_TYPE1 operation to produce RESULT1.
  • The combined results of RESULT0 and RESULT1 are processed using OP_TYPE2 operation to form RESULT2.
  • RESULT2 is accumulated into the corresponding event counter.

The reset value for the event index portion of the performance event selector is specified as 0

The Kunming Lake architecture categorizes the provided performance events into four types based on their sources: frontend, backend, memory access, and cache. The counters are divided into four sections, each recording performance events from the aforementioned sources:

  • Frontend: mhpmevent 3-10
  • Backend: mhpmevent11-18
  • Memory Access: mhpmevent19-26
  • Cache: mhpmevent27-31
Kunming Lake Frontend Performance Event Index Table
Index Event
0 noEvent
1 frontendFlush
2 ifu_req
3 ifu_miss
4 ifu_req_cacheline_0
5 ifu_req_cacheline_1
6 ifu_req_cacheline_0_hit
7 ifu_req_cacheline_1_hit
8 only_0_hit
9 only_0_miss
10 hit_0_hit_1
11 hit_0_miss_1
12 miss_0_hit_1
13 miss_0_miss_1
14 IBuffer_Flushed
15 IBuffer_hungry
16 IBuffer_1_4_valid
17 IBuffer_2_4_valid
18 IBuffer_3_4_valid
19 IBuffer_4_4_valid
20 IBuffer_full
21 Front_Bubble
22 Fetch_Latency_Bound
23 icache_miss_cnt
24 icache_miss_penalty
25 bpu_s2_redirect
26 bpu_s3_redirect
27 bpu_to_ftq_stall
28 mispredictRedirect
29 replayRedirect
30 predecodeRedirect
31 to_ifu_bubble
32 from_bpu_real_bubble
33 BpInstr
34 BpBInstr
35 BpRight
36 BpWrong
37 BpBRight
38 BpBWrong
39 BpJRight
40 BpJWrong
41 BpIRight
42 BpIWrong
43 BpCRight
44 BpCWrong
45 BpRRight
46 BpRWrong
47 ftb_false_hit
48 ftb_hit
49 fauftb_commit_hit
50 fauftb_commit_miss
51 tage_tht_hit
52 sc_update_on_mispred
53 sc_update_on_unconf
54 ftb_commit_hits
55 ftb_commit_misses
56 itlb_access
57 itlb_miss
Kunming Lake Backend Performance Event Index Table
Index Event
0 noEvent
1 decoder_fused_instr
2 decoder_waitInstr
3 decoder_stall_cycle
4 decoder_utilization
5 frontend_stall_cycle
6 backend_stall_cycle
7 INST_SPEC
8 RECOVERY_BUBBLE
9 rename_in
10 rename_waitinstr
11 rename_stall
12 rename_stall_cycle_walk
13 rename_stall_cycle_dispatch
14 rename_stall_cycle_int
15 rename_stall_cycle_fp
16 rename_stall_cycle_vec
17 rename_stall_cycle_v0
18 rename_stall_cycle_vl
19 me_freelist_1_4_valid
20 me_freelist_2_4_valid
21 me_freelist_3_4_valid
22 me_freelist_4_4_valid
23 std_freelist_1_4_valid
24 std_freelist_2_4_valid
25 std_freelist_3_4_valid
26 std_freelist_4_4_valid
27 std_freelist_1_4_valid
28 std_freelist_2_4_valid
29 std_freelist_3_4_valid
30 std_freelist_4_4_valid
31 std_freelist_1_4_valid
32 std_freelist_2_4_valid
33 std_freelist_3_4_valid
34 std_freelist_4_4_valid
35 std_freelist_1_4_valid
36 std_freelist_2_4_valid
37 std_freelist_3_4_valid
38 std_freelist_4_4_valid
39 dispatch_in
40 dispatch_empty
41 dispatch_utili
42 dispatch_waitinstr
43 dispatch_stall_cycle_lsq
44 dispatch_stall_cycle_rob
45 dispatch_stall_cycle_int_dq
46 dispatch_stall_cycle_fp_dq
47 dispatch_stall_cycle_ls_dq
48 rob_interrupt_num
49 rob_exception_num
50 rob_flush_pipe_num
51 rob_replay_inst_num
52 rob_commitUop
53 rob_commitInstr
54 rob_commitInstrFused
55 rob_commitInstrLoad
56 rob_commitInstrBranch
57 rob_commitInstrStore
58 rob_walkInstr
59 rob_walkCycle
60 rob_1_4_valid
61 rob_2_4_valid
62 rob_3_4_valid
63 rob_4_4_valid
64 BRANCH_JUMP
65 BR_MIS_PRED
66 TOTAL_FLUSH
67 EXEC_STALL_CYCLE
68 MEMSTALL_STORE
69 MEMSTALL_L1MISS
70 MEMSTALL_L2MISS
71 MEMSTALL_L3MISS
72 issueQueue_enq_fire_cnt
73 IssueQueueAluMulBkuBrhJmp_full
74 IssueQueueAluMulBkuBrhJmp_full
75 IssueQueueAluBrhJmpI2fVsetriwiVsetriwvfI2v_full
76 IssueQueueAluCsrFenceDiv_full
77 issueQueue_enq_fire_cnt
78 IssueQueueFaluFcvtF2vFmacFdiv_full
79 IssueQueueFaluFmacFdiv_full
80 IssueQueueFaluFmac_full
81 issueQueue_enq_fire_cnt
82 IssueQueueVfmaVialuFixVimacVppuVfaluVfcvtVipuVsetrvfwvf_full
83 IssueQueueVfmaVialuFixVfalu_full
84 IssueQueueVfdivVidiv_full
85 issueQueue_enq_fire_cnt
86 IssueQueueStaMou_full
87 IssueQueueStaMou_full
88 IssueQueueLdu_full
89 IssueQueueLdu_full
90 IssueQueueLdu_full
91 IssueQueueVlduVstuVseglduVsegstu_full
92 IssueQueueVlduVstu_full
93 IssueQueueStdMoud_full
94 IssueQueueStdMoud_full
Kunminghu Memory Access Performance Event Index Table
Index Event
0 noEvent
1 load_s0_in_fire
2 load_to_load_forward
3 stall_dcache
4 load_s1_in_fire
5 load_s1_tlb_miss
6 load_s2_in_fire
7 load_s2_dcache_miss
8 l1D_load_hw_prf_access
9 l1D_load_hw_prf_miss
10 load_s0_in_fire
11 load_to_load_forward
12 stall_dcache
13 load_s1_in_fire
14 load_s1_tlb_miss
15 load_s2_in_fire
16 load_s2_dcache_miss
17 l1D_load_hw_prf_access
18 l1D_load_hw_prf_miss
19 load_s0_in_fire
20 load_to_load_forward
21 stall_dcache
22 load_s1_in_fire
23 load_s1_tlb_miss
24 load_s2_in_fire
25 load_s2_dcache_miss
26 l1D_load_hw_prf_access
27 l1D_load_hw_prf_miss
28 sbuffer_req_valid
29 sbuffer_req_fire
30 sbuffer_merge
31 sbuffer_newline
32 dcache_req_valid
33 dcache_req_fire
34 sbuffer_idle
35 sbuffer_flush
36 sbuffer_replace
37 mpipe_resp_valid
38 replay_resp_valid
39 coh_timeout
40 sbuffer_1_4_valid
41 sbuffer_2_4_valid
42 sbuffer_3_4_valid
43 sbuffer_full_valid
44 MEMSTALL_ANY_LOAD
45 enq
46 ld_ld_violation
47 enq
48 stld_rollback
49 enq
50 deq
51 deq_block
52 replay_full
53 replay_rar_nack
54 replay_raw_nack
55 replay_nuke
56 replay_mem_amb
57 replay_tlb_miss
58 replay_bank_conflict
59 replay_dcache_replay
60 replay_forward_fail
61 replay_dcache_miss
62 full_mask_000
63 full_mask_001
64 full_mask_010
65 full_mask_011
66 full_mask_100
67 full_mask_101
68 full_mask_110
69 full_mask_111
70 nuke_rollback
71 nack_rollback
72 mmioCycle
73 mmioCnt
74 mmio_wb_success
75 mmio_wb_blocked
76 stq_1_4_valid
77 stq_2_4_valid
78 stq_3_4_valid
79 stq_4_4_valid
80 dcache_wbq_req
81 dcache_wbq_1_4_valid
82 dcache_wbq_2_4_valid
83 dcache_wbq_3_4_valid
84 dcache_wbq_4_4_valid
85 l1D_write_dcache_access
86 l1D_write_dcache_miss
87 dcache_mp_req
88 dcache_mp_total_penalty
89 dcache_missq_req
90 dcache_missq_1_4_valid
91 dcache_missq_2_4_valid
92 dcache_missq_3_4_valid
93 dcache_missq_4_4_valid
94 dcache_probq_req
95 dcache_probq_1_4_valid
96 dcache_probq_2_4_valid
97 dcache_probq_3_4_valid
98 dcache_probq_4_4_valid
99 load_req
100 load_replay
101 load_replay_for_data_nack
102 load_replay_for_no_mshr
103 load_replay_for_conflict
104 l1D_read_dcache_access
105 l1D_read_dcache_miss
106 load_req
107 load_replay
108 load_replay_for_data_nack
109 load_replay_for_no_mshr
110 load_replay_for_conflict
111 l1D_read_dcache_access
112 l1D_read_dcache_miss
113 load_req
114 load_replay
115 load_replay_for_data_nack
116 load_replay_for_no_mshr
117 load_replay_for_conflict
118 l1D_read_dcache_access
119 l1D_read_dcache_miss
120 dtlb_ld_access
121 dtlb_ld_miss
122 dtlb_st_access
123 dtlb_st_miss
124 PTW_tlbllptw_incount
125 PTW_tlbllptw_inblock
126 PTW_tlbllptw_memcount
127 PTW_tlbllptw_memcycle
128 PTW_access
129 PTW_l2_hit
130 PTW_l1_hit
131 PTW_l0_hit
132 PTW_sp_hit
133 PTW_pte_hit
134 PTW_rwHazard
135 PTW_out_blocked
136 PTW_fsm_count
137 PTW_fsm_busy
138 PTW_fsm_idle
139 PTW_resp_blocked
140 PTW_mem_count
141 PTW_mem_cycle
142 PTW_mem_blocked
143 ldDeqCount
144 stDeqCount
Kunming Lake Cache Performance Event Index Table
Index Event
0 noEvent
1 Slice0_l2_cache_refill
2 Slice0_l2_cache_rd_refill
3 Slice0_l2_cache_wr_refill
4 Slice0_l2_cache_long_miss
5 Slice0_l2_cache_hit
6 Slice0_l2_cache_miss
7 Slice0_l2_cache_access
8 Slice0_l2_cache_l2wb
9 Slice0_l2_cache_l1wb
10 Slice0_l2_cache_wb_victim
11 Slice0_l2_cache_wb_cleaning_coh
12 Slice0_l2_cache_prefetch_access
13 Slice0_l2_cache_prefetch_miss
14 Slice0_l2_cache_access_rd
15 Slice0_l2_cache_access_wr
16 Slice0_l2_cache_miss_rd
17 Slice0_l2_cache_inv
18 Slice1_l2_cache_refill
19 Slice1_l2_cache_rd_refill
20 Slice1_l2_cache_wr_refill
21 Slice1_l2_cache_long_miss
22 Slice1_l2_cache_hit
23 Slice1_l2_cache_miss
24 Slice1_l2_cache_access
25 Slice1_l2_cache_l2wb
26 Slice1_l2_cache_l1wb
27 Slice1_l2_cache_wb_victim
28 Slice1_l2_cache_wb_cleaning_coh
29 Slice1_l2_cache_prefetch_access
30 Slice1_l2_cache_prefetch_miss
31 Slice1_l2_cache_access_rd
32 Slice1_l2_cache_access_wr
33 Slice1_l2_cache_miss_rd
34 Slice1_l2_cache_inv
35 Slice2_l2_cache_refill
36 Slice2_l2_cache_rd_refill
37 Slice2_l2_cache_wr_refill
38 Slice2_l2_cache_long_miss
39 Slice2_l2_cache_hit
40 Slice2_l2_cache_miss
41 Slice2_l2_cache_access
42 Slice2_l2_cache_l2wb
43 Slice2_l2_cache_l1wb
44 Slice2_l2_cache_wb_victim
45 Slice2_l2_cache_wb_cleaning_coh
46 Slice2_l2_cache_prefetch_access
47 Slice2_l2_cache_prefetch_miss
48 Slice2_l2_cache_access_rd
49 Slice2_l2_cache_access_wr
50 Slice2_l2_cache_miss_rd
51 Slice2_l2_cache_inv
52 Slice3_l2_cache_refill
53 Slice3_l2_cache_rd_refill
54 Slice3_l2_cache_wr_refill
55 Slice3_l2_cache_long_miss
56 Slice3_l2_cache_hit
57 Slice3_l2_cache_miss
58 Slice3_l2_cache_access
59 Slice3_l2_cache_l2wb
60 Slice3_l2_cache_l1wb
61 Slice3_l2_cache_wb_victim
62 Slice3_l2_cache_wb_cleaning_coh
63 Slice3_l2_cache_prefetch_access
64 Slice3_l2_cache_prefetch_miss
65 Slice3_l2_cache_access_rd
66 Slice3_l2_cache_access_wr
67 Slice3_l2_cache_miss_rd
68 Slice3_l2_cache_inv

Topdown PMU

Topdown performance analysis is a top-down approach designed to quickly identify CPU performance bottlenecks. Its core concept involves decomposing high-level performance categories step by step, gradually refining the issues to accurately pinpoint the root cause. We have implemented three levels of Topdown performance events, as detailed below:

Table: Three-Level Top-Down Performance Events

+-------------+-------------+-------------+--------------+---------------------------------------+
| Level 1 | Level 2 | Level 3 | Description | Formula |
+=============+=============+=============+==============+=======================================+
| Retiring | - | - | Instruction commit impact | INST_RETIRED / | | | | | |
(IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| FrontEnd | - | - | Front-end impact | IF_FETCH_BUBBLE / | | Bound | | | |
(IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Fetch | - | Fetch latency impact | IF_FETCH_BUBBLE_EQ_MAX / | | | Latency
| | | CPU_CYCLES | | | Bound | | | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| | Fetch | | | FrontEnd Bound - | | - | Bandwidth | - | Fetch bandwidth impact
| Fetch Latency Bound | | | Bound | | | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| Bad | | | | (INST_SPEC - INST_RETIRED+ | | Speculation | - | - | Misprediction
impact | RECOVERY_BUBBLE) / | | | | | | (IssueBW * CPU_CYCLES) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Branch | - | Branch misprediction | Bad Speculation * | | | Misspredict |
| impact | BR_MIS_PRED / TOTAL_FLUSH |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Machine | - | Machine clear | Bad Speculation - Branch Misspredict | | |
Clears | | event impact | |
+-------------+-------------+-------------+--------------+---------------------------------------+
| BackEnd | - | - | Back-end impact | 1 - (FrontEnd Bound + | | Bound | | | |
Bad Speculation + Retiring) |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Core | - | Core impact | (EXEC_STALL_CYCLE - MEMSTALL_ANYLOAD -| | | Bound
| | | MEMSTALL_STORE) / CPU_CYCLE |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | Memory | - | Memory access impact | (MEMSTALL_ANYLOAD + MEMSTALL_STORE) /
| | | Bound | | | CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L1 Bound | L1 impact | (MEMSTALL_ANYLOAD - MEMSTALL_L1MISS) /| | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L2 Bound | L2 impact | (MEMSTALL_L1MISS - MEMSTALL_L2MISS) / | | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | L3 Bound | L3 impact | (MEMSTALL_L2MISS - MEMSTALL_L3MISS) / | | | | |
| CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | Mem Bound | External memory impact | MEMSTALL_L3MISS / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+
| - | - | Store Bound | Store instruction impact | MEMSTALL_STORE / CPU_CYCLES |
+-------------+-------------+-------------+--------------+---------------------------------------+

Here, IssueBW represents the issue width, which is currently 6-issue in the Xiangshan Kunminghu architecture.

Table: Topdown Performance Events

+----------------------------+----------------------+---------------------------------------------+\ | Name | Corresponding Event | Description |\ +============================+======================+=============================================+\ | CPU_CYCLES | - | Total clock cycles after all instructions commit |\ +----------------------------+----------------------+---------------------------------------------+\ | INST_RETIRED | rob_commitInstr | Number of successfully committed instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | INST_SPEC | - | Number of speculatively executed instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | IF_FETCH_BUBBLE | Front_Bubble | Number of bubbles fetched from the instruction buffer, |\ | | | with no backend stall |\ +----------------------------+----------------------+---------------------------------------------+\ | IF_FETCH_BUBBLE_EQ_MAX | Fetch_Latency_Bound | Cycles fetching zero instructions from the instruction buffer, |\ | | | with no backend stall |\ +----------------------------+----------------------+---------------------------------------------+\ | BR_MIS_PRED | - | Number of mispredicted branch instructions |\ +----------------------------+----------------------+---------------------------------------------+\ | TOTAL_FLUSH | - | Number of pipeline flush events |\ +----------------------------+----------------------+---------------------------------------------+\ | RECOVERY_BUBBLE | - | Number of cycles recovering from early mispredictions |\ +----------------------------+----------------------+---------------------------------------------+\ | EXEC_STALL_CYCLE | - | Number of cycles issuing few uops |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_ANY_LOAD | - | No uops issued, and at least one Load instruction not completed |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_STORE | - | Non-Store uops issued, |\ | | | and Store instructions not completed |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L1MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L1-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L2MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L2-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+\ | MEMSTALL_L3MISS | - | No uops issued, at least one Load instruction not completed, |\ | | | and an L3-cache Miss occurred |\ +----------------------------+----------------------+---------------------------------------------+

To measure the impact of front-end fetch latency over a period, we can set the EVENT0 field of mhpmevent3 to 22, leaving the remaining bits at their default values, then proceed with testing. Upon completion, the CSR read instruction can be used to access the mhpmcounter3 register, obtaining the cycle count of front-end fetch latency during that period. Through calculation, the impact caused by front-end fetch latency can then be determined.

The performance event counters in the Xiangshan Kunminghu architecture are divided into three groups: machine-mode event counters, supervisor-mode event counters, and user-mode event counters.

Machine Mode Event Counter List
Name Index Read/Write Description Reset Value
MCYCLE 0xB00 RW Machine Mode Clock Cycle Counter -
MINSTRET 0xB02 RW Machine-mode retired instruction counter -
MHPMCOUNTER3-31 0XB03-0XB1F RW Machine-mode Performance Event Counter 0

The corresponding MHPMCOUNTERx counter is controlled by MHPMEVENTx, specifying the counting of relevant performance events.

Supervisor mode event counters include the supervisor mode counter overflow interrupt flag register (SCOUNTOVF)

Table: Supervisor Mode Counter Overflow Interrupt Flag Register (SCOUNTOVF) Description

+------------+--------+-------+-----------------------------------------------+--------+
| Name | Bits | R/W | Behavior | Reset |
+============+========+=======+===============================================+========+
| OFVEC | 31:3 | RO | mhpmcounterx register overflow flag: | 0 | | | | | | | | |
| | 1: Overflow occurred | | | | | | | | | | | | 0: No overflow occurred | |
+------------+--------+-------+-----------------------------------------------+--------+
| -- | 2:0 | RO 0 | -- | 0 |
+------------+--------+-------+-----------------------------------------------+--------+

scountovf serves as a read-only mapping of the OF bit in the mhpmcounter register, controlled by xcounteren:

  • M-mode can read the correct value when accessing scountovf.
  • HS-mode access to scountovf: When mcounteren.HPMx is 1, the corresponding OFVECx can read the correct value; otherwise, it only reads 0.
  • When accessing scountovf in VS-mode: When both mcounteren.HPMx and hcounteren.HPMx are 1, the corresponding OFVECx can be read correctly; otherwise, it only reads 0.
User Mode Event Counter List
Name Index Read/Write Description Reset Value
CYCLE 0xC00 RO User-mode read-only copy of mcycle register -
TIME 0xC01 RO Memory-mapped register mtime user-mode read-only copy -
INSTRET 0xC02 RO User-mode read-only copy of minstret register -
HPMCOUNTER3-31 0XC03-0XC1F RO Read-only user-mode copy of the mhpmcounter3-31 registers 0