[XiangShan Biweekly 92] 20251222

Welcome to XiangShan biweekly column! Through this column, we will regularly share the latest development progress of XiangShan.

This is the 92nd issue of the biweekly report.

In the last issue of the biweekly report in 2025, we are excited to announce the performance evaluation results of the current Kunminghu V3 architecture on SPEC CPU2006 for the first time! Since the performance regression of Kunminghu V3 started in August this year, a total of 11 performance regressions have been completed. These 11 performance regressions witness the process of the XiangShan team working together to rapidly develop and iterate on the design. The initial version of Kunminghu V3 scored only 3.717 points/GHz in the SPEC 2006 test. Now, in the latest performance regression, V3 has reached 16.081 points/GHz, surpassing the score of V2. V3 has also replaced V2 as the new mainline of the XiangShan repository!

Performance Regression Results for XiangShan Kunminghu

During this process, ~~the frontend undoubtedly took the biggest blame~~ the most significant change is the brand-new frontend of V3. The new frontend has greatly improved instruction bandwidth, now capable of predicting up to 8 branches and providing 32 instructions per cycle. Meanwhile, the backend and memory subsystem have also increased their throughput capabilities, including increasing from 6 to 8 issue ports and adjusting the sizes of various queues.

It is worth noting that the performance data curve of V3 vividly reflects the agile development philosophy of the XiangShan team. Unlike traditional waterfall development processes, the development of V3 is not a one-time delivery of all code, but rather a result of rapid iteration and continuous evolution based on the initial code. We believe that this new philosophy will bring a new development paradigm to the industry and will certainly help Kunminghu V3 reach new heights, further enhancing the performance benchmark of open-source processors.

We appreciate your companionship and support for XiangShan, and we look forward to your continued attention to the subsequent progress of Kunminghu V3!

In terms of XiangShan development, the frontend has fixed some BPU-related performance bugs and added numerous performance counters for better performance analysis. The backend continues to advance the design of the new vector unit. The memory subsystem has fixed several bugs in V2 and is continuing with V3 module refactoring and infrastructure construction.

Recent Developments

Frontend

RTL feature
Reduce SRAM write requests when TAGE counters are saturated, thereby reducing stall caused by SRAM port conflicts (#5309)
Align TAGE prediction selection logic with GEM5 (#5377)
Implement SC bias table (#5234)
Implement ITTAGE prediction for call-type branches (#5311)
Bug Fix
Fix the misuse issue caused by unclear naming between branch address (cfiPc) and prediction block address (startPc) in BPU training (#5317)
Fix UBTB training pipeline hit condition to avoid incorrect replacer updates (#5326)
Fix TAGE folded history signal width typo (#5325)
Fix TAGE cfiPc typo (#5345)
Fix some RAS typos and enable RAS (#5321)
Fix FTQ resolveQueue bpu enqueue flush logic error issue (#5344)
Timing/Area optimization
Move TAGE BaseTable into MBTB to synchronize counter allocation with MBTB entries, reducing redundant storage (#5349)
Code quality improvements
Unify the naming of pc-related signals within BPU (#5318)
Add some utility methods to batch generate performance counters with similar prefixes (#5298)
Debugging tools
Add and fix a large number of performance counters in various modules (#5320, #5265, #5319, #5332, #5339, #5347, #5353, #5370, #5383, #5372)
Optimize the branch real address calculation logic of TAGE Trace, considering compressed instructions (#5355)

Backend

RTL new features
implementating the new design of V3 vector unit
Bug fixes
Fix backend TopDown interface connection issues (#5340)
Modify the value of mvendorid (#5367)
Fix Dispatch pipeline stall cycle counting issue (#5398)
Code optimizations
Make the connection of srcLoadDependencyUpdate more readable ([#5404](https://
Others
Update the list of backend code owners (#5342)

MemBlock and Cache

RTL new features
(V2) Support disabling ClockGate in CoupledL2 via parameters (CoupledL2 #451)
(V2) Parameterize TIMERange in MMIOBridge of CoupledL2 (CoupledL2 #453)
The refactoring and testing of MMU, LoadUnit, StoreQueue, L2, etc. is ongoing
Bug fix
(V2) Fix the incorrect wakeup of load requests in LoadQueueReplay (#5327)
(V2) Fix the wlineflag not delayed one cycle in LoadQueueRAW (#5352)
(V2) Fix the depth of L1StreamPrefetcher (#5365)
(V2) Remove some RegNext(hartid) in L2Top and MemBlock (#5408)
(V2) Fix the wrong DataCheck logic in TXDAT (CoupledL2 #455)
(V2) Fix the compilation error of l2MissMatch IO (CoupledL2 #456)
Performance Optimizations
(V2) Increase the capacity of uncachebuffer from 4 to 16 (#5364)
Add PerfCCT support for LoadUnit (#5286)
Timing
(V2) Adjust the arbtration sequence for s0 source in LoadUnit (#5300)
(V2) Optimize the timing of VSegmentUnit and exceptionBuffer (#5330, #5292)
(V2) Remove IO port for store prefetch in Sbuffer (#5329)
(V2) Remove unnecessary Mux in MemBlock when generating paddr for TLB (#5331)
(V2) Replace BitmapCache from register to SRAM (#5346)
Debugging tools
Support outputting performance counters in tl-test-new (tl-test-new #84)
Support outputting detailed information when check_paddr fails in NEMU (NEMU #867)
Continuous improvement of CHI infrastructure CHIron
Develop a verification tool CHI Test for the new version of L2 Cache. Continuous progressing
Refine the prefetch statistics in L2 Topdown Monitor (CoupledL2 #452)

Performance Evaluation

SPECint 2006 est.	@ 3GHz	SPECfp 2006 est.	@ 3GHz
400.perlbench	36.71	410.bwaves	73.92
401.bzip2	27.45	416.gamess	54.70
403.gcc	42.71	433.milc	45.12
429.mcf	59.65	434.zeusmp	60.17
445.gobmk	35.10	435.gromacs	38.47
456.hmmer	44.18	436.cactusADM	54.20
458.sjeng	32.30	437.leslie3d	52.85
462.libquantum	107.84	444.namd	37.91
464.h264ref	61.89	447.dealII	61.38
471.omnetpp	43.56	450.soplex	54.62
473.astar	30.43	453.povray	56.90
483.xalancbmk	75.89	454.Calculix	19.18
GEOMEAN	45.85	459.GemsFDTD	44.14
		465.tonto	36.35
		470.lbm	93.88
		481.wrf	48.77
		482.sphinx3	56.20
		GEOMEAN	49.72

We use SimPoint to sample programs and create checkpoints images based on our custom format. The coverage of SimPoint clustering reaches 100%. Note that the above scores are estimated based on program segments rather than a complete SPEC CPU2006 evaluation, which may deviate from the actual performance of real chips.

Compilation parameters are as follows:


Compiler	gcc12
Optimization level	O3
Memory library	jemalloc
-march	RV64GCB
-ffp-contraction	fast

Processor and SoC parameters are as follows:


Commit	64e7bff7f
Date	12/19/2025
L1 ICache	64KB
L1 DCache	64KB
L2 Cache	1MB
L3 Cache	16MB
LSU	3ld2st
Bus protocol	TileLink
Memory latency	DDR4-3200

XiangShan technical discussion QQ group: 879550595
XiangShan technical discussion website: https://github.com/OpenXiangShan/XiangShan/discussions
XiangShan Documentation: https://xiangshan-doc.readthedocs.io/
XiangShan User Guide: https://docs.xiangshan.cc/projects/user-guide/
XiangShan Design Doc: https://docs.xiangshan.cc/projects/design/

Editors: Zhihao Xu, Junxiong Ji, Zhuo Chen, Junjie Yu, Yanjun Li