跳转至

[XiangShan Biweekly 92] 20251222

Welcome to XiangShan biweekly column! Through this column, we will regularly share the latest development progress of XiangShan.

This is the 92nd issue of the biweekly report.

In the last issue of the biweekly report in 2025, we are excited to announce the performance evaluation results of the current Kunminghu V3 architecture on SPEC CPU2006 for the first time! Since the performance regression of Kunminghu V3 started in August this year, a total of 11 performance regressions have been completed. These 11 performance regressions witness the process of the XiangShan team working together to rapidly develop and iterate on the design. The initial version of Kunminghu V3 scored only 3.717 points/GHz in the SPEC 2006 test. Now, in the latest performance regression, V3 has reached 16.081 points/GHz, surpassing the score of V2. V3 has also replaced V2 as the new mainline of the XiangShan repository!

Performance Regression Results for XiangShan Kunminghu

During this process, ~~the frontend undoubtedly took the biggest blame~~ the most significant change is the brand-new frontend of V3. The new frontend has greatly improved instruction bandwidth, now capable of predicting up to 8 branches and providing 32 instructions per cycle. Meanwhile, the backend and memory subsystem have also increased their throughput capabilities, including increasing from 6 to 8 issue ports and adjusting the sizes of various queues.

It is worth noting that the performance data curve of V3 vividly reflects the agile development philosophy of the XiangShan team. Unlike traditional waterfall development processes, the development of V3 is not a one-time delivery of all code, but rather a result of rapid iteration and continuous evolution based on the initial code. We believe that this new philosophy will bring a new development paradigm to the industry and will certainly help Kunminghu V3 reach new heights, further enhancing the performance benchmark of open-source processors.

We appreciate your companionship and support for XiangShan, and we look forward to your continued attention to the subsequent progress of Kunminghu V3!

In terms of XiangShan development, the frontend has fixed some BPU-related performance bugs and added numerous performance counters for better performance analysis. The backend continues to advance the design of the new vector unit. The memory subsystem has fixed several bugs in V2 and is continuing with V3 module refactoring and infrastructure construction.

Recent Developments

Frontend

  • RTL feature
  • Reduce SRAM write requests when TAGE counters are saturated, thereby reducing stall caused by SRAM port conflicts (#5309)
  • Align TAGE prediction selection logic with GEM5 (#5377)
  • Implement SC bias table (#5234)
  • Implement ITTAGE prediction for call-type branches (#5311)
  • Bug Fix
  • Fix the misuse issue caused by unclear naming between branch address (cfiPc) and prediction block address (startPc) in BPU training (#5317)
  • Fix UBTB training pipeline hit condition to avoid incorrect replacer updates (#5326)
  • Fix TAGE folded history signal width typo (#5325)
  • Fix TAGE cfiPc typo (#5345)
  • Fix some RAS typos and enable RAS (#5321)
  • Fix FTQ resolveQueue bpu enqueue flush logic error issue (#5344)
  • Timing/Area optimization
  • Move TAGE BaseTable into MBTB to synchronize counter allocation with MBTB entries, reducing redundant storage (#5349)
  • Code quality improvements
  • Unify the naming of pc-related signals within BPU (#5318)
  • Add some utility methods to batch generate performance counters with similar prefixes (#5298)
  • Debugging tools
  • Add and fix a large number of performance counters in various modules (#5320, #5265, #5319, #5332, #5339, #5347, #5353, #5370, #5383, #5372)
  • Optimize the branch real address calculation logic of TAGE Trace, considering compressed instructions (#5355)

Backend

  • RTL new features
  • implementating the new design of V3 vector unit
  • Bug fixes
  • Fix backend TopDown interface connection issues (#5340)
  • Modify the value of mvendorid (#5367)
  • Fix Dispatch pipeline stall cycle counting issue (#5398)
  • Code optimizations
  • Make the connection of srcLoadDependencyUpdate more readable ([#5404](https://
  • Others
  • Update the list of backend code owners (#5342)

MemBlock and Cache

  • RTL new features
  • (V2) Support disabling ClockGate in CoupledL2 via parameters (CoupledL2 #451)
  • (V2) Parameterize TIMERange in MMIOBridge of CoupledL2 (CoupledL2 #453)
  • The refactoring and testing of MMU, LoadUnit, StoreQueue, L2, etc. is ongoing
  • Bug fix
  • (V2) Fix the incorrect wakeup of load requests in LoadQueueReplay (#5327)
  • (V2) Fix the wlineflag not delayed one cycle in LoadQueueRAW (#5352)
  • (V2) Fix the depth of L1StreamPrefetcher (#5365)
  • (V2) Remove some RegNext(hartid) in L2Top and MemBlock (#5408)
  • (V2) Fix the wrong DataCheck logic in TXDAT (CoupledL2 #455)
  • (V2) Fix the compilation error of l2MissMatch IO (CoupledL2 #456)
  • Performance Optimizations
  • (V2) Increase the capacity of uncachebuffer from 4 to 16 (#5364)
  • Add PerfCCT support for LoadUnit (#5286)
  • Timing
  • (V2) Adjust the arbtration sequence for s0 source in LoadUnit (#5300)
  • (V2) Optimize the timing of VSegmentUnit and exceptionBuffer (#5330, #5292)
  • (V2) Remove IO port for store prefetch in Sbuffer (#5329)
  • (V2) Remove unnecessary Mux in MemBlock when generating paddr for TLB (#5331)
  • (V2) Replace BitmapCache from register to SRAM (#5346)
  • Debugging tools
  • Support outputting performance counters in tl-test-new (tl-test-new #84)
  • Support outputting detailed information when check_paddr fails in NEMU (NEMU #867)
  • Continuous improvement of CHI infrastructure CHIron
  • Develop a verification tool CHI Test for the new version of L2 Cache. Continuous progressing
  • Refine the prefetch statistics in L2 Topdown Monitor (CoupledL2 #452)

Performance Evaluation

SPECint 2006 est. @ 3GHz SPECfp 2006 est. @ 3GHz
400.perlbench 36.71 410.bwaves 73.92
401.bzip2 27.45 416.gamess 54.70
403.gcc 42.71 433.milc 45.12
429.mcf 59.65 434.zeusmp 60.17
445.gobmk 35.10 435.gromacs 38.47
456.hmmer 44.18 436.cactusADM 54.20
458.sjeng 32.30 437.leslie3d 52.85
462.libquantum 107.84 444.namd 37.91
464.h264ref 61.89 447.dealII 61.38
471.omnetpp 43.56 450.soplex 54.62
473.astar 30.43 453.povray 56.90
483.xalancbmk 75.89 454.Calculix 19.18
GEOMEAN 45.85 459.GemsFDTD 44.14
465.tonto 36.35
470.lbm 93.88
481.wrf 48.77
482.sphinx3 56.20
GEOMEAN 49.72

We use SimPoint to sample programs and create checkpoints images based on our custom format. The coverage of SimPoint clustering reaches 100%. Note that the above scores are estimated based on program segments rather than a complete SPEC CPU2006 evaluation, which may deviate from the actual performance of real chips.

Compilation parameters are as follows:

Compiler gcc12
Optimization level O3
Memory library jemalloc
-march RV64GCB
-ffp-contraction fast

Processor and SoC parameters are as follows:

Commit 64e7bff7f
Date 12/19/2025
L1 ICache 64KB
L1 DCache 64KB
L2 Cache 1MB
L3 Cache 16MB
LSU 3ld2st
Bus protocol TileLink
Memory latency DDR4-3200

Editors: Zhihao Xu, Junxiong Ji, Zhuo Chen, Junjie Yu, Yanjun Li