跳转至

[XiangShan Biweekly 101] 20260427

Welcome to XiangShan biweekly column! Through this column, we will regularly share the latest development progress of XiangShan. This is the 101st issue of the biweekly report.

The design document of Kunming Lake V3 has been gradually released, and we welcome everyone to read and discuss it with us! Currently, the new design document includes two modules: ICache and BPU. The design documents for other modules will be released as development progresses. The design document is still available at https://docs.xiangshan.cc/projects/design/zh-cn/. If you are interested in the design document of Kunming Lake V2, you can switch branches at the bottom right corner of the webpage to view it.

As for recent XiangShan core development, the frontend optimized branch predictor timing, while backend and memory teams fixed bugs and continued module refactoring and testing.

Recent Developments

Frontend

  • RTL features
  • Enable SC Backward Table (#5796)
  • Bug fixes
  • Fix S1 RAS stack top address issue during S3 override (#5680)
  • PPA optimizations
  • Remove SC training metadata stored in FTQ, read on update instead, to save area (#5819)
  • Decouple the write of TAGE taken counter and useful counter to save power (#5782)
  • Fix BPU S3 timing paths (#5797)
  • Fix SC prediction timing paths (#5843)
  • Fix FTQ redirect and branch resolve timing paths (#5835)
  • Code quality
  • Remove unused V2 utility classes (#5821)

Backend

  • RTL New Features
  • (V2) Make commit stuck critical error check configurable by CSR (#5806)
  • Add switch to disable dispatch balance opt (#5815)
  • Resolve the false positive issue caused by insufficient main pipeline resources (#5803)
  • PPA optimizations
  • Optimize dispatch policy to improve performance (#5801)
  • Bug fixes
  • fix indirect csr RegOut ((V2) #5823, 5833)

MemBlock and Cache

  • RTL New Features
  • Finish the design of the new StoreUnit (#5760)
  • Refactoring of L2 is continuously progressing
  • Bug Fixes
  • Fix OverlapMask for cross16B forward in StoreUnit (#5814)
  • Skip global CleanInvalid during local Flush All in CoupledL2 (CoupledL2 #499)

XSAI

  • RTL New features
  • Testing FP8 precision support in the matrix unit
  • Evaluating 8-channel cache access for the matrix unit
  • Co-developing BF16 scalar and vector support with the backend team
  • Code quality
  • Optimized the XSAI parameter system (XSAI #59)
  • Debugging tools
  • Added BF16 extension support in NEMU (NEMU #995)
  • HBL2 tests are now compatible with multi-core environments

Performance Evaluation

Processor and SoC parameters are as follows:

Parameters Options
Commit 82d2669b2
Date 2026/04/23
L1 ICache 64KB
L1 DCache 64KB
L2 Cache 1MB
L3 Cache 16MB
LSU 3ld2st
Bus protocol CHI
Memory configuration DDR4-3200

The SPEC CPU2006 scores are as follows:

SPECint 2006 @ 3GHz GCC15 XSCC SPECfp 2006 @ 3GHz GCC15 XSCC
400.perlbench 48.55 47.58 410.bwaves 85.31 90.03
401.bzip2 27.44 28.26 416.gamess 57.05 53.20
403.gcc 55.18 39.57 433.milc 64.74 64.04
429.mcf 61.07 55.44 434.zeusmp 71.39 64.13
445.gobmk 38.93 40.08 435.gromacs 37.20 34.38
456.hmmer 54.39 64.70 436.cactusADM 76.02 87.74
458.sjeng 38.89 39.43 437.leslie3d 56.29 56.46
462.libquantum 136.76 294.79 444.namd 43.21 45.23
464.h264ref 63.44 72.03 447.dealII 64.12 68.46
471.omnetpp 41.05 39.51 450.soplex 52.08 64.00
473.astar 30.46 29.66 453.povray 73.34 66.37
483.xalancbmk 75.80 84.53 454.Calculix 43.80 39.68
GEOMEAN 50.92 54.14 459.GemsFDTD 63.55 64.27
465.tonto 52.57 35.04
470.lbm 125.76 133.04
481.wrf 54.94 41.59
482.sphinx3 59.37 62.42
GEOMEAN 61.05 59.23

Compilation parameters are as follows:

Parameters GCC15 XSCC
Compiler gcc15 xscc
Optimization level O3 O3
Memory library jemalloc jemalloc
-march RV64GCB RV64GCB
-ffp-contraction fast fast
Linker optimization -flto -flto
Floating-point optimization -ffast-math -ffast-math
-mcpu - xiangshan-kunminghu

Note: We use SimPoint to sample the programs and create checkpoint images based on our custom checkpoint format, with a SimPoint clustering coverage of 100%. The above scores are estimates based on program segments, not full SPEC CPU2006 evaluations, and may differ from actual chip performance.

Editors: Zhihao Xu, Junxiong Ji, Zhuo Chen, Junjie Yu, Jiru Sun, Yanjun Li