跳转至

[XiangShan Biweekly 89] 20251110

Welcome to XiangShan biweekly column! Through this column, we will regularly share the latest development progress of XiangShan. We look forward to your contribution.

This is the 89th issue of the biweekly report.

XiangShan development-wise, the frontend team continued to fix performance bugs caused by BPU refactoring~~hoping that performance can reach pre-refactoring levels by the next biweekly report~~. The backend team is advancing the new design implementation of the V3 vector unit while optimizing the timing of the V2 backend. The MemBlock team is continuously pushing forward with the V3 refactoring and testing of various modules, fixing some V2 functional bugs, optimizing timing, and improving code quality.

Development Bonus

The dual-core Kunming Lake V2 successfully booted GUI OpenEuler 24.03 on FPGA at 50MHz! We also successfully ran LibreOffice and had a passionate game of DOOM! This marks a milestone in XiangShan's verification work and gives us greater confidence.

Due to the low frequency of the FPGA, the boot is a bit slow, so please be patient~

(Due to GitHub limitations, please watch on the official WeChat public account)

~~Please ignore the image quality issue, this is an extremely artistic shaky camera~~

Recent Developments

Frontend

  • RTL feature
  • Support Hardware Error exception introduced in RISC-V Privileged Spec v1.13 (#4770)
  • Enable UBTB fast training (#5145)
  • Remove takenCnt and valid fields of UBTB (#5157)
  • Implement globalTable for SC predictor (#5150)
  • Support PHR tracking branch target (#5169)
  • ABTB override fast predict, MicroTAGE and other new performance features are being explored
  • Decoupled BPU train and commit path refactoring are in progress
  • Bug Fix
  • Fix the X-state propagation issue caused by ABTB and MBTB SRAM read data not being latched (#5153)
  • Fix the issue related to ABTB training conditions (#5160)
  • Fix the issue of non-one-hot waymask in MBTB multi-hit flush logic (#5181)
  • Fix the issue of incorrect set index calculation for TAGE BaseTable and MBTB (#5155)
  • Fix the issue related to mismatch of PHR pointer metadata (#5139)
  • Fix the issue related to WriteBuffer write port wiring (#5143)
  • Fix the issue where IBuffer still records exceptions when the number of enqueued entries is 0 (#5147)
  • Timing optimization
  • (V2) Replace the dual-port SRAM of FTQ with registers (#5142)
  • Area optimization
  • Support ICache WayLookup to only store the first exception encountered after power-on/redirection (#4959, #5165)
  • Code quality improvements
  • Refactor the fast training interface of s1 predictors (#5144)
  • Refactor MBTB alignBank and fix the issue of incorrect bank index calculation (#5159)

Backend

  • RTL New Features
  • V3 vector unit new design implementation is in progress
  • Support OpenSBI PMU extension (#5172)
  • Parameterize PMP and PMA (#5177)
  • Timing
  • Continue to optimize the timing of V2 vector arithmetic units

MemBlock and Cache

  • RTL new features
  • The refactoring and testing of MMU, LoadUnit, StoreQueue, L2, etc. is ongoing
  • Bug fix
  • Fix the issue with the TLB refresh logic. When hfence.vvma, sfence.vma have v=1, or mbmc.BME = 1 and CMODE = 0, address matching is ignored, and all TLB entries are refreshed directly (#5114)
  • Fix the issue of splitting unaligned elements in VsegmentUnit without locking Paddr. This will cause incorrect access requests to be initiated (#5164)
  • Fix the issue that gpaddr and vaddr are different when TLB is in onlyS2 phase (#5189)
  • Timing
  • Fix the timing issue of PTW (#5170)
  • Fixing the timing issue of LoadReplayQueue and DCache (#5185)
  • Code quality improvements
  • Move IntBuffer for beu into L2Top for partition (#5110)
  • Remove all combinational logic in XSCore top module, only retain connection logic (#5120)
  • Parameterize PMP and PMA (#5177)
  • Debugging tools
  • Add some hardware performance counters in DCache and LDU (#5166)
  • Continue to improve the functionality of CHI related infrastructure CHIron (CHIron)

Performance Evaluation

SPECint 2006 est. @ 3GHz SPECfp 2006 est. @ 3GHz
400.perlbench 35.82 410.bwaves 67.23
401.bzip2 25.40 416.gamess 40.96
403.gcc 47.81 433.milc 45.06
429.mcf 60.26 434.zeusmp 51.80
445.gobmk 30.24 435.gromacs 33.58
456.hmmer 41.60 436.cactusADM 46.20
458.sjeng 30.35 437.leslie3d 47.88
462.libquantum 122.66 444.namd 28.86
464.h264ref 56.55 447.dealII 73.57
471.omnetpp 41.43 450.soplex 52.49
473.astar 29.12 453.povray 53.44
483.xalancbmk 72.71 454.Calculix 16.37
GEOMEAN 44.54 459.GemsFDTD 39.73
465.tonto 36.65
470.lbm 91.98
481.wrf 40.65
482.sphinx3 49.09
GEOMEAN 44.94

We use SimPoint to sample programs and create checkpoints images based on our custom format. The coverage of SimPoint clustering reaches 100%. Note that the above scores are estimated based on program segments rather than a complete SPEC CPU2006 evaluation, which may deviate from the actual performance of real chips.

Compilation parameters are as follows:

Compiler gcc12
Optimization level O3
Memory library jemalloc
-march RV64GCB
-ffp-contraction fast

Processor and SoC parameters are as follows:

Commit 1e9f1b4
Date 11/07/2025
L1 ICache 64KB
L1 DCache 64KB
L2 Cache 1MB
L3 Cache 16MB
LSU 3ld2st
Bus protocol TileLink
Memory latency DDR4-3200

Editors: Zhihao Xu, Junxiong Ji, Zhuo Chen, Junjie Yu, Yanjun Li