BenchCouncil Transactions on Benchmarks, Standards and

Evaluations, 2026

Full Length Articles

FULL LENGTH ARTICLES

TraceRTL: Agile Performance Evaluation for

Microarchitecture Exploration

Zifei Zhang ,

1,2

Yinan Xu ,

Kaichen Gong ,

Sa Wang ,

1,2

Dan Tang

1,3

and Yungang Bao

1,2,∗

State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China,

University of

Chinese Academy of Sciences, 100190, Beijing, China,

Beijing Institute of Open Source Chip, 100080, Beijing, China and

School of

Information Science and Technology, ShanghaiTech University, 200000, Shanghai, China

∗

Corresponding author. baoyg@ict.ac.cn

Received on 2 February 2026; Accepted on 29 March 2026

Abstract

While agile chip development methodologies have accelerated RTL design and simulation, performance evaluation re-

mains constrained by challenges: (1) limited benchmark availability due to incomplete peripheral/software simulation

environments or unavailable source code; (2) ineﬃcient feature prototyping caused by the tight coupling between func-

tional correctness and performance evaluation, particularly for large-scale, error-prone microarchitectures. To address

these challenges, we propose TraceRTL, an agile, trace-driven performance evaluation methodology that decouples the

functional and performance components of CPU RTL designs. It introduces three contributions to the benchmarking com-

munity: (1) a trace-driven exploration framework that bypasses full functional correctness while preserving performance

behavior and supports replaying workload traces on RTL designs; (2) a quantitative analysis and mitigation methodology

to identify and reduce trace-driven performance discrepancies; (3) a trace transformation technique, TraceBridge, that

converts benchmark traces between diﬀerent formats and instruction sets. Using TraceRTL, we have developed the ﬁrst

trace-driven RTL CPU derived from XiangShan, a high-performance out-of-order RISC-V processor. TraceRTL achieves

performance accuracy of 99.87% and 99.86% on SPECint2017 and SPECfp2017, respectively. With TraceBridge, we

evaluate x86 Google workload traces on a RISC-V RTL CPU and reveal distinct memory-bound behavior.

Key words: Trace-driven simulation, Performance evaluation, Cross ISA benchmarking

1. Introduction

Performance has always been a central consideration in CPU

development. As Moore’s Law slows and application demands

diversify, achieving further performance improvements has be-

come increasingly challenging. This highlights the importance

of microarchitecture exploration methodologies. A key question

is: given a baseline CPU design, how can we eﬃciently quantify

the performance impact of a proposed hardware feature using

representative benchmarks?

Among available evaluation methods for assessing CPU

design changes using diverse benchmarks, the most faithful

approach is to use the register-transfer level (RTL) implementa-

tion. As the deﬁnitive description of the microarchitecture, RTL

is the most reliable basis for assessing CPU microarchitecture

designs. Ultimately, any proposed feature must be implemented

and evaluated in RTL to determine its true performance impact.

However, since the RTL development process is time-

consuming, the computer architecture community has adopted

more eﬃcient approaches to accelerate early-stage exploration

before implementing a proposed feature in RTL. As shown in

Fig. 1(a), software-based architectural simulators [1–8] model

low-level hardware components using high-level languages and

abstractions, enabling fast simulation and rapid design itera-

tion. Despite their high productivity in early-stage exploration,

the last mile remains unavoidable: performance must still be re-

evaluated at the RTL level after initial simulator studies, since

the additional modeling layer inevitably introduces discrep-

ancies that require substantial engineering eﬀorts and costly

calibration with the actual implementation [9].

Another fundamental yet often overlooked challenge is the

benchmarking asymmetry across the development workﬂow.

While software simulators [5, 6, 8, 10] widely adopt trace-driven

methodologies to execute diverse benchmarks in trace format,

RTL models lack the capability to replay traces and faces lim-

ited benchmarks due to immature simulation environments.

The benchmarking gap prevents a consistent and continuous

evaluation ﬂow from early-stage modeling to ﬁnal hardware

implementation.

Z. Zhang et al.

Trace-DrivenExecution-Driven

RTL

Software

Execution Model

Platform

• Rocket Chip Generator (2016)

• FireSim (2018)

• BOOM-Explorer (2021)

• XiangShan (2022)

• SimpleScalar (1997)

• GEMS (2005)

• M5 (2006)

• gem5 (2011)

• Sniper (2011)

• ZSim (2013)

• ChampSim (2022)

• Calipers (2022)

TRACERTL

(This Work)

(a) Approaches to microarchitecture exploration.

Workload

Preparation

RTL

Implementation

Agile

Performance

Evaluation

Functional

Correctness

Performance

Evaluation

TRACERTL

Current Workflow

(b) Current and TraceRTL performance evaluation workﬂows.

Figure 1. Microarchitecture exploration methods and workﬂows.

Recent advancements in RTL design and simulation oﬀer

strong potential for a seamless, progressive reﬁnement work-

ﬂow from early-stage exploration to the last mile. On the design

side, high-level hardware construction languages [11–13] enable

parameterized and reusable components, allowing rapid imple-

mentation and iteration of new microarchitectural ideas. On

the evaluation side, eﬃcient RTL simulation methods [14–18],

especially FPGAs and emulators [19–22], have signiﬁcantly ac-

celerated large-scale RTL simulation. These capabilities have

already been demonstrated in several open-source, industrial-

competitive CPUs [23–27], which provide realistic and accessi-

ble microarchitecture research platforms [28–32]. For example,

it takes less than 200 minutes and approximately 300 lines of

modiﬁed Chisel code to implement an instruction scheduler pol-

icy PUBS [33] on the XiangShan, a high-performance RISC-V

CPU achieving >15 SPECint2006/GHz [26, 34].

These trends motivate us to adapt proven exploration tech-

niques from simulators directly to RTL, aiming to inherit the

agile workﬂow of simulator-based exploration while enabling

a seamless integration into RTL for last-mile evaluation.

To realize this opportunity, we propose TraceRTL, an

RTL-based performance evaluation methodology that derives

a trace-driven RTL model from an existing execution-driven

RTL implementation. As illustrated in Fig. 1(a), TraceRTL

reuses open-source, silicon-validated RTL designs as a solid

foundation for faithful microarchitecture exploration. It drives

performance-critical modules with pre-generated traces, en-

abling simulator-like agility for early-stage exploration on RTL.

By deriving the trace-driven model directly from RTL, it

inherently avoids costly last-mile calibration and preserves

performance ﬁdelity for evaluation of the proposed feature.

One key motivation behind TraceRTL is to overcome the

execution-driven nature of current RTL designs. As illustrated

by the white boxes in Fig. 1(b), the modiﬁed CPU must ﬁrst

pass full functional veriﬁcation before any performance evalu-

ation can be conducted. This tight coupling forces every RTL

design modiﬁcation to undergo complete implementation, veri-

ﬁcation, and lengthy simulation, even when the modiﬁcation

is unrelated to performance. For example, evaluating opti-

mizations for virtualized, two-stage address translation requires

implementing complex privileged operations to guarantee cor-

rect functionality, whose functional details, however, do not

aﬀect performance.

With TraceRTL, this strict dependency between functional

correctness and performance evaluation is eliminated. As high-

lighted in Fig. 1(b), TraceRTL enables agile performance

evaluation without ﬁrst implementing or verifying unrelated

functional details. Additionally, it accepts trace inputs from

a broader range of real-world applications, including those

with unavailable source code, diﬀerent ISAs, or peripheral de-

pendencies [10, 35–37], without requiring porting to an RTL

simulation. To realize these capabilities, however, we need to

address three key challenges.

1) Feature prototyping: Can we develop a trace-driven

RTL CPU with minimal modiﬁcations to the existing

execution-driven microarchitecture while preserving perfor-

mance accuracy? Our key insight is that hardware module

interfaces can be categorized into functional interfaces, which

determine what each instruction computes and where exe-

cution proceeds next, and performance-sensitive interfaces,

which determine how eﬃciently the instruction stream is re-

alized. Based on this distinction, TraceRTL selectively takes

over key interfaces to decouple the functional model while

preserving performance behaviors using externally supplied

traces. This preserves cycle-accurate performance accuracy

while eliminating the complexity of managing full functional

correctness.

2) Performance accuracy: Can we mitigate the perfor-

mance discrepancies introduced by trace-driven simulation?

Conventional trace-driven simulation often suﬀers from ﬁdelity

loss due to the lack of necessary information to replicate

execution-driven behaviors. We observe that these information

gaps stem from two primary sources: intentional abstraction

and dynamic omission. We quantitatively analyze the perfor-

mance impact of these missing components, revealing their es-

sential role for ﬁdelity. TraceRTL proposes a dynamic informa-

tion reconstruction mechanism that synthetically reconstructs

missing data, achieving high performance accuracy.

3) Broader workloads: Can we bridge the semantic gap

across diverse trace formats and ISAs? Industrial workloads

are valuable for microarchitecture exploration, but the scarcity

of RISC-V workloads necessitates cross-ISA transformation to

generate benchmark traces. This transformation is performed

only once during trace preparation. However, diﬀerences in

trace formats and ISAs hinder the direct execution of publicly

available traces on RTL CPU models. Since trace-driven simu-

lation relaxes the need for full functional correctness, TraceRTL

introduces TraceBridge, a trace transformation technique that

leverages instruction and register mapping to enable the replay

of traces from diﬀerent formats and ISAs.

To demonstrate the feasibility of TraceRTL, we develop

a trace-driven RTL model derived from XiangShan [26, 38].

It achieves performance accuracy of 99.87% and 99.86% on

SPECint2017 and SPECfp2017, respectively, reducing perfor-

mance discrepancies by 10.31× and 29.21× compared to a

calibrated XS-gem5 model. By leveraging TraceBridge, we

evaluate x86-based Google workload traces [36] on Xiang-

Shan, and reveal distinct memory-bound behavior compared

to SPECint2017.

TraceRTL expands the possibilities for microarchitecture

research by supporting both RTL-based exploration and seam-

less integration with simulator-based workﬂows. By preserving

a simulator-like, trace-driven environment for workloads and

simulation, it eﬀectively bridges early-stage exploration on

simulators and last-mile RTL evaluation.

To summarize, this paper makes the following contribu-

tions.

• We propose TraceRTL, bringing trace-driven simulation to

RTL CPUs for agile microarchitecture exploration.

TraceRTL: Agile Performance Evaluation

• We quantify the sources of performance discrepancies and

implement dynamic information reconstruction to achieve

high performance accuracy.

• We propose TraceBridge, which enhances trace compatibil-

ity to expand the sources of benchmark workloads.

• We demonstrate TraceRTL by using x86 workload traces

collected from Google warehouse-scale computers for per-

formance evaluation of XiangShan, a RISC-V CPU.

2. Background

2.1. Out-of-Order Microarchitecture

Modern CPUs improve performance primarily by exploiting

parallelism and speculation. The front-end speculatively fetches

instructions using branch prediction, while the back-end de-

codes, schedules, and issues them to execution units for

computation and to memory subsystem for data access.

The eﬃciency of this pipeline depends on several critical

microarchitectural components. Branch prediction and instruc-

tion fetching determine the instruction supply rate. Execution

pipelines and scheduling queues aﬀect throughput. The mem-

ory hierarchy bridges the large speed gap between CPU and

DRAM by caching frequently used data. The memory manage-

ment unit (MMU) accelerates address translation by caching

recently used address mappings near the CPU.

2.2. Exploration on RTL

While RTL models oﬀer higher accuracy for design space explo-

ration, directly evaluating performance on RTL presents several

challenges, including the inﬂexibility of traditional hardware

description languages, slow simulation speeds, and the lack of

open-source RTL processors. Recent eﬀorts have focused on

these issues.

Flexibility. Many emerging high-level hardware descrip-

tion languages [12, 13, 39] oﬀer enhanced expressiveness and

parameterization that accelerate the development of micro-

architectures . New hardware design methodologies [11] are also

proposed to further improve design modularity.

Simulation Speed. Novel RTL simulation techniques

have been proposed to accelerate software-based [14–16] or

hardware-based [19, 21, 22] simulation of RTL designs.

Additionally, sampling-based methods [40–42] estimate full-

program performance by aggregating results from several rep-

resentative program segments.

Open-Source RTL Processors. With the rapid growth of

the RISC-V open-source community, a number of RTL pro-

cessors have emerged, including in-order designs [43, 44] and

out-of-order designs such as BOOM [23–25], XuanTie-910 [27],

and XiangShan [26]. These designs provide accessible and re-

alistic platforms for microarchitecture research, enabling agile

exploration directly on RTL.

2.3. Simulation Methodologies

In computer architecture research, performance evaluation

of novel designs predominantly relies on two core method-

ologies: execution-driven and trace-driven simulations. These

approaches fundamentally diﬀer in how they provide program

stimuli to performance models, leading to distinct trade-oﬀs

between ﬁdelity, ﬂexibility , and simulation speed.

The execution-driven methodology emulates the behavior of

real CPUs within the performance model, such as fetching, de-

coding, scheduling and executing instructions. This approach

is inherent to RTL models [23–26, 43] and is also implemented

in many software simulators [1–4]. By coupling functional ex-

ecution with performance modeling, this approach captures

microarchitecture-dependent dynamic behaviors, such as spec-

ulative execution and wrong-path eﬀects, thereby oﬀering high

ﬁdelity. However, this accuracy comes at the cost of signiﬁcant

complexity, increased error-proneness, and reduced simulation

speed.

In contrast, the trace-driven methodology decouples the

functional model from the performance model by replaying the

pre-generated traces of instructions including architectural in-

formation such as instruction semantics, instruction addresses,

memory accesses, and branch outcomes [5, 6, 8, 10, 45]. These

traces are often generated using instrumentation tools like

Pin [46], DynamoRIO [47], and Valgrind [48], or obtained

from public pre-generated traces [10, 35, 36]. This decoupling

aﬀords higher ﬂexibility, enabling researchers to focus on mi-

croarchitectural optimization. However, this ﬂexibility often

comes at the cost of reduced ﬁdelity, as traces lack dynamic

microarchitecture-dependent information.

3. Challenge

Agile performance evaluation requires rapid feature prototyp-

ing, support for extensive workloads, and fast simulation. To

meet these goals at the RTL level, trace-driven simulation oﬀers

a promising approach by decoupling performance and func-

tional models and supporting trace-based workloads. However,

integrating trace-driven simulation into existing execution-

driven CPU RTL models introduces non-trivial challenges.

Publicly available traces often vary in trace format, lack in-

formation such as instruction encodings, and are sometimes

generated from diﬀerent instruction sets.

3.1. Trace-driven RTL Integration

Transforming a complex execution-driven CPU RTL model

into a trace-driven implementation presents unique challenges

compared to building an RTL model from scratch or driving

individual RTL modules independently. In addition to supply-

ing stimuli to existing RTL modules, a trace-driven model must

precisely control the instruction ﬂow based on external traces

while maintaining the original performance behavior.

3.2. Trace-driven Performance Discrepancies

Trace-driven simulation inherently suﬀers from ﬁdelity loss due

to the lack of necessary information. This gap stems from two

primary sources: intentional abstraction and dynamic infor-

mation omission. First, to balance conﬁdentiality and storage

overhead, conventional traces often omit critical details such as

operand values and instruction opcodes. Second, static traces

fail to capture dynamic execution states, such as wrong-path

instructions and page table walks, which only emerge during

runtime. The absence of these microarchitectural side eﬀects

prevents the accurate replication of execution-driven behaviors,

potentially leading to signiﬁcant performance discrepancies.

3.3. Trace Compatibility

Trace-driven approaches can bypass the limitations of simu-

lated peripheral environments, thereby enhancing the cover-

age of supported workloads. However, due to conﬁdentiality

constraints, instruction source code is often unavailable for

publicly accessible trace ﬁles [10, 35, 36]. Another scenario

Z. Zhang et al.

involves target applications that require evaluation but have

not been adapted to the target instruction set, rendering direct

assessment infeasible.

Instruction sets share commonalities but also exhibit sig-

niﬁcant diﬀerences, which hinder direct trace porting. For

example, diﬀerences in general-purpose register conventions,

instruction encodings and sizes, PC alignment rules, and the

range of direct branch instructions all impose constraints on

cross-instruction-set trace evaluation. These challenges are par-

ticularly pronounced for RTL models, which typically lack

suﬃcient abstraction capabilities.

4. TraceRTL Design

To enable agile performance evaluation of RTL designs, we

ﬁrst propose a trace-driven simulation methodology at the RTL

level (§ 4.1) while preserving high performance accuracy (§ 4.2).

Building on this, we introduce TraceBridge, a trace transforma-

tion method that enhances compatibility by enabling the replay

of traces from diﬀerent formats and instruction sets (§ 4.3).

4.1. Trace-Driven Microarchitecture Design

We decompose the CPU into core components and describe how

each component is driven by the trace. Interfaces, deﬁned asthe

set of I/O signals between modules, can be driven to control

the module’s behavior. By driving the key interfaces with the

information in traces, TraceRTL replaces the functional model

with external traces while maintaining the original performance

behavior. This section describes the design of trace-driven in-

tegration to meet its objectives: (1) driving RTL modules with

external instruction traces, (2) enforcing the CPU model to

conform to the trace instruction ﬂow, and (3) identifying and

mitigating performance discrepancies inherent in trace-driven

simulation.

4.1.1. Trace-Driven RTL Modules

Our key insight is that hardware module interfaces can be

classiﬁed into functional interfaces, which determine what

each instruction computes and where execution proceeds

next (e.g., arithmetic, branching, or exception handling),

and performance-sensitive interfaces, which determine how

eﬃciently the instruction stream is realized (e.g., branch

prediction, cache access, and memory prefetching). Based

on this distinction, we analyze key module behaviors and

drive performance-sensitive modules using external instruction

traces, preserving original performance characteristics without

requiring full functional execution.

Branch predictor. The branch predictor’s performance-

critical interfaces primarily include two types: training and

prediction. The predictor is trained on the committed branch

outcomes. Therefore, instructions on the mis-speculated path,

which are ﬂushed from the pipeline, leave no side eﬀects. By

substituting the branch outcomes with trace information, which

includes branch direction and target, we are able to stimulate

the training process. For prediction, the predictor takes the

current program counter (PC) and branch history to gener-

ate the next instruction fetch request. While prediction is at

speculative stage, the PC and history for correct-path instruc-

tions are consistent between execution-driven and trace-driven

simulations.

Instruction fetch. The instruction fetch unit obtains fetch

requests from the branch predictor and retrieves instructions

from the traces. We propose an interval match mechanism to

Branch

Predictor

Instruction

Fetch

Ibp: [0x100, 0x110)

trace instrs:

0x100: add

0x104: add

0x108: jump

0x200: sub

0x204: ...

0x100: add

0x104: add

0x108: jump

pc=0x100

Itrace:[0x100,0x10c)

Backend

TraceReader

Iif: [0x100, 0x10c)

Figure 2. Trace-driven instruction fetch with interval match mechanism.

simulate the fetch bandwidth, as shown in Fig. 2. A fetch re-

quest typically speciﬁes a contiguous instruction interval I

deﬁned by starting and ending addresses. The fetch unit for-

wards the request to TraceReader that extracts a continuous

sequence of trace instructions I

trace

. Instructions in I

, which

are common to both I

and I

trace

, are then sent to subsequent

pipeline stages for execution. When the starting address does

not match the beginning of the trace, I

is empty, preventing

any instructions from being fetched. Consequently, the impact

of instructions on the mis-predicted path cannot be modeled.

§ 4.2.1 presents a reﬁned design to address this limitation.

Out-of-order backend. The backend relies on instruction

encodings to stimulate decoding, register renaming, dynamic

scheduling and execution. These encodings are directly supplied

from the trace. Alternatively, a more aggressive approach is to

provide the results of the decoding directly to drive renaming

and scheduling, although this is beyond the scope of this work.

Particular units like the FDivSqrt operation may need optional

data for accurate execution latency.

Cache hierarchy. Cache behavior is mainly inﬂuenced by

access addresses. Instruction addresses are derived from fetch

requests generated by the branch predictor. Data addresses, on

the other hand, are dynamically calculated from the operands,

which are invalid in trace-driven simulation. Therefore, memory

access addresses should be included in traces to model mem-

ory behavior. Special modules like the indirect memory access

prefetcher need extra information.

Memory management unit. The virtual-to-physical ad-

dress translation and page-table walk require in-memory page

table entries (PTEs) that are typically absent in traces [49]. We

employ a dynamic page table generation approach: For each in-

struction in the trace, we traverse the page tables using its

virtual address. If a required PTE is invalid, a new page is

allocated, and the corresponding PTE is initialized. This pro-

cess continues recursively until reaching the leaf page, which is

initialized with the physical address in traces.

4.1.2. Trace-Controlled Instruction Flow

TraceRTL controls the instruction ﬂow by managing branch in-

structions, interrupts, and exceptions, while ensuring processor

compliance by instruction stream correctness checks.

Branch instruction. We replace the branch execution

unit’s outcomes with target and conditional result recorded in

the trace to control the programs’ instruction ﬂow.

Exception and interrupt. Traps, including exceptions like

page faults and interrupts like timer interrupts, may be trig-

gered by programs, devices, and operating system. Traps aﬀect

control ﬂow and pipeline redirection, as illustrated in Fig. 3(a).

These are intercepted and re-injected according to the trace.

Speciﬁcally, trace-recorded exceptions are triggered as illegal

instructions, redirecting to the target in trace, as illustrated

in Fig. 3(b). This design ensures that exceptions are preserved

without relying on full functional execution.

TraceRTL: Agile Performance Evaluation

Table 1. Key CPU module behaviors and their corresponding trace-driven stimuli in TraceRTL.

Module Key Behavior Trace-Driven Stimulus

Branch Predictor

Prediction uses PC and history; Training uses

committed branch outcomes.

Use current PC and history for prediction; Use

branch outcomes from trace for training.

Instruction Fetch

Fetch request deﬁnes instruction interval;

Wrong-path instructions.

Apply interval match mechanism to simulate fetch

bandwidth; Generate wrong paths on mismatch.

Instruction Execution Decode, rename, schedule, and execution.

Use instruction encoding and optional operand

from trace.

Instruction Flow

Branch instruction outcome; Exception and in-

terrupt redirect the pipeline.

Intercept branch outcomes/exception generation;

Support redirect; Flow check.

Cache Hierarchy Access cache by addresses.

Memory address from trace; Instruction address

from branch predictor; Optional data from trace.

MMU

Virtual-to-physical address translation; Access

memory for page table entries.

Construct page table according to the address

from trace.

Branch

Predictor

Instruction

Fetch

Decode MemoryUnit

Reorder

Buffer

Peripherals

Interrupt

Redirect to Exception Handler

Native Exceptions

(a) Native exceptions and interrupt triggered by the CPU

pipeline and peripherals.

Branch

Predictor

Instruction

Fetch

Decode MemoryUnit

Reorder

Buffer

Exceptions in Trace

Redirect to Target in Trace

(b) Intercept the native exceptions and trigger trace

exceptions as illegal instruction.

Figure 3. Exception and interrupt management in TraceRTL.

Instruction stream check. A fundamental requirement of

trace-driven simulation is that the performance model must be

guided by the trace, a key aspect of which is to ensure its exe-

cution adheres to the provided instruction stream. We capture

the processor’s actual instruction stream through committed

instructions and compare it against the trace. The diﬀerences

in the streams indicate implementation ﬂaws in the RTL model

itself or trace-driven framework.

4.1.3. Overall

In summary, TraceRTL provides a general and adaptable

framework for trace-driven RTL performance evaluation. It is

designed to evolve naturally with RTL designs, require minimal

eﬀort across microarchitectural iterations, remain applicable

across diverse microarchitectures, and ﬂexibly support various

performance optimizations.

Extending TraceRTL to new architectures. We sum-

marize the trace-driven transformation methodology in Table 1.

TraceRTL employs an interface-based modiﬁcation strategy

that reduces modiﬁcation overhead while accommodating vari-

ations in module design. The processor module partitioning

methodology is universal across diﬀerent microarchitectures,

making TraceRTL a reusable and microarchitecture-agnostic

framework for RTL performance evaluation. The speciﬁc modi-

ﬁcations may vary depending on processor-speciﬁc designs. For

instruction fetch, for instance, in-order processors commonly

fetch one or two instructions per cycle, which does not require

the interval match described in § 4.1.1. In contrast, some high-

performance processors may fetch instructions spanning two

intervals per cycle, thus necessitating two interval-match op-

erations. For CPU-driven accelerators, such as matrix units,

the necessary execution information can also be recorded into

trace instructions and dispatched accordingly.

Applicability for microarchitecture features. TraceRTL

is particularly advantageous for evaluating functionally com-

plex yet performance-critical features (§ 7.2). Beyond function-

ality, it captures ﬁne-grained timing eﬀects that are diﬃcult

to model accurately at higher abstraction levels. For exam-

ple, variations in microarchitectural timing may critically aﬀect

the overall performance (§7.4). It can also evaluate microar-

chitectural optimizations in the same way as conventional

trace-driven simulators (e.g., branch prediction, prefetching, re-

placement, memory dependence prediction). With additional

trace information, TraceRTL can be extended to model ad-

vanced optimizations, such as value prediction (with execu-

tion results) and indirect memory prefetching (with memory

values).

4.2. Trace-driven Performance Discrepancy

Mitigation

To achieve high accuracy, trace-driven simulation should strive

to mimic the behaviors of execution-driven simulation. This

section details our methodology for bridging this gap by en-

hancing trace-driven simulation of the frontend fetch unit

through wrong-path simulation, reﬁning execution latency

via operand and opcode provisioning, and maintaining MMU

ﬁdelity through dynamic page table construction.

4.2.1. Fetch: Wrong-Path Simulation

Out-of-order processors may execute instructions that are later

discarded due to events like branch mispredictions. These in-

structions, although executed, are ﬂushed by pipeline redirect

operations, preventing them from aﬀecting the architectural

state of the CPU, such as the register ﬁle or memory.

Wrong-path instructions’ performance impact, particu-

larly on the cache hierarchy, cannot be ignored. The impact

on the cache can be categorized into prefetching and pollu-

tion, leading to positive and negative eﬀects. Fig. 4 presents a

code example divided into three sections: (1) Code1, executed

unconditionally before the branch; (2) Code2, located within

one branch; and (3) Code3/Code4, placed outside the branch’s

inﬂuence, further categorized into the proximate Code3 and

Z. Zhang et al.

the distant Code4. Upon a mispredicted branch, Code2 is exe-

cuted, and if its execution is swift, Code3 may follow. Once the

branch is resolved, speculatively fetched instructions of Code2

and Code3 are discarded, with Code2 potentially polluting the

cache and Code3 prefetching the cache.

*b = 1; /* Code1 */

a = *b;

if (a == 0)

 a = *c;/* Code2 */

a = *d; /* Code3 */

a = *e; /* Code4 */

Figure 4. Code example demonstrating wrong-path instruction

generation.

We statistically analyze the number and addresses of mem-

ory instructions on both correct and wrong paths in the

out-of-order processor XiangShan, focusing on instructions sent

to the load pipeline. These addresses are aligned to cache-

line size. We categorize the address space into three types: (1)

exclusive-arch-path: only accessed by correct path instructions,

(2) exclusive-wrong-path: only accessed by wrong-path instruc-

tions and (3) overlapped: accessed by both paths. As shown

in Fig. 5, we found that most of the address space falls into

type(1) and (3). Therefore, we can tentatively draw a rough

conclusion that prefetching has the predominant inﬂuence.

Figure 5. Percentage of memory interval weighted by load access times

for the SPEC CPU2017. Each bar represents a sub-benchmark, sorted

according to “ExclusiveArchPath”.

Based on the observation, we focus on simulating the

prefetching inﬂuence by taking the instructions at correct path

as wrong-path instructions. The process involves the follow-

ing steps: (1) When a branch misprediction occurs, we check

whether the fetch request’s starting address exists in traces

within a ﬁxed instruction window; (2) If it exists, the instruc-

tions in the trace are sent to subsequent pipeline stages as

wrong-path instructions. The instruction fetch unit is blocked

for simpliﬁcation. These instructions are not discarded from

the traces; (3) Once the branch instruction is resolved and the

pipeline is redirected, the correct fetch request is issued.

4.2.2. Execution: Instruction Opcode Provisioning

Conventional trace-driven simulators often operate with par-

tial instruction encodings. Instruction encoding has two types

of information: instruction opcode for functionality like ADD

and SUB, register indices for instruction dependency and out-

of-order scheduling. Explicit instruction opcodes are frequently

Algorithm 1 Dynamic Page Table Construction

1: procedure Instruction Walk(instList)

2: for inst in instList do

3: if inst’s PC valid then

4: PageWalk(inst.VirtualPC, inst.PhysicalPC)

5: end if

6: if inst’s memory address valid then

7: PageWalk(inst.VirtualAddr, inst.PhysicalAddr)

8: end if

9: end for

10: end procedure

11: procedure Page Walk(va, pa)

12: pageBase = PageTableRootAddr

13: for level := 0 to MaxLevel do

14: pteAddr = getPteAddr(va, level, pageBase)

15: pte = readPageTable(pteAddr)

16: if pte not valid then

17: if level == MaxLevel-1 then

18: newPte = genPte(pa) ▷ Leaf page arrived

19: else

20: newPte = genPte(AllocatePage())

21: end if

22: writePageTable(pteAddr, newPte)

23: end if

24: pageBase = pte.ppn << 12

25: end for

26: end procedure

abstracted or omitted for conﬁdentiality concerns and software

simulators’ highly abstracted microarchitecture designs. Con-

sequently, instead of providing detailed opcodes, trace instruc-

tions are categorized into coarse-grained functional groups: (1)

control ﬂow (unconditional direct, conditional direct, and in-

direct jumps); (2) memory access (loads and stores); and (3)

computation (integer and ﬂoating-point).

Our work focuses on quantifying the performance model-

ing deviations induced by this loss of ﬁne-grained opcodes.

Speciﬁcally, we investigate how substituting precise opcodes

with coarse-grained categories impacts simulation ﬁdelity. This

analysis aims to isolate the impact of operation abstraction

from other simulation variables, providing a quantitative un-

derstanding of the accuracy trade-oﬀs in abstracted trace

modeling.

4.2.3. Execution: Operand Provisioning

Some operations are implemented in a blocking manner and

their execution cycles are variable depending on the operands,

like division, ﬂoating-point division and square-root. This type

of performance error is always neglected and simulators often

implement them with ﬁxed latency.

Although these instructions are relatively few, their long

execution cycles and low degree of concurrency amplify their

performance impact. To achieve more accurate simulation for

these types of instructions, we record their operands in traces.

4.2.4. MMU: Dynamic Page Table Construction

User-space programs use virtual addresses, which must be

translated to physical addresses by the memory management

unit (MMU) before accessing the cache or main memory. In

the MMU, the virtual address ﬁrst consults the L1 translation

lookaside buﬀer (TLB). If L1 TLB hits, the physical address

TraceRTL: Agile Performance Evaluation

is obtained directly. In case of L1 TLB miss, the virtual ad-

dress will be sent to a larger L2 TLB or hardware page table

walker to traverse the memory-resident page tables to ﬁnd the

physical address corresponding to the virtual address, which

involves multiple memory accesses, especially in hypervisor en-

vironments. Page table caches are used to speed up page table

walks. In summary, the hit rates of TLB and page table cache,

as well as page table walker’s memory latency, are crucial for

MMU-sensitive programs.

To simulate the behavior of MMU and minimize the modi-

ﬁcations on RTL modules, we need to provide a self-consistent

page table for the MMU. However, traces typically contain only

the physical and virtual addresses, but not the page table [49].

Therefore, we employ a dynamic page table generation method,

as illustrated in Algorithm 1. By iterating over each instruction

in the traces and traversing the page tables based on the vir-

tual address, we allocate new page frames and initialize the

invalid corresponding page table entries, until reaching the leaf

page. The leaf entry is then initialized with the corresponding

physical address. After dynamically generating the page table,

when a TLB miss occurs, the memory-resident page tables are

traversed.

4.3. Trace Compatibility with TraceBridge

We introduce a trace transformation methodology, Trace-

Bridge, to bridge the incompatibilities in trace formats and

instruction sets. To support trace-driven simulation, the trace

must contain at least three categories of information: (1) pri-

mary instruction type, including branch types, computation,

and memory operations; (2) execution guidance, including PC,

branch target and conditional result, and memory address;

(3) register dependencies to model instruction-level parallelism.

Such information is typically included in the trace format

of dynamic instrumentation tools [49] and publicly available

traces [10, 35, 36], where ﬁne-grained semantic information such

as instruction opcodes are sometimes missing.

TraceBridge retains the key information from the trace,

transforming its format to be compatible with the target model

by reﬁning the execution semantics. However, trace-driven RTL

models pose additional low-level challenges due to their rich

details: (1) instruction correspondence and register seman-

tics; (2) diﬀerence in instruction encoding size and program

counter (PC) alignment constraints; (3) variations in branch

oﬀset ranges.

The primary principle of TraceBridge is to maintain per-

formance semantics consistency. This ensures that the perfor-

mance characteristics of the original program are reﬂected in

the target architecture. For conﬁdentiality, public traces omit

instruction encodings [10] or provide instruction categories [36].

To address this, we observe that an instruction can encom-

pass multiple performance semantics, which fall into four types:

⟨Load, Computation, Store, Branch⟩. To maintain performance

semantics consistency, we map each individual performance se-

mantic to its corresponding instruction(s) in the target ISA.

A single x86 instruction, which may encompass multiple micro-

operations, is translated into an equivalent sequence of RISC-V

instructions. For instance, the x86 RET instruction is mapped

to two RISC-V instructions (LOAD and JR), and x86 memory

accesses exceeding the width of a single RISC-V instruction are

decomposed into multiple instructions to preserve the access

range. The necessary mapping results in instruction inﬂation,

which is analyzed in § 7.1. In the case of missing opcodes,

compute instructions are mapped to representative types such

as [F]ADD, [F]MUL, and CONVERT due to limited informa-

tion in the traces. For ISAs with ﬂag mechanism, such as x86,

spare registers can be employed to establish inter-instruction

dependencies. Furthermore, special handling for architecturally

signiﬁcant registers, like the return address register, guarantees

the correct correspondence between x86 call/return operations

and their RISC-V counterparts.

PC INSTR

0x100 math

0x101 math

0x105 ret 0x201

0x201 math

0x203 math

0x210 j 0x150

0x150 math fp

0x154 math fp

old PC new PC

0x100 -- 0x100

0x101 -- 0x104

0x105 -- 0x108

-- 0x10C

0x150 -- 0x150

0x154 -- 0x154

0x201 -- 0x200

0x203 -- 0x204

0x210 -- 0x208

PC INSTR

0x100 add

0x104 add

0x108 load

0x10C jr 0x200

0x200 add

0x204 add

0x208 j 0x150

0x150 fadd

0x154 fadd

origin

traces

transform

& PC map

RISC-V

traces

Figure 6. Example of trace transformation, consisting of PC conversion

and instruction encoding mapping

To resolve diﬀerences in instruction size, PC alignment, and

instruction inﬂation, we reorganize PCs in the traces to conform

to RISC-V requirements. As illustrated in Fig. 6, we collect all

instruction PCs and sequentially reassign new addresses based

on RISC-V encoding size. When a PC gap is detected (e.g.,

from 0x105 to 0x150), the current PC is updated accordingly. A

mapping from original PCs to RISC-V PCs is then constructed,

and branch target addresses are updated using this mapping.

While the x86 ISA supports larger oﬀset ranges for di-

rect branch instructions than RISC-V, we observe that branch

target computation mainly occurs in two modules: the pre-

decoding unit at the fetch stage and the branch execution unit.

By overriding the computation result with the target recorded

in the traces, we eﬀectively support larger branch oﬀset ranges

in the trace-driven RISC-V model.

Overall. TraceBridge provides a methodology to evaluate

the microarchitectural behavior of mature, real-world software

ecosystems (e.g., Google workloads) on an emerging hardware

ecosystem (e.g., RISC-V). Admittedly, TraceBridge is unable

to eliminate all performance discrepancies caused by inher-

ent cross-ISA diﬀerences and missing execution information in

traces, such as instruction semantics and application binary in-

terfaces(ABIs). Furthermore, while the high-level methodology

is consistent, the speciﬁc rules should adapt for source and tar-

get ISAs. For example, x86 and RISC-V diﬀer in the number of

general-purpose registers. Consequently, when translating to

x86, some registers may map to memory (i.e., register spilling).

It is also constrained by information missing from the trace,

forcing a simpliﬁed instruction remapping, which inevitably

introduces performance errors. However, according to our eval-

uation of missing RISC-V opcodes (§ 7.1), the accuracy is above

99% (0.95% error for SPECint2017) for early-stage performance

exploration.

5. Put It All Together

TraceRTL improves the performance evaluation workﬂow by

optimizing stages such as workload preparation, prototyp-

ing, and performance simulation. As illustrated in Fig. 7,

a typical iterative workﬂow based on TraceRTL is employed

to perform agile performance evaluation. The workﬂow in-

volves the following steps:

○ Trace Preparation: Program

traces for the benchmarks or target applications are prepared

Z. Zhang et al.

Trace

Preparation

Enhanced

Prototyping

Trace-Driven

Simulation

Functional

Correctness

Execution-

Driven

Simulation

① ②

③ ④

⑤

⑥

Performance

Analysis

Figure 7. Agile performance evaluation workﬂow with TraceRTL. The

workﬂow comprises two loops: a trace-driven loop

○→

○ and

an execution-driven loop

○→

○.

for subsequent performance evaluation. Each trace represents

a program segment. Traces can be generated using a va-

riety of tools, including dynamic instrumentation tools like

Pin [46] and DynamoRIO [47], instruction-level simulators like

QEMU [50] , and publicly available traces such as Google

workload traces [36] and Qualcomm workload traces [35].

TraceRTL can be combined with additional techniques such as

SimPoint [40] to further shorten simulation time, while also

avoiding the overhead and complexity of booting.

○ Proto-

typing: New microarchitectural features can be prototyped

on a RTL model without full implementation, as shown in

§ 7.2.

○ Trace-Driven Simulation: The trace-formatted pro-

gram segments are replayed in trace-driven simulation, yielding

performance results of the CPU model.

○ Performance

Analysis: The performance results and program behaviors are

analyzed to identify performance bottlenecks. These insights in-

form subsequent iterations and guide prototype reﬁnement.

○

Functional Correctness: When the design meets expected

performance targets, the eﬀorts invested in prototype devel-

opment can be seamlessly carried over. TraceRTL supports

compile-time mode switching between execution-driven and

trace-driven simulation, enabling smooth transition to func-

tional validation.

○ Execution-Driven Simulation: Fur-

ther performance analysis and iteration are conducted through

execution-driven.

TraceRTL facilitates an agile and accurate RTL-level mi-

croarchitecture design exploration process. Rather than re-

placing existing architectural simulators, TraceRTL serves as

a complementary and reinforcing component that enhances

RTL performance exploration and bridges the gap between

high-level models and real RTL behavior. It targets a distinct

sweet spot in the accuracy-productivity trade-oﬀ, preserv-

ing the ground-truth RTL model and accepting manageable

maintenance overhead to achieve substantially higher accuracy,

with comparable or potentially lower (Palladium/FPGA) sim-

ulation cost. By enabling direct performance evaluation on

real RTL implementations, TraceRTL empowers architects to

broaden application coverage and identify microarchitectural

bottlenecks that high-level simulators may overlook.

6. Evaluation

We conduct evaluations to address two key questions:

1. Can we mitigate trace-driven simulation’s performance

inaccuracies (§ 6.2)?

2. Does TraceRTL achieve high performance accuracy (§6.3)?

To address these questions, we compare the performance of

the original RTL model, TraceRTL, and the state-of-the-art

simulator gem5 [4].

Table 2. Target system conﬁguration.

Component Description

Branch Predictor

uBTB, BTB, TAGE-SC,

ITTAGE, RAS

Fetch/Decode/Rename Width 8/6/6

RoB/LoadQueue/StoreQueue 160/72/64

Integer/Float Register File 224/192

ALU/FMA/FDivSqrt unit 4/4/2/

Load/Store unit 3/2

L1 ICache 64KB, 4-way, 256-set

L1 DCache 64KB, 8-way, 128-set

L2 Cache 1MB, 8-way, 512-set, 4-bank

L3 Cache 16MB, 16-way, 4096-set, 4-bank

L1 ITLB/DTLB 48-entry, fully-associative

L2 TLB 2048-entry, 8-way, 32-set

DRAM DRAMsim3, 8GB, DDR4-3200

6.1. Experimental Setup

Target System. We evaluate TraceRTL by altering an open-

source high-performance RISC-V processor, XiangShan [26, 38],

into a trace-driven model. TraceRTL introduces low implemen-

tation overhead while preserving RTL ﬁdelity. It reuses the

original RTL and drives existing modules by intercepting in-

puts and outputs. The modiﬁcations consist of three primary

components. First, the simulation environment, implemented

primarily in C++, manages trace ﬁle loading, instruction

stream validation, and page table generation. Second, the

TraceRTL module, written in Chisel, retrieves traces via the

DPI and supplies instructions to the processor. Third, in-

terface connections and execution guidance are applied to

existing processor modules. The ﬁrst two components are

microarchitecture-agnostic, whereas the third requires tighter

coupling with speciﬁc microarchitectural details. Speciﬁcally,

the microarchitecture-speciﬁc modiﬁcations account for fewer

than 450 LOC (lines of code). Nevertheless, the modiﬁcation

methodology remains portable across diverse processor designs.

XiangShan, implemented in Chisel [13], is a tape-out

ready superscalar out-of-order processor. Its latest third-

generation, Kunminghu, achieves a clock frequency of 3GHz

and SPECint2006 score exceeding 15/GHz, demonstrating its

capability as a platform for exploring high performance mi-

croarchitecture designs. We use the default conﬁguration of Xi-

angShan, as shown in Table 2. We take the original XiangShan’s

performance as the ground truth.

gem5 is widely used for CPU microarchitecture exploration

and is often referenced as the ground truth in some simulator

works [8, 51, 52] for its rich details. We use the XS-gem5 [53] as

the baseline, which has been carefully calibrated to XiangShan

through over 1,200 git commits and more than 60,000 lines of

source code additions since July 2022, including XiangShan-

speciﬁc adjustments.

Simulation Speed. XS-gem5 achieves a simulation speed of

around 35kHz. As TraceRTL is directly derived from the orig-

inal RTL model, it inherently shares a comparablesimulation

speed and beneﬁts from hardware-accelerated emulation tools.

The simulation speeds are both around 6.5kHz using Verila-

tor [14] and around 1.4MHz on Cadence Palladium, which is

40× faster than XS-gem5.

Workloads. We use SPEC CPU2006 [54] and SPEC

CPU2017 [55] benchmark suites. We compare the benchmark

TraceRTL: Agile Performance Evaluation

scores between XiangShan, TraceRTL and XS-gem5. The com-

plete execution of SPEC CPU benchmarks takes a very long

time in software simulation. A set of representative program

segments are generated by sampling the SPEC CPU bench-

marks using SimPoint [40]. Each segment consists of 20M

instructions for warm-up and 20M instructions for performance

sampling. To limit simulation time, more than 30% weight

of the program segments are included for each application.

NEMU [56], an instruction-level simulator, is employed to

execute these segments and generate trace ﬁles to feed into

TraceRTL. Both XiangShan and XS-gem5 are functionally veri-

ﬁed against NEMU, guaranteeing they share the same execution

ﬂow.

6.2. Trace-Driven Performance Discrepancies

For the ﬁrst time, we can evaluate the performance impact of

the trace-driven simulation on an accurate high-performance

RTL processor and the eﬀectiveness of measures to mitigate

its performance errors. We quantify the performance errors

arising from wrong-path simulation, memory management unit

behaviors, operand and opcode absence.

Figure 8. Performance errors of TraceRTL w/ and w/o wrong-path sim-

ulation on SPEC CPU2006 and SPEC CPU2017.

6.2.1. Wrong-path Simulation.

We adopt the mechanism detailed in § 4.2.1 to model wrong-

path eﬀects. For comparison, we also consider the basic ap-

proach where the instruction fetch halts upon encountering

a mis-prediction, detailed in § 4.1.1. Fig. 8 illustrates SPEC

CPU2006’s and SPEC CPU2017’s performance diﬀerences with

and without simulating wrong-path instructions’ eﬀect, con-

taining the sub-benchmarks whose “w/o WPS” errors are more

than 1%, benchmarks’ overall performance errors and RMSE

(root mean squared error) metric. Although the overall per-

formance impact of neglecting wrong paths is relatively small

(-3.91% and -0.18% for SPECint2006 and SPECfp2006, -2.17%

and 0.14% for SPECint2017 and SPECfp2017), certain bench-

marks, such as 429.mcf and 450.soplex on SPEC CPU2006 and

505.mcf and 557.xz on SPEC CPU2017, exhibited substan-

tial performance degradation. Our results demonstrate that

simulating the impact of wrong-path instructions eﬀectively

mitigates these programs’ performance discrepancies, reduc-

ing the overall performance error to 0.14% for SPECint2006

and 0.13% for SPECint2017. The RMSE of SPECint2006 and

SPECint2017 falls from 9.56% and 4.87% to 1.38% and 2.38%.

6.2.2. Instruction Opcode Provisioning

Figure 9. Performance errors of TraceRTL on SPEC CPU2017 when

omitting computation instruction opcodes.

(a) Branch predictor MPKI.

(b) Data cache MPKI.

Figure 10. BPU and data cache MPKI comparison between TraceRTL

w/ and w/o computation instruction opcodes on SPEC CPU2017 bench-

marks. Each point represents one sub-benchmark.

Coarse-grained opcode abstraction is common in trace-driven

simulators without detailed execution unit modeling, or in

applications that directly provide traces without instruction

encoding. To quantify the performance deviations, we im-

plemented a controlled mapping scheme within the TraceRTL

framework. Speciﬁcally, the diverse array of complex compu-

tational opcodes are collapsed into a simpliﬁed set of generic

operations: integer addition/multiplication (ADD/MUL) and

ﬂoating-point addition/multiplication (FADD/FMUL).

The results across the SPEC CPU2017 demonstrate that

the impact of opcode abstraction varies signiﬁcantly between

workload types. As shown in Fig. 9, SPECint2017 exhibits

high resilience to coarse-grained semantic mapping, maintain-

ing a negligible average error of 0.95%. In contrast, SPECfp2017

shows a much higher sensitivity, with the average error rising to

5.30% and peaking at 19.29% in 519.lbm. The results suggest

that while coarse-grained opcode traces are suﬃcient for evalu-

ating general-purpose integer architectures, they may introduce

unacceptable ﬁdelity loss for ﬂoating-point heavy workloads.

Despite the divergence, the coarse-grained abstraction ef-

fectively preserves the control-ﬂow and memory-access char-

acteristics of the workloads. As illustrated in Fig. 10, the

MPKI metrics of branch predictor and data cache remain

highly consistent between the abstracted traces and normal

TraceRTL. In summary, coarse-grained opcode abstraction has

limited impact on integer compute-intensive applications, fron-

tend modules (branch prediction and instruction fetch), and

memory-access related research. It is well-suited for studies

where the target workloads or modules have a weak correlation

with ﬂoating-point operations.

6.2.3. Uncertain-latency Operations

To model the execution latency of uncertain-latency opera-

tions, represented by ﬂoating-point division and square root

(FDivSqrt), we adopt the approach that supplies operands, de-

tailed in § 4.2.3. For comparison, we also evaluated a baseline

conﬁguration where the FDivSqrt is replaced with a ﬁxed-

latency dummy unit, with latencies varying based on the

operation type and data width. As shown in Fig. 11, which

Z. Zhang et al.

contains sub-benchmarks whose ”ﬁxed-latency” errors exceed

0.5%, the ﬁxed-latency model resulted in overall performance

errors of -1.65% and -1.22% on SPECfp2006 and SPECfp2017,

respectively, with signiﬁcant deviations for sub-benchmarks

such as gromacs and zeusmp in SPECfp2006, and 521.wrf,

527.cam4, and 544.nab in SPECfp2017. By providing operands,

we are able to improve the accuracy of performance for these

applications.

Figure 11. Performance error of simulating FDivSqrt with operand-

dependent vs. ﬁxed latency on SPECfp2006 and SPECfp2017.

6.2.4. Memory Management Unit

To evaluate the performance impact of the MMU, we employ

the dynamic page table (Dynamic PT) approach detailed in

§ 4.2.4. For comparison, we also simulate an ideal L1 TLB

which always hits and a page table walker with ﬁxed-latency of

15 cycles. As shown in Fig. 12, which contains sub-benchmarks

whose “Ideal L1TLB” errors are more than 3%, the ideal

MMU introduces average performance discrepancies of 6.19%

and 2.35% on SPEC CPU2017 int and fp, with 11 out of 23

benchmarks experiencing performance discrepancies exceeding

3%. When simulating a page table walker with ﬁxed memory la-

tency, 4 out of the 23 benchmarks have errors greater than 3%.

In contrast, when simulating the actual MMU behavior, the

overall performance overhead decreases to 0.13% and 0.14% for

SPEC 2017 int and fp, and only 1 out of 23 benchmarks exhibits

a performance error greater than 3%.

Figure 12. Performance error of simulating the MMU using diﬀerent

strategies on SPEC CPU2006 and SPEC CPU2017.

6.3. Overall Performance Accuracy

We evaluate the performance accuracy of TraceRTL and XS-

gem5 on SPEC CPU2006 and SPEC CPU2017, with original

XiangShan as the ground truth, as shown in Fig. 13.

Overall. TraceRTL achieves signiﬁcantly high accuracy in

overall performance. For RMSE metric, TraceRTL achieves

1.45% and 1.00% on SPECint2006 and SPECfp2006, compared

to 9.85% and 19.44% for XS-gem5. Similarly, the RMSE of

SPECint2017 and SPECfp2017 of TraceRTL are 2.38% and

0.67%, compared to 8.02% and 22.53% of XS-gem5.

Sub-benchmarks. TraceRTL exhibits high accuracy at both

the overall and sub-benchmark levels. For XS-gem5, on SPEC

CPU2006, 11 out of 29 sub-benchmarks have errors greater than

10%, and 14 out of 29 have errors greater than 3%. Similarly,

on SPEC CPU2017, 7 out of 23 sub-benchmarks have errors

greater than 10%, and 13 out of 23 have errors greater than

3%. These discrepancies can be attributed to the diversity of

program characteristics, which makes it challenging to perfectly

calibrate. In contrast, by inheriting rich details, TraceRTL

eﬀortlessly achieves high accuracy. TraceRTL achieves perfor-

mance accuracy such that only 1 out of 29 on SPEC CPU2006

and 1 out of 23 on SPEC CPU2017 has an error greater than

3%.

7. Case Studies

In this section, we present case studies to demonstrate how

TraceRTL facilitates agile performance evaluation:

1. Trace Compatibility: Using TraceBridge, we evaluate

x86-based Google workload traces on the RISC-V Xiang-

Shan CPU (§ 7.1).

2. Prototyping: We use TraceRTL to quickly evaluate the

performance impact of adopting a two-stage address trans-

lation MMU (§ 7.2) and a new ﬂoating-point unit(§ 7.3).

3. Performance Sensitivity Accuracy: We compare the

accuracy of performance impact between TraceRTL and

XS-gem5 at frontend, backend and memory (§ 7.4).

7.1. Trace Compatibility: Google Workload Traces

We evaluate datacenter workloads, the x86-based Google work-

load traces [36] from warehouse-scale computer workloads on

the RISC-V high performance processor XiangShan to show the

feasibility of TraceBridge described in § 4.3. Google workload

traces consist of multiple trace groups, each containing many

trace ﬁles. For each group, we select the longest trace and apply

the SimPoint [40] for sampling. Applying SimPoint directly to

the transformed traces can avoid errors caused by instruction

inﬂation.

While TraceBridge maintains semantic consistency, it intro-

duces the overhead of instruction inﬂation. We analyze this

inﬂation across both static and dynamic dimensions, consider-

ing instruction count and size, as shown in Fig. 14. The inﬂation

ratios for static and dynamic instruction counts remain stable

within a narrow range, from 1.09 for arizona to 1.19 for yankee.

The dynamic instruction size, an indicator of instruction cache

pressure, exhibits an inﬂation ratio ranging from 0.95 for ari-

zona to 1.20 for bravo.a, with 9 out of 12 applications staying

within a 10% inﬂation margin.

We provide a Top-down [57] breakdown analysis of per-

formance bottlenecks for both Google workload traces and

SPECint2017, sorted by IPC, as shown in Fig. 15. While only 3

out of 10 SPECint2017 sub-benchmarks exhibit memory-bound

TraceRTL: Agile Performance Evaluation

Figure 13. Performance error of TraceRTL and XS-gem5 on SPEC CPU2006 and SPEC CPU2017, using the execution-driven XiangShan as the baseline.

Figure 14. Instruction inﬂation rate of TraceBridge on Google workload

traces.

Figure 15. Top-down breakdown comparison between Google workload

traces, SPECint2017 , and llama2.c.

over 20%, 8 out of 12 Google workload traces demonstrate

this characteristic, with 6 reaching approximately 40%. These

results highlight memory access as the primary performance

bottleneck, underscoring the importance of memory optimiza-

tion for warehouse-scale computing systems. TraceBridge

introduces dynamic instruction size and count expansion, which

primarily aﬀects front-end and core-bound performance cate-

gories. However, since these two factors account for relatively

small proportions in Google workload traces, TraceBridge has

limited impact through expansion eﬀects. Although coarse-

grained instruction encoding may potentially aﬀect ﬂoating-

point workloads, the analysis in § 6.2.2 shows that it preserves

accurate instruction streams and cache behavior. This indi-

cates that the impact on memory-bound and bad-speculation

categories is also minimal.

TraceRTL also streamlines porting workloads by leveraging

the well-developed QEMU. It takes less than 30 minutes to

compile llama2.c [58] and generate program traces by QEMU.

As shown in Fig. 15, these traces are simulated on TraceRTL,

and, unlike Google workload traces and SPECint2017, exhibit

distinct core-bound behaviors.

7.2. Prototyping #1: Memory Management Unit

TraceRTL enables eﬃcient prototyping and performance evalu-

ation of complex microarchitectural modules. As a case study,

we examine two-stage address translation, a key mechanism

for supporting virtual machines through memory virtualization

deﬁned in the RISC-V Hypervisor extension [59].

Evaluating this module is non-trivial due to its reliance

on privileged operations, complex control and status registers

(CSRs), and software-managed page tables. Additionally, its

performance impact is signiﬁcant: address translation may trig-

ger multiple memory accesses to page table. For instance, the

RISC-V Sv39 scheme requires 3 memory accesses, while the

virtualized, two-stage Sv39-Sv39x4 scheme requires up to 15

memory accesses that increase the translation latency.

Figure 16. Performance decrease estimation when adopting two-stage

address translation on SPEC CPU2006.

TraceRTL allows performance evaluation of such designs be-

fore full functional implementation is complete. By (1) directly

Z. Zhang et al.

Figure 17. Performance error of TraceRTL on SPEC CPU2006 under

KVM virtualization.

providing the page table following the two-stage translation

scheme and (2) adding a standalone host page table walker

which performs guest-physical-address to host-physical-address

translation, we enable the MMU to perform the two-stage

Sv39-Sv39x4 scheme, thereby obtaining the performance re-

sults of two-stage address translation. Fig. 16 illustrates the

performance changes of the TraceRTL under normal address

translation and two-stage address translation modes on SPEC

CPU2006. The two-stage address translation results in a per-

formance degradation of 9.99% for SPECint2006 and 5.27% for

SPECfp2017. Among the 29 sub-benchmarks, 10 have a degra-

dation exceeding 5%. In summary, TraceRTL simpliﬁes the

requirements for functional correctness and software modiﬁca-

tions, providing a robust development platform for exploration

around MMU.

To evaluate the accuracy, we compare TraceRTL-based

Hypervisor against fully-functional XiangShan Hypervisor on

SPEC CPU2006 under KVM virtualization. As shown in

Fig. 17, TraceRTL achieves high accuracy, with performance

errors below 1% for 19 of 23 sub-benchmarks. The overall error

is 0.32% for SPECint2006 and 0.23% for SPECfp2006.

7.3. Prototyping #2: FDivSqrt Unit

TraceRTL enables the implementation of dummy execution

units with conﬁgurable latency behavior without complex be-

havioral modeling. For instance, implementing a functional

FDivSqrt unit in RTL entails substantial eﬀort, as the imple-

mentation in XiangShan exceeds 2,400 LOC and requires exten-

sive veriﬁcation. To evaluate the pipelined design [60] without

incurring such overhead, alternative modeling approaches are

necessary. XS-gem5 adopts cycle-accurate modeling for execu-

tion units and requires more than 40 LOC of modiﬁcations.

In contrast, TraceRTL enables dummy implementation with

conﬁgurable latency in fewer than 10 LOC of modiﬁcations,

eliminating veriﬁcation overhead. Figure 18 shows the perfor-

mance impact of replacing two blocking FDivSqrt units with a

pipelined version across benchmarks where TraceRTL changes

exceed 0.5%. Notable discrepancies between XiangShan and

XS-gem5 appear in SPEC CPU2006 wrf and SPEC CPU2017

lbm. Given the diﬀerences in performance accuracy, TraceRTL

results are considered more reliable.

7.4. Performance Sensitivity Accuracy

In addition to the performance accuracy of the processor model,

the performance sensitivity to microarchitectural modiﬁcations

is also important. To evaluate the performance sensitivity

accuracy to microarchitectural modiﬁcations, we adjust key

conﬁgurations in the frontend, backend, and memory subsys-

tem. For the frontend, we compare the performance impact of

diﬀerent branch target buﬀer (BTB) sizes—speciﬁcally, from

Figure 18. Performance improvement when adopting pipelined FDivSqrt

on SPECfp2006 and SPECfp2017.

1024 to 2048 (default) entries. For the backend, we vary the

number of ﬂoating-point units FMA from 2 to 4 (default). For

the memory subsystem, we evaluate performance with the best-

oﬀset prefetcher in the L2 cache both disabled and enabled

(default).

As shown in Fig. 19, we compare the performance variations

of XiangShan, TraceRTL and XS-gem5 on SPEC CPU2017

benchmarks under microarchitectural modiﬁcations mentioned

above. The performance trends observed on TraceRTL closely

match those of XiangShan better than those of XS-gem5.

For instance, when enlarging BTB size, sub-benchmarks such

as 500.perlbench, 502.gcc, and 511.povray exhibit similar

trends between TraceRTL and XiangShan. When increasing

the number of the FMA, sub-benchmarks like 507.cactuB-

SSN, 508.namd, and 519.lbm show consistent behavior. When

adopting the best-oﬀset prefetcher, sub-benchmarks including

500.perlbench and 507.cactuBSSN also demonstrate analogous

performance improvements.

We analyze the notable performance errors of XS-gem5

and ﬁnd that its prefetching subsystem is considerably more

complex and ﬁnely tuned, yet lacks clear calibration against

the RTL design. This mismatch diminishes the observable

performance gains from new prefetchers such as best-oﬀset.

The observation highlights the fundamental calibration chal-

lenge and motivates the design of TraceRTL: while a model

may overﬁt to the baseline conﬁguration to reproduce sim-

ilar overall performance, its performance trends for speciﬁc

microarchitectural features may diverge signiﬁcantly.

8. Related Work

Trace-Driven Model Transformation. Prior work has explored

employing trace-based methods to directly control RTL mod-

ules’ behavior for functional veriﬁcation, coverage analysis, and

performance validation [61, 62]. These works use traces to drive

separate RTL modules and the main challenge lies in the gener-

ation of traces. Some works collect the traces generated by CPU

RTL models for coverage analysis [63]. In contrast, TraceRTL,

centered on the whole CPU RTL model, addresses the chal-

lenges of design space exploration at the RTL level. Given that

achieving high performance accuracy is both a fundamental

requirement and a persistent challenge, TraceRTL provides a

solution that not only supports prototyping but also enables

the execution of workloads in trace form. Trace-driven method-

ology can be used to improve existing software simulators,

such as the trace-driven gem5 mentioned at [64]. In contrast,

TraceRTL enhances the RTL simulation to avoid extra model

TraceRTL: Agile Performance Evaluation

(a) From 1024 to 2048 BTB entries on SPEC CPU2017.

(b) Increasing FMA number from 2 to 4 on SPECfp2017.

Figure 19. Performance improvements of microarchitectural modiﬁca-

tions on XiangShan, TraceRTL and XS-gem5.

layers. Accel-Sim [65] adds a new frontend for GPGPU-Sim [66]

to support trace-driven simulation. Unlike Accel-Sim’s high-

level GPU modeling, TraceRTL targets low-level RTL CPU

models and addresses challenges of model calibration.

Trace-Driven Performance Inaccuracy. Previous works

have investigated performance inaccuracy in trace-driven sim-

ulation, primarily focusing on the wrong-path simulation in

single-core [67–69], multi-core [70, 71] and synchronization in

multi-core simulation [72, 73]. Our methodology mainly focuses

on prefetching inﬂuence of wrong paths by taking the instruc-

tions at the correct path as wrong-path instructions, to suit

the RTL model and achieve high accuracy. Moreover, exist-

ing trace-driven simulators have a high level of abstraction ,

which may introduce performance errors thus masking some

inﬂuencing factors. TraceRTL provides a platform for studying

trace-driven simulation.

Error-Prone RTL Model. New RTL languages such as

Bluespec SystemVerilog [12], Chisel [13], and SpinalHDL [39]

provide high expressiveness and abstraction to reduce design

errors. Assassyn [74] introduces a high-level abstraction for

asynchronous event handling of pipelined architectures and can

generate a calibrated C++ simulator. TraceRTL presents an

orthogonal approach to utilizing a trace-driven methodology

to decouple the functional and performance models of existing

CPU models and expand the scope of workloads.

Trace Format Transformation. Prior work has explored

trace format transformation, e.g., converting Arm traces into

ChampSim-compatible format [7, 75]. However, ChampSim’s

high-level abstraction bypasses many low-level challenges, such

as diﬀerences in instruction semantics, encoding size, PC align-

ment, and branch oﬀset range, which become critical when

executing traces on RTL models.

9. Conclusion

We propose TraceRTL, a methodology to bring trace-driven

simulation to the CPU RTL model to facilitate agile perfor-

mance evaluation. We evaluate TraceRTL by integrating it into

XiangShan, achieving high accuracy of 99.87% and 99.86% on

SPECint2017 and SPECfp2017. We propose a trace transforma-

tion strategy, TraceBridge, and evaluate x86 Google workload

traces on the RISC-V XiangShan. TraceRTL mitigates the

benchmarking gap between software simulators and RTL de-

sign, supports both an RTL-based performance exploration

workﬂow and seamless integration with simulator-driven ﬂows,

serving as a bridge from early-stage exploration to last-mile

RTL evaluation.

Ethical Statement

No ethical approval was required for this study, as it did not

involve human or animal subjects.

Funding

This work was supported by the National Natural Science Foun-

dation of China (Grant No. 62090022, 62090023, 62172388) and

the Strategic Priority Research Program of Chinese Academy

of Sciences (Grant No. XDA0320000, XDA0320300).

Declaration of competing interests

The authors declare that they have no known competing ﬁnan-

cial interests or personal relationships that could have appeared

to inﬂuence the work reported in this paper.

Data Availability Statements

The data supporting the ﬁndings of this study are openly

available in XiangShan at https://github.com/OpenXiangShan/

XiangShan/tree/dev-tracertl.

Credit authorship contribution statement

Zifei Zhang: Conceptualization; Project administration;

Methodology; Validation; Investigation; Data curation; Formal

Analysis; Writing – original draft. Yinan Xu: Methodology;

Writing – review & editing; Kaichen Gong: Software; Vali-

dation; Investigation. Sa Wang: Writing; Visualization. Dan

Tang: Supervision; Funding acquisition; Resources. Yungang

Bao: Supervision; Funding acquisition; Resources; Writing –

review & editing.

Z. Zhang et al.

References

1. Doug Burger and Todd M. Austin. The simplescalar

tool set, version 2.0. SIGARCH Comput. Archit. News,

25(3):13–25, June 1997. doi:10.1145/268806.268810.

2. Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann,

Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E.

Moore, Mark D. Hill, and David A. Wood. Multifacet’s

general execution-driven multiprocessor simulator (gems)

toolset. SIGARCH Comput. Archit. News, 33(4):92–99,

November 2005. doi:10.1145/1105734.1105747.

3. N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G.

Saidi, and S.K. Reinhardt. The m5 simulator: Modeling

networked systems. IEEE Micro, 26(4):52–60, 2006. doi:

10.1109/MM.2006.82.

4. Nathan Binkert, Bradford Beckmann, Gabriel Black,

Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel

Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sar-

dashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib,

Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5

simulator. SIGARCH Comput. Archit. News, 39(2):1–7,

August 2011. doi:10.1145/2024716.2024718.

5. Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout.

Sniper: exploring the level of abstraction for scalable and

accurate parallel multi-core simulation. In Proceedings

of 2011 International Conference for High Performance

Computing, Networking, Storage and Analysis, SC ’11,

New York, NY, USA, 2011. Association for Computing

Machinery. doi:10.1145/2063384.2063454.

6. Daniel Sanchez and Christos Kozyrakis. Zsim: fast and

accurate microarchitectural simulation of thousand-core

systems. In Proceedings of the 40th Annual Interna-

tional Symposium on Computer Architecture, ISCA ’13,

page 475–486, New York, NY, USA, 2013. Association for

Computing Machinery. doi:10.1145/2485922.2485963.

7. Nathan Gober, Gino Chacon, Lei Wang, Paul V. Gratz,

Daniel A. Jimenez, Elvira Teran, Seth Pugsley, and Jinchun

Kim. The championship simulator: Architectural simula-

tion for education and competition, 2022. URL: https:

//arxiv.org/abs/2210.14324, arXiv:2210.14324.

8. Hossein Golestani, Rathijit Sen, Vinson Young, and Gagan

Gupta. Calipers: a criticality-aware framework for mod-

eling processor performance. In Proceedings of the 36th

ACM International Conference on Supercomputing, ICS

’22, New York, NY, USA, 2022. Association for Computing

Machinery. doi:10.1145/3524059.3532390.

9. Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and

Karthikeyan Sankaralingam. Architectural simulators con-

sidered harmful. IEEE Micro, 35(6):4–12, 2015. doi:

10.1109/MM.2015.74.

10. Cbp2025 simulator framework. https://ericrotenberg.

wordpress.ncsu.edu/cbp2025-simulator-framework/, 2025.

11. Sizhuo Zhang, Andrew Wright, Thomas Bourgeat, and

Arvind. Composable building blocks to open up proces-

sor design. In Proceedings of the 51st Annual IEEE/ACM

International Symposium on Microarchitecture, MICRO-

51, page 68–81. IEEE Press, 2018. doi:10.1109/MICRO.2018.

00015.

12. Thomas Bourgeat, Cl´ement Pit-Claudel, Adam Chlipala,

and Arvind. The essence of bluespec: a core language for

rule-based hardware design. In Proceedings of the 41st

ACM SIGPLAN Conference on Programming Language

Design and Implementation, PLDI 2020, page 243–257,

New York, NY, USA, 2020. Association for Computing

Machinery. doi:10.1145/3385412.3385965.

13. Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee,

Andrew Waterman, Rimas Aviˇzienis, John Wawrzynek, and

Krste Asanovi´c. Chisel: constructing hardware in a scala

embedded language. In Proceedings of the 49th Annual

Design Automation Conference, pages 1216–1225, 2012.

doi:10.1145/2228360.2228584.

14. Verilator. Verilator user’s guide. https://www.veripool.

org/guide/latest/, 2026.

15. Haoyuan Wang and Scott Beamer. Repcut: Superlinear

parallel rtl simulation with replication-aided partitioning.

In Proceedings of the 28th ACM International Confer-

ence on Architectural Support for Programming Lan-

guages and Operating Systems, Volume 3, ASPLOS 2023,

page 572–585, New York, NY, USA, 2023. Association for

Computing Machinery. doi:10.1145/3582016.3582034.

16. Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and

Ru Huang. Khronos: Fusing memory access for improved

hardware rtl simulation. In Proceedings of the 56th Annual

IEEE/ACM International Symposium on Microarchitec-

ture, MICRO ’23, page 180–193, New York, NY, USA,

2023. Association for Computing Machinery. doi:10.1145/

3613424.3614301.

17. Haoyuan Wang, Thomas Nijssen, and Scott Beamer. Don’t

repeat yourself! coarse-grained circuit deduplication to ac-

celerate rtl simulation. In Proceedings of the 29th ACM

International Conference on Architectural Support for

Programming Languages and Operating Systems, Volume

4, ASPLOS ’24, page 79–93, New York, NY, USA, 2025. As-

sociation for Computing Machinery. doi:10.1145/3622781.

3674184.

18. Mahyar Emami, Thomas Bourgeat, and James R. Larus.

Parendi: Thousand-way parallel rtl simulation. In Pro-

ceedings of the 30th ACM International Conference on

Architectural Support for Programming Languages and

Operating Systems, Volume 2, ASPLOS ’25, page 783–797,

New York, NY, USA, 2025. Association for Computing

Machinery. doi:10.1145/3676641.3716010.

19. Sagar Karandikar, Howard Mao, Donggyu Kim, David

Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton,

Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qi-

jing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz,

Jonathan Bachrach, and Krste Asanovi´c. Firesim: Fpga-

accelerated cycle-exact scale-out system simulation in the

public cloud. In Proceedings of the 45th Annual Interna-

tional Symposium on Computer Architecture, ISCA ’18,

page 29–42. IEEE Press, 2018. doi:10.1109/ISCA.2018.

00014.

20. Sagar Karandikar, Albert Ou, Alon Amid, Howard Mao,

Randy Katz, Borivoje Nikoli´c, and Krste Asanovi´c.

Fireperf: Fpga-accelerated full-system hardware/software

performance proﬁling and co-design. In Proceedings of

the Twenty-Fifth International Conference on Architec-

tural Support for Programming Languages and Operating

Systems, ASPLOS ’20, page 715–731, New York, NY,

USA, 2020. Association for Computing Machinery. doi:

10.1145/3373376.3378455.

21. Mahyar Emami, Sahand Kashani, Keisuke Kamahori, Mo-

hammad Sepehr Pourghannad, Ritik Raj, and James R

Larus. Manticore: Hardware-accelerated rtl simulation

with static bulk-synchronous parallelism. In Proceedings

of the 28th ACM International Conference on Architec-

tural Support for Programming Languages and Operating

Systems, Volume 4, pages 219–237, 2023. doi:10.1145/

TraceRTL: Agile Performance Evaluation

3623278.3624750.

22. Fares Elsabbagh, Shabnam Sheikhha, Victor A Ying,

Quan M Nguyen, Joel S Emer, and Daniel Sanchez. Ac-

celerating rtl simulation with hardware-software co-design.

In Proceedings of the 56th Annual IEEE/ACM Interna-

tional Symposium on Microarchitecture, pages 153–166,

2023. doi:10.1145/3613424.3614257.

23. Christopher Celio, David A Patterson, and Krste

Asanovic. The berkeley out-of-order machine (boom):

An industry-competitive, synthesizable, parameterized

risc-v processor. EECS Department, University of

California, Berkeley, Tech. Rep. UCB/EECS-2015-

167, 2015. URL: https://www2.eecs.berkeley.edu/Pubs/

TechRpts/2015/EECS-2015-167.html.

24. Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste

Asanovic. Sonicboom: The 3rd generation berkeley out-

of-order machine. In Fourth Workshop on Computer

Architecture Research with RISC-V, volume 5, pages 1–

7, 2020. URL: https://people.eecs.berkeley.edu/

krste/

papers/SonicBOOM-CARRV2020.pdf.

25. Christopher Celio, Pi-Feng Chiu, Borivoje Nikolic, David A

Patterson, and Krste Asanovic. BOOMv2: an open-

source out-of-order RISC-V core. In First Work-

shop on Computer Architecture Research with RISC-V

(CARRV), 2017. URL: https://www2.eecs.berkeley.edu/

Pubs/TechRpts/2017/EECS-2017-157.pdf.

26. Kaifan Wang, Jian Chen, Yinan Xu, Zihao Yu, Zifei Zhang,

Guokai Chen, Xuan Hu, Linjuan Zhang, Xi Chen, Wei

He, Dan Tang, Ninghui Sun, and Yungang Bao. Xi-

angShan: An Open-Source Project for High-Performance

RISC-V Processors Meeting Industrial-Grade Standards .

In 2024 IEEE Hot Chips 36 Symposium (HCS), pages 1–

25, Los Alamitos, CA, USA, August 2024. IEEE Computer

Society. URL: https://doi.ieeecomputersociety.org/10.

1109/HCS61935.2024.10665293, doi:10.1109/HCS61935.2024.

10665293.

27. Chen Chen, Xiaoyan Xiang, Chang Liu, Yunhai Shang, Ren

Guo, Dongqi Liu, Yimin Lu, Ziyi Hao, Jiahui Luo, Zhijian

Chen, Chunqiang Li, Yu Pu, Jianyi Meng, Xiaolang Yan,

Yuan Xie, and Xiaoning Qi. Xuantie-910: A commercial

multi-core 12-stage pipeline out-of-order 64-bit high per-

formance risc-v processor with vector extension: Industrial

product. In 2020 ACM/IEEE 47th Annual International

Symposium on Computer Architecture (ISCA), pages 52–

64. IEEE, 2020. doi:10.1109/ISCA45697.2020.00016.

28. Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu,

and Martin DF Wong. Boom-explorer: Risc-v boom mi-

croarchitecture design space exploration framework. In

2021 IEEE/ACM International Conference On Computer

Aided Design (ICCAD), pages 1–9. IEEE, 2021. doi:

10.1109/ICCAD51958.2021.9643455.

29. Siddharth Gupta, Yuanlong Li, Qingxuan Kang, Abhishek

Bhattacharjee, Babak Falsaﬁ, Yunho Oh, and Mathias

Payer. Imprecise store exceptions. In Proceedings of

the 50th Annual International Symposium on Computer

Architecture, ISCA ’23, New York, NY, USA, 2023. As-

sociation for Computing Machinery. doi:10.1145/3579371.

3589087.

30. Moein Ghaniyoun, Kristin Barber, Yuan Xiao, Yinqian

Zhang, and Radu Teodorescu. Teesec: Pre-silicon vulner-

ability discovery for trusted execution environments. In

Proceedings of the 50th Annual International Symposium

on Computer Architecture, ISCA ’23, New York, NY,

USA, 2023. Association for Computing Machinery. doi:

10.1145/3579371.3589070.

31. Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang,

Tianyue Lu, Mingyu Chen, Siwei Luo, and Keji Huang.

Asynchronous memory access unit: Exploiting massive par-

allelism for far memory access. ACM Trans. Archit. Code

Optim., 21(3), September 2024. doi:10.1145/3663479.

32. Duo Wang, Mingyu Yan, Yihan Teng, Dengke Han, Hao-

ran Dang, Xiaochun Ye, and Dongrui Fan. A transfer

learning framework for high-accurate cross-workload design

space exploration of cpu. In 2023 IEEE/ACM Interna-

tional Conference on Computer Aided Design (ICCAD),

pages 1–9, 2023. doi:10.1109/ICCAD57390.2023.10323840.

33. Hideki Ando. Performance improvement by prioritizing the

issue of the instructions in unconﬁdent branch slices. In

2018 51st Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO), pages 82–94, 2018. doi:

10.1109/MICRO.2018.00016.

34. Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen,

Lingrui Gou, Yue Jin, Qianruo Li, Xin Li, Zuojun Li, Ji-

awei Lin, Tong Liu, Zhigang Liu, Jiazhan Tan, Huaqiang

Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang,

Fawang Zhang, Linjuan Zhang, Zifei Zhang, Yangyang

Zhao, Yaoyang Zhou, Yike Zhou, Jiangrui Zou, Ye Cai,

Dandan Huan, Zusong Li, Jiye Zhao, Zihao Chen, Wei He,

Qiyuan Quan, Xingwu Liu, Sa Wang, Kan Shi, Ninghui

Sun, and Yungang Bao. Towards developing high per-

formance risc-v processors using agile methodology. In

2022 55th IEEE/ACM International Symposium on Mi-

croarchitecture (MICRO), pages 1178–1199, 2022. doi:

10.1109/MICRO56248.2022.00080.

35. Championship value prediction. https://microarch.org/

cvp1/. Accessed: 2025-02-20.

36. Google workload traces version 2. https://console.cloud.

google.com/storage/browser/external-traces-v2. Ac-

cessed: 2025-02-20.

37. Wei Su, Abhishek Dhanotia, Carlos Torres, Jayneel Gandhi,

Neha Gholkar, Shobhit Kanaujia, Maxim Naumov, Kalyan

Subramanian, Valentin Andrei, Yifan Yuan, and Chunqiang

Tang. Dcperf: An open-source, battle-tested performance

benchmark suite for datacenter workloads. In Proceedings

of the 52nd Annual International Symposium on Com-

puter Architecture, ISCA ’25, page 1717–1730, New York,

NY, USA, 2025. Association for Computing Machinery.

doi:10.1145/3695053.3731411.

38. OpenXiangShan. XiangShan. https://github.com/

OpenXiangShan/XiangShan, 2020.

39. SpinalHDL. Scala based hdl. https://github.com/

SpinalHDL/SpinalHDL, 2024.

40. Timothy Sherwood, Erez Perelman, Greg Hamerly, and

Brad Calder. Automatically characterizing large scale

program behavior. In Proceedings of the 10th Inter-

national Conference on Architectural Support for Pro-

gramming Languages and Operating Systems, ASPLOS X,

page 45–57, New York, NY, USA, 2002. Association for

Computing Machinery. doi:10.1145/605397.605403.

41. Alen Sabu, Harish Patil, Wim Heirman, and Trevor E

Carlson. Looppoint: Checkpoint-driven sampled simula-

tion for multi-threaded applications. In 2022 IEEE In-

ternational Symposium on High-Performance Computer

Architecture (HPCA), pages 604–618. IEEE, 2022. doi:

10.1109/HPCA53966.2022.00051.

42. Trevor E Carlson, Wim Heirman, Kenzo Van Craeynest,

and Lieven Eeckhout. Barrierpoint: Sampled simulation

Z. Zhang et al.

of multi-threaded applications. In 2014 IEEE Interna-

tional Symposium on Performance Analysis of Systems

and Software (ISPASS), pages 2–12. IEEE, 2014. doi:

10.1109/ISPASS.2014.6844456.

43. Krste Asanovic, Rimas Avizienis, Jonathan Bachrach,

Scott Beamer, David Biancolin, Christopher Celio, Henry

Cook, Daniel Dabbelt, John Hauser, Adam Izraele-

vitz, Sagar Karandikar, Ben Keller, Donggyu Kim, and

John Koenig. The rocket chip generator. EECS De-

partment, University of California, Berkeley, Tech.

Rep. UCB/EECS-2016-17, 4:6–2, 2016. URL: https:

//aspire.eecs.berkeley.edu/wp/wp-content/uploads/2016/

04/Tech-Report-The-Rocket-Chip-Generator-Beamer.pdf.

44. Bruno S´a, Luca Valente, Jos´e Martins, Davide Rossi,

Luca Benini, and Sandro Pinto. CVA6 RISC-V virtual-

ization: Architecture, microarchitecture, and design space

exploration. IEEE Transactions on Very Large Scale In-

tegration (VLSI) Systems, 2023. doi:10.1109/TVLSI.2023.

3302837.

45. RISC-V community. Olympia. https://github.com/

riscv-software-src/riscv-perf-model, 2026.

46. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil,

Artur Klauser, Geoﬀ Lowney, Steven Wallace, Vijay Janapa

Reddi, and Kim Hazelwood. Pin: building customized pro-

gram analysis tools with dynamic instrumentation. In Pro-

ceedings of the 2005 ACM SIGPLAN Conference on Pro-

gramming Language Design and Implementation, PLDI

’05, page 190–200, New York, NY, USA, 2005. Association

for Computing Machinery. doi:10.1145/1065010.1065034.

47. Derek Bruening, Evelyn Duesterwald, and Saman Amaras-

inghe. Design and implementation of a dynamic optimiza-

tion framework for windows. In 4th ACM workshop on

feedback-directed and dynamic optimization (FDDO-4),

page 20, 2001.

48. Nicholas Nethercote and Julian Seward. Valgrind: a frame-

work for heavyweight dynamic binary instrumentation.

In Proceedings of the 28th ACM SIGPLAN Conference

on Programming Language Design and Implementation,

PLDI ’07, page 89–100, New York, NY, USA, 2007. As-

sociation for Computing Machinery. doi:10.1145/1250734.

1250746.

49. DynamoRIO. Dynamorio trace format. https://dynamorio.

org/sec_drcachesim_format.html. Accessed: 2026-02-07.

50. Fabrice Bellard. QEMU, a fast and portable dynamic

translator. In USENIX annual technical conference,

FREENIX Track, volume 41, pages 10–5555. California,

USA, 2005. URL: https://www.usenix.org/legacy/event/

usenix05/tech/freenix/full_papers/bellard/bellard.pdf.

51. Santosh Pandey, Amir Yazdanbakhsh, and Hang Liu. Tao:

Re-thinking dl-based microarchitecture simulation. Pro-

ceedings of the ACM on Measurement and Analysis of

Computing Systems, 8(2):1–25, 2024. doi:10.1145/3656012.

52. Muhammad E. S. Elrabaa, Ayman Hroub, Muhamed F.

Mudawar, Amran Al-Aghbari, Mohammed Al-Asli, and

Ahmad Khayyat. A very fast trace-driven simulation plat-

form for chip-multiprocessors architectural explorations.

IEEE Transactions on Parallel and Distributed Systems,

28(11):3033–3045, 2017. doi:10.1109/TPDS.2017.2713782.

53. OpenXiangShan. XS-gem5. https://github.com/

OpenXiangShan/GEM5, 2020.

54. John L Henning. Spec cpu2006 benchmark descriptions.

ACM SIGARCH Computer Architecture News, 34(4):1–

17, 2006. doi:10.1145/1186736.1186737.

55. James Bucek, Klaus-Dieter Lange, and J´oakim

v. Kistowski. SPEC CPU2017: Next-generation com-

pute benchmark. In Companion of the 2018 ACM/SPEC

International Conference on Performance Engineering,

pages 41–42, 2018. doi:10.1145/3185768.3185771.

56. OpenXiangShan. NEMU. https://github.com/

OpenXiangShan/NEMU, 2019.

57. Ahmad Yasin. A top-down method for performance anal-

ysis and counters architecture. In 2014 IEEE Interna-

tional Symposium on Performance Analysis of Systems

and Software (ISPASS), pages 35–44, 2014. doi:10.1109/

ISPASS.2014.6844459.

58. Andrej Karpathy. llama2.c: Inference Llama 2 in one ﬁle of

pure C. https://github.com/karpathy/llama2.c. Accessed:

2026-02-07.

59. RISC-V. RISC-V Instruction Set Manual. https://github.

com/riscv/riscv-isa-manual. Accessed: 2026-02-07.

60. Javier D. Bruguera. Low-latency and high-bandwidth

pipelined radix-64 division and square root unit. In

2022 IEEE 29th Symposium on Computer Arithmetic

(ARITH), pages 10–17, 2022. doi:10.1109/ARITH54963.

2022.00012.

61. Vivekananda M Vedula, Jacob A Abraham, Jayanta

Bhadra, and Raghuram Tupuri. A hierarchical test genera-

tion approach using program slicing techniques on hardware

description languages. Journal of Electronic Testing,

19:149–160, 2003. doi:10.1023/A:1022885523034.

62. Lingyi Liu and Shobha Vasudevan. Eﬃcient validation in-

put generation in rtl by hybridized source code analysis. In

2011 Design, Automation & Test in Europe, pages 1–6,

2011. doi:10.1109/DATE.2011.5763253.

63. Biruk Mammo, Jim Larimer, Matthew Morgan, Dave Fan,

Eric Hennenhoefer, and Valeria Bertacco. Architectural

trace-based functional coverage for multiprocessor veriﬁ-

cation. In 2012 13th International Workshop on Micro-

processor Test and Veriﬁcation (MTV), pages 1–5, 2012.

doi:10.1109/MTV.2012.12.

64. Sotiris Apostolakis, Chris Kennelly, Xinliang David Li, and

Parthasarathy Ranganathan. Necro-reaper: Pruning away

dead memory traﬃc in warehouse-scale computers. In Pro-

ceedings of the 30th ACM International Conference on

Architectural Support for Programming Languages and

Operating Systems, Volume 2, ASPLOS ’25, page 689–703,

New York, NY, USA, 2025. Association for Computing

Machinery. doi:10.1145/3676641.3716007.

65. Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and

Timothy G. Rogers. Accel-sim: An extensible simulation

framework for validated gpu modeling. In 2020 ACM/IEEE

47th Annual International Symposium on Computer Ar-

chitecture (ISCA), pages 473–486, 2020. doi:10.1109/

ISCA45697.2020.00047.

66. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry

Wong, and Tor M. Aamodt. Analyzing cuda workloads

using a detailed gpu simulator. In 2009 IEEE Interna-

tional Symposium on Performance Analysis of Systems

and Software, pages 163–174, 2009. doi:10.1109/ISPASS.

2009.4919648.

67. Onur Mutlu, Hyesoon Kim, David N Armstrong, and Yale N

Patt. An analysis of the performance impact of wrong-

path memory references on out-of-order and runahead ex-

ecution processors. IEEE Transactions on Computers,

54(12):1556–1571, 2005. doi:10.1109/TC.2005.190.

TraceRTL: Agile Performance Evaluation

68. Stijn Eyerman, Sam Van den Steen, Wim Heirman, and

Ibrahim Hur. Simulating wrong-path instructions in de-

coupled functional-ﬁrst simulation. In 2023 IEEE Interna-

tional Symposium on Performance Analysis of Systems

and Software (ISPASS), pages 124–133. IEEE, 2023. doi:

10.1109/ISPASS57527.2023.00021.

69. Bhargav Reddy Godala, Sankara Prasad Ramesh, Krish-

nam Tibrewala, Chrysanthos Pepi, Gino Chacon, Svilen

Kanev, Gilles A Pokam, Daniel A Jim´enez, Paul V Gratz,

and David I August. Correct wrong path. arXiv preprint

arXiv:2408.05912, 2024. URL: https://doi.org/10.48550/

arXiv.2408.05912.

70. Resit Sendag, Ayse Yilmazer, Joshua J. Yi, and Augus-

tus K. Uht. The impact of wrong-path memory refer-

ences in cache-coherent multiprocessor systems. Jour-

nal of Parallel and Distributed Computing, 67(12):1256–

1269, 2007. Best Paper Awards: 20th International

Parallel and Distributed Processing Symposium (IPDPS

2006). URL: https://www.sciencedirect.com/science/

article/pii/S0743731507000457, doi:10.1016/j.jpdc.2007.

03.005.

71. R. Sendag, A. Yilmazer, J.J. Yi, and A.K. Uht. Quantifying

and reducing the eﬀects of wrong-path memory references

in cache-coherent multiprocessor systems. In Proceedings

20th IEEE International Parallel & Distributed Process-

ing Symposium, pages 10 pp.–, 2006. doi:10.1109/IPDPS.

2006.1639260.

72. Stephen R Goldschmidt and John L Hennessy. The

accuracy of trace-driven simulations of multiprocessors.

ACM SIGMETRICS Performance Evaluation Review,

21(1):146–157, 1993. doi:10.1145/166962.167001.

73. Karthik Sangaiah, Michael Lui, Radhika Jagtap,

Stephan Diestelhorst, Siddharth Nilakantan, Ankit

More, Baris Taskin, and Mark Hempstead. Synchrotrace:

Synchronization-aware architecture-agnostic traces for

lightweight multicore simulation of cmp and hpc work-

loads. ACM Trans. Archit. Code Optim., 15(1), March

2018. doi:10.1145/3158642.

74. Jian Weng, Boyang Han, Derui Gao, Ruijie Gao, Wanning

Zhang, An Zhong, Ceyu Xu, Jihao Xin, Yangzhixin Luo,

Lisa Wu Wills, and Marco Canini. Assassyn: A uniﬁed

abstraction for architectural simulation and implementa-

tion. In Proceedings of the 52nd Annual International

Symposium on Computer Architecture, ISCA ’25, page

1464–1479, New York, NY, USA, 2025. Association for

Computing Machinery. doi:10.1145/3695053.3731004.

75. Josu´e Feliu, Arthur Perais, Daniel A. Jim´enez, and Alberto

Ros. Rebasing microarchitectural research with indus-

try traces. In 2023 IEEE International Symposium on

Workload Characterization (IISWC), pages 100–114, 2023.

doi:10.1109/IISWC59245.2023.00027.