Midterm — Spring 2025 | EECS 4340 Final Review

Question 1 12 pts

Performance. MLPerf Inference: Tiny benchmark latencies are given for two vendors. Determine which vendor has better performance and by how much.

Benchmark use case	Dataset	Model	Vendor A latency	Vendor B latency
Keyword spotting	Google speech commands	DS-CNN	5 ms	6 ms
Visual wake words	Visual wake words dataset	MobileNetV1 0.25x	42 ms	56 ms
Image classification	CIFAR10	ResNet-8	90 ms	81 ms
Anomaly detection	ToyADMOS	Deep AutoEncoder	165 ms	150 ms

Use performance as inverse latency. For the speedup of A over B:

speedup_A_over_B = latency_B / latency_A

Benchmark	A-over-B speedup
Keyword spotting	6 / 5 = 1.20
Visual wake words	56 / 42 = 1.33
Image classification	81 / 90 = 0.90
Anomaly detection	150 / 165 = 0.91

Use the geometric mean:

GM = (1.20 * 1.33 * 0.90 * 0.91)^(1/4) ~= 1.07

Answer: Vendor A achieves better performance by about 1.07x.

This applies the iron-law performance equation reframed as latency ratios, summarized by a geometric mean across heterogeneous benchmarks.

Question 2(a) 3 pts

Reorder Buffer. How does a reorder buffer enhance an out-of-order pipeline like P6?

A reorder buffer supports out-of-order execution while preserving in-order retirement. It also supports precise exceptions/interrupts and speculative execution.

Question 2(b) 3 pts

Reorder Buffer. If adding a ROB to a basic Tomasulo pipeline, what new stages are required and what does each do?

Given original stages:

Fetch, Dispatch, Issue, Execute, Writeback

Replace/extend writeback with:

Complete / Write Result: mark the instruction as done and make the result available, often by writing it into the ROB or broadcasting it on the CDB.
Retire / Commit: commit instructions in program order, update the architectural register file, and free resources.

Question 2(c) 6 pts

Reorder Buffer. Stage progression for four memory instructions when the ROB and memory execute logic share hardware.

Assumptions:

Instruction and data memories are separate and can be accessed in the same cycle.
No data dependencies.
No branches.
Execute stage takes one cycle.
Memory execute and the final ROB/retire stage share hardware, so they cannot occur in the same cycle.

Instructions:

LOAD  R1, 0[R2]
LOAD  R4, 0[R2]
STORE R5, 0[R3]
STORE R6, 0[R7]

Extracted schedule:

Instruction	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11
`LOAD R1, 0[R2]`	F	D	S	X	C	R
`LOAD R4, 0[R2]`		F	D	S	X	C	R
`STORE R5, 0[R3]`			F	D	S	-	-	X	C	R
`STORE R6, 0[R7]`				F	D	S	-	-	X	C	R

Legend:

F = fetch, D = dispatch/decode, S = issue/schedule, X = execute,
C = complete, R = retire, - = stall/wait

This schedule reflects the structural hazard between memory execute and ROB retire.

Question 3(a) 3 pts

DVFS. Given:

P0 = C V0^2 f0
t0 = Ncycles / f0
E0 = P0 t0 = Ncycles C V0^2

If voltage scales by $x$ and frequency scales by $y$ , derive the scaled ratios for power, time, and energy.

Answer:

P / P0 = x^2 y
T / T0 = 1 / y
E / E0 = x^2

Derivation:

P = C (xV0)^2 (y f0) = x^2 y P0
T = Ncycles / (y f0) = T0 / y
E = P*T = (x^2 y P0)(T0/y) = x^2 E0

Question 3(b) 6 pts

DVFS. Multi-core DVFS table under power and energy budgets.

Assumptions:

Program parallelizes perfectly across Ncores.
Same voltage/frequency scale factor is used for each core, with x = y.
Each core consumes the same power at a given VF point.
Power budget is P0; energy budget is E0.

Fill in the table.

Useful formulas:

P_total / P0 = Ncores * x^3
E_total / E0 = x^2

Ncores	Scale factor x = y	P/P0	Under power budget?	E/E0	Under energy budget?
1	1.00	1.000	Yes	1.000	Yes
2	0.75	2 * 0.75^3 = 0.844	Yes	0.75^2 = 0.5625	Yes
4	0.75	4 * 0.75^3 = 1.688	No	0.75^2 = 0.5625	Yes
8	0.50	8 * 0.50^3 = 1.000	Yes	0.50^2 = 0.25	Yes

This is the canonical DVFS tradeoff: scaling cores up while voltage-scaling down keeps power flat and reduces energy quadratically.

Question 3(c) 3 pts

DVFS. Per-stage DVFS: when could it help, and what are the challenges?

Pros:

Can reduce power by scaling down stages that are idle, stalled, or not on the critical path.

Cons:

Requires extra synchronization/isolation hardware between stages at different voltage/frequency levels.
Adds pipeline complexity and overhead.
Frequency/voltage scaling transition time may reduce or eliminate the benefit.

This is essentially fine-grained DVFS applied at sub-pipeline granularity.

Question 4(1) 1 pts

Data Hazards. A RISC in-order scalar processor has five stages:

Fetch, Decode, Execute, Memory, Writeback

Branches are predicted not taken and are resolved in the Execute stage. There is a separate fully pipelined floating-point unit.

Floating-point instruction latencies:

Instruction	Meaning	Latency
`fld op1 op2`	load double from address in `op2`, store in `op1`	1 cycle
`fsd op1 op2`	store value in `op1` to address in `op2`	1 cycle
`fadd.d op1, op2, op3`	floating-point add, store in `op1`	3 cycles

Loop:

LOOP:
1: fld     f0, 0(x1)
2: addi    x1, x1, -8
3: fadd.d  f4, f0, f2
4: fsd     f4, 8(x1)
5: bne     x1, x2, LOOP

What is the total stall time for the basic block LOOP and what type of dependence causes it?

Answer: 2 cycles of stall due to a RAW data dependence from fadd.d to fsd on register f4. As a true (flow) dependence, it cannot be removed by register renaming — the consumer must wait for the producer’s value.

Question 4(2) 3 pts

Data Hazards. Using the same loop and pipeline as in 4(1), how many cycles are required for one and two iterations of LOOP to complete?

Answer:

1 iteration: 11 cycles
2 iterations: 20 cycles

Reasoning:

The first loop iteration has two stalls from the fadd.d -> fsd RAW dependence (assuming no MEM/WB forwarding into the FSD’s data input).
The branch is predicted not taken, but the loop branch is taken until exit. Since the branch is resolved in Execute, the next iteration begins only after the target is known — a control hazard from the mispredicted not-taken prediction.

Question 4(3) 2 pts

Data Hazards. Consider the unrolled loop below. Does the processor need to stall in LOOP_UNROLLED?

LOOP_UNROLLED:
1: fld     f0, 0(x1)
2: fld     f6, -8(x1)
3: fadd.d  f4, f0, f2
4: fadd.d  f8, f6, f2
5: fsd     f4, 0(x1)
6: fsd     f8, -8(x1)
7: addi    x1, x1, -16
9: bne     x1, x2, LOOP_UNROLLED

Answer: Yes. The stores still depend on the floating-point add results — these are RAW dependences:

3 -> 5 on f4
4 -> 6 on f8

Question 4(4) 2 pts

Data Hazards. In the unrolled loop above, which instruction can be moved to improve performance while retaining program semantics?

Handwritten answer: Move instruction 7 between instructions 4 and 5.

Checked answer: The intended movable instruction is:

7: addi x1, x1, -16

Moving it into the gap between the floating-point adds and the stores can fill delay slots from the RAW stall, but only if the store offsets are adjusted:

fadd.d  f4, f0, f2
fadd.d  f8, f6, f2
addi    x1, x1, -16
fsd     f4, 16(x1)    # adjusted from 0(x1)
fsd     f8, 8(x1)     # adjusted from -8(x1)
bne     x1, x2, LOOP_UNROLLED

Question 5 16 pts

Tomasulo. You have the Tomasulo map table/register-alias table and reservation stations. All reservation stations were empty before the last four instructions. Reconstruct the last four instructions in program order.

Map table:

Reg	Tag
x0	-
x1	t0
x2	t5
x3	-
x4	t6

Register file:

Reg	Value
x0	0
x1	3
x2	7
x3	9
x4	0

Multiplier reservation station:

ID	Source 1 tag	Source 1 value	Source 2 tag	Source 2 value
t0	t4	-	t4	-

Adder reservation station:

ID	Source 1 tag	Source 1 value	Source 2 tag	Source 2 value
t4	-	3	-	7
t5	t0	-	-	9
t6	t4	-	t5	-

Reconstructed answer:

Program order	Operation	Dest	Source R1	Source R2	Reasoning
1	`addi`	x4	x1	x2	Tag `t4` has values 3 and 7, matching original RF values of x1 and x2.
2	`multi`	x1	x4	x4	Tag `t0` is current x1 and waits on `t4` for both operands.
3	`addi`	x2	x1	x3	Tag `t5` is current x2 and waits on `t0`, plus value 9 from x3.
4	`addi`	x4	x4	x2	Tag `t6` is current x4 and waits on `t4` and `t5`.

Equivalent assembly:

1: addi  x4, x1, x2
2: multi x1, x4, x4
3: addi  x2, x1, x3
4: addi  x4, x4, x2

This problem exercises register renaming via the map table: each new write to an architectural register allocates a fresh reservation station tag, so even though x4 is written twice, the second write uses tag t6 and the first (t4) lives on as a source for younger instructions. The Tomasulo machinery dissolves the WAW on x4 by renaming, while preserving the RAW chains via tag-matching when results are broadcast on the CDB.