Skip to content
EECS 4340 Final Review

Midterm — Spring 2025

Midterm — Spring 2025

EECS 4340 · Spring 2025

Historically numbered as EECS 4824.

Question 1 12 pts

Performance. MLPerf Inference: Tiny benchmark latencies are given for two vendors. Determine which vendor has better performance and by how much.

Benchmark use caseDatasetModelVendor A latencyVendor B latency
Keyword spottingGoogle speech commandsDS-CNN5 ms6 ms
Visual wake wordsVisual wake words datasetMobileNetV1 0.25x42 ms56 ms
Image classificationCIFAR10ResNet-890 ms81 ms
Anomaly detectionToyADMOSDeep AutoEncoder165 ms150 ms

Use performance as inverse latency. For the speedup of A over B:

speedup_A_over_B = latency_B / latency_A
BenchmarkA-over-B speedup
Keyword spotting6 / 5 = 1.20
Visual wake words56 / 42 = 1.33
Image classification81 / 90 = 0.90
Anomaly detection150 / 165 = 0.91

Use the geometric mean:

GM = (1.20 * 1.33 * 0.90 * 0.91)^(1/4) ~= 1.07

Answer: Vendor A achieves better performance by about 1.07x.

This applies the iron-law performance equation reframed as latency ratios, summarized by a geometric mean across heterogeneous benchmarks.

Question 2(a) 3 pts

Reorder Buffer. How does a reorder buffer enhance an out-of-order pipeline like P6?

A reorder buffer supports out-of-order execution while preserving in-order retirement. It also supports precise exceptions/interrupts and speculative execution.

Question 2(b) 3 pts

Reorder Buffer. If adding a ROB to a basic Tomasulo pipeline, what new stages are required and what does each do?

Given original stages:

Fetch, Dispatch, Issue, Execute, Writeback

Replace/extend writeback with:

  • Complete / Write Result: mark the instruction as done and make the result available, often by writing it into the ROB or broadcasting it on the CDB.
  • Retire / Commit: commit instructions in program order, update the architectural register file, and free resources.

Question 2(c) 6 pts

Reorder Buffer. Stage progression for four memory instructions when the ROB and memory execute logic share hardware.

Assumptions:

  • Instruction and data memories are separate and can be accessed in the same cycle.
  • No data dependencies.
  • No branches.
  • Execute stage takes one cycle.
  • Memory execute and the final ROB/retire stage share hardware, so they cannot occur in the same cycle.

Instructions:

LOAD R1, 0[R2]
LOAD R4, 0[R2]
STORE R5, 0[R3]
STORE R6, 0[R7]

Extracted schedule:

InstructionC1C2C3C4C5C6C7C8C9C10C11
LOAD R1, 0[R2]FDSXCR
LOAD R4, 0[R2]FDSXCR
STORE R5, 0[R3]FDS--XCR
STORE R6, 0[R7]FDS--XCR

Legend:

F = fetch, D = dispatch/decode, S = issue/schedule, X = execute,
C = complete, R = retire, - = stall/wait

This schedule reflects the structural hazard between memory execute and ROB retire.

Question 3(a) 3 pts

DVFS. Given:

P0 = C V0^2 f0
t0 = Ncycles / f0
E0 = P0 t0 = Ncycles C V0^2

If voltage scales by xx and frequency scales by yy, derive the scaled ratios for power, time, and energy.

Answer:

P / P0 = x^2 y
T / T0 = 1 / y
E / E0 = x^2

Derivation:

P = C (xV0)^2 (y f0) = x^2 y P0
T = Ncycles / (y f0) = T0 / y
E = P*T = (x^2 y P0)(T0/y) = x^2 E0

Question 3(b) 6 pts

DVFS. Multi-core DVFS table under power and energy budgets.

Assumptions:

  • Program parallelizes perfectly across Ncores.
  • Same voltage/frequency scale factor is used for each core, with x = y.
  • Each core consumes the same power at a given VF point.
  • Power budget is P0; energy budget is E0.

Fill in the table.

Useful formulas:

P_total / P0 = Ncores * x^3
E_total / E0 = x^2
NcoresScale factor x = yP/P0Under power budget?E/E0Under energy budget?
11.001.000Yes1.000Yes
20.752 * 0.75^3 = 0.844Yes0.75^2 = 0.5625Yes
40.754 * 0.75^3 = 1.688No0.75^2 = 0.5625Yes
80.508 * 0.50^3 = 1.000Yes0.50^2 = 0.25Yes

This is the canonical DVFS tradeoff: scaling cores up while voltage-scaling down keeps power flat and reduces energy quadratically.

Question 3(c) 3 pts

DVFS. Per-stage DVFS: when could it help, and what are the challenges?

Pros:

  • Can reduce power by scaling down stages that are idle, stalled, or not on the critical path.

Cons:

  • Requires extra synchronization/isolation hardware between stages at different voltage/frequency levels.
  • Adds pipeline complexity and overhead.
  • Frequency/voltage scaling transition time may reduce or eliminate the benefit.

This is essentially fine-grained DVFS applied at sub-pipeline granularity.

Question 4(1) 1 pts

Data Hazards. A RISC in-order scalar processor has five stages:

Fetch, Decode, Execute, Memory, Writeback

Branches are predicted not taken and are resolved in the Execute stage. There is a separate fully pipelined floating-point unit.

Floating-point instruction latencies:

InstructionMeaningLatency
fld op1 op2load double from address in op2, store in op11 cycle
fsd op1 op2store value in op1 to address in op21 cycle
fadd.d op1, op2, op3floating-point add, store in op13 cycles

Loop:

LOOP:
1: fld f0, 0(x1)
2: addi x1, x1, -8
3: fadd.d f4, f0, f2
4: fsd f4, 8(x1)
5: bne x1, x2, LOOP

What is the total stall time for the basic block LOOP and what type of dependence causes it?

Answer: 2 cycles of stall due to a RAW data dependence from fadd.d to fsd on register f4. As a true (flow) dependence, it cannot be removed by register renaming — the consumer must wait for the producer’s value.

Question 4(2) 3 pts

Data Hazards. Using the same loop and pipeline as in 4(1), how many cycles are required for one and two iterations of LOOP to complete?

Answer:

1 iteration: 11 cycles
2 iterations: 20 cycles

Reasoning:

  • The first loop iteration has two stalls from the fadd.d -> fsd RAW dependence (assuming no MEM/WB forwarding into the FSD’s data input).
  • The branch is predicted not taken, but the loop branch is taken until exit. Since the branch is resolved in Execute, the next iteration begins only after the target is known — a control hazard from the mispredicted not-taken prediction.

Question 4(3) 2 pts

Data Hazards. Consider the unrolled loop below. Does the processor need to stall in LOOP_UNROLLED?

LOOP_UNROLLED:
1: fld f0, 0(x1)
2: fld f6, -8(x1)
3: fadd.d f4, f0, f2
4: fadd.d f8, f6, f2
5: fsd f4, 0(x1)
6: fsd f8, -8(x1)
7: addi x1, x1, -16
9: bne x1, x2, LOOP_UNROLLED

Answer: Yes. The stores still depend on the floating-point add results — these are RAW dependences:

3 -> 5 on f4
4 -> 6 on f8

Question 4(4) 2 pts

Data Hazards. In the unrolled loop above, which instruction can be moved to improve performance while retaining program semantics?

Handwritten answer: Move instruction 7 between instructions 4 and 5.

Checked answer: The intended movable instruction is:

7: addi x1, x1, -16

Moving it into the gap between the floating-point adds and the stores can fill delay slots from the RAW stall, but only if the store offsets are adjusted:

fadd.d f4, f0, f2
fadd.d f8, f6, f2
addi x1, x1, -16
fsd f4, 16(x1) # adjusted from 0(x1)
fsd f8, 8(x1) # adjusted from -8(x1)
bne x1, x2, LOOP_UNROLLED

Question 5 16 pts

Tomasulo. You have the Tomasulo map table/register-alias table and reservation stations. All reservation stations were empty before the last four instructions. Reconstruct the last four instructions in program order.

Map table:

RegTag
x0-
x1t0
x2t5
x3-
x4t6

Register file:

RegValue
x00
x13
x27
x39
x40

Multiplier reservation station:

IDSource 1 tagSource 1 valueSource 2 tagSource 2 value
t0t4-t4-

Adder reservation station:

IDSource 1 tagSource 1 valueSource 2 tagSource 2 value
t4-3-7
t5t0--9
t6t4-t5-

Reconstructed answer:

Program orderOperationDestSource R1Source R2Reasoning
1addix4x1x2Tag t4 has values 3 and 7, matching original RF values of x1 and x2.
2multix1x4x4Tag t0 is current x1 and waits on t4 for both operands.
3addix2x1x3Tag t5 is current x2 and waits on t0, plus value 9 from x3.
4addix4x4x2Tag t6 is current x4 and waits on t4 and t5.

Equivalent assembly:

1: addi x4, x1, x2
2: multi x1, x4, x4
3: addi x2, x1, x3
4: addi x4, x4, x2

This problem exercises register renaming via the map table: each new write to an architectural register allocates a fresh reservation station tag, so even though x4 is written twice, the second write uses tag t6 and the first (t4) lives on as a source for younger instructions. The Tomasulo machinery dissolves the WAW on x4 by renaming, while preserving the RAW chains via tag-matching when results are broadcast on the CDB.