Midterm — Spring 2024 | EECS 4340 Final Review

Question 1 12 pts

In-Order Pipelines. A processor has expected CPI = 1 for programs with no hazards. Caches remove the structural memory hazard. Branches are converted to predication, removing the control hazard. For each program, measure expected CPI.

Load (%)	Store (%)	ALU (%)	% of all instructions dependent on the instruction in front of them	Expected CPI
15	15	70	10	?
25	25	50	20	?
40	40	20	30	?

Only a load followed immediately by a dependent instruction causes the remaining one-cycle load-use stall (a RAW hazard not coverable by forwarding). Therefore:

Expected CPI = 1 + LoadFraction * DependentFraction * 1-cycle stall

Load fraction	Dependent fraction	CPI calculation	Expected CPI
0.15	0.10	`1 + 0.15 * 0.10`	1.015
0.25	0.20	`1 + 0.25 * 0.20`	1.05
0.40	0.30	`1 + 0.40 * 0.30`	1.12

Question 2(a) 4 pts

Instruction-Level Parallelism. Given the instruction sequence:

I1: R2 = R6 + R3
I2: R1 = R7 + R3
I3: R2 = R5 * R4
I4: R1 = R4 + R6
I5: R2 = R2 + 1
I6: R5 = R7 + R2
I7: R1 = R3 * R2

List all RAW, WAW, and WAR hazards.

Hazard type	Hazards
RAW	`I3 -> I5`, `I5 -> I6`, `I5 -> I7`
WAW	`I1 -> I3`, `I2 -> I4`, `I3 -> I5`, `I4 -> I7`
WAR	`I3 -> I6`

Question 2(b) 4 pts

For scoreboard scheduling, draw the control-flow graph: explicit checks/stalls are needed for RAW, WAW, and WAR hazards.

Edge list:

I1 -> I3
I3 -> I5
I5 -> I6
I5 -> I7
I2 -> I4
I4 -> I7
I3 -> I6

One way to view the graph:

I1 -> I3 -> I5 -> I6
            \
             -> I7
I2 -> I4 --------^
I3 -> I6   (WAR edge)

Question 2(c) 3 pts

What is the ILP for scoreboard scheduling with infinite resources?

Longest dependence chain length is 4 instructions, e.g.:

I1 -> I3 -> I5 -> I6

So:

ILP = total instructions / critical path length = 7 / 4 = 1.75

Question 2(d) 4 pts

Draw the dependence graph for Tomasulo scheduling. With Tomasulo / register renaming, WAW and WAR false dependences are removed; only RAW dependences remain.

Edge list:

I3 -> I5
I5 -> I6
I5 -> I7

Graph:

I1    I2    I4    independent
I3 -> I5 -> I6
       \
        -> I7

Question 2(e) 3 pts

What is the ILP for Tomasulo scheduling with infinite resources?

Longest RAW chain length is 3 instructions:

I3 -> I5 -> I6

or:

I3 -> I5 -> I7

So:

ILP = 7 / 3 ~= 2.333

Question 3(i) 5 pts

Finite State Machines. The provided SystemVerilog module implements a 2-bit saturating-counter branch predictor. The states are:

STRONGLY_NOT_TAKEN = 2'b00
WEAKLY_NOT_TAKEN   = 2'b01
WEAKLY_TAKEN       = 2'b10
STRONGLY_TAKEN     = 2'b11

predict_taken is 1 in the weakly-taken or strongly-taken states. Draw the FSM diagram.

States and outputs:

State	Encoding	Output `predict_taken`
Strongly not taken	00	0
Weakly not taken	01	0
Weakly taken	10	1
Strongly taken	11	1

Transitions:

Current state	If `branch_taken = 0`	If `branch_taken = 1`
Strongly not taken	Strongly not taken	Weakly not taken
Weakly not taken	Strongly not taken	Weakly taken
Weakly taken	Weakly not taken	Strongly taken
Strongly taken	Weakly taken	Strongly taken

Reset goes to strongly not taken.

Mermaid diagram equivalent:

stateDiagram-v2
    [*] --> SNT: reset
    SNT: Strongly Not Taken / output 0
    WNT: Weakly Not Taken / output 0
    WT: Weakly Taken / output 1
    ST: Strongly Taken / output 1

    SNT --> SNT: 0
    SNT --> WNT: 1
    WNT --> SNT: 0
    WNT --> WT: 1
    WT --> WNT: 0
    WT --> ST: 1
    ST --> WT: 0
    ST --> ST: 1

Question 3(ii) 5 pts OCR

Complete the timing diagram for the 2-bit saturating counter branch predictor given the supplied reset and branch_taken waveform.

The handwritten counter-state sequence is:

XX -> 00 -> 00 -> 01 -> 10 -> 11 -> 10 -> 01

predict_taken is 1 when the counter state is 10 or 11, and 0 otherwise.

Reading the handwritten waveform: before reset deasserts, the counter is unknown (XX). Reset clamps it to STRONGLY_NOT_TAKEN (00). Each subsequent rising edge of the clock advances the state per the FSM in 3(i) using that cycle’s branch_taken:

Cycle	`branch_taken`	Next state	`predict_taken`
reset	—	`00`	0
1	0	`00`	0
2	1	`01`	0
3	1	`10`	1
4	1	`11`	1
5	0	`10`	1
6	0	`01`	0

So the state trace is XX → 00 → 00 → 01 → 10 → 11 → 10 → 01, and predict_taken is high only while the state is 10 or 11, matching the 2-bit counter semantics.

Question 4(a) 5 pts

Performance. A program has 20% sequential and 80% parallel work. An accelerator speeds up the parallel part by 4x. Compute the overall speedup.

Using Amdahl’s law:

Speedup = 1 / (sequential_fraction + parallel_fraction / accelerator_speedup)
        = 1 / (0.2 + 0.8 / 4)
        = 1 / 0.4
        = 2.5x

As percent improvement over baseline:

(2.5 - 1) * 100% = 150% faster

Question 4(b) 5 pts

A program has 40% sequential and 60% parallel work. An accelerator speeds up the parallel part by 6x. Compute the overall speedup.

Speedup = 1 / (0.4 + 0.6 / 6)
        = 1 / 0.5
        = 2.0x

As percent improvement over baseline:

(2.0 - 1) * 100% = 100% faster

Question 4(c) 5 pts

Which program has better area computing efficiency? Assumptions:

Both programs have the same baseline area without accelerators.
Accelerator A gives 2.5x speedup with 150% area increase, so total area is 2.5x baseline.
Accelerator B gives 2.0x speedup with 100% area increase, so total area is 2.0x baseline.

Area computing efficiency can be compared as:

ACE = speedup / area_factor

For program A:

ACE_A = 2.5 / 2.5 = 1

For program B:

ACE_B = 2.0 / 2.0 = 1

Answer: Programs A and B have the same area computing efficiency.