Skip to content
EECS 4340 Final Review

Midterm — Spring 2024

Midterm — Spring 2024

EECS 4340 · Spring 2024

Historically numbered as EECS 4824.

Question 1 12 pts

In-Order Pipelines. A processor has expected CPI = 1 for programs with no hazards. Caches remove the structural memory hazard. Branches are converted to predication, removing the control hazard. For each program, measure expected CPI.

Load (%)Store (%)ALU (%)% of all instructions dependent on the instruction in front of themExpected CPI
15157010?
25255020?
40402030?

Only a load followed immediately by a dependent instruction causes the remaining one-cycle load-use stall (a RAW hazard not coverable by forwarding). Therefore:

Expected CPI = 1 + LoadFraction * DependentFraction * 1-cycle stall
Load fractionDependent fractionCPI calculationExpected CPI
0.150.101 + 0.15 * 0.101.015
0.250.201 + 0.25 * 0.201.05
0.400.301 + 0.40 * 0.301.12

Question 2(a) 4 pts

Instruction-Level Parallelism. Given the instruction sequence:

I1: R2 = R6 + R3
I2: R1 = R7 + R3
I3: R2 = R5 * R4
I4: R1 = R4 + R6
I5: R2 = R2 + 1
I6: R5 = R7 + R2
I7: R1 = R3 * R2

List all RAW, WAW, and WAR hazards.

Hazard typeHazards
RAWI3 -> I5, I5 -> I6, I5 -> I7
WAWI1 -> I3, I2 -> I4, I3 -> I5, I4 -> I7
WARI3 -> I6

Question 2(b) 4 pts

For scoreboard scheduling, draw the control-flow graph: explicit checks/stalls are needed for RAW, WAW, and WAR hazards.

Edge list:

I1 -> I3
I3 -> I5
I5 -> I6
I5 -> I7
I2 -> I4
I4 -> I7
I3 -> I6

One way to view the graph:

I1 -> I3 -> I5 -> I6
\
-> I7
I2 -> I4 --------^
I3 -> I6 (WAR edge)

Question 2(c) 3 pts

What is the ILP for scoreboard scheduling with infinite resources?

Longest dependence chain length is 4 instructions, e.g.:

I1 -> I3 -> I5 -> I6

So:

ILP = total instructions / critical path length = 7 / 4 = 1.75

Question 2(d) 4 pts

Draw the dependence graph for Tomasulo scheduling. With Tomasulo / register renaming, WAW and WAR false dependences are removed; only RAW dependences remain.

Edge list:

I3 -> I5
I5 -> I6
I5 -> I7

Graph:

I1 I2 I4 independent
I3 -> I5 -> I6
\
-> I7

Question 2(e) 3 pts

What is the ILP for Tomasulo scheduling with infinite resources?

Longest RAW chain length is 3 instructions:

I3 -> I5 -> I6

or:

I3 -> I5 -> I7

So:

ILP = 7 / 3 ~= 2.333

Question 3(i) 5 pts

Finite State Machines. The provided SystemVerilog module implements a 2-bit saturating-counter branch predictor. The states are:

STRONGLY_NOT_TAKEN = 2'b00
WEAKLY_NOT_TAKEN = 2'b01
WEAKLY_TAKEN = 2'b10
STRONGLY_TAKEN = 2'b11

predict_taken is 1 in the weakly-taken or strongly-taken states. Draw the FSM diagram.

States and outputs:

StateEncodingOutput predict_taken
Strongly not taken000
Weakly not taken010
Weakly taken101
Strongly taken111

Transitions:

Current stateIf branch_taken = 0If branch_taken = 1
Strongly not takenStrongly not takenWeakly not taken
Weakly not takenStrongly not takenWeakly taken
Weakly takenWeakly not takenStrongly taken
Strongly takenWeakly takenStrongly taken

Reset goes to strongly not taken.

Mermaid diagram equivalent:

stateDiagram-v2
[*] --> SNT: reset
SNT: Strongly Not Taken / output 0
WNT: Weakly Not Taken / output 0
WT: Weakly Taken / output 1
ST: Strongly Taken / output 1
SNT --> SNT: 0
SNT --> WNT: 1
WNT --> SNT: 0
WNT --> WT: 1
WT --> WNT: 0
WT --> ST: 1
ST --> WT: 0
ST --> ST: 1

Question 3(ii) 5 pts OCR

Complete the timing diagram for the 2-bit saturating counter branch predictor given the supplied reset and branch_taken waveform.

The handwritten counter-state sequence is:

XX -> 00 -> 00 -> 01 -> 10 -> 11 -> 10 -> 01

predict_taken is 1 when the counter state is 10 or 11, and 0 otherwise.

Question 4(a) 5 pts

Performance. A program has 20% sequential and 80% parallel work. An accelerator speeds up the parallel part by 4x. Compute the overall speedup.

Using Amdahl’s law:

Speedup = 1 / (sequential_fraction + parallel_fraction / accelerator_speedup)
= 1 / (0.2 + 0.8 / 4)
= 1 / 0.4
= 2.5x

As percent improvement over baseline:

(2.5 - 1) * 100% = 150% faster

Question 4(b) 5 pts

A program has 40% sequential and 60% parallel work. An accelerator speeds up the parallel part by 6x. Compute the overall speedup.

Speedup = 1 / (0.4 + 0.6 / 6)
= 1 / 0.5
= 2.0x

As percent improvement over baseline:

(2.0 - 1) * 100% = 100% faster

Question 4(c) 5 pts

Which program has better area computing efficiency? Assumptions:

  • Both programs have the same baseline area without accelerators.
  • Accelerator A gives 2.5x speedup with 150% area increase, so total area is 2.5x baseline.
  • Accelerator B gives 2.0x speedup with 100% area increase, so total area is 2.0x baseline.

Area computing efficiency can be compared as:

ACE = speedup / area_factor

For program A:

ACE_A = 2.5 / 2.5 = 1

For program B:

ACE_B = 2.0 / 2.0 = 1

Answer: Programs A and B have the same area computing efficiency.