Skip to content
EECS 4340 Final Review

Final — Spring 2025

Final — Spring 2025

EECS 4340 · Spring 2025

Historically numbered as EECS 4824.

Question 1 12 pts

The processor has split load and store queues. It retires stores in order, forwards values from older unretired stores to younger loads, allows load speculation, and detects memory ordering violations. For each instruction, identify whether it can issue by cycle 15. Each instruction is given as (address, cycle dispatched, cycle operand ready):

  • load A, 4, 10
  • load B, 5, 20
  • store C, 6, 8
  • store A, 7, 11
  • store B, 8, 18
  • load A, 9, 11
  • load B, 10, 12
InstructionCan issue by cycle 15?
load A, 4, 10Yes (given as the example; operand ready before cycle 15).
load B, 5, 20No (operand not ready until cycle 20).
store C, 6, 8Yes (older stores do not block it; operand ready by cycle 15).
store A, 7, 11Yes (older store C is ready, and this store is ready by cycle 15).
store B, 8, 18No (store operand not ready until cycle 18).
load A, 9, 11Yes (can receive forwarded value from older store A, which is ready by cycle 15).
load B, 10, 12Yes (handwritten mark — but see verification).

Question 2 12 pts

Classify which cache-performance metric each technique helps reduce: miss rate, miss penalty, or hit time.

Techniques:

  • Increasing block / cache-line size
  • Virtually indexed cache
  • Increasing cache size
  • Direct-mapped cache (see cache associativity)
  • Critical-word first
  • Skewed or pseudo-associative cache
  • Adding L2 cache between L1 and memory
TechniqueMiss rateMiss penaltyHit time
Increasing block / cache-line sizeYes
Virtually indexed cache (overlaps lookup with TLB translation)Yes
Increasing cache sizeYes
Direct-mapped cacheYes
Critical-word firstYes
Skewed or pseudo-associative cacheYes
Adding L2 cache between L1 and memoryYes

Question 3(a) 3 pts

A partner proposes CDB arbitration logic between functional units and the common data bus. Protocol:

  • A functional unit asserts valid when it wants to write to the CDB.
  • The arbitrator provides a grant in the same cycle by asserting that unit’s ready input.
  • Both valid and ready must be asserted for a functional unit to write to the CDB.

The clock has a 5 ns period and the setup constraint is 1 ns before the rising clock edge. What is the issue with this design? Be specific about the problematic signals.

The adder does not see its ready signal drop until after the setup deadline, so it may believe it still has the grant when the multiplier has actually received the grant. The adder_block signal also does not arrive in the correct cycle.

Question 3(b) 3 pts

Propose a way to resolve the issue without changing the combinational blocks.

Possible fixes:

  1. Extend the clock period.
  2. Add registers to the signals going from the arbiter to the adder and multiplier so the CDB grant is never used in the same cycle in which it is generated.

Question 3(c) 3 pts

Complete the CDB data timing diagram. The adder produces value 32 for destination 3; the multiplier produces value 64 for destination 2. Show the resulting cdb.data, cdb.dest, and cdb.valid.

Text equivalent of the intended waveform:

Time intervalcdb.datacdb.destcdb.validMeaning
Before any selected grant is stable0 / don’t-care0 / don’t-care0No valid CDB transfer.
Adder grant window3231Adder result is broadcast.
Multiplier grant window6421Multiplier result is broadcast.

Handwritten note: add registers to the data signals so they are synchronized with the corresponding ready signal.

Question 3(d) 3 pts

What is the shortest clock period after the CDB arbiter modification of part (b), with 1 ns setup padding?

Shortest clock period: 9 ns.

Reasoning from the handwritten solution:

5 ns for multiplier valid/data arrival
2 ns + 1 ns for combinational logic delays
1 ns setup padding
= 9 ns

Question 3(e) 3 pts OCR

Extra credit. What is the maximum clock frequency under a 1 mW power budget?

The handwritten solution estimates total switching energy per cycle as approximately:

x41.5  pJ/cyclex \approx 41.5\;\text{pJ/cycle}

Using P=x/TclkP = x / T_{clk} and the 1 mW budget:

xTclk1  mW    Tclk41.5  ns    fclk141.5  ns24  MHz\frac{x}{T_{clk}} \leq 1\;\text{mW} \;\Rightarrow\; T_{clk} \geq 41.5\;\text{ns} \;\Rightarrow\; f_{clk} \leq \frac{1}{41.5\;\text{ns}} \approx 24\;\text{MHz}

Answer: Maximum clock frequency is about 24 MHz, which is stricter than the 9 ns functional-timing limit from (d).

Question 4(a) 3 pts

Consider a loop that computes the elementwise product of vectors A and B and stores per-element results into C:

loop:
1: beq t0, a3, end # if i == N, exit loop
2: slli t1, t0, 2 # t1 = i * 4
3: add t2, A, t1 # t2 = &A[i]
4: lw t3, 0(t2) # t3 = A[i]
5: add t4, B, t1 # t4 = &B[i]
6: lw t5, 0(t4) # t5 = B[i]
7: mul t6, t3, t5 # t6 = A[i] * B[i]
8: add t7, C, t1 # t7 = &C[i]
9: sw t6, 0(t7) # C[i] = result
10: addi t0, t0, 1 # i++
11: j loop
end:

Compute the D-cache miss rate with a 512 B fully associative cache, no prefetching, 8 B blocks, and 32-bit integers.

Extracted handwritten answer: “3 load and store instructions; 4-byte integers; <50% miss-rate.”

Question 4(b) 3 pts

Same assumptions as (a), but with a 1024 B cache and 16 B blocks. Compute the D-cache miss rate.

integers per block=16  B4  B/int=4\text{integers per block} = \frac{16\;\text{B}}{4\;\text{B/int}} = 4

miss rate=14=25%\text{miss rate} = \frac{1}{4} = 25\%

Question 4(c) 3 pts

Compute the CPI for parts (a) and (b). The ideal CPI is 1 and each missed load / store costs 20 cycles in total.

There are 11 loop instructions, of which 3 are data-memory instructions. If a missed load / store costs 20 total cycles and a hit / non-memory instruction costs 1 cycle:

For part (a), using a 50% miss rate:

cycles/iter=81+3(0.520+0.51)=8+310.5=39.5\text{cycles/iter} = 8\cdot 1 + 3\cdot(0.5\cdot 20 + 0.5\cdot 1) = 8 + 3\cdot 10.5 = 39.5

CPI=39.5113.59\text{CPI} = \frac{39.5}{11} \approx 3.59

For part (b), using a 25% miss rate:

cycles/iter=81+3(0.2520+0.751)=8+35.75=25.25\text{cycles/iter} = 8\cdot 1 + 3\cdot(0.25\cdot 20 + 0.75\cdot 1) = 8 + 3\cdot 5.75 = 25.25

CPI=25.25112.30\text{CPI} = \frac{25.25}{11} \approx 2.30

Question 4(d) 3 pts

Compute the I-cache misses for two iterations of the loop, assuming 4 B instructions, a 32 B fully associative I-cache, and 4 B blocks.

11 compulsory misses and 11 capacity misses.

Reasoning:

  • The loop body has 11 instructions.
  • Each instruction is 4 bytes and each cache block is 4 bytes, so each instruction occupies its own I-cache block.
  • The I-cache holds 32/4=832 / 4 = 8 instruction blocks.
  • First iteration: first access to each of the 11 instructions gives 11 compulsory misses.
  • Second iteration: the 11-instruction loop is larger than the 8-block cache. With LRU-like fully associative replacement, each instruction in the second pass has been evicted before reuse, giving 11 capacity misses.

Question 5(a) 2 pts

A system has 4 KiB of physical address space, 8 KiB of virtual address space, and 256 B pages. Compute the size in bits of each address portion: physical page number (PPN), physical offset (PO), virtual page number (VPN), and virtual offset (VO).

Address portionSize
Physical Page Number (PPN)4 bits
Physical Offset (PO)8 bits
Virtual Page Number (VPN)5 bits
Virtual Offset (VO)8 bits

Reasoning (the OS-managed page table maps VPNs to PPNs):

page size = 256 B = 2^8 -> offset = 8 bits
physical address space = 4 KiB = 2^12 -> PPN = 12 - 8 = 4 bits
virtual address space = 8 KiB = 2^13 -> VPN = 13 - 8 = 5 bits

Question 5(b) 2 pts

Direct-mapped cache (see cache associativity): 256 B cache with 8 B blocks. How many address bits are tag, index, and block offset?

Block offset:

8  B block=23    3  offset bits8\;\text{B block} = 2^3 \;\Rightarrow\; 3\;\text{offset bits}

Index:

256  B8  B=32  lines=25    5  index bits\frac{256\;\text{B}}{8\;\text{B}} = 32\;\text{lines} = 2^5 \;\Rightarrow\; 5\;\text{index bits}

Tag depends on whether the cache is indexed / tagged by virtual or physical address:

Address typeTotal address bitsTagIndexBlock offset
Physical12453
Virtual13553