Final — Spring 2025 | EECS 4340 Final Review

Question 1 12 pts

The processor has split load and store queues. It retires stores in order, forwards values from older unretired stores to younger loads, allows load speculation, and detects memory ordering violations. For each instruction, identify whether it can issue by cycle 15. Each instruction is given as (address, cycle dispatched, cycle operand ready):

load A, 4, 10
load B, 5, 20
store C, 6, 8
store A, 7, 11
store B, 8, 18
load A, 9, 11
load B, 10, 12

Instruction	Can issue by cycle 15?
`load A, 4, 10`	Yes (given as the example; operand ready before cycle 15).
`load B, 5, 20`	No (operand not ready until cycle 20).
`store C, 6, 8`	Yes (older stores do not block it; operand ready by cycle 15).
`store A, 7, 11`	Yes (older `store C` is ready, and this store is ready by cycle 15).
`store B, 8, 18`	No (store operand not ready until cycle 18).
`load A, 9, 11`	Yes (can receive forwarded value from older `store A`, which is ready by cycle 15).
`load B, 10, 12`	Yes (handwritten mark — but see verification).

Question 2 12 pts

Classify which cache-performance metric each technique helps reduce: miss rate, miss penalty, or hit time.

Techniques:

Increasing block / cache-line size
Virtually indexed cache
Increasing cache size
Direct-mapped cache (see cache associativity)
Critical-word first
Skewed or pseudo-associative cache
Adding L2 cache between L1 and memory

Technique	Miss rate	Miss penalty	Hit time
Increasing block / cache-line size	Yes
Virtually indexed cache (overlaps lookup with TLB translation)			Yes
Increasing cache size	Yes
Direct-mapped cache			Yes
Critical-word first		Yes
Skewed or pseudo-associative cache	Yes
Adding L2 cache between L1 and memory		Yes

Question 3(a) 3 pts

A partner proposes CDB arbitration logic between functional units and the common data bus. Protocol:

A functional unit asserts valid when it wants to write to the CDB.
The arbitrator provides a grant in the same cycle by asserting that unit’s ready input.
Both valid and ready must be asserted for a functional unit to write to the CDB.

The clock has a 5 ns period and the setup constraint is 1 ns before the rising clock edge. What is the issue with this design? Be specific about the problematic signals.

The adder does not see its ready signal drop until after the setup deadline, so it may believe it still has the grant when the multiplier has actually received the grant. The adder_block signal also does not arrive in the correct cycle.

Question 3(b) 3 pts

Propose a way to resolve the issue without changing the combinational blocks.

Possible fixes:

Extend the clock period.
Add registers to the signals going from the arbiter to the adder and multiplier so the CDB grant is never used in the same cycle in which it is generated.

Question 3(c) 3 pts

Complete the CDB data timing diagram. The adder produces value 32 for destination 3; the multiplier produces value 64 for destination 2. Show the resulting cdb.data, cdb.dest, and cdb.valid.

Text equivalent of the intended waveform:

Time interval	`cdb.data`	`cdb.dest`	`cdb.valid`	Meaning
Before any selected grant is stable	0 / don’t-care	0 / don’t-care	0	No valid CDB transfer.
Adder grant window	32	3	1	Adder result is broadcast.
Multiplier grant window	64	2	1	Multiplier result is broadcast.

Handwritten note: add registers to the data signals so they are synchronized with the corresponding ready signal.

Question 3(d) 3 pts

What is the shortest clock period after the CDB arbiter modification of part (b), with 1 ns setup padding?

Shortest clock period: 9 ns.

Reasoning from the handwritten solution:

5 ns  for multiplier valid/data arrival
2 ns + 1 ns for combinational logic delays
1 ns  setup padding
= 9 ns

Question 3(e) 3 pts OCR

Extra credit. What is the maximum clock frequency under a 1 mW power budget?

The handwritten solution estimates total switching energy per cycle as approximately:

$x \approx 41.5\;\text{pJ/cycle}$

Using $P = x / T_{clk}$ and the 1 mW budget:

$\frac{x}{T_{clk}} \leq 1\;\text{mW} \;\Rightarrow\; T_{clk} \geq 41.5\;\text{ns} \;\Rightarrow\; f_{clk} \leq \frac{1}{41.5\;\text{ns}} \approx 24\;\text{MHz}$

Answer: Maximum clock frequency is about 24 MHz, which is stricter than the 9 ns functional-timing limit from (d).

The intended derivation:

Sum the dynamic switching energy per cycle across the relevant nodes (adder output flops, multiplier output flops, the registered arbiter ready signals, and the registered cdb.data / cdb.dest flops). The handwritten total is roughly $41.5\;\text{pJ/cycle}$ .
Average dynamic power is $P = E_{\text{cycle}} / T_{clk}$ , so $T_{clk} \geq E_{\text{cycle}} / P_{\max}$ .
With $E_{\text{cycle}} \approx 41.5\;\text{pJ}$ and $P_{\max} = 1\;\text{mW}$ :

$T_{clk} \geq 41.5\;\text{ns} \;\Rightarrow\; f_{clk} \leq 24\;\text{MHz}$

The key takeaway is that the power budget, not the setup-time path, sets the maximum clock frequency for this design.

Question 4(a) 3 pts

Consider a loop that computes the elementwise product of vectors A and B and stores per-element results into C:

loop:
1:  beq   t0, a3, end    # if i == N, exit loop
2:  slli  t1, t0, 2      # t1 = i * 4
3:  add   t2, A, t1      # t2 = &A[i]
4:  lw    t3, 0(t2)      # t3 = A[i]
5:  add   t4, B, t1      # t4 = &B[i]
6:  lw    t5, 0(t4)      # t5 = B[i]
7:  mul   t6, t3, t5     # t6 = A[i] * B[i]
8:  add   t7, C, t1      # t7 = &C[i]
9:  sw    t6, 0(t7)      # C[i] = result
10: addi  t0, t0, 1      # i++
11: j     loop
end:

Compute the D-cache miss rate with a 512 B fully associative cache, no prefetching, 8 B blocks, and 32-bit integers.

Extracted handwritten answer: “3 load and store instructions; 4-byte integers; <50% miss-rate.”

Corrected steady-state miss rate: 50%, assuming write-allocate stores are counted as cache accesses.

Reasoning:

Each integer is 4 bytes; each 8 B cache block holds 2 integers.
Streams A, B, and C are accessed sequentially, so each stream sees one miss per two elements — exactly $1/2$ .
Each iteration performs 3 data accesses: lw A[i], lw B[i], and sw C[i].

Miss rate $= 1/2 = 50\%$ . The handwritten <50% is wrong: the cache is fully associative with 64 blocks, so capacity is not the binding constraint — every miss is the (compulsory) first access to a new block, which is exactly half the accesses. (If write-allocate stores are not counted as cache accesses, the miss rate is still $1/2$ when computed over the 2 reads per iteration, because both A[i] and B[i] follow the same one-miss-per-two-elements pattern.)

Question 4(b) 3 pts

Same assumptions as (a), but with a 1024 B cache and 16 B blocks. Compute the D-cache miss rate.

$\text{integers per block} = \frac{16\;\text{B}}{4\;\text{B/int}} = 4$

$\text{miss rate} = \frac{1}{4} = 25\%$

Question 4(c) 3 pts

Compute the CPI for parts (a) and (b). The ideal CPI is 1 and each missed load / store costs 20 cycles in total.

There are 11 loop instructions, of which 3 are data-memory instructions. If a missed load / store costs 20 total cycles and a hit / non-memory instruction costs 1 cycle:

For part (a), using a 50% miss rate:

$\text{cycles/iter} = 8\cdot 1 + 3\cdot(0.5\cdot 20 + 0.5\cdot 1) = 8 + 3\cdot 10.5 = 39.5$

$\text{CPI} = \frac{39.5}{11} \approx 3.59$

For part (b), using a 25% miss rate:

$\text{cycles/iter} = 8\cdot 1 + 3\cdot(0.25\cdot 20 + 0.75\cdot 1) = 8 + 3\cdot 5.75 = 25.25$

$\text{CPI} = \frac{25.25}{11} \approx 2.30$

Correct under the stated reading of the cost model — “costs 20 cycles in total” meaning the full latency of a missed memory instruction is 20 cycles, not 20 stall cycles in addition to the 1 cycle the instruction would normally take. Under that reading, $\text{CPI}_\text{a} \approx 3.59$ and $\text{CPI}_\text{b} \approx 2.30$ both follow directly from $\text{CPI} = (\text{non-memory insns}\cdot 1 + \text{mem insns}\cdot[\text{miss\_rate}\cdot 20 + (1-\text{miss\_rate})\cdot 1]) / N_\text{insns}$ . Halving the miss rate from (a) to (b) is what drives CPI from 3.59 down to 2.30.

Question 4(d) 3 pts

Compute the I-cache misses for two iterations of the loop, assuming 4 B instructions, a 32 B fully associative I-cache, and 4 B blocks.

11 compulsory misses and 11 capacity misses.

Reasoning:

The loop body has 11 instructions.
Each instruction is 4 bytes and each cache block is 4 bytes, so each instruction occupies its own I-cache block.
The I-cache holds $32 / 4 = 8$ instruction blocks.
First iteration: first access to each of the 11 instructions gives 11 compulsory misses.
Second iteration: the 11-instruction loop is larger than the 8-block cache. With LRU-like fully associative replacement, each instruction in the second pass has been evicted before reuse, giving 11 capacity misses.

Question 5(a) 2 pts

A system has 4 KiB of physical address space, 8 KiB of virtual address space, and 256 B pages. Compute the size in bits of each address portion: physical page number (PPN), physical offset (PO), virtual page number (VPN), and virtual offset (VO).

Address portion	Size
Physical Page Number (PPN)	4 bits
Physical Offset (PO)	8 bits
Virtual Page Number (VPN)	5 bits
Virtual Offset (VO)	8 bits

Reasoning (the OS-managed page table maps VPNs to PPNs):

page size = 256 B = 2^8 -> offset = 8 bits
physical address space = 4 KiB = 2^12 -> PPN = 12 - 8 = 4 bits
virtual address space = 8 KiB = 2^13 -> VPN = 13 - 8 = 5 bits

Question 5(b) 2 pts

Direct-mapped cache (see cache associativity): 256 B cache with 8 B blocks. How many address bits are tag, index, and block offset?

Block offset:

$8\;\text{B block} = 2^3 \;\Rightarrow\; 3\;\text{offset bits}$

Index:

$\frac{256\;\text{B}}{8\;\text{B}} = 32\;\text{lines} = 2^5 \;\Rightarrow\; 5\;\text{index bits}$

Tag depends on whether the cache is indexed / tagged by virtual or physical address:

Address type	Total address bits	Tag	Index	Block offset
Physical	12	4	5	3
Virtual	13	5	5	3