Glossary

Every concept term used across the site links here. The pages list every slide and exam question that touches the term.

Address translation — The hardware process by which a virtual (effective) address is rewritten to a physical address on every memory reference, using a TLB-cached page-table lookup. The shared mechanism that implements both protection and demand paging.
Address-correlating prefetcher — A prefetcher that learns repeating relationships between addresses (or deltas) — such as 'address Y follows address X with high probability' — and issues prefetches by replaying the recorded successors. Often implemented with a Markov-style table or a Global History Buffer (GHB) walked via per-key linked lists.
Amdahl's Law — If an enhancement speeds up a fraction f of execution by factor S, the overall speedup is 1/((1-f) + f/S), bounded above by 1/(1-f). Quantifies the diminishing returns of optimizing only part of a workload and the central limit on parallel speedup.
Arithmetic, harmonic, and geometric means — Three averaging techniques used in performance evaluation: arithmetic mean for quantities proportional to time (latencies), harmonic mean for rates (throughput, MIPS), and geometric mean for unit-less ratios (speedups). Choosing the wrong mean systematically distorts conclusions.
Average Memory Access Time (AMAT) — Hit time + miss rate × miss penalty. The standard analytic model for cache performance, applied recursively for multilevel hierarchies.
Barrier synchronization — A synchronization primitive that requires all participating threads to reach a designated point before any may proceed. Foundational to bulk-synchronous parallel (BSP) algorithms and SPMD parallelism.
Base-and-bound registers — An early protection mechanism that gives each process a single contiguous physical region described by two privileged registers (base, bound). Each effective address is translated as PA = EA + base and trapped if PA >= bound.
Branch history register (BHR) — A shift register that records the most recent N branch outcomes. Indexes a pattern history table to capture correlation between branches. Can be global (one shared register) or per-PC (one register per branch).
Branch History Table (BHT) — A table of saturating counters indexed by branch PC (and optionally history) that predicts the direction of a conditional branch.
Branch prediction — A hardware mechanism that predicts the direction (and target) of a branch before it executes so that the front end can keep fetching instructions speculatively.
Branch stack (recovery stack) — A small hardware stack that checkpoints recovery state (PC, ROB/LSQ tails, rename map, free list, predictor repair info) for each in-flight predicted branch, enabling single-cycle squash on misprediction. Each entry is associated with a bit in a global branch mask register that flags younger instructions that depend on the branch.
Branch Target Buffer (BTB) — A small cache indexed by branch PC that stores predicted branch targets so the front end can redirect fetch the cycle a branch is fetched.
Cache aliasing / synonyms — The hazard, present in virtual or large VIPT caches, where two virtual addresses that translate to the same physical address occupy different cache lines, so that a write to one alias is not visible from the other. Solved by capping cache size, parallel multi-set search, OS page coloring, or L2-mediated detection.
Cache associativity — The number of ways in each cache set: how many different blocks may map to the same set. Direct-mapped = 1, fully associative = N.
Cache inclusion property — An invariant maintained between cache levels stating that the lower (closer) cache's contents are a subset of the higher (farther) cache's contents. Simplifies external coherence (snoops only check the larger cache) but requires back-invalidation on far-level evictions.
Cache line / block — The unit of transfer between the cache and the next level of the memory hierarchy — typically 32–128 bytes.
Chip Multiprocessor (CMP) — A single die integrating multiple complete processor cores, each with its own architectural state and (usually) private L1/L2 caches, sharing a last-level cache and memory controller. The dominant architecture since ~2005.
Clock gating — A power-saving technique that disables the clock signal to idle functional units, eliminating their dynamic switching power until needed. Targets the activity factor A in P ~ ½ C V² A f.
CMOS (Complementary Metal-Oxide-Semiconductor) — A logic family pairing NMOS pull-down networks with structurally dual PMOS pull-up networks so that exactly one network conducts in steady state. Eliminates static current flow from Vdd to ground in the steady state, making CMOS the dominant low-power digital logic family.
Common Data Bus (CDB) — A broadcast bus in a Tomasulo machine that delivers a functional unit's result to every reservation station and ROB entry waiting on its tag.
Control hazard — A pipeline stall caused by a branch whose direction or target is not yet known when the next fetch must occur.
Correlated (two-level) branch predictor — A direction predictor (Yeh & Patt, 1991) that maintains a separate prediction per (PC, recent-history) pair, exploiting the observation that branch outcomes correlate with recent control-flow history. Formalised as the {G,P}A{g,p,s} taxonomy.
Critical word first — A miss-handling optimization in which the requested word of a missed cache line is delivered to the CPU as soon as it arrives from memory, before the rest of the line has filled. Reduces effective miss latency to near the access time of the first word.
Cycles Per Instruction (CPI) — Average cycles required per executed instruction. Iron-law factor; lower is better. Reciprocal of IPC.
Dark silicon — The phenomenon where a chip contains more transistors than can be powered on simultaneously within its thermal budget. Forces architects to leave portions of the die idle ("dark") at any given moment, motivating heterogeneous and specialized cores.
Demand paging — A virtual-memory mechanism that uses disk as an automatically-managed extension of DRAM: a page is brought into a physical frame only when it is referenced and is evicted back to swap when DRAM is full. Hit handling is in hardware, miss handling (page fault) is in software.
Dennard scaling — Robert Dennard's 1974 observation that if every linear MOSFET dimension and the supply voltage are shrunk by the same factor α, the per-chip power stays roughly constant while density and frequency rise. Held until the early 2000s when leakage broke the voltage-scaling assumption.
Directory-based coherence — A coherence model where a directory tracks which caches share each block, replacing broadcast snooping with point-to-point messages — scalable to many cores.
Dynamic (switching) power — CMOS power dissipated when nodes toggle, scaling as P ~ ½ C V² A f, where C is switched capacitance, V the supply voltage, A the activity factor, and f the clock frequency. Includes capacitive charging/discharging plus a small short-circuit component.
Dynamic Voltage/Frequency Scaling (DVFS) — A power-management technique that adjusts the supply voltage and clock frequency at runtime to match the workload's performance demand. Because dynamic power scales as roughly V^3 when V tracks f, even small frequency reductions yield disproportionate power savings.
Energy-Delay Product (EDP) — EDP = Energy × delay = Power × time². A combined power-performance metric that penalizes slow designs more heavily than energy alone, used to compare designs that trade speed for energy.
Energy-Delay² Product (ED²P) — ED²P = Energy × delay² = Power × time³. A combined metric that is approximately voltage-invariant under DVFS, making it a fair way to compare circuit/architecture quality independent of where the operator sets the V/f knob.
Functional Unit Status Table (FUST) — The per-functional-unit status table at the heart of a scoreboard. Each entry holds the in-flight instruction's destination/source register names (R, R1, R2), opcode, busy bit, and the producer tags (T, T1, T2) used to detect when source operands are ready and when the FU's writeback should clear downstream waiters.
GShare predictor — A correlating branch predictor that XORs the global branch history register with the branch PC to index a shared BHT, capturing inter-branch correlation.
Hardware prefetching — A speculative mechanism that issues memory requests for blocks the processor is predicted to access soon — stride, stream, or correlation-based.
Hierarchical (multi-level) page table — A page table organized as a tree whose levels are themselves paged. The VPN is split into one field per level; each level indexes the next, and unused subtrees consume no physical memory. Storage is proportional to the resident working set, not the virtual address space.
Hybrid (tournament) predictor — A combined branch direction predictor that runs two sub-predictors in parallel (typically a simple BHT and a correlated/GShare predictor) and uses a per-PC meta-predictor (chooser) to pick which one to trust per branch. Used in the Alpha 21264; reaches 90–95% accuracy.
Instruction-Level Parallelism (ILP) — A measure of the average number of instructions that could execute concurrently given unlimited hardware: the ratio of dynamic instruction count to the length of the critical dependence chain. Sets the ceiling on what any out-of-order or VLIW machine can extract from a program.
Instructions Per Cycle (IPC) — Average instructions completed per cycle. Reciprocal of CPI; higher is better.
Inverted (hashed) page table — A page-table organization with one entry per physical frame rather than per virtual page. Translation hashes (PID, VPN) into a small set of candidate entries that are searched for a matching tuple. Size is proportional to physical memory.
Iron law of performance — Execution time = Instructions × CPI × Cycle time. Frames every architectural decision as a tradeoff between these three terms.
Latency vs. throughput — Two distinct definitions of performance: latency is the time to finish a fixed task, throughput is the number of tasks per unit time. Throughput can exploit parallelism; latency cannot. The two metrics often disagree and must be matched to the measurement goal.
Leakage power — Static current that flows through nominally-off transistors. Grows exponentially as threshold voltage Vth or device temperature rises and linearly with device count. Dominates dynamic power at modern process nodes and is the immediate cause of Dennard scaling's collapse.
Least Recently Used (LRU) replacement — A cache replacement policy that evicts the way in a set whose most-recent access is furthest in the past. Approximates Belady's optimal policy when temporal locality holds; exact LRU is hardware-expensive beyond 2-way and is often replaced by NMRU, tree-PLRU, or RRIP.
Linked-data prefetcher — A prefetcher that learns load-to-load value dependences (a producer load's result becomes a consumer load's base address) and issues the consumer template ahead of the processor, walking linked structures via a small FSM. Targets pointer chasing in lists, trees, and hash tables.
Load queue (LQ) — A FIFO of in-flight load addresses kept in program order and associatively searchable. Used to detect memory ordering violations: when an older store's address resolves to one a younger speculative load already issued past, the LQ raises a flush. Sized roughly 20–30% of ROB.
Load/Store Queue (LSQ) — A pair of FIFOs (or one combined queue) that buffers in-flight loads and stores in program order so that the data cache only sees retired stores and out-of-order loads can disambiguate against older stores via address matching.
Lock (mutual exclusion) — A synchronization primitive that ensures at most one thread executes a critical section at a time. Implemented atop hardware atomics (test-and-set, compare-and-swap, LL/SC) plus software waiting algorithms.
Memoization (architectural) — Storing the result of an expensive computation so a future identical query can return it without recomputing. In hardware: branch predictors cache past branch outcomes, trace caches cache decoded instruction traces, value predictors cache likely load/op results.
Memory consistency model — The contract specifying the order in which memory accesses by one thread become visible to others. Includes sequential consistency (strong), TSO (x86), release consistency (ARM), and weaker models requiring explicit fence instructions.
Memory disambiguation — Determining whether an out-of-order load may bypass an older store with an unknown address. Implementations range from conservative stalling to speculative bypass with replay on misprediction.
Memory wall — The widening performance gap between CPU speed (historically 52%/year, then 15%/year) and main-memory access time (~7%/year). One DRAM access today costs roughly 500 arithmetic-op-equivalents, motivating deep cache hierarchies and prefetching.
MESI protocol — A four-state cache-coherence protocol (Modified, Exclusive, Shared, Invalid) for multiprocessor caches. Adds Exclusive to MSI to avoid bus traffic on private writes.
Message-passing programming model — A parallel programming model in which processes have private address spaces and communicate via explicit send/receive operations. Common in distributed systems and HPC clusters (MPI).
MIMD execution model — Flynn taxonomy class in which each processor independently fetches and executes its own instruction stream against its own data — the model used by general-purpose multiprocessors and CMPs.
MIPS (millions of instructions per second) — Clock frequency divided by CPI, scaled to millions. A partial performance metric that ignores instruction count, so a smarter compiler that removes instructions can lower MIPS while making programs faster. Often derided as 'Meaningless Indicator of Performance for Salespeople.'
Miss Status Holding Register (MSHR) — A hardware register in a non-blocking cache that records the address, status, and target of each in-flight miss. New misses are associatively compared against the MSHR file to coalesce duplicate requests; the number of MSHRs upper-bounds the cache's memory-level parallelism.
Moore's Law — Gordon Moore's 1965 observation that the number of transistors on an economically manufacturable die doubles roughly every 18–24 months. Drove four decades of "free" performance and density improvements; now widely viewed as ending or having ended.
MSI protocol — A three-state cache-coherence protocol (Modified, Shared, Invalid). Simplest snoopy protocol that supports multiple readers and one writer per block.
Multiple-branch prediction — Hardware that emits predictions for two or more branches per cycle, used by wide-issue cores to keep fetch bandwidth above one taken branch every ~10 instructions. Yeh, Marr & Patt's branch address cache reads multiple PHT rows in parallel and selects between them using freshly-computed predictions.
Network-on-Chip (NoC) — A regular interconnection topology (mesh, ring, crossbar) replacing ad-hoc wiring within a chip to connect cores, caches, and memory controllers. Used in many-core processors with >8 cores.
Non-blocking (lockup-free) cache — A cache that continues to service hits and additional misses while one or more earlier misses are still pending, using MSHRs to track outstanding requests. Essential for memory-level parallelism in out-of-order cores. First proposed by Kroft.
Non-Uniform Cache Architecture (NUCA) — A large on-die cache organization in which different banks have different access latencies based on their physical distance from the CPU. Static NUCA fixes the address-to-bank mapping; Dynamic NUCA migrates frequently-used lines toward the CPU.
Operand forwarding — Routing a result from a later pipeline stage directly back to a dependent instruction's operand input, avoiding stalls for RAW hazards on adjacent instructions.
Overriding (multi-level) branch predictor — A two-level predictor organisation in which a fast, simple predictor produces an initial prediction every cycle while a slower, more accurate predictor runs in the background; if the two disagree the wrong-path instructions are squashed and refetched using the misprediction recovery mechanism.
P6 microarchitecture — Intel's value-based out-of-order microarchitecture (Pentium Pro / II / III / M) that combines Tomasulo-style reservation stations with a separate reorder buffer to support precise interrupts. Tags are ROB indices; the Map Table carries a 'ready-in-ROB' bit so dispatch can pull completed-but-not-yet-retired values directly out of the ROB.
Page fault — The exception raised when a memory reference's translation either does not exist in the page table or refers to a page that is currently swapped to disk. The OS handles the fault, allocates a frame, reads the page from disk, and updates the page table — at a cost of millions of cycles.
Page table — An OS-managed data structure mapping virtual pages to physical frames. Modern systems use multilevel (radix) page tables walked by hardware on TLB miss.
Pattern History Table (PHT) — A table of saturating counters indexed by a hash of the branch PC and the branch history register. Provides per-pattern direction predictions in two-level (correlated) and global-history predictors.
Physically-Indexed Physically-Tagged (PIPT) cache — A cache that uses bits from the translated physical address for both its index and its tag. Free of synonyms and immune to context switches, but the TLB lookup sits serially on the L1 hit critical path.
Power gating (Vdd gating) — A static-power-saving technique that cuts the Vdd or ground connection to a circuit block via a header/footer transistor, eliminating leakage entirely. Loses any state stored in the block and incurs a wakeup latency to re-establish supply voltage.
Pre-decode bits — Extra bits attached to each I-cache line slot (e.g. is-return, is-conditional-branch) written the first time an instruction executes. Allow the front end to route predictions through the appropriate predictor (RAS, DIRP, BTB) before decode.
Precise architectural state — The ability to abort and restart the machine at every instruction boundary so that the architectural register file and memory look exactly as if execution had stopped between two sequential instructions. Implemented in P6 by deferring all regfile/D-cache writes to the in-order Retire stage of the ROB.
Precise interrupts — An interrupt model in which the architectural state at the point of the interrupt is exactly the state after committing all earlier instructions and before any later instruction. Implemented via ROB-based in-order commit.
Read-After-Write hazard (RAW) — A data hazard where an instruction needs to read a register before a prior in-flight instruction has written it. Enforces a real ordering constraint that cannot be removed by renaming.
Register renaming — Mapping architectural registers to a larger physical register file so that WAR and WAW dependencies disappear, enabling out-of-order execution.
Reliability wall — The growing impact of transient, manufacturing, and wearout faults at advanced process nodes. As transistors shrink, single-event upsets, electromigration, and per-device variability force architects to treat hardware as statistically unreliable.
Reorder Buffer (ROB) — A FIFO buffer that holds in-flight instructions in program order so that out-of-order execution results can be committed in order, supporting precise interrupts.
Reservation station — A buffer at the input of a functional unit that holds an instruction and its operands (or operand tags) until the instruction is ready to execute.
Return Address Stack (RAS) — A small hardware stack that predicts return targets by mirroring call/return semantics — pushed on call, popped on return.
Scoreboarding — A centralized hardware structure (CDC 6600) that tracks instruction status and register dependencies to issue, execute, and write-back instructions out of order without renaming.
Segmented addressing — A protection/translation scheme that gives each process multiple variable-sized segments, each described by its own (base, bound) pair stored in a per-process segment table. Suffers from external fragmentation when segments are allocated and freed.
Shared-memory programming model — A parallel programming model in which multiple threads operate on a single global address space, communicating implicitly via loads and stores. Supported in hardware by cache coherence.
SIMD execution model — Flynn taxonomy class in which one instruction operates on multiple data lanes in lockstep. Implemented as vector extensions (AVX, NEON) or as the underlying mechanism of GPU SIMT execution.
SIMT execution model — GPU execution model where the programmer writes scalar per-thread code and the hardware groups threads into warps, executing one instruction across all lanes with predication for divergent control flow.
Skewed-associative cache — A set-associative cache in which each way uses a different hash function to compute its index, so two addresses that collide in one way almost certainly do not collide in others. Achieves near-conventional miss rates with lower associativity.
Snooping coherence — A coherence model where every cache observes a shared interconnect and reacts to other caches' transactions to keep its state consistent.
Spatial locality — The empirical property that if a memory location was accessed, locations near it (in address space) are likely to be accessed soon. Justifies fetching whole cache lines (typically 64 bytes) instead of individual words.
Spatial-correlation prefetcher — A prefetcher that records the bitmap of touched offsets within a memory region (keyed by a triggering load PC), and on revisit of a similar region prefetches the same offset pattern. Exploits repetitive but irregular spatial layouts (e.g., database pages) that strided prefetchers miss.
Speculative execution — Executing instructions past unresolved branches (or other unresolved events) and squashing their effects if the speculation turns out to be wrong.
SPMD execution model — Programming idiom in which all processors run copies of the same program but branch on a per-thread ID to operate on different data. The common case for OpenMP, MPI, and CUDA kernels.
Static (leakage) power — CMOS power dissipated when transistors are nominally off. Dominated by subthreshold leakage (exponential in temperature and inverse of threshold voltage) and gate-oxide tunneling leakage. Increases in importance with each successive process generation.
Store buffer — A FIFO that holds committed-but-not-yet-written stores so the pipeline can continue after a store, with loads forwarding from matching addresses.
Store-Sets — A memory-dependence predictor that maps each load PC to a set of store PCs it has aliased with, plus a Last Store Table that maps each store PC to the SQ index of its most recent dynamic instance. Lets a load wait for exactly the predicted older store rather than for all older stores.
Stream-buffer prefetcher — A prefetcher (Jouppi) that maintains one or more FIFO buffers, each holding sequentially prefetched cache lines beyond a detected stream. Avoids cache pollution by checking the head of each FIFO on demand misses; can be augmented with stride detection.
Stride prefetcher — A hardware prefetcher that detects constant strides between consecutive accesses by the same static load (or by the same address-history key) and issues prefetches for the predicted next addresses. Captures regular array and matrix patterns including non-unit strides.
Structural hazard — A pipeline stall caused by two instructions needing the same hardware resource (e.g., a single memory port) in the same cycle.
Superscalar processor — A processor that issues and executes more than one instruction per cycle by replicating functional units and fetch/decode/issue width. Drives the need for multi-port or multi-bank caches to sustain multiple memory references per cycle.
Symmetric Multiprocessor (SMP) — A bus-based shared-memory multiprocessor where all processors are identical and have uniform access time to a single shared memory. Typical of 1990s servers, limited to ~8–16 processors by bus bandwidth.
TAGE branch predictor — Seznec's modern direction predictor that uses multiple tagged PHTs indexed by geometrically increasing global-history lengths, with a base predictor as fallback. The longest-history table with a tag match supplies the prediction. Standard basis for Zen/Cortex BPUs.
Temporal locality — The empirical property that if a memory location was accessed recently, it is likely to be accessed again soon. Justifies cache replacement policies that retain recently-used lines (e.g., LRU).
Thermal Design Power (TDP) — The maximum sustained power a chip's package and cooling solution can dissipate without exceeding safe operating temperature. Modern desktop and server CPUs are TDP-limited, capping clock frequency and forcing architectural energy efficiency.
Tomasulo's algorithm — A dynamic scheduling algorithm using reservation stations and a common data bus to execute instructions out-of-order while resolving RAW hazards via tag matching.
Tournament (hybrid) branch predictor — A branch direction predictor that runs two component predictors in parallel (typically a simple bimodal and a correlated/gshare) and uses a meta-predictor table to choose between them per branch. McFarling's combining predictor and the Alpha 21264 are canonical examples.
Trace cache — An instruction-cache replacement (Peleg & Weiser; Rotenberg et al.) that stores dynamic instruction sequences (traces) including statically non-contiguous instructions. A trace is tagged by its first PC plus the taken/not-taken pattern of its embedded branches, enabling wide single-cycle fetch across taken branches.
Transactional memory — An optimistic concurrency-control mechanism that executes a critical section speculatively and commits atomically if no conflicting accesses occurred, otherwise rolls back. Trades hardware/software complexity for higher concurrency under low contention.
Translation Lookaside Buffer (TLB) — A small fully or set-associative cache that holds recent virtual-to-physical address translations, avoiding a full page-table walk on every memory access.
Two-bit saturating counter — A four-state counter (strongly/weakly taken, weakly/strongly not-taken) used as a per-branch direction predictor. Resists single-mispredict flips.
Two-level (correlated) branch predictor — A direction predictor that maintains a recent-outcome history (Branch History Register, BHR) and uses it together with the branch PC to index a Pattern History Table of 2-bit counters. Captures self- and cross-branch correlation that a single bimodal counter misses.
Victim cache — A small fully-associative buffer (typically 4–16 lines) that captures lines evicted from a direct-mapped or low-associativity main cache; subsequent misses also probe the victim, restoring conflict-evicted lines instead of going to the next level. Targets conflict misses.
Virtual memory — The illusion of a private, large, contiguous address space per process, implemented by hardware paging and an OS-managed page table.
Virtually-Indexed Physically-Tagged (VIPT) cache — A cache whose index is taken from the virtual address (typically from page-offset bits, so it is identical in VA and PA) while the tag is the physical page number from the TLB. Permits TLB and cache to be accessed in parallel; the cache hit signal is the AND of cache tag match and TLB-PPN match.
Way prediction — An optimization for set-associative caches that predicts which way will hit (typically based on PC or address) and reads only that way's data array. Saves energy and sometimes latency on correct predictions; falls back to a full set-associative search on mispredictions.
Work and critical path — Work T1 is the time to run a computation on one processor; critical path T_inf is its dependence-graph depth (time on infinitely many processors). Average parallelism is T1/T_inf, and a p-wide machine takes at least max(T1/p, T_inf) time.
Write-After-Read hazard (WAR) — A data hazard where a later instruction writes a register that an earlier instruction still needs to read. False dependence — eliminated by register renaming.
Write-After-Write hazard (WAW) — A data hazard where two instructions write the same register out of program order. False dependence — eliminated by register renaming.
Write-back cache — A cache that updates main memory only when a dirty line is evicted, using a dirty bit per line. Lower bandwidth than write-through, more complex coherence.
Write-through cache — A cache that propagates every store to the next level immediately. Simpler coherence at the cost of higher write traffic.