Memory Hierarchy
Design
The whole chapter follows from one number that keeps growing — the distance, in cycles, from the processor to its data. Fast memory is small and costly; the hierarchy is the compromise. Read the idea, then drive the simulator beside it.
The chapter in one breath
A memory hierarchy stacks levels — registers, L1/L2/L3 caches (SRAM), main memory (DRAM), then Flash/disk — each larger, slower, and cheaper per byte than the one above. It works because of the principle of locality: programs reuse data in time (temporal) and space (spatial). The figure of merit is average memory access time (AMAT), and the chapter spends its energy on technologies and tricks that drive AMAT down: cache geometry, ten advanced optimizations, DRAM/HBM/Flash technology, virtual memory, and the security holes that the whole edifice opened up.
Locality & the Hierarchy
Why the pyramid exists, the inclusion property, and the processor–memory gap.
AMAT & Performance
The master formula, multilevel composition, and a calculator to feel CPI sensitivity.
Cache Org & the 3 C's
Mapping, tags, write policies — and a simulator that classifies every miss.
Replacement Policies
LRU/FIFO/NRU/Random head-to-head, plus Belady's anomaly.
Memory Technology
SRAM, DRAM banks/rows, DDR, HBM, and Flash.
Ten Optimizations
The chapter's core, each mapped to the AMAT term it attacks.
Virtual Memory
TLBs, page-table walks, page faults, and VIPT caches.
A53 vs Core i9
A real side-by-side of two shipping memory hierarchies.
Locality & the Memory Hierarchy
Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. The economical answer is a hierarchy that exploits locality and the cost/speed trade-offs of different memory technologies.
The principle of locality
Programs don't touch all code and data uniformly. They exhibit two kinds of locality, and every level of the hierarchy is a bet on one or both:
- Temporal locality: a byte referenced now is likely referenced again soon. (Loop counters, hot functions.) Caches exploit this by keeping recently used blocks.
- Spatial locality: a byte referenced now means its neighbors are likely referenced soon. (Array walks, instruction streams.) Caches exploit this by fetching a whole block (line) at once, not a single word.
Add one hardware fact — for a fixed technology and power budget, smaller memories can be made faster — and the hierarchy falls out naturally: keep the hot working set in a small fast store, back it with progressively larger, slower, cheaper stores.
The inclusion property
In most (not all) designs, the data in a lower level is a superset of the next level up. This inclusion property is typically maintained by main memory for caches, and by secondary storage for virtual memory. It simplifies coherence and lookup — but note the modern exception you'll meet in the case study: the Core i9's last-level cache is non-inclusive, so L2 and L3 hold different blocks and the effective capacity is their sum.
The processor–memory gap
The reason this chapter exists: processor demand for memory (requests/second) grew far faster than DRAM latency improved. DRAM latency improved only ~1.07×/year historically and has slowed since ~2010, while cores multiplied and each wants data faster. Bandwidth has scaled far better than latency — which is why so many techniques in this chapter are really about hiding or tolerating latency (prefetching, nonblocking caches, multiple banks) rather than eliminating it.
Spatial vs temporal locality — give a hardware mechanism that targets each.
Why not just build one giant fast memory and skip the hierarchy?
What breaks the hierarchy's effectiveness?
AMAT & Cache Performance
Average Memory Access Time is the single most important formula in the chapter. Every optimization that follows is an attack on one of its three terms.
Three levers, and naming which one you're pulling is the heart of cache design:
| Term | What it is | How you attack it |
|---|---|---|
| Hit time | Time to access a block that is in the cache | Smaller/simpler caches, way prediction, pipelined access, VIPT to overlap translation |
| Miss rate | Fraction of accesses not found | Bigger caches, higher associativity, larger blocks, better replacement, prefetching, compiler blocking |
| Miss penalty | Extra time to fetch a missed block from below | Multilevel caches, critical-word-first, merging write buffers, nonblocking caches, more memory banks/HBM |
It composes recursively
Real machines have multiple cache levels, so the "miss penalty" of one level is itself an AMAT of the level below. Unrolled for three levels with a DRAM backstop:
Try it — feel the sensitivity
The calculator below computes AMAT and effective CPI live. The lesson interviewers want you to internalize: because L1's miss rate multiplies everything beneath it, a tiny change in L1 ripples massively into CPI. Drag L1's miss rate and watch the CPI number move far more than an equal change to L3.
L1 First level
L2 Second level
L3 Third level
DRAM Main memory
CPI model
= 2.60 cycles
Canonical worked examples
The calculator builds intuition; these fixed examples are the arithmetic you should be able to reproduce on a whiteboard. Numbers follow the chapter's formulas (drill values derived).
| Design | Hit time | Miss rate | Miss penalty | AMAT |
|---|---|---|---|---|
| A · direct-mapped | 1.0 ns | 5% | 50 ns | 1.0 + 0.05·50 = 3.5 ns |
| B · 2-way | 1.2 ns | 3% | 50 ns | 1.2 + 0.03·50 = 2.7 ns |
B wins by 0.8 ns. The full answer: (1) compute both; (2) mechanism — associativity cuts conflict misses; (3) tradeoff — higher hit time and energy; (4) workload — only helps if conflicts actually dominate.
The trap: the 25% is L2's local rate (of accesses reaching L2). Only 1% of all CPU references actually miss L2 — that 1% is what AMAT and CPI weight.
OoO caveat: use only the non-overlapped penalty — independent work, prefetch, and multiple outstanding misses hide much of the raw latency.
The on-chip caches must supply the rest via internal bandwidth — multiported, pipelined, banked, nonblocking. This is the bandwidth half of the memory gap, distinct from latency.
Why do architects obsess over L1 hit time even though L1 misses are "rare"?
Your L3 local miss rate is 40%. Is that bad?
Give the CPI form of the cache-performance equation.
Cache Organization & the Three C's
A cache holds fixed-size blocks (lines). The two design questions are: where can a block go, and when it's not there, why did it miss? The second question — the three C's — is where interviews live.
Address decomposition & mapping
Given a block size and a number of sets, every address splits into three fields. This split is the cache's wiring — there's no arithmetic beyond shifting and masking:
- Block offset = log₂(block size) low bits — which byte within the block.
- Set index = log₂(number of sets) middle bits — which set to search.
- Tag = the remaining high bits — stored in the cache to confirm identity on a hit.
Associativity is just how many blocks live in one set. Direct-mapped = 1 way (one home per block, cheap but conflict-prone). Fully associative = 1 set (a block goes anywhere, no conflicts but expensive parallel tag search). n-way set associative is the practical middle.
Write policies
Reads are easy; writes force choices. Two orthogonal decisions:
| Decision | Option A | Option B |
|---|---|---|
| On a write hit, when to update memory? | Write-through: write cache + memory. Simple, always-coherent memory, but heavy write traffic (needs a write buffer). | Write-back: write only the cache, mark the line dirty, flush on eviction. Far less traffic; the common choice for L1/L2/L3. |
| On a write miss, fetch the block? | Write-allocate: bring the block in, then write. Pairs naturally with write-back. | No-write-allocate: write straight to memory, skip the cache. Pairs with write-through. |
The case-study machines both use write-back, write-allocate caches — note that in the A53/i9 sections.
The Three C's — a taxonomy of why misses happen
Every miss is exactly one of:
- Compulsory (cold): the first-ever reference to a block. Unavoidable except by prefetching or larger blocks. Independent of cache size.
- Capacity: the working set is simply bigger than the cache, so blocks are evicted and later re-fetched. Fix: bigger cache, or better locality.
- Conflict (collision): too many hot blocks map to the same set and evict each other, even though the cache as a whole has room. Fix: more associativity, victim cache, better indexing. Fully associative caches have zero conflict misses by definition.
Drive it
Configure a cache, pick a policy, and step an address trace. Watch sets fill and evict, see the live tag·index·offset decode, and read the three-C's breakdown. Try the presets in order — Sequential shows cold then capacity misses; Conflict thrash manufactures conflict misses you can then erase by raising associativity.
Cache state
16 sets × 2 ways · 16B blocks · tag/idx/off = 4/4/4 bitsCurrent access
Results
| # | addr | set | tag | result | type | evicted |
|---|---|---|---|---|---|---|
| 1 | 0x0 | 0 | 0x0 | miss | compulsory | — |
| 2 | 0x10 | 1 | 0x0 | miss | compulsory | — |
| 3 | 0x20 | 2 | 0x0 | miss | compulsory | — |
| 4 | 0x30 | 3 | 0x0 | miss | compulsory | — |
| 5 | 0x0 | 0 | 0x0 | hit | — | — |
| 6 | 0x10 | 1 | 0x0 | hit | — | — |
| 7 | 0x20 | 2 | 0x0 | hit | — | — |
| 8 | 0x30 | 3 | 0x0 | hit | — | — |
| 9 | 0x40 | 4 | 0x0 | miss | compulsory | — |
| 10 | 0x50 | 5 | 0x0 | miss | compulsory | — |
| 11 | 0x0 | 0 | 0x0 | hit | — | — |
| 12 | 0x10 | 1 | 0x0 | hit | — | — |
A workload thrashes one cache set. Which C, and what are your fixes — ranked?
Block size went up, miss rate went up. Explain.
Why are compulsory misses "the same" regardless of cache size?
Replacement Policies
When a set is full and a new block must enter, which resident block dies? Replacement only matters on a miss to a full set, but on associative caches the choice can swing the miss rate — which is why "better replacement policies" is one of the chapter's ten advanced optimizations.
| Policy | Victim | Cost / notes |
|---|---|---|
| LRU (least-recently-used) | The block unused for the longest time | Best intuition for temporal locality, but exact LRU needs per-way age state — expensive at high associativity, so real caches use approximations. |
| NRU / clock (not-recently-used) | A block whose reference bit is 0 | Cheap 1-bit-per-line approximation of LRU. What real "LRU" caches (including the case-study machines) actually ship. |
| FIFO | The oldest-inserted block | Cheap (one pointer per set), ignores reuse. Can suffer Belady's anomaly. |
| Random | A random way | Trivial hardware, surprisingly competitive at high associativity, and immune to pathological patterns. |
Head-to-head
Same cache, same trace, four policies. Replacement ties when there's no reuse pressure; it separates when sets overflow with reuse. Try the Set thrash and Loop > capacity presets to make the policies disagree.
Misses by policy · 4 sets × 4 ways · 14 accesses
Belady's anomaly
Intuition says more capacity can only help. FIFO breaks that. On the classic reference string below, going from 3 to 4 frames increases the fault count — because FIFO isn't a stack algorithm and gives up the inclusion property that guarantees monotonic behavior. LRU, OPT, and LFU are stack algorithms and never show the anomaly.
What's a "stack algorithm" and why does it matter?
At very high associativity, why does Random get competitive with LRU?
Memory Technology & Optimizations
Caches are SRAM; main memory is DRAM; storage is Flash. Knowing the physics of each — and especially how DRAM extracts bandwidth from a fundamentally high-latency array — separates a memory-systems candidate from a generalist.
SRAM vs DRAM
| SRAM (caches) | DRAM (main memory) | |
|---|---|---|
| Cell | ~6 transistors, bistable latch | 1 transistor + 1 capacitor (1T1C) |
| Density / cost | Low density, expensive per bit | High density, cheap per bit |
| Speed | Fast, no refresh | Slower; charge leaks → must refresh periodically |
| Quirk | Holds value while powered | Reads are destructive (must write back); access is row-then-column |
How DRAM makes bandwidth from latency
A DRAM access is two phases: activate a row into the row buffer (the slow part — RAS), then read columns out of that buffer (fast — CAS). The architecture's whole game is amortizing the expensive row activation:
- Row buffer = an open row acts as a small cache; consecutive accesses to the same row are fast (row hits).
- Burst mode + SDRAM/DDR: one address yields a stream of words. DDR (double data rate) transfers on both clock edges. Generations (DDR3→DDR4→DDR5) raise transfer rates; the case study uses DDR5-4800.
- Banks, ranks, channels: independent banks let multiple row activations overlap; multiple channels multiply bus width. This is how you scale bandwidth without lowering latency — a modern i9 can sustain enough memory parallelism to feed many cores.
DDR generations: bandwidth soared, latency stalled
Peak DIMM bandwidth = transfers/s × 8 bytes. Notice the right-hand column: row-miss latency is essentially flat while bandwidth climbs an order of magnitude — the bandwidth-not-latency story in one table.
| Standard | I/O clock | Transfers | DIMM bandwidth | Row-miss latency |
|---|---|---|---|---|
| DDR1 | 200 MHz | 400 MT/s | 3.2 GB/s | ~63 ns |
| DDR3 | 800 MHz | 1600 MT/s | 12.8 GB/s | ~39 ns |
| DDR4 | 1333 MHz | 2666 MT/s | 21.3 GB/s | ~39 ns |
| DDR5 | 2400 MHz | 4800 MT/s | 38.4 GB/s | ~39 ns |
Demand outruns a commodity DRAM bus by ~70×. The gap is closed by on-chip cache bandwidth (banks/ports/nonblocking) plus, for bandwidth-bound work, HBM.
HBM — stacking for bandwidth
High Bandwidth Memory stacks DRAM dies with through-silicon vias, announced across AMD/Intel/NVIDIA from ~2017. It delivers far higher bandwidth (and lower access energy per bit) than a single DDR bus, at higher cost. The chapter discusses HBM as an additional cache level (an L4/LLC, 10×+ the on-chip LLC) and even as main memory; the i9's memory channels can attach HBM or standard DIMMs. HBM is the backbone of GPUs and accelerators — directly relevant for NVIDIA-style interviews.
Flash & the "in-between" trap
Flash (NAND) is nonvolatile, denser and cheaper than DRAM but slower, and it wears out — blocks tolerate a limited number of writes, so controllers do wear leveling. SSDs replace disks for secondary storage. The chapter also flags PCM (phase-change memory) as a cautionary tale: a technology that sat between DRAM and Flash but offered no decisive win in speed or price, and so failed to gain momentum — a pitfall we revisit in §2.7.
Why is opening a DRAM row expensive but reading more columns cheap?
When would you choose HBM over DDR5, and what's the cost?
Why does Flash need wear leveling and what's the architectural consequence?
Dependability: ECC & Chipkill
At warehouse scale, "rare" memory errors become constant. A 10,000-server fleet sees DRAM faults continuously, so protection isn't optional — it's a design requirement, and the level of protection you choose is an architectural decision with real cost.
The ladder of protection
- Parity: 1 extra bit per word — detects a single-bit error, can't correct. Cheap but weak; a single-processor server with only parity has a worse unrecoverable error rate than a huge ECC fleet.
- SECDED ECC: single-error-correct, double-error-detect (Hamming-style codes). The standard for server DRAM. The chapter notes a ~17-server ECC system has roughly the failure rate of a 10,000-server Chipkill system — quantifying how much stronger Chipkill is.
- Chipkill: RAID-for-DRAM. Data and check bits are distributed across multiple chips so the system survives the complete failure of an entire DRAM chip. The chapter's figure: about one undetected/unrecoverable failure every ~2 months for a Chipkill-protected fleet — making Chipkill a requirement for large-scale systems.
| 10,000-processor server | Scheme | Unrecoverable / undetected failure rate |
|---|---|---|
| Detect only | Parity | ~1 every 17 minutes |
| Correct single-bit | ECC (SECDED) | ~1 every 7.5 hours |
| Survive whole-chip loss | Chipkill | ~1 every 2 months |
Same hardware, three protection levels, four orders of magnitude difference in failure interval — the quantitative case for matching protection to fleet size.
Why isn't SECDED ECC enough at warehouse scale?
What does ECC cost you?
The Ten Advanced Optimizations
The chapter classifies ten techniques by which metric they improve: hit time/power, bandwidth, miss penalty, miss rate, or miss-penalty/rate via parallelism. Complexity generally rises as you go down the list. Learn each as a triple: what problem, which AMAT term, what it costs.
| # | Optimization | Attacks | Idea & cost |
|---|---|---|---|
| 1 | Pipelined L1 caches with virtual indexing & set associativity | HIT BW | Pipeline the cache access so a new request starts each cycle; virtual indexing overlaps with translation. Raises throughput & clock; adds pipeline complexity and hit latency in cycles. |
| 2 | Multiple banks & ports to increase L1 D-cache bandwidth | BW HIT | Multiple banks/ports serve several accesses per cycle. Helps superscalar load/store throughput; bank conflicts and area are the cost. |
| 3 | Better replacement policies | MISS RATE | NRU/clock, tree-PLRU, RRIP — approximate LRU cheaply and dodge pathological evictions. Small state per set; the face-off widget above demonstrates the payoff. |
| 4 | Multibanked L2/L3 to cut power & latency, raise bandwidth | BW PWR PEN | Bank the large lower caches; activate only the addressed bank (less energy), serve refills in parallel. The i9's hashed 8-bank L3 is exactly this. |
| 5 | Nonblocking caches (hit-under-miss, miss-under-miss) | BW PEN | Let the cache keep serving hits (and further misses) while a miss is outstanding — essential with out-of-order execution. Needs MSHRs to track multiple in-flight misses; significant control complexity. |
| 6 | Critical word first & early restart | MISS PEN | Fetch the requested word first and resume the CPU immediately, before the rest of the block arrives. Cheap; helps most with large blocks/long transfers. The A53 does this. |
| 7 | Compiler optimizations (loop interchange, blocking/tiling) | MISS RATE | Restructure code/data so the working set fits and is reused before eviction. Zero hardware cost — pure software locality. Blocking a matrix multiply is the canonical example. |
| 8 | Hardware prefetching of instructions & data | PEN MR | Detect streams/strides and fetch ahead (stream buffers, next-line). Hides latency for regular patterns; "bad" prefetches waste bandwidth and can evict useful blocks. The i7/i9 prefetch into L1 and L2. |
| 9 | Compiler-controlled prefetching | PEN MR | Compiler inserts explicit, non-faulting prefetch instructions ahead of use. Precise but adds instruction overhead and needs accurate scheduling/distance tuning. |
| 10 | Multiple memory buses & modules / HBM | BW PEN | More channels and HBM widen and parallelize the path to memory, shortening effective block-fetch latency and feeding many cores. The most "system-level" lever; depends on HBM/packaging. |
Each optimization in depth
Expand any optimization for the full interview checklist: problem, technique, AMAT term, complexity, a concrete cited number, and the one-line takeaway. Cited figures follow the chapter's Section 2.3 discussion.
1 · Pipelined VIPT L1 caches HIT TIME BANDWIDTH
Technique: pipeline the access; index L1 with page-offset bits while the TLB translates, compare physical tags after.
AMAT term: hit time / clock (throughput).
Complexity: moderate — size/block/associativity constrained to avoid synonyms; more stages raise branch & load-use penalties.
Concrete: I-cache access latency grew Pentium 1 cyc → Pentium Pro–III 2 cyc → Pentium 4 / current i7/i9 4 cyc as pipelining deepened.
Takeaway: "fast L1" means high throughput at high clock, even as latency in cycles rises. Way prediction (2-way >90%, 4-way ~80% accurate; needs ≥10% speedup to pay) and victim caches are companion hit-time tricks.
2 · Multiple banks & ports (L1 D-cache) BANDWIDTH HIT TIME
Technique: bank the cache (
bank = block address MOD #banks) and/or add ports so independent accesses proceed in parallel.AMAT term: bandwidth (and effective hit throughput).
Complexity: pure multiporting is expensive; banking is cheaper but bank conflicts create variable service time.
P(no collision, 4 refs, 8 banks) = 7⁄8·6⁄8·5⁄8 ≈ 41%.Concrete: the i9 generates four memory references/clock; its L1 D-cache is dual-ported with eight banks.
Takeaway: banking buys bandwidth cheaply but trades deterministic latency for conflict-dependent service time.
3 · Better replacement policies MISS RATE
Technique: approximate Belady-MIN online — LRU, NRU/clock, or reuse predictors that separate streaming from reused blocks.
AMAT term: miss rate (matters most at L2/L3 where each miss is costly).
Complexity: exact LRU is costly at high associativity; NRU is ~1 bit/way.
Concrete: NRU is about 1% worse than LRU on a cited 2 MiB 16-way L2; a 2-bit reuse predictor is ~5% better than LRU there, and ~7% better in a 4-core multiprogrammed LLC.
Takeaway: L1 favors simple replacement (hit throughput dominates); L2/L3 can afford more state because each miss is expensive. (Drive the face-off widget in Replacement Policies.)
4 · Multibanked L2 / L3 BANDWIDTH POWER PENALTY
Technique: split into banks; activate only the addressed bank; serve multiple refills in parallel.
AMAT term: bandwidth, power, and bank-local latency/penalty.
Complexity: low-moderate; needs a bank-select hash to spread accesses.
Concrete: the i9's 30 MiB L3 is hashed across 8 banks; 3 hash bits pick the bank so only one activates.
Takeaway: banking is how big caches stay both low-power and high-bandwidth.
5 · Nonblocking caches (hit/miss-under-miss) PENALTY BANDWIDTH
Technique: keep serving hits (and further misses) during an outstanding miss, tracked by MSHRs (destination, tag, requesting load/store); returns may be out of order.
AMAT term: effective miss penalty (= non-overlapped stall) and bandwidth.
Complexity: high — arbitration, ordering, deadlock avoidance, coherence; a verification burden.
Concrete: Li et al. model on the i7 — one hit-under-miss reduces cache latency ~9% (SPECINT2006) and ~12.5% (SPECFP2006).
Takeaway: the metric becomes non-overlapped stall, not raw miss latency; effective penalty ≈ latency ÷ MLP.
6 · Critical word first & early restart MISS PENALTY
Technique: request the missed word first and restart the core the instant it arrives; fill the rest in the background (early restart = resume on normal-order arrival).
AMAT term: miss penalty.
Complexity: low; benefit grows with block size and falls if later words are reused immediately.
Concrete: SPECint2006 on i7-6700 averaged 1.23 references to a block with an outstanding miss (range 0.5–3.0) — modest reuse, so the technique helps but isn't dramatic.
Takeaway: cheap penalty reducer; most valuable with large blocks.
7 · Compiler optimizations (locality) MISS RATE
Technique: loop interchange (walk arrays in storage order → spatial locality) and blocking/tiling (operate on B×B submatrices so data is reused before eviction → temporal locality).
AMAT term: miss rate — at zero hardware cost.
Complexity: software; limited by loop structure and alias analysis.
Concrete: blocked matrix-multiply cuts memory words from
2N³ + N² to 2N³⁄B + N².Takeaway: with caches the compiler should expose blocking; with scratchpads, software must manage locality explicitly.
8 · Hardware prefetching PENALTY RATE
Technique: detect streams/strides and fetch ahead into cache or a stream buffer — no ISA/compiler burden.
AMAT term: miss penalty/rate via parallelism.
Complexity: moderate; bad prefetches waste bandwidth and evict useful lines (pollution).
Concrete: Skylake-SP has four data prefetchers; the L2 streamer provides ~70% of the CPI improvement on memory-intensive SPEC CPU2017, while L3 traffic rises ~19%.
Takeaway: accuracy and timeliness matter — prefetch is a bandwidth-for-latency trade that backfires on irregular, bandwidth-bound code.
9 · Compiler-controlled prefetching PENALTY RATE
Technique: compiler inserts explicit non-faulting prefetch instructions at a tuned distance before use.
AMAT term: miss penalty/rate.
Complexity: instruction overhead; needs accurate scheduling; weak for irregular pointer chasing.
Concrete: a cited loop's misses fall 251 → 19; the 232 avoided misses cost ~400 prefetch instructions, turning ~27,200 cycles into ~4,400.
Takeaway: precise and powerful when the access pattern is statically analyzable; overhead-bound otherwise.
10 · More memory channels / HBM as memory or LLC BANDWIDTH PENALTY
Technique: add channels/modules; use stacked HBM as main memory or as a giant L4/LLC.
AMAT term: bandwidth and effective block-fetch penalty.
Complexity: highest / most system-level — packaging, and especially tag/metadata placement for HBM-as-cache.
Concrete: a 1 GiB L4 with 64 B blocks needs ~96 MiB of tags. Loh–Hill places tags+data in the same HBM row; the Alloy cache (direct-mapped, tag+data together) is ~2× faster hit time but 1.13–1.2× higher miss rate.
Takeaway: HBM-as-cache wins when software placement is hard; HBM-as-memory wins when the runtime/OS/compiler can place critical data deliberately.
Nonblocking caches — worth a deeper look
With out-of-order execution, stalling the whole pipeline on one miss wastes enormous parallelism. A nonblocking (lockup-free) cache continues to satisfy hits under a miss, and a more aggressive one allows misses under misses (multiple outstanding). The bookkeeping lives in MSHRs (miss status handling registers), which track each in-flight miss so returning data is matched to the right request — and misses can return out of order, especially if L2 is itself nonblocking. This is the cache-side counterpart to memory-level parallelism in the CPU, and it's why effective miss penalty drops far below the raw DRAM latency on a well-designed core.
Walk me from "small/simple L1" to "big/associative L2" — why the split?
How does a nonblocking cache actually reduce effective miss penalty?
When does hardware prefetching hurt?
Virtual Memory & Protection
Virtual memory treats physical memory as a cache of secondary storage and gives every process its own address space. The TLB is a cache of translations; the page table is the backing store. Same hierarchy ideas, one level up.
The moving parts
- Pages are the blocks of virtual memory. A virtual address splits into a virtual page number (VPN) and a page offset.
- The page table maps VPN → physical frame number (PFN), with protection bits per entry. Only the OS may update it — the basis of memory protection.
- The TLB caches recent VPN→PFN translations so you don't walk the page table on every access. TLBs act as caches on the page table, just as caches act on memory.
Step through a translation
The widget runs a sequence of virtual addresses through TLB → page-table walk → physical address, at a readable teaching scale (16-bit VA, 256-byte pages ⇒ 8-bit VPN + 8-bit offset). The trace deliberately includes a TLB hit (repeat access to a page), a TLB miss that the page table resolves, and an address that page-faults. Press Next step to advance.
TLB · 4 entries, fully associative (LRU)
Page table (window)
The VIPT trick — why L1 size is "capped"
Translation is on the critical path, so we'd love the cache to start before the TLB finishes. If the cache index + block offset fit entirely within the page offset bits, those bits are identical in virtual and physical addresses — so the cache can index using virtual bits while the TLB translates the VPN, then check the physical tag at the end. That's a virtually-indexed, physically-tagged (VIPT) cache. It's why L1 capacity is often limited to roughly page size × associativity: it keeps the index inside the page offset and avoids aliasing. Both case-study L1s are VIPT — and the A53 even handles the one-bit overlap case with hardware alias detection.
Why is L1 commonly ≤ (page size × associativity)?
VIPT, PIPT, VIVT — trade-offs?
How does virtual memory provide protection?
Side-Channel Attacks on the Memory System
Virtual memory enforces protection in the architecture — but the microarchitecture leaks. The same caches and timing tricks that make memory fast can be turned into a covert channel that reads across protection boundaries. This is now core architect knowledge, not a footnote.
How they work
Side-channel memory attacks perturb the memory system and observe the effect through timing — using high-resolution timers or hardware performance counters. The cache is the leak: whether an address is cached or not changes its access latency, and that latency difference encodes secret-dependent behavior. Canonical patterns include Prime+Probe and Flush+Reload, where an attacker arranges cache state, lets the victim run, and then times its own accesses to infer which lines the victim touched.
Mitigations and their cost
Mitigations reduce the probability of leakage but can't eliminate all side channels if any resource is shared: partitioning caches, constant-time code that avoids secret-dependent memory access patterns, restricting fine-grained timers, flushing/isolating predictors and TLBs across boundaries, and disabling or constraining speculation on sensitive paths. Each costs performance — the running theme is that security and performance trade against each other in the memory system.
Why can't protection bits stop a cache side channel?
Sketch Flush+Reload.
clflushes a target line, lets the victim run, then times reloading that line: a fast reload means the victim accessed it (it's cached), a slow one means it didn't. Repeating over addresses reconstructs the victim's secret-dependent access trace — e.g., key-dependent table lookups in crypto.ARM Cortex-A53 vs Intel Core i9-12900
Two shipping memory hierarchies at opposite ends of the design space: a low-power embedded IP core and a high-end big.LITTLE desktop part. The contrast is the lesson — same principles, opposite priorities.
| Dimension | ARM Cortex-A53 | Intel Core i9-12900 |
|---|---|---|
| Role / market | Energy-efficient IP core for tablets & phones (PMD) | High-end desktop, Alder Lake |
| ISA / issue | ARMv8 (32 & 64-bit), 2-issue, up to ~1.3 GHz | x86-64, up to 4 instr/clock per P-core |
| Cores | Configurable; discussion is a single core | big.LITTLE: 8 P-cores + 8 E-cores; focus on one P-core |
| L1 | 8–64 KiB (32 KiB typical), 2-way, 64 B, VIPT, write-back/allocate, LRU-approx; critical-word-first | L1 I 32 KiB 8-way (4 cyc); L1 D 48 KiB 6-way (5 cyc), dual-ported, 8 banks; VIPT |
| L2 | Example 1 MiB; 2-level TLB; up to 4 memory banks | 1.25 MiB, 10-way, ~15-cycle latency (index = 2¹¹) |
| L3 / LLC | — | 30 MiB, 8 hashed banks (3.75 MiB/bank, 15-way), 12-bit index, ~50-cycle, non-inclusive |
| Main memory | 64–128-bit L2↔memory bus | DDR5-4800, 2 channels (HBM or DIMMs); miss penalty ≈ 200 cycles |
| Notable | Page-map cache cuts L2-TLB miss penalty; hardware alias detection for VIPT | Merging write buffer; LLC holds L2 evictions, so effective capacity ≈ L2 + L3 |
What to take from the numbers
- Non-inclusive LLC (i9): because L3 mainly holds blocks ejected from L2, L2 and L3 store different blocks — total cached data ≈ L2 + L3, not just L3. A deliberate capacity win over strict inclusion.
- Hashed L3 banking (i9): 3 bits of a hash select one of 8 banks, so only that bank activates — saving power (optimization 4) and spreading conflicts.
- Penalty dominates rate (A53): the chapter measures L1 miss rates ~7× the L2 rate, but the L2 penalty is ~9.5× larger — so L2 misses slightly dominate the memory-stressing benchmarks. A concrete reminder that miss rate alone never tells the story; you must weight by penalty.
- Local vs global, again: the A53's median L2 stand-alone miss rate is 15.1% but only 0.3% global — the §2.1 trap made real.
The full numbers, side by side
The detailed hierarchies from the chapter's figures. Read them as two answers to the same problem under different budgets.
ARM Cortex-A53 — PMD, energy-first
| Structure | Size | Organization | Penalty |
|---|---|---|---|
| Instr / Data µTLB | 10 entries each | fully associative | 2 cyc |
| L2 unified TLB | 512 entries | 4-way | 20 cyc |
| L1 I-cache | 8–64 KiB | 2-way, 64 B block | 13 cyc |
| L1 D-cache | 8–64 KiB | 2-way, 64 B block | 13 cyc |
| L2 unified | 128 KiB–2 MiB | 16-way, LRU-approx | 124 cyc |
Features: critical-word-first, up to four memory banks, VIPT L1, write-back L1 D and L2 with write-allocate, approximate LRU.
Intel Core i9-12900 — desktop, throughput-first
| Level | Size | Assoc. | Latency | Notes |
|---|---|---|---|---|
| L1 I | 32 KiB | 8-way | 4 cyc | per P-core |
| L1 D | 48 KiB | 6-way | 5 cyc | dual-ported, 8 banks |
| L2 | 1.25 MiB / core | 10-way | 15 cyc | private |
| L3 | 30 MiB shared | 15-way | 50 cyc | distributed, non-inclusive |
| DRAM | DDR4/DDR5, 2 ch | — | ~200 cyc miss | DDR5-4800 up to ~77 GB/s |
On an L3 miss the block is filled into L2 and L1, not inserted into L3 — the LLC primarily holds blocks ejected from L2, so effective capacity ≈ L2 + L3.
Design poles
| Axis | A53-like (power/area) | i9-like (throughput) |
|---|---|---|
| Top priority | energy efficiency, configurability | bandwidth, latency hiding, parallelism |
| L1 | small 2-way VIPT, critical-word-first | small but dual-ported, 8-bank VIPT |
| Mid/LLC | modest unified L2, ≤4 banks | private L2 + large non-inclusive banked L3 |
| Memory | narrow 64–128-bit bus | multi-channel, HBM-capable, prefetchers |
Reading the Measurements
A miss rate is not a cost. The single most common analysis error — and a favorite interview trap — is comparing miss rates across levels without weighting each by its penalty. This section is how to read cache data like an architect.
Weight every miss by its penalty
Misses at L1, L2, and L3 cost wildly different amounts. The honest figure of merit is the weighted contribution, not the raw rate:
MPKI (misses per thousand instructions) is the level-normalized rate; multiply by the level's penalty to get cycles. A 0.3% global L3 miss rate at a 200-cycle penalty can outweigh a 4% L1 rate at a 4-cycle penalty.
Higher rate, lower level — yet the lower level wins on cost because penalty scales faster than rate. Always multiply.
Diagnose by program behavior
| Signature | What it means | Where to look |
|---|---|---|
| High L1D MPKI, low L2/L3 | L1 locality / capacity issue; lower levels catch the working set | block size, associativity, blocking |
| Moderate L1D, high weighted L3 cost | memory-system bottleneck | prefetch accuracy, bandwidth, LLC design |
| Huge variance across programs | don't generalize from one workload | characterize a representative suite |
Why cache-busters dominate averages
Miss behavior varies enormously by program — the A53's L1D miss rate spans 0.5% to 37.3%, a factor of 75. A single cache-buster like MCF sets the upper bound and drags the arithmetic mean with it. So a lone "miss rate" is nearly meaningless: report distributions, and be explicit that one outlier may be steering the average. This is also the chapter's first fallacy — predicting one program's cache behavior from another's.
L1 has a higher miss rate than L2. Which matters more for performance?
How would you quickly find a memory bottleneck on a real workload?
Why is reporting a single average miss rate dangerous?
Fallacies & Pitfalls
Memory hierarchy is the most quantitative subfield in architecture, yet it's riddled with traps. Each of these doubles as an interview "what's wrong with this reasoning?" prompt.
Fallacy — Predicting one program's cache behavior from another's
Miss rates vary enormously by workload. The chapter shows three SPEC programs whose misses-per-1000-instructions for the same large cache differ by huge factors (e.g., 9 vs 2 vs ~90). A cache tuned to benchmark A can be terrible for benchmark B. Lesson: never quote a single miss rate as "the" miss rate — characterize across representative workloads, and beware means dominated by one cache-buster (like MCF).
Pitfall — Not simulating enough instructions for accurate memory measurements
Three nested traps: (1) predicting a large cache's behavior from a short trace (the trace never fills the cache); (2) assuming locality is constant over a run — it isn't; and (3) locality varies by phase, so a snippet misrepresents the whole. Lesson: warm the cache and simulate long, phase-representative traces, or your numbers are fiction.
Pitfall — Not delivering high memory bandwidth in a cache-based system
Caches improve average latency but don't guarantee bandwidth to an application that must keep going to main memory (streaming, large sparse, ML). You can have great hit times and still starve a bandwidth-bound kernel. Lesson: design for bandwidth explicitly — channels, banks, HBM — when the workload blows past the cache.
Pitfall — A memory technology that "fits between" two others but wins at neither
The PCM cautionary tale: a technology slotted between DRAM and Flash that offered no decisive advantage in speed or price over either neighbor, and so failed to gain momentum. Lesson: a new tier must dominate an existing one on a axis that matters, or the ecosystem routes around it.
Interview Rapid-Fire
A consolidated drill set for senior memory-systems loops. Cover the answer, say yours aloud, then check. If you can do all of these crisply, you own Chapter 2.
The numbers & formulas to have cold
| Thing | Have-it-cold version |
|---|---|
| AMAT | Hit time + Miss rate × Miss penalty |
| Multilevel AMAT | T₁ + MR₁(T₂ + MR₂(T₃ + MR₃·T_mem)) |
| CPI from memory | CPI = base + (refs/instr)·MR·penalty |
| Global miss rate | MR_global(Ln) = MR₁·MR₂·…·MR_n |
| The 3 C's | Compulsory (cold), Capacity (too small), Conflict (bad mapping) |
| 5 optimization buckets | ↓hit time/power · ↑bandwidth · ↓miss penalty · ↓miss rate · hide via parallelism (prefetch) |
| VIPT L1 size cap | ≲ page size × associativity |
| TLB miss vs page fault | tens of cycles (walk) vs millions (disk, OS exception) |
Conceptual drills
"Make this kernel faster" — give your memory-system checklist.
Why is "latency" the hard problem and "bandwidth" the easy one?
Design an L1 for a phone vs a server — what changes and why?
Explain why non-inclusion can beat inclusion.
Where does security enter the memory hierarchy?
Source Notes & Corrections
Where the key numbers come from, and how conflicts were resolved. Rule used throughout: when the uploaded review deck and the textbook disagree, prefer the chapter, then note the correction.
Key figures and their source
| Figure used here | Source |
|---|---|
| Memory-hierarchy levels, locality, inclusion, AMAT identity | CAQA 7e §2.1, Fig. 2.1 |
| §2.1 bandwidth example: demand ≈ 3840 GiB/s vs supply ≈ 56 GiB/s; i9 case-study DDR5-4800 peak ≈ 77 GB/s | CAQA 7e §2.1–2.2, §2.6 |
| DDR1–DDR5 transfer rates, ~39 ns flat row-miss latency | CAQA 7e §2.2, Figs. 2.4–2.5 |
| HBM: DDR5-4800 38.4 GiB/s, 4 stacks ≈ 4 TB/s; Loh–Hill, Alloy cache | CAQA 7e §2.2–2.3, Opt. 10, Fig. 2.16 |
| Dependability: parity ~17 min, ECC ~7.5 h, Chipkill ~2 months (10,000-proc) | CAQA 7e §2.2 |
| Ten optimizations + cited results (NRU ~1%; 2-bit predictor ~5–7%; hit-under-miss ~9%/12.5%; Skylake-SP L2 streamer ~70%, +19% L3; compiler 251→19 misses) | CAQA 7e §2.3, Opts. 1–10, Fig. 2.17 |
| VM/TLB, VIPT constraint, side-channel Prime/Probe and mitigations | CAQA 7e §2.4, §2.6 |
| A53 hierarchy (µTLB 10 / L2 TLB 512 / L1 8–64 KiB 2-way / L2 16-way; 2/20/13/124-cyc penalties) | CAQA 7e §2.6, Figs. 2.18–2.19 |
| A53 measured: L1D median 2.4% (0.5–37.3%); global L2 median 0.3%; L2 penalty ≈ 9.5× L1 | CAQA 7e §2.6, Figs. 2.20–2.21 |
| i9-12900: L1I 32 KiB 8-way; L1D 48 KiB 6-way dual-port 8-bank; L2 1.25 MiB 10-way; L3 30 MiB 15-way non-inclusive; ~200-cyc miss | CAQA 7e §2.6, Figs. 2.23–2.24 |
| Fallacies & pitfalls (program-to-program prediction; trace length; bandwidth; in-between tech / PCM) | CAQA 7e §2.7 |
Corrections applied
- i9 L1 size. An earlier draft of this guide listed the i9 L1 generically as "32 KiB, ~4-cycle." Corrected to the chapter/deck figures: L1 I 32 KiB 8-way (4 cyc) and L1 D 48 KiB 6-way (5 cyc), dual-ported, 8 banks. (The 32 KiB 4-cycle figure is the i7 example used elsewhere in the chapter, not the i9-12900 case study.)
- Two bandwidth figures, two contexts. The chapter cites ~56 GiB/s in the §2.1 multicore-demand example and ~77 GB/s for the i9-12900 DDR5-4800 case study — not a conflict, two different numbers. The §2.1 bandwidth-wall example here uses 56 GiB/s; the i9 case study keeps 77 GB/s.
- Case-study processors. This guide uses the 7th-edition "Putting It All Together" pairing — ARM Cortex-A53 and Intel Core i9-12900 — not the 6th-edition Cortex-A8 / Core i7.
- Replacement naming. Both sources describe shipping "LRU" as an approximation (NRU / tree-PLRU / reuse predictors), not exact LRU; the guide says so throughout.
- Derived drill numbers. The single-level AMAT A/B drill and the CPI example use the chapter's formulas with illustrative inputs (so labeled), not measured device data.