Memory Wall · H&P Ch.2
comprehensive · interactive · interview-ready
Hennessy & Patterson · Chapter 2 · Interactive Study Guide

Memory Hierarchy
Design

The whole chapter follows from one number that keeps growing — the distance, in cycles, from the processor to its data. Fast memory is small and costly; the hierarchy is the compromise. Read the idea, then drive the simulator beside it.

cool = fast, on-chip SRAM warm = slow, off-chip sections marked live embed a working simulator

The chapter in one breath

A memory hierarchy stacks levels — registers, L1/L2/L3 caches (SRAM), main memory (DRAM), then Flash/disk — each larger, slower, and cheaper per byte than the one above. It works because of the principle of locality: programs reuse data in time (temporal) and space (spatial). The figure of merit is average memory access time (AMAT), and the chapter spends its energy on technologies and tricks that drive AMAT down: cache geometry, ten advanced optimizations, DRAM/HBM/Flash technology, virtual memory, and the security holes that the whole edifice opened up.

2.1

Locality & the Hierarchy

Why the pyramid exists, the inclusion property, and the processor–memory gap.

2.1 · live

AMAT & Performance

The master formula, multilevel composition, and a calculator to feel CPI sensitivity.

2.1 · live

Cache Org & the 3 C's

Mapping, tags, write policies — and a simulator that classifies every miss.

2.3 · live

Replacement Policies

LRU/FIFO/NRU/Random head-to-head, plus Belady's anomaly.

2.2

Memory Technology

SRAM, DRAM banks/rows, DDR, HBM, and Flash.

2.3

Ten Optimizations

The chapter's core, each mapped to the AMAT term it attacks.

2.4 · live

Virtual Memory

TLBs, page-table walks, page faults, and VIPT caches.

2.6

A53 vs Core i9

A real side-by-side of two shipping memory hierarchies.

Section 2.1

Locality & the Memory Hierarchy

Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. The economical answer is a hierarchy that exploits locality and the cost/speed trade-offs of different memory technologies.

The principle of locality

Programs don't touch all code and data uniformly. They exhibit two kinds of locality, and every level of the hierarchy is a bet on one or both:

  • Temporal locality: a byte referenced now is likely referenced again soon. (Loop counters, hot functions.) Caches exploit this by keeping recently used blocks.
  • Spatial locality: a byte referenced now means its neighbors are likely referenced soon. (Array walks, instruction streams.) Caches exploit this by fetching a whole block (line) at once, not a single word.

Add one hardware fact — for a fixed technology and power budget, smaller memories can be made faster — and the hierarchy falls out naturally: keep the hot working set in a small fast store, back it with progressively larger, slower, cheaper stores.

Registers L1 cache (SRAM) ~32 KiB · ~1–4 cyc L2 cache ~256KiB–1.25MiB · ~10–15 cyc L3 / LLC ~2–30 MiB · ~40–50 cyc Main memory (DRAM) GiBs · ~150–250 cyc Flash / Disk (secondary) TiBs · ~10⁴–10⁶+ cyc faster,smaller,$$$/byte slower,bigger,¢/byte
The hierarchy: each level trades capacity against speed. Latencies are representative single-core figures from the chapter's examples.

The inclusion property

In most (not all) designs, the data in a lower level is a superset of the next level up. This inclusion property is typically maintained by main memory for caches, and by secondary storage for virtual memory. It simplifies coherence and lookup — but note the modern exception you'll meet in the case study: the Core i9's last-level cache is non-inclusive, so L2 and L3 hold different blocks and the effective capacity is their sum.

The processor–memory gap

The reason this chapter exists: processor demand for memory (requests/second) grew far faster than DRAM latency improved. DRAM latency improved only ~1.07×/year historically and has slowed since ~2010, while cores multiplied and each wants data faster. Bandwidth has scaled far better than latency — which is why so many techniques in this chapter are really about hiding or tolerating latency (prefetching, nonblocking caches, multiple banks) rather than eliminating it.

Interview framing
Memorize the asymmetry: bandwidth is cheap to scale, latency is not. When you propose a memory-system fix in an interview, classify it: are you reducing latency, hiding latency, or buying bandwidth? Most "advanced optimizations" hide latency or buy bandwidth, because reducing raw DRAM latency is the hardest lever of all.
⚡ Interview rapid-fire
Spatial vs temporal locality — give a hardware mechanism that targets each.
Spatial → multi-word cache blocks and hardware prefetching (fetch the neighbors before they're asked for). Temporal → the cache itself plus a good replacement policy (keep recently used blocks resident). Larger blocks lean on spatial locality; larger caches and smarter replacement lean on temporal.
Why not just build one giant fast memory and skip the hierarchy?
Physics and economics. For a given technology, larger arrays are slower (longer wires, more decode) and SRAM cells are ~6 transistors vs DRAM's 1T1C, so fast memory is far costlier per bit. A hierarchy gives you nearly the speed of the fastest level at nearly the cost-per-byte of the cheapest — provided locality holds.
What breaks the hierarchy's effectiveness?
Workloads with poor locality — huge strided/streaming or pointer-chasing access patterns (graphs, sparse linear algebra) blow past cache capacity and turn into latency-bound DRAM traffic. That's exactly when bandwidth, prefetching, and non-blocking caches matter, and why "cache-busters" like MCF dominate the means in the chapter's measurements.
Section 2.1

AMAT & Cache Performance

Average Memory Access Time is the single most important formula in the chapter. Every optimization that follows is an attack on one of its three terms.

AMAT = Hit time + Miss rate × Miss penalty

Three levers, and naming which one you're pulling is the heart of cache design:

TermWhat it isHow you attack it
Hit timeTime to access a block that is in the cacheSmaller/simpler caches, way prediction, pipelined access, VIPT to overlap translation
Miss rateFraction of accesses not foundBigger caches, higher associativity, larger blocks, better replacement, prefetching, compiler blocking
Miss penaltyExtra time to fetch a missed block from belowMultilevel caches, critical-word-first, merging write buffers, nonblocking caches, more memory banks/HBM

It composes recursively

Real machines have multiple cache levels, so the "miss penalty" of one level is itself an AMAT of the level below. Unrolled for three levels with a DRAM backstop:

AMAT = TL1 + MRL1·( TL2 + MRL2·( TL3 + MRL3·Tmem ) )
Local vs global miss rate — the classic trap
Local miss rate = misses at a level ÷ accesses that reach that level. Global miss rate = misses at a level ÷ all CPU references (= product of local rates above it). A 40% local L3 miss rate sounds catastrophic but may be <0.5% global. For AMAT and CPI, the global rate is what generates stall cycles; the local rate is for judging that one cache's design in isolation.

Try it — feel the sensitivity

The calculator below computes AMAT and effective CPI live. The lesson interviewers want you to internalize: because L1's miss rate multiplies everything beneath it, a tiny change in L1 ripples massively into CPI. Drag L1's miss rate and watch the CPI number move far more than an equal change to L3.

AMAT / CPI Calculator live · computed

L1 First level

1 cyc
4.0%

L2 Second level

10 cyc
25%

L3 Third level

40 cyc
40%

DRAM Main memory

200 cyc

CPI model

1.0
1.35
AMAT
2.60 cycles
AMAT = T₁+MR₁·(T₂+MR₂·(T₃+MR₃·Tmem))
= 1 + 0.040·(10 + 0.25·(40 + 0.40·200))
= 2.60 cycles
Effective CPI
3.16
CPI = base + (mem refs/instr) × (AMAT − T₁)
1.00%
global MR · L2
0.40%
global MR · L3
2.16
stall cyc/instr

Canonical worked examples

The calculator builds intuition; these fixed examples are the arithmetic you should be able to reproduce on a whiteboard. Numbers follow the chapter's formulas (drill values derived).

Worked · single-level AMAT — a higher hit time can still win
DesignHit timeMiss rateMiss penaltyAMAT
A · direct-mapped1.0 ns5%50 ns1.0 + 0.05·50 = 3.5 ns
B · 2-way1.2 ns3%50 ns1.2 + 0.03·50 = 2.7 ns

B wins by 0.8 ns. The full answer: (1) compute both; (2) mechanism — associativity cuts conflict misses; (3) tradeoff — higher hit time and energy; (4) workload — only helps if conflicts actually dominate.

Worked · multilevel AMAT + local vs global miss rate
given: HT₁=1, MR₁=4% · HT₂=10, MR₂(local)=25% · penalty after L2=120 cyc AMAT = HT₁ + MR₁·( HT₂ + MR₂·penalty ) = 1 + 0.04·( 10 + 0.25·120 ) = 1 + 0.04·40 = 2.6 cycles L2 global miss rate = MR₁ · MR₂ = 0.04 × 0.25 = 1%

The trap: the 25% is L2's local rate (of accesses reaching L2). Only 1% of all CPU references actually miss L2 — that 1% is what AMAT and CPI weight.

Worked · turning misses into CPI
misses / instruction = miss rate × (memory accesses / instruction) CPI_memory = (MPKI / 1000) × miss-penalty cycles example: 30 MPKI at an effective 80-cycle non-overlapped penalty = (30 / 1000) × 80 = 2.4 stall cycles / instruction effective CPI = base CPI + stall = 1.0 + 2.4 = 3.4

OoO caveat: use only the non-overlapped penalty — independent work, prefetch, and multiple outstanding misses hide much of the raw latency.

Worked · the bandwidth wall (why caches aren't only latency filters)
peak demand = cores × clock × refs/cycle × bytes/ref = 8 × 3 GHz × 5 × 16 B ≈ 3840 GiB/s supply: two-channel DDR5 (chapter §2.1 example) ≈ 56 GiB/s → demand ≫ supply

The on-chip caches must supply the rest via internal bandwidth — multiported, pipelined, banked, nonblocking. This is the bandwidth half of the memory gap, distinct from latency.

⚡ Interview rapid-fire
Why do architects obsess over L1 hit time even though L1 misses are "rare"?
Because L1 hit time sits in the common case of every memory instruction and often on the critical path of the pipeline — it can bound clock frequency. And because MR₁ multiplies the entire cost of everything below it, L1 is the highest-leverage point in the whole hierarchy. That tension (keep L1 fast and low-miss) is why L1s stay small/simple while L2/L3 grow.
Your L3 local miss rate is 40%. Is that bad?
Not necessarily — judge it globally. If L1 and L2 already filter the stream so that only, say, 1.2% of CPU references reach L3, then a 40% local miss rate is ~0.5% global. The chapter's A53 data makes this vivid: a median L2 stand-alone (local) miss rate of 15.1% corresponds to just 0.3% global. Always ask "local or global?" before reacting.
Give the CPI form of the cache-performance equation.
CPI = base CPI + (memory accesses per instruction) × (miss rate) × (miss penalty), i.e. memory-stall cycles per instruction added to the pipeline's base CPI. Equivalently, stall cycles/instr = (mem refs/instr) × (AMAT − hit time) when the hit time is already overlapped in the base CPI — which is exactly what the calculator above computes.
Section 2.1 · Appendix B foundations

Cache Organization & the Three C's

A cache holds fixed-size blocks (lines). The two design questions are: where can a block go, and when it's not there, why did it miss? The second question — the three C's — is where interviews live.

Address decomposition & mapping

Given a block size and a number of sets, every address splits into three fields. This split is the cache's wiring — there's no arithmetic beyond shifting and masking:

  • Block offset = log₂(block size) low bits — which byte within the block.
  • Set index = log₂(number of sets) middle bits — which set to search.
  • Tag = the remaining high bits — stored in the cache to confirm identity on a hit.

Associativity is just how many blocks live in one set. Direct-mapped = 1 way (one home per block, cheap but conflict-prone). Fully associative = 1 set (a block goes anywhere, no conflicts but expensive parallel tag search). n-way set associative is the practical middle.

Tag Index Offset address (high → low) selects set set 0 set k ◄ set N−1 way 0: tag way 1: tag way 2 … = ? compare tag against all ways in the set (in parallel) → hit/miss
Set-associative lookup: index picks the set, then the tag is compared against every way in parallel. More ways = fewer conflict misses but wider comparators and slower hit time.

Write policies

Reads are easy; writes force choices. Two orthogonal decisions:

DecisionOption AOption B
On a write hit, when to update memory?Write-through: write cache + memory. Simple, always-coherent memory, but heavy write traffic (needs a write buffer).Write-back: write only the cache, mark the line dirty, flush on eviction. Far less traffic; the common choice for L1/L2/L3.
On a write miss, fetch the block?Write-allocate: bring the block in, then write. Pairs naturally with write-back.No-write-allocate: write straight to memory, skip the cache. Pairs with write-through.

The case-study machines both use write-back, write-allocate caches — note that in the A53/i9 sections.

The Three C's — a taxonomy of why misses happen

Every miss is exactly one of:

  • Compulsory (cold): the first-ever reference to a block. Unavoidable except by prefetching or larger blocks. Independent of cache size.
  • Capacity: the working set is simply bigger than the cache, so blocks are evicted and later re-fetched. Fix: bigger cache, or better locality.
  • Conflict (collision): too many hot blocks map to the same set and evict each other, even though the cache as a whole has room. Fix: more associativity, victim cache, better indexing. Fully associative caches have zero conflict misses by definition.
How the classification actually works
The rigorous test (Hill): run the same trace through a fully-associative LRU cache of the same total size. First touch → compulsory. A miss that also misses in the ideal cache → capacity. A miss that would have hit in the ideal cache → conflict. The simulator below does precisely this, so its breakdown is real, not estimated.

Drive it

Configure a cache, pick a policy, and step an address trace. Watch sets fill and evict, see the live tag·index·offset decode, and read the three-C's breakdown. Try the presets in order — Sequential shows cold then capacity misses; Conflict thrash manufactures conflict misses you can then erase by raising associativity.

Configurable Cache Simulator live · computed

Cache state

16 sets × 2 ways · 16B blocks · tag/idx/off = 4/4/4 bits
set 0
empty
empty
set 1
empty
empty
set 2
empty
empty
set 3
empty
empty
set 4
empty
empty
set 5
empty
empty
set 6
empty
empty
set 7
empty
empty
set 8
empty
empty
set 9
empty
empty
set 10
empty
empty
set 11
empty
empty
set 12
empty
empty
set 13
empty
empty
set 14
empty
empty
set 15
empty
empty
hitmissevictedempty

Current access

— ready, press Step or Play —

Results

0
accesses
0
hits
0
misses
hit rate
The three C's — why misses happened
no misses yet
CompulsoryCapacityConflict
compulsory 0 · capacity 0 · conflict 0
#addrsettagresulttypeevicted
10x000x0misscompulsory
20x1010x0misscompulsory
30x2020x0misscompulsory
40x3030x0misscompulsory
50x000x0hit
60x1010x0hit
70x2020x0hit
80x3030x0hit
90x4040x0misscompulsory
100x5050x0misscompulsory
110x000x0hit
120x1010x0hit
⚡ Interview rapid-fire
A workload thrashes one cache set. Which C, and what are your fixes — ranked?
Conflict misses. Ranked fixes: (1) increase associativity — directly attacks it; (2) add a small victim cache to catch recently-evicted conflicting lines; (3) change indexing (e.g. hashed/skewed indexing, like the i9's hashed L3 banking) to spread the hot blocks across sets; (4) at software level, pad/realign data structures so they stop colliding. Prove it in the sim: load "Conflict thrash," then bump associativity and watch the orange segment vanish.
Block size went up, miss rate went up. Explain.
Larger blocks cut compulsory misses (more spatial locality per fetch) but, for a fixed cache size, mean fewer blocks — so capacity and conflict misses rise, and miss penalty grows (more bytes per transfer). Past the sweet spot the capacity/conflict increase dominates. The classic U-shaped miss-rate-vs-block-size curve.
Why are compulsory misses "the same" regardless of cache size?
A compulsory miss is the first-ever reference to a block — no cache, however large, can hold a block it has never seen. Only prefetching (fetch before first use) or larger blocks (amortize one cold miss over more bytes) reduce them. That's why prefetching is the only one of the three C's lever that touches compulsory.
Section 2.3 · Third Optimization

Replacement Policies

When a set is full and a new block must enter, which resident block dies? Replacement only matters on a miss to a full set, but on associative caches the choice can swing the miss rate — which is why "better replacement policies" is one of the chapter's ten advanced optimizations.

PolicyVictimCost / notes
LRU (least-recently-used)The block unused for the longest timeBest intuition for temporal locality, but exact LRU needs per-way age state — expensive at high associativity, so real caches use approximations.
NRU / clock (not-recently-used)A block whose reference bit is 0Cheap 1-bit-per-line approximation of LRU. What real "LRU" caches (including the case-study machines) actually ship.
FIFOThe oldest-inserted blockCheap (one pointer per set), ignores reuse. Can suffer Belady's anomaly.
RandomA random wayTrivial hardware, surprisingly competitive at high associativity, and immune to pathological patterns.
The real-world answer
Production caches rarely implement true LRU. They use NRU/clock, tree-PLRU, or RRIP-style schemes that approximate LRU with a few bits per set. If asked "what replacement does a modern L2 use," the honest answer is "an LRU approximation" — exactly what the chapter says the A53 and i9 do.

Head-to-head

Same cache, same trace, four policies. Replacement ties when there's no reuse pressure; it separates when sets overflow with reuse. Try the Set thrash and Loop > capacity presets to make the policies disagree.

Replacement-Policy Face-off live · computed

Misses by policy · 4 sets × 4 ways · 14 accesses

LRU
10 misses
FIFO
12 misses
NRU
12 misses
Random
6 misses
Random wins (6 misses); FIFO trails (12). Spread of 6 misses from the policy choice alone. Random is seeded, so results are reproducible.

Belady's anomaly

Intuition says more capacity can only help. FIFO breaks that. On the classic reference string below, going from 3 to 4 frames increases the fault count — because FIFO isn't a stack algorithm and gives up the inclusion property that guarantees monotonic behavior. LRU, OPT, and LFU are stack algorithms and never show the anomaly.

Belady's Anomaly (FIFO) live · computed
Fully-associative FIFO on 1 2 3 4 1 2 5 1 2 3 4 5.
3 frames (FIFO)
9
page faults
4 frames (FIFO)
10
page faults
⚠️ Anomaly confirmed: going from 3 → 4 frames increased faults from 9 to 10. More memory, worse performance — because FIFO isn't a stack algorithm and gives up the inclusion property.
⚡ Interview rapid-fire
What's a "stack algorithm" and why does it matter?
A replacement policy is a stack algorithm if, for any reference string, the set of blocks held with N frames is always a superset of the set held with N−1 frames (the inclusion property). That guarantees more capacity never increases misses — so OPT, LRU, and LFU are immune to Belady's anomaly. FIFO violates it, which is why it can get worse with more frames. Run LRU on the same string in the face-off to see the anomaly vanish.
At very high associativity, why does Random get competitive with LRU?
With many ways, the probability that Random evicts the one block you're about to reuse is low, and the cost of tracking true recency grows. Empirically the gap narrows, and Random has two perks: trivial hardware and no pathological worst case (an adversarial stride can defeat LRU but not Random in expectation). That's why some designs use pseudo-random or RRIP rather than chasing exact LRU.
Section 2.2

Memory Technology & Optimizations

Caches are SRAM; main memory is DRAM; storage is Flash. Knowing the physics of each — and especially how DRAM extracts bandwidth from a fundamentally high-latency array — separates a memory-systems candidate from a generalist.

SRAM vs DRAM

SRAM (caches)DRAM (main memory)
Cell~6 transistors, bistable latch1 transistor + 1 capacitor (1T1C)
Density / costLow density, expensive per bitHigh density, cheap per bit
SpeedFast, no refreshSlower; charge leaks → must refresh periodically
QuirkHolds value while poweredReads are destructive (must write back); access is row-then-column

How DRAM makes bandwidth from latency

A DRAM access is two phases: activate a row into the row buffer (the slow part — RAS), then read columns out of that buffer (fast — CAS). The architecture's whole game is amortizing the expensive row activation:

  • Row buffer = an open row acts as a small cache; consecutive accesses to the same row are fast (row hits).
  • Burst mode + SDRAM/DDR: one address yields a stream of words. DDR (double data rate) transfers on both clock edges. Generations (DDR3→DDR4→DDR5) raise transfer rates; the case study uses DDR5-4800.
  • Banks, ranks, channels: independent banks let multiple row activations overlap; multiple channels multiply bus width. This is how you scale bandwidth without lowering latency — a modern i9 can sustain enough memory parallelism to feed many cores.
DRAM bank (rows × columns) active row → 1. RAS: activate row (slow) Row buffer (open row = mini-cache) 2. CAS: stream columns out in a burst (fast) row hit → cheap · row miss → pay RAS again Banks/channels overlap many of these → bandwidth
DRAM is latency-bound at the array but bandwidth-rich via row buffers, bursts, and bank/channel parallelism. "Open-row" scheduling exploits locality at the DRAM level, just as caches do at the chip level.
DRAM bank cutaway: cell array, row decoder, active wordline, sense amplifiers / row buffer, column mux, and the activate-read-precharge sequence
Hardware · DRAM bank, up close. An ACTIVATE (RAS) opens one row (the wordline) into the row buffer / sense amplifiers; READ/WRITE (CAS) then streams columns out cheaply; PRECHARGE closes the row. The open row acts as a mini-cache — row hits are fast, a row miss pays activate + precharge — and independent banks overlap these steps to build bandwidth. Each cell is 1 transistor + 1 capacitor (1T1C). Illustrative.
Annotated DDR5 DIMM
Hardware · DDR5 DIMM. Note the DDR5-specific parts: an on-DIMM PMIC and SPD hub, and a module split into two independent 32-bit sub-channels with on-die ECC inside each DRAM die. (Shown as an unbuffered UDIMM; server RDIMMs add a registering clock driver.) Illustrative.

DDR generations: bandwidth soared, latency stalled

Peak DIMM bandwidth = transfers/s × 8 bytes. Notice the right-hand column: row-miss latency is essentially flat while bandwidth climbs an order of magnitude — the bandwidth-not-latency story in one table.

StandardI/O clockTransfersDIMM bandwidthRow-miss latency
DDR1200 MHz400 MT/s3.2 GB/s~63 ns
DDR3800 MHz1600 MT/s12.8 GB/s~39 ns
DDR41333 MHz2666 MT/s21.3 GB/s~39 ns
DDR52400 MHz4800 MT/s38.4 GB/s~39 ns
Worked · the multicore bandwidth wall
peak demand = 8 cores × 3 GHz × 5 refs/cyc × 16 B ≈ 3840 GiB/s supply = two-channel DDR5, §2.1 example ≈ 56 GiB/s

Demand outruns a commodity DRAM bus by ~70×. The gap is closed by on-chip cache bandwidth (banks/ports/nonblocking) plus, for bandwidth-bound work, HBM.

HBM — stacking for bandwidth

High Bandwidth Memory stacks DRAM dies with through-silicon vias, announced across AMD/Intel/NVIDIA from ~2017. It delivers far higher bandwidth (and lower access energy per bit) than a single DDR bus, at higher cost. The chapter discusses HBM as an additional cache level (an L4/LLC, 10×+ the on-chip LLC) and even as main memory; the i9's memory channels can attach HBM or standard DIMMs. HBM is the backbone of GPUs and accelerators — directly relevant for NVIDIA-style interviews.

2.5D HBM stack on silicon interposer cross-section
Hardware · the real thing. A 2.5D package (e.g., CoWoS): the HBM stack and the compute die sit side by side on a silicon interposer, joined by microbumps; through-silicon vias (TSVs) carry signals vertically through the stack, and C4 bumps connect the interposer to the package substrate. This is what “HBM next to the processor” physically means. Illustrative.
HBM in-package vs DDR5 on motherboard placement
Hardware · placement & scale. Why HBM wins bandwidth: the stacks sit in-package, millimeters from the die, while DDR5 DIMMs live out on the motherboard. Shorter, wider links buy bandwidth; the DIMM’s distance buys capacity. Illustrative, not to scale.

Flash & the "in-between" trap

Flash (NAND) is nonvolatile, denser and cheaper than DRAM but slower, and it wears out — blocks tolerate a limited number of writes, so controllers do wear leveling. SSDs replace disks for secondary storage. The chapter also flags PCM (phase-change memory) as a cautionary tale: a technology that sat between DRAM and Flash but offered no decisive win in speed or price, and so failed to gain momentum — a pitfall we revisit in §2.7.

Interview framing
When discussing memory, always separate latency-bound from bandwidth-bound reasoning. DRAM latency has barely moved in a decade; bandwidth scales with banks/channels/HBM. Accelerator and server design is overwhelmingly about bandwidth and parallelism to hide a latency you can't fix.
⚡ Interview rapid-fire
Why is opening a DRAM row expensive but reading more columns cheap?
Activating a row drives a full wordline and senses thousands of tiny capacitors into the row buffer — slow and energy-heavy, and destructive (the row must be written back on precharge). Once latched in the row buffer, additional columns are just fast reads from SRAM-like sense amps. Hence "open-row" policies and why access pattern (row hits vs row misses) hugely affects effective DRAM latency.
When would you choose HBM over DDR5, and what's the cost?
Choose HBM when the workload is bandwidth-bound — GPUs, ML accelerators, large streaming/sparse workloads — where you need TB/s and better pJ/bit. Costs: higher $/GB, capacity limits, thermal/packaging complexity (stacked dies, TSVs, interposer). For latency-sensitive, capacity-hungry general servers, DDR5 DIMMs often still win. The i9 supporting either on its channels reflects exactly that trade-off.
Why does Flash need wear leveling and what's the architectural consequence?
NAND cells degrade after a bounded number of program/erase cycles, so the controller's FTL (flash translation layer) spreads writes across blocks to avoid hot spots, plus over-provisions and does garbage collection. Consequences: write amplification, variable/asymmetric read-vs-write latency, and the need to treat the device as a log-structured store — all of which leak into how you design the storage tier of the hierarchy.
Section 2.2

Dependability: ECC & Chipkill

At warehouse scale, "rare" memory errors become constant. A 10,000-server fleet sees DRAM faults continuously, so protection isn't optional — it's a design requirement, and the level of protection you choose is an architectural decision with real cost.

The ladder of protection

  • Parity: 1 extra bit per word — detects a single-bit error, can't correct. Cheap but weak; a single-processor server with only parity has a worse unrecoverable error rate than a huge ECC fleet.
  • SECDED ECC: single-error-correct, double-error-detect (Hamming-style codes). The standard for server DRAM. The chapter notes a ~17-server ECC system has roughly the failure rate of a 10,000-server Chipkill system — quantifying how much stronger Chipkill is.
  • Chipkill: RAID-for-DRAM. Data and check bits are distributed across multiple chips so the system survives the complete failure of an entire DRAM chip. The chapter's figure: about one undetected/unrecoverable failure every ~2 months for a Chipkill-protected fleet — making Chipkill a requirement for large-scale systems.
10,000-processor serverSchemeUnrecoverable / undetected failure rate
Detect onlyParity~1 every 17 minutes
Correct single-bitECC (SECDED)~1 every 7.5 hours
Survive whole-chip lossChipkill~1 every 2 months

Same hardware, three protection levels, four orders of magnitude difference in failure interval — the quantitative case for matching protection to fleet size.

The scaling argument to remember
Error rate scales with the number of devices. What's negligible on one laptop is a daily event across a datacenter. So protection strength is chosen against fleet size, not single-machine intuition — parity for the desktop, ECC for servers, Chipkill for warehouse scale.
⚡ Interview rapid-fire
Why isn't SECDED ECC enough at warehouse scale?
SECDED corrects one bad bit and detects two per protected word — great against random bit flips, but it can't survive an entire chip dropping out (which takes out many bits in the protected words at once). Chipkill spreads each word's bits across chips so a whole-chip failure still leaves a correctable pattern. At 10,000-server scale, whole-device failures are frequent enough that you need that stronger guarantee.
What does ECC cost you?
Extra storage (check bits → wider DIMMs), extra latency/energy on the encode/decode path, and memory-controller complexity. Chipkill adds layout constraints (bits striped across chips) and can reduce achievable bandwidth or require wider accesses. It's a classic reliability-vs-cost/performance trade — justified only when fleet error rates demand it.
Section 2.3 · the core of the chapter

The Ten Advanced Optimizations

The chapter classifies ten techniques by which metric they improve: hit time/power, bandwidth, miss penalty, miss rate, or miss-penalty/rate via parallelism. Complexity generally rises as you go down the list. Learn each as a triple: what problem, which AMAT term, what it costs.

HIT TIMEBANDWIDTHMISS PENALTYMISS RATEPOWER
#OptimizationAttacksIdea & cost
1Pipelined L1 caches with virtual indexing & set associativityHIT BWPipeline the cache access so a new request starts each cycle; virtual indexing overlaps with translation. Raises throughput & clock; adds pipeline complexity and hit latency in cycles.
2Multiple banks & ports to increase L1 D-cache bandwidthBW HITMultiple banks/ports serve several accesses per cycle. Helps superscalar load/store throughput; bank conflicts and area are the cost.
3Better replacement policiesMISS RATENRU/clock, tree-PLRU, RRIP — approximate LRU cheaply and dodge pathological evictions. Small state per set; the face-off widget above demonstrates the payoff.
4Multibanked L2/L3 to cut power & latency, raise bandwidthBW PWR PENBank the large lower caches; activate only the addressed bank (less energy), serve refills in parallel. The i9's hashed 8-bank L3 is exactly this.
5Nonblocking caches (hit-under-miss, miss-under-miss)BW PENLet the cache keep serving hits (and further misses) while a miss is outstanding — essential with out-of-order execution. Needs MSHRs to track multiple in-flight misses; significant control complexity.
6Critical word first & early restartMISS PENFetch the requested word first and resume the CPU immediately, before the rest of the block arrives. Cheap; helps most with large blocks/long transfers. The A53 does this.
7Compiler optimizations (loop interchange, blocking/tiling)MISS RATERestructure code/data so the working set fits and is reused before eviction. Zero hardware cost — pure software locality. Blocking a matrix multiply is the canonical example.
8Hardware prefetching of instructions & dataPEN MRDetect streams/strides and fetch ahead (stream buffers, next-line). Hides latency for regular patterns; "bad" prefetches waste bandwidth and can evict useful blocks. The i7/i9 prefetch into L1 and L2.
9Compiler-controlled prefetchingPEN MRCompiler inserts explicit, non-faulting prefetch instructions ahead of use. Precise but adds instruction overhead and needs accurate scheduling/distance tuning.
10Multiple memory buses & modules / HBMBW PENMore channels and HBM widen and parallelize the path to memory, shortening effective block-fetch latency and feeding many cores. The most "system-level" lever; depends on HBM/packaging.

Each optimization in depth

Expand any optimization for the full interview checklist: problem, technique, AMAT term, complexity, a concrete cited number, and the one-line takeaway. Cited figures follow the chapter's Section 2.3 discussion.

1 · Pipelined VIPT L1 caches  HIT TIME BANDWIDTH
Problem: translation and cache access on the load-use critical path cap clock speed.
Technique: pipeline the access; index L1 with page-offset bits while the TLB translates, compare physical tags after.
AMAT term: hit time / clock (throughput).
Complexity: moderate — size/block/associativity constrained to avoid synonyms; more stages raise branch & load-use penalties.
Concrete: I-cache access latency grew Pentium 1 cyc → Pentium Pro–III 2 cyc → Pentium 4 / current i7/i9 4 cyc as pipelining deepened.
Takeaway: "fast L1" means high throughput at high clock, even as latency in cycles rises. Way prediction (2-way >90%, 4-way ~80% accurate; needs ≥10% speedup to pay) and victim caches are companion hit-time tricks.
2 · Multiple banks & ports (L1 D-cache)  BANDWIDTH HIT TIME
Problem: wide-issue cores demand several loads/stores per cycle; one port can't keep up.
Technique: bank the cache (bank = block address MOD #banks) and/or add ports so independent accesses proceed in parallel.
AMAT term: bandwidth (and effective hit throughput).
Complexity: pure multiporting is expensive; banking is cheaper but bank conflicts create variable service time. P(no collision, 4 refs, 8 banks) = 7⁄8·6⁄8·5⁄8 ≈ 41%.
Concrete: the i9 generates four memory references/clock; its L1 D-cache is dual-ported with eight banks.
Takeaway: banking buys bandwidth cheaply but trades deterministic latency for conflict-dependent service time.
3 · Better replacement policies  MISS RATE
Problem: evicting a block you'll soon reuse wastes the capacity you have.
Technique: approximate Belady-MIN online — LRU, NRU/clock, or reuse predictors that separate streaming from reused blocks.
AMAT term: miss rate (matters most at L2/L3 where each miss is costly).
Complexity: exact LRU is costly at high associativity; NRU is ~1 bit/way.
Concrete: NRU is about 1% worse than LRU on a cited 2 MiB 16-way L2; a 2-bit reuse predictor is ~5% better than LRU there, and ~7% better in a 4-core multiprogrammed LLC.
Takeaway: L1 favors simple replacement (hit throughput dominates); L2/L3 can afford more state because each miss is expensive. (Drive the face-off widget in Replacement Policies.)
4 · Multibanked L2 / L3  BANDWIDTH POWER PENALTY
Problem: large lower caches burn energy if fully activated and serialize parallel misses.
Technique: split into banks; activate only the addressed bank; serve multiple refills in parallel.
AMAT term: bandwidth, power, and bank-local latency/penalty.
Complexity: low-moderate; needs a bank-select hash to spread accesses.
Concrete: the i9's 30 MiB L3 is hashed across 8 banks; 3 hash bits pick the bank so only one activates.
Takeaway: banking is how big caches stay both low-power and high-bandwidth.
5 · Nonblocking caches (hit/miss-under-miss)  PENALTY BANDWIDTH
Problem: stalling the whole pipeline on one miss wastes OoO parallelism.
Technique: keep serving hits (and further misses) during an outstanding miss, tracked by MSHRs (destination, tag, requesting load/store); returns may be out of order.
AMAT term: effective miss penalty (= non-overlapped stall) and bandwidth.
Complexity: high — arbitration, ordering, deadlock avoidance, coherence; a verification burden.
Concrete: Li et al. model on the i7 — one hit-under-miss reduces cache latency ~9% (SPECINT2006) and ~12.5% (SPECFP2006).
Takeaway: the metric becomes non-overlapped stall, not raw miss latency; effective penalty ≈ latency ÷ MLP.
6 · Critical word first & early restart  MISS PENALTY
Problem: the core needs one word now but the whole block is in flight.
Technique: request the missed word first and restart the core the instant it arrives; fill the rest in the background (early restart = resume on normal-order arrival).
AMAT term: miss penalty.
Complexity: low; benefit grows with block size and falls if later words are reused immediately.
Concrete: SPECint2006 on i7-6700 averaged 1.23 references to a block with an outstanding miss (range 0.5–3.0) — modest reuse, so the technique helps but isn't dramatic.
Takeaway: cheap penalty reducer; most valuable with large blocks.
7 · Compiler optimizations (locality)  MISS RATE
Problem: bad access order misses even when the data would fit.
Technique: loop interchange (walk arrays in storage order → spatial locality) and blocking/tiling (operate on B×B submatrices so data is reused before eviction → temporal locality).
AMAT term: miss rate — at zero hardware cost.
Complexity: software; limited by loop structure and alias analysis.
Concrete: blocked matrix-multiply cuts memory words from 2N³ + N² to 2N³⁄B + N².
Takeaway: with caches the compiler should expose blocking; with scratchpads, software must manage locality explicitly.
8 · Hardware prefetching  PENALTY RATE
Problem: demand misses expose full latency.
Technique: detect streams/strides and fetch ahead into cache or a stream buffer — no ISA/compiler burden.
AMAT term: miss penalty/rate via parallelism.
Complexity: moderate; bad prefetches waste bandwidth and evict useful lines (pollution).
Concrete: Skylake-SP has four data prefetchers; the L2 streamer provides ~70% of the CPI improvement on memory-intensive SPEC CPU2017, while L3 traffic rises ~19%.
Takeaway: accuracy and timeliness matter — prefetch is a bandwidth-for-latency trade that backfires on irregular, bandwidth-bound code.
9 · Compiler-controlled prefetching  PENALTY RATE
Problem: hardware can't always see far enough ahead for irregular but analyzable loops.
Technique: compiler inserts explicit non-faulting prefetch instructions at a tuned distance before use.
AMAT term: miss penalty/rate.
Complexity: instruction overhead; needs accurate scheduling; weak for irregular pointer chasing.
Concrete: a cited loop's misses fall 251 → 19; the 232 avoided misses cost ~400 prefetch instructions, turning ~27,200 cycles into ~4,400.
Takeaway: precise and powerful when the access pattern is statically analyzable; overhead-bound otherwise.
10 · More memory channels / HBM as memory or LLC  BANDWIDTH PENALTY
Problem: a single bus can't feed many cores; package traversal is costly.
Technique: add channels/modules; use stacked HBM as main memory or as a giant L4/LLC.
AMAT term: bandwidth and effective block-fetch penalty.
Complexity: highest / most system-level — packaging, and especially tag/metadata placement for HBM-as-cache.
Concrete: a 1 GiB L4 with 64 B blocks needs ~96 MiB of tags. Loh–Hill places tags+data in the same HBM row; the Alloy cache (direct-mapped, tag+data together) is ~2× faster hit time but 1.13–1.2× higher miss rate.
Takeaway: HBM-as-cache wins when software placement is hard; HBM-as-memory wins when the runtime/OS/compiler can place critical data deliberately.
Mental model for interviews
Don't memorize ten names in a list — memorize the five buckets and place each technique: (1) cut hit time/power, (2) buy bandwidth, (3) cut miss penalty, (4) cut miss rate, (5) hide miss penalty/rate via parallelism (prefetch). Then for any proposed technique, you can instantly say which AMAT term it moves and what it costs. That's the reasoning interviewers probe, not recall.

Nonblocking caches — worth a deeper look

With out-of-order execution, stalling the whole pipeline on one miss wastes enormous parallelism. A nonblocking (lockup-free) cache continues to satisfy hits under a miss, and a more aggressive one allows misses under misses (multiple outstanding). The bookkeeping lives in MSHRs (miss status handling registers), which track each in-flight miss so returning data is matched to the right request — and misses can return out of order, especially if L2 is itself nonblocking. This is the cache-side counterpart to memory-level parallelism in the CPU, and it's why effective miss penalty drops far below the raw DRAM latency on a well-designed core.

⚡ Interview rapid-fire
Walk me from "small/simple L1" to "big/associative L2" — why the split?
L1 is on the critical path of every access and bounds clock, so you keep it small and simple (low hit time, optimizations 1–2). Misses there are expensive, so the next level optimizes for miss rate and bandwidth with size, associativity, and banking (optimizations 3–4), accepting higher hit latency because it's off the common path. The hierarchy is each level specializing for a different AMAT term.
How does a nonblocking cache actually reduce effective miss penalty?
By overlapping: while one miss is being serviced from L2/DRAM, the core keeps issuing independent loads that hit (or that miss and are also serviced concurrently). MSHRs track multiple outstanding misses so their latencies overlap rather than serialize. Net effective penalty ≈ raw latency ÷ memory-level parallelism — far below the single-miss number.
When does hardware prefetching hurt?
When predictions are wrong or the workload is already cache-resident: useless prefetches consume memory bandwidth and can evict live blocks, raising miss rate — the opposite of the goal. Aggressive prefetch is a bandwidth-for-latency trade; it pays on regular, bandwidth-spare workloads and backfires on irregular, bandwidth-bound ones. Good prefetchers throttle based on accuracy and memory pressure.
Section 2.4

Virtual Memory & Protection

Virtual memory treats physical memory as a cache of secondary storage and gives every process its own address space. The TLB is a cache of translations; the page table is the backing store. Same hierarchy ideas, one level up.

The moving parts

  • Pages are the blocks of virtual memory. A virtual address splits into a virtual page number (VPN) and a page offset.
  • The page table maps VPN → physical frame number (PFN), with protection bits per entry. Only the OS may update it — the basis of memory protection.
  • The TLB caches recent VPN→PFN translations so you don't walk the page table on every access. TLBs act as caches on the page table, just as caches act on memory.
TLB miss ≠ page fault
A TLB miss means the translation isn't cached but the page is in memory — resolved in tens of cycles by a (often hardware) page-table walk. A page fault means the page isn't resident at all — an OS-handled exception that fetches from disk/Flash, costing millions of cycles. Confusing these is a classic interview tell.

Step through a translation

The widget runs a sequence of virtual addresses through TLB → page-table walk → physical address, at a readable teaching scale (16-bit VA, 256-byte pages ⇒ 8-bit VPN + 8-bit offset). The trace deliberately includes a TLB hit (repeat access to a page), a TLB miss that the page table resolves, and an address that page-faults. Press Next step to advance.

TLB & Translation Walk live · computed
address 1 / 6
Virtual address
Press Next step to translate the first address.
Physical address

TLB · 4 entries, fully associative (LRU)

empty
— empty —
— empty —
— empty —
— empty —

Page table (window)

VPN
0x5
present
yes
PFN
0xA
VPN
0x12
present
yes
PFN
0x3
VPN
0x13
present
yes
PFN
0x7
VPN
0x40
present
yes
PFN
0x1
VPN
0x41
present
yes
PFN
0x9

The VIPT trick — why L1 size is "capped"

Translation is on the critical path, so we'd love the cache to start before the TLB finishes. If the cache index + block offset fit entirely within the page offset bits, those bits are identical in virtual and physical addresses — so the cache can index using virtual bits while the TLB translates the VPN, then check the physical tag at the end. That's a virtually-indexed, physically-tagged (VIPT) cache. It's why L1 capacity is often limited to roughly page size × associativity: it keeps the index inside the page offset and avoids aliasing. Both case-study L1s are VIPT — and the A53 even handles the one-bit overlap case with hardware alias detection.

⚡ Interview rapid-fire
Why is L1 commonly ≤ (page size × associativity)?
To stay VIPT without aliasing. If the index+offset bits all lie within the page offset, the virtual and physical index bits match, so you can index in parallel with TLB lookup and never have two virtual addresses for the same physical line landing in different sets. Growing L1 beyond that pushes index bits into the VPN, reintroducing the synonym/aliasing problem — so designers add associativity (more ways, same index width) or pay for alias handling instead.
VIPT, PIPT, VIVT — trade-offs?
PIPT: index & tag both physical — no aliasing, but you must translate before indexing (slow unless overlapped). VIVT: both virtual — fastest, but synonyms/homonyms and flushes on context switch make it painful. VIPT: index virtual (fast, parallel with TLB), tag physical (correct) — the sweet spot for L1, at the cost of the size cap above. Lower levels are typically PIPT.
How does virtual memory provide protection?
Each page-table entry carries protection bits (read/write/execute, user/supervisor), and only the OS can modify the page table. A process can't name physical memory directly or touch a page not mapped (or mapped without permission) in its address space — the hardware faults. That isolation is the foundation everything else (including the side-channel discussion next) builds on or tries to break.
Section 2.4

Side-Channel Attacks on the Memory System

Virtual memory enforces protection in the architecture — but the microarchitecture leaks. The same caches and timing tricks that make memory fast can be turned into a covert channel that reads across protection boundaries. This is now core architect knowledge, not a footnote.

How they work

Side-channel memory attacks perturb the memory system and observe the effect through timing — using high-resolution timers or hardware performance counters. The cache is the leak: whether an address is cached or not changes its access latency, and that latency difference encodes secret-dependent behavior. Canonical patterns include Prime+Probe and Flush+Reload, where an attacker arranges cache state, lets the victim run, and then times its own accesses to infer which lines the victim touched.

Speculation makes it far worse
The chapter notes that adding speculation and multithreading dramatically widens the bandwidth of side-channel attacks — the lineage that leads to Spectre/Meltdown-class attacks explored in the next chapter. Speculative execution can touch memory (and leave cache footprints) for instructions that architecturally should never have run, so secrets leak even though the committed state looks correct.

Mitigations and their cost

Mitigations reduce the probability of leakage but can't eliminate all side channels if any resource is shared: partitioning caches, constant-time code that avoids secret-dependent memory access patterns, restricting fine-grained timers, flushing/isolating predictors and TLBs across boundaries, and disabling or constraining speculation on sensitive paths. Each costs performance — the running theme is that security and performance trade against each other in the memory system.

⚡ Interview rapid-fire
Why can't protection bits stop a cache side channel?
Because the leak isn't an architectural read of protected data — it's an inference from timing. The attacker never reads the victim's bytes; it observes how the victim's execution changed shared microarchitectural state (which cache lines are resident), and times its own accesses to recover secret-dependent patterns. Protection governs architectural visibility; side channels exploit microarchitectural side effects that the ISA doesn't model.
Sketch Flush+Reload.
Attacker and victim share a read-only page (e.g., a library). The attacker clflushes a target line, lets the victim run, then times reloading that line: a fast reload means the victim accessed it (it's cached), a slow one means it didn't. Repeating over addresses reconstructs the victim's secret-dependent access trace — e.g., key-dependent table lookups in crypto.
Section 2.6 · Putting It All Together

ARM Cortex-A53 vs Intel Core i9-12900

Two shipping memory hierarchies at opposite ends of the design space: a low-power embedded IP core and a high-end big.LITTLE desktop part. The contrast is the lesson — same principles, opposite priorities.

DimensionARM Cortex-A53Intel Core i9-12900
Role / marketEnergy-efficient IP core for tablets & phones (PMD)High-end desktop, Alder Lake
ISA / issueARMv8 (32 & 64-bit), 2-issue, up to ~1.3 GHzx86-64, up to 4 instr/clock per P-core
CoresConfigurable; discussion is a single corebig.LITTLE: 8 P-cores + 8 E-cores; focus on one P-core
L18–64 KiB (32 KiB typical), 2-way, 64 B, VIPT, write-back/allocate, LRU-approx; critical-word-firstL1 I 32 KiB 8-way (4 cyc); L1 D 48 KiB 6-way (5 cyc), dual-ported, 8 banks; VIPT
L2Example 1 MiB; 2-level TLB; up to 4 memory banks1.25 MiB, 10-way, ~15-cycle latency (index = 2¹¹)
L3 / LLC30 MiB, 8 hashed banks (3.75 MiB/bank, 15-way), 12-bit index, ~50-cycle, non-inclusive
Main memory64–128-bit L2↔memory busDDR5-4800, 2 channels (HBM or DIMMs); miss penalty ≈ 200 cycles
NotablePage-map cache cuts L2-TLB miss penalty; hardware alias detection for VIPTMerging write buffer; LLC holds L2 evictions, so effective capacity ≈ L2 + L3

What to take from the numbers

  • Non-inclusive LLC (i9): because L3 mainly holds blocks ejected from L2, L2 and L3 store different blocks — total cached data ≈ L2 + L3, not just L3. A deliberate capacity win over strict inclusion.
  • Hashed L3 banking (i9): 3 bits of a hash select one of 8 banks, so only that bank activates — saving power (optimization 4) and spreading conflicts.
  • Penalty dominates rate (A53): the chapter measures L1 miss rates ~7× the L2 rate, but the L2 penalty is ~9.5× larger — so L2 misses slightly dominate the memory-stressing benchmarks. A concrete reminder that miss rate alone never tells the story; you must weight by penalty.
  • Local vs global, again: the A53's median L2 stand-alone miss rate is 15.1% but only 0.3% global — the §2.1 trap made real.
Interview framing
If asked to "design a memory hierarchy for X," anchor on these two as poles. Battery-bound PMD → small VIPT L1, modest L2, aggressive power-gating, no big LLC (A53-like). Throughput desktop/server → deep hierarchy, big banked non-inclusive LLC, multi-channel/HBM bandwidth, prefetchers (i9-like). State the priority (energy vs throughput) first; the structure follows from it.

The full numbers, side by side

The detailed hierarchies from the chapter's figures. Read them as two answers to the same problem under different budgets.

ARM Cortex-A53 — PMD, energy-first

StructureSizeOrganizationPenalty
Instr / Data µTLB10 entries eachfully associative2 cyc
L2 unified TLB512 entries4-way20 cyc
L1 I-cache8–64 KiB2-way, 64 B block13 cyc
L1 D-cache8–64 KiB2-way, 64 B block13 cyc
L2 unified128 KiB–2 MiB16-way, LRU-approx124 cyc

Features: critical-word-first, up to four memory banks, VIPT L1, write-back L1 D and L2 with write-allocate, approximate LRU.

Intel Core i9-12900 — desktop, throughput-first

LevelSizeAssoc.LatencyNotes
L1 I32 KiB8-way4 cycper P-core
L1 D48 KiB6-way5 cycdual-ported, 8 banks
L21.25 MiB / core10-way15 cycprivate
L330 MiB shared15-way50 cycdistributed, non-inclusive
DRAMDDR4/DDR5, 2 ch~200 cyc missDDR5-4800 up to ~77 GB/s

On an L3 miss the block is filled into L2 and L1, not inserted into L3 — the LLC primarily holds blocks ejected from L2, so effective capacity ≈ L2 + L3.

Modern multicore die floorplan with private L1/L2 and distributed L3
Hardware · modern die floorplan (schematic). L1 and L2 are private and sit beside each core; the last-level cache is sliced/banked and distributed across the cores on a ring or mesh — which is exactly why the i9’s L3 is eight hashed banks, not one block. Conceptual, not to scale.

Design poles

AxisA53-like (power/area)i9-like (throughput)
Top priorityenergy efficiency, configurabilitybandwidth, latency hiding, parallelism
L1small 2-way VIPT, critical-word-firstsmall but dual-ported, 8-bank VIPT
Mid/LLCmodest unified L2, ≤4 banksprivate L2 + large non-inclusive banked L3
Memorynarrow 64–128-bit busmulti-channel, HBM-capable, prefetchers
How to talk about this in interviews
Don't recite the tables — lead with the priority, then derive the structure. "It's a PMD core, so energy is first-order: keep L1 small, 2-way, VIPT; critical-word-first to cut penalty cheaply; no big LLC." vs. "It's a throughput desktop part, so I spend area on a deep, banked, non-inclusive LLC and multi-channel bandwidth, and hide latency with nonblocking caches and prefetch." If asked to design for a new target (an inference accelerator, a phone SoC, a cloud server), state its dominant constraint first and let A53/i9 be your two anchors.
Section 2.6–2.7

Reading the Measurements

A miss rate is not a cost. The single most common analysis error — and a favorite interview trap — is comparing miss rates across levels without weighting each by its penalty. This section is how to read cache data like an architect.

Weight every miss by its penalty

Misses at L1, L2, and L3 cost wildly different amounts. The honest figure of merit is the weighted contribution, not the raw rate:

weighted cost (per 1000 instr) ≈ MPKIlevel × miss-penaltylevel

MPKI (misses per thousand instructions) is the level-normalized rate; multiply by the level's penalty to get cycles. A 0.3% global L3 miss rate at a 200-cycle penalty can outweigh a 4% L1 rate at a 4-cycle penalty.

Worked · the A53's lesson — local L1 misses aren't the story
L1 D miss rate: median 2.4% (range 0.5%–37.3%) global L2 miss rate: median 0.3% (range 0.05%–9.0%) L1 rate ≈ 7× the L2 rate … but L2 penalty ≈ 9.5× the L1 penalty ⇒ weighted: L2 misses slightly dominate the memory-stressing benchmarks

Higher rate, lower level — yet the lower level wins on cost because penalty scales faster than rate. Always multiply.

Diagnose by program behavior

SignatureWhat it meansWhere to look
High L1D MPKI, low L2/L3L1 locality / capacity issue; lower levels catch the working setblock size, associativity, blocking
Moderate L1D, high weighted L3 costmemory-system bottleneckprefetch accuracy, bandwidth, LLC design
Huge variance across programsdon't generalize from one workloadcharacterize a representative suite

Why cache-busters dominate averages

Miss behavior varies enormously by program — the A53's L1D miss rate spans 0.5% to 37.3%, a factor of 75. A single cache-buster like MCF sets the upper bound and drags the arithmetic mean with it. So a lone "miss rate" is nearly meaningless: report distributions, and be explicit that one outlier may be steering the average. This is also the chapter's first fallacy — predicting one program's cache behavior from another's.

The four questions to ask of any cache number
(1) What's the denominator — local or global, per-instruction or per-access? (2) Demand or prefetch misses? (3) How much penalty is overlapped (OoO/MLP)? (4) One workload or a suite — is a cache-buster steering the mean? Asking these four out loud is itself a strong-candidate signal.
⚡ Interview rapid-fire
L1 has a higher miss rate than L2. Which matters more for performance?
Whichever has the larger weighted cost = MPKI × penalty. Lower levels have much larger penalties, so a smaller-rate L2/L3 miss often dominates stall cycles. The A53 is the textbook case: L1 rate ~7× higher, but L2 penalty ~9.5× higher, so L2 misses slightly dominate. Never compare bare rates across levels.
How would you quickly find a memory bottleneck on a real workload?
Convert to weighted cost: MPKI by level × penalty by level; add bandwidth utilization, prefetch accuracy, and demand-vs-prefetch split; then estimate overlap to isolate the dominant non-overlapped term. Optimize that term, name its AMAT category, and re-measure — don't chase the highest raw miss rate.
Why is reporting a single average miss rate dangerous?
Because per-program variance is enormous (here ~75×) and arithmetic means are dominated by outliers like MCF. The same cache can look great or terrible depending on the workload, so a single number hides the distribution and invites the fallacy of generalizing across programs. Report ranges/medians and flag cache-busters.
Section 2.7

Fallacies & Pitfalls

Memory hierarchy is the most quantitative subfield in architecture, yet it's riddled with traps. Each of these doubles as an interview "what's wrong with this reasoning?" prompt.

Fallacy — Predicting one program's cache behavior from another's

Miss rates vary enormously by workload. The chapter shows three SPEC programs whose misses-per-1000-instructions for the same large cache differ by huge factors (e.g., 9 vs 2 vs ~90). A cache tuned to benchmark A can be terrible for benchmark B. Lesson: never quote a single miss rate as "the" miss rate — characterize across representative workloads, and beware means dominated by one cache-buster (like MCF).

Pitfall — Not simulating enough instructions for accurate memory measurements

Three nested traps: (1) predicting a large cache's behavior from a short trace (the trace never fills the cache); (2) assuming locality is constant over a run — it isn't; and (3) locality varies by phase, so a snippet misrepresents the whole. Lesson: warm the cache and simulate long, phase-representative traces, or your numbers are fiction.

Pitfall — Not delivering high memory bandwidth in a cache-based system

Caches improve average latency but don't guarantee bandwidth to an application that must keep going to main memory (streaming, large sparse, ML). You can have great hit times and still starve a bandwidth-bound kernel. Lesson: design for bandwidth explicitly — channels, banks, HBM — when the workload blows past the cache.

Pitfall — A memory technology that "fits between" two others but wins at neither

The PCM cautionary tale: a technology slotted between DRAM and Flash that offered no decisive advantage in speed or price over either neighbor, and so failed to gain momentum. Lesson: a new tier must dominate an existing one on a axis that matters, or the ecosystem routes around it.

The meta-lesson
Almost every pitfall here is a measurement or positioning error, not a logic error. In interviews, when handed a memory-system claim, first ask: What workload? Measured how long? Latency or bandwidth? Local or global rate? Those four questions defuse most of this section.
Exam & interview prep

Interview Rapid-Fire

A consolidated drill set for senior memory-systems loops. Cover the answer, say yours aloud, then check. If you can do all of these crisply, you own Chapter 2.

The numbers & formulas to have cold

ThingHave-it-cold version
AMATHit time + Miss rate × Miss penalty
Multilevel AMATT₁ + MR₁(T₂ + MR₂(T₃ + MR₃·T_mem))
CPI from memoryCPI = base + (refs/instr)·MR·penalty
Global miss rateMR_global(Ln) = MR₁·MR₂·…·MR_n
The 3 C'sCompulsory (cold), Capacity (too small), Conflict (bad mapping)
5 optimization buckets↓hit time/power · ↑bandwidth · ↓miss penalty · ↓miss rate · hide via parallelism (prefetch)
VIPT L1 size cap≲ page size × associativity
TLB miss vs page faulttens of cycles (walk) vs millions (disk, OS exception)

Conceptual drills

⚡ Drill
"Make this kernel faster" — give your memory-system checklist.
(1) Is it latency- or bandwidth-bound? (2) Which C dominates the misses — cold (prefetch/bigger blocks), capacity (blocking/bigger cache), or conflict (associativity/padding/indexing)? (3) Is there memory-level parallelism to exploit (nonblocking cache, more MSHRs, prefetch)? (4) Can the compiler improve locality (loop interchange, tiling)? (5) At the system level, more channels/HBM? Always name the AMAT term you're moving and its cost.
Why is "latency" the hard problem and "bandwidth" the easy one?
Bandwidth scales by adding parallel structure — banks, channels, stacked HBM dies — which is an engineering/cost lever. Latency is bounded by physics: array sense times, wire delay, and the speed of charge movement that have barely improved since ~2010. So architecture hides latency (prefetch, nonblocking, OoO, multithreading) far more than it removes it.
Design an L1 for a phone vs a server — what changes and why?
Phone (A53-like): small VIPT L1, modest associativity, energy-gated, LRU-approx, critical-word-first — optimize energy/access and keep hit time low. Server/desktop (i9-like): similar small fast L1 (hit time still bounds clock) but backed by a deep, banked, non-inclusive LLC and multi-channel/HBM bandwidth with aggressive prefetch — optimize throughput. The L1 barely changes; what changes is everything beneath it, driven by the energy-vs-throughput priority.
Explain why non-inclusion can beat inclusion.
Strict inclusion wastes LLC capacity duplicating everything in L2; with a non-inclusive LLC (i9), L2 and L3 hold different blocks, so effective capacity ≈ L2 + L3. The cost is more complex coherence (the LLC can't act as a single snoop filter for everything above it). It's a capacity-vs-complexity trade that high-end parts increasingly take.
Where does security enter the memory hierarchy?
Protection is architectural (page tables, privilege), but performance features create microarchitectural side channels — caches, TLBs, predictors — that leak via timing, amplified by speculation/multithreading (Spectre/Meltdown lineage). Every mitigation (partitioning, constant-time code, limiting timers, taming speculation) costs performance. Modern architects must reason about the security/performance trade as a first-class concern.
Final exam-day mantra
For any memory-system question: name the AMAT term, the C, latency vs bandwidth, and local vs global. Those four axes structure almost every correct answer in this chapter.
Appendix

Source Notes & Corrections

Where the key numbers come from, and how conflicts were resolved. Rule used throughout: when the uploaded review deck and the textbook disagree, prefer the chapter, then note the correction.

Key figures and their source

Figure used hereSource
Memory-hierarchy levels, locality, inclusion, AMAT identityCAQA 7e §2.1, Fig. 2.1
§2.1 bandwidth example: demand ≈ 3840 GiB/s vs supply ≈ 56 GiB/s; i9 case-study DDR5-4800 peak ≈ 77 GB/sCAQA 7e §2.1–2.2, §2.6
DDR1–DDR5 transfer rates, ~39 ns flat row-miss latencyCAQA 7e §2.2, Figs. 2.4–2.5
HBM: DDR5-4800 38.4 GiB/s, 4 stacks ≈ 4 TB/s; Loh–Hill, Alloy cacheCAQA 7e §2.2–2.3, Opt. 10, Fig. 2.16
Dependability: parity ~17 min, ECC ~7.5 h, Chipkill ~2 months (10,000-proc)CAQA 7e §2.2
Ten optimizations + cited results (NRU ~1%; 2-bit predictor ~5–7%; hit-under-miss ~9%/12.5%; Skylake-SP L2 streamer ~70%, +19% L3; compiler 251→19 misses)CAQA 7e §2.3, Opts. 1–10, Fig. 2.17
VM/TLB, VIPT constraint, side-channel Prime/Probe and mitigationsCAQA 7e §2.4, §2.6
A53 hierarchy (µTLB 10 / L2 TLB 512 / L1 8–64 KiB 2-way / L2 16-way; 2/20/13/124-cyc penalties)CAQA 7e §2.6, Figs. 2.18–2.19
A53 measured: L1D median 2.4% (0.5–37.3%); global L2 median 0.3%; L2 penalty ≈ 9.5× L1CAQA 7e §2.6, Figs. 2.20–2.21
i9-12900: L1I 32 KiB 8-way; L1D 48 KiB 6-way dual-port 8-bank; L2 1.25 MiB 10-way; L3 30 MiB 15-way non-inclusive; ~200-cyc missCAQA 7e §2.6, Figs. 2.23–2.24
Fallacies & pitfalls (program-to-program prediction; trace length; bandwidth; in-between tech / PCM)CAQA 7e §2.7

Corrections applied

  • i9 L1 size. An earlier draft of this guide listed the i9 L1 generically as "32 KiB, ~4-cycle." Corrected to the chapter/​deck figures: L1 I 32 KiB 8-way (4 cyc) and L1 D 48 KiB 6-way (5 cyc), dual-ported, 8 banks. (The 32 KiB 4-cycle figure is the i7 example used elsewhere in the chapter, not the i9-12900 case study.)
  • Two bandwidth figures, two contexts. The chapter cites ~56 GiB/s in the §2.1 multicore-demand example and ~77 GB/s for the i9-12900 DDR5-4800 case study — not a conflict, two different numbers. The §2.1 bandwidth-wall example here uses 56 GiB/s; the i9 case study keeps 77 GB/s.
  • Case-study processors. This guide uses the 7th-edition "Putting It All Together" pairing — ARM Cortex-A53 and Intel Core i9-12900 — not the 6th-edition Cortex-A8 / Core i7.
  • Replacement naming. Both sources describe shipping "LRU" as an approximation (NRU / tree-PLRU / reuse predictors), not exact LRU; the guide says so throughout.
  • Derived drill numbers. The single-level AMAT A/B drill and the CPI example use the chapter's formulas with illustrative inputs (so labeled), not measured device data.
Visualize · export-ready diagrams

The whole chapter, on six pictures

Static, screenshot-and-paste-into-slides diagrams for the concepts that are easiest to grasp visually. Every one maps to a live simulator or worked example in Study mode — use the jump buttons. Each concept on this page is covered by a diagram, a simulator, a table, or a worked number; nothing is left as prose-only.

Diagram 1 · the pyramid

Memory hierarchy & the latency gradient

Registers · ~1 cyc L1 · ~4 cyc · 32–48 KiB L2 · ~15 cyc · 1.25 MiB L3 / LLC · ~50 cyc · 2–30 MiB DRAM · ~200 cyc · GiBs Flash / Disk · 10⁴–10⁶ cyc · TiBs cool = fast warm = slow

Each level is larger, slower, and cheaper per byte than the one above. Temperature encodes latency.

Diagram 2 · AMAT dependency tree

How AMAT composes down the hierarchy

AMATT₁ + MR₁ · (…) L2 pathT₂ + MR₂ · (…) L3 pathT₃ + MR₃ · T_mem DRAMT_mem ≈ 200 AMAT = T₁ + MR₁·( T₂ + MR₂·( T₃ + MR₃·T_mem ) ) worked: 1 + 0.04·(10 + 0.25·(36 + 0.40·200)) = 1 + 0.04·(10 + 0.25·116) = 1 + 0.04·39 = 2.56 cycles

MR₁ multiplies everything beneath it — why L1 is the highest-leverage point in the hierarchy.

Diagram 3 · the three C's

Classifying any miss

a miss occurs first-everreference? yes Compulsory no would it also miss in afully-assoc cache of same size? yes Capacity no (would hit) Conflict

The exact test the simulator runs against a fully-associative reference cache.

Diagram 4 · VIPT critical path

Overlapping translation with cache indexing

Virtual page number (VPN) page offset parallel ↓ TLB → physical tag (PFN) index L1 with offset bits compare physical tag = hit? CONSTRAINT index+offset must fit within the page offset → L1 size ≲ page size × associativity, else aliasing.

Virtually-indexed, physically-tagged: index while you translate, check the physical tag at the end.

Diagram 5 · case study side-by-side

ARM Cortex-A53 vs Intel Core i9-12900 — two hierarchies, one principle

ARM Cortex-A53 · PMD, energy-first L1 I/D · 8–64 KiB · 2-way L2 · 128 KiB–2 MiB · 16-way DRAM · 64–128-bit bus VIPT L1 · critical-word-first ≤4 banks · LRU-approx · write-back small, simple, low-energy Intel Core i9-12900 · throughput-first L1 I 32K·8w / D 48K·6w L2 · 1.25 MiB/core · 10-way L3 · 30 MiB · 15-way · noninclusive DDR4/5 · 2 ch · ~200 cyc dual-ported 8-bank L1D · hashed L3 banks · prefetch deep, banked, bandwidth-rich

Same AMAT principles; opposite priorities. The structure follows from energy-vs-throughput.

Diagram 6 · optimization → metric matrix

Which AMAT term does each of the ten optimizations move?

OptimizationHit timeMiss rateMiss penaltyBandwidthPower
1 · Pipelined VIPT L1···
2 · Banks & ports (L1)···
3 · Better replacement····
4 · Multibanked L2/L3··
5 · Nonblocking caches···
6 · Critical word first····
7 · Compiler locality···
8 · Hardware prefetch···
9 · Compiler prefetch···
10 · More channels / HBM···

primary target   secondary effect  · minimal. Full per-optimization detail in Study → The Ten Optimizations.

Interview · rehearsal

Interview Drill

Answer out loud, then reveal. Every question ships with a rubric — what a strong senior-architect answer must mention — because in real loops the difference is naming the mechanism, the tradeoff, and the workload assumption, not just the headline. Filter by topic or shuffle the whole bank.

The four traps interviewers set

① Local vs global miss rate mr_local ≠ mr_global

A 40% local L3 miss rate can be <0.5% global. AMAT/CPI care about global (= product of local rates above). The A53's median L2 is 15.1% local but 0.3% global. Always ask "local or global?"

② TLB miss vs page fault 10s vs 10⁶ cycles

A TLB miss is a translation cache miss — tens of cycles to walk the page table. A page fault means the page isn't resident — an OS exception fetching from storage, millions of cycles. Different mechanisms, different orders of magnitude.

③ Bandwidth vs latency structural vs physical

Bandwidth scales structurally (banks, channels, HBM); latency is bounded by physics and barely moves. Most "advanced optimizations" hide or buy around latency rather than reducing it. Name which one your fix touches.

④ Miss rate vs weighted penalty rate × penalty

A tiny L3 miss rate can dominate stall time if the penalty is hundreds of (non-overlapped) cycles. Weight every miss by its penalty (≈ MPKI × penalty) before declaring a bottleneck.

Drill the question bank

Drill mode

Press “Start drill” to begin.

Tip: the full per-topic Q&A also lives inline in Study mode at the end of each section. This bank consolidates them with rubrics and traps for rapid rehearsal.

Chapter 2 · CAQA 7ereview deck

Memory Hierarchy Design

A computer-architect interview review deck — quantitative reasoning, design tradeoffs, Q&A.

answer shape: metric → mechanism → tradeoff → workload
  • Memorize: AMAT/CPI, local vs global miss rate, bandwidth arithmetic, the ten optimizations.
  • Practice: explain why a choice helps one term while hurting another (hit time / miss rate / miss penalty / bandwidth / power).
  • Challenge every technique: workload assumption, hardware cost, power cost, failure mode.
Use: arrow keys or the buttons below to advance. Hit Print / PDF (top bar) to export the whole deck — one card per page.
2.1concept

The hierarchy exists because locality is exploitable

Smaller memories are faster and costlier per byte; larger are slower and cheaper.

  • Temporal locality: recently used items are reused soon → keep them (caches + replacement).
  • Spatial locality: nearby items used soon → fetch whole blocks + prefetch.
  • Inclusion: a lower level often holds a superset of the next — but modern LLCs may be non-inclusive.
Goal: approach the latency of the fastest level at the cost-per-byte of the cheapest.
2.1concept

The gap is bandwidth as much as latency

Single-core latency growth slowed, but multicore demand overwhelms DRAM bandwidth.

8 cores × 3 GHz × 5 refs × 16 B ≈ 3840 GiB/s peak demand
two-channel DDR5 (§2.1 example) ≈ 56 GiB/s supply
  • Caches aren't only latency filters — multiported, pipelined, banked caches provide internal bandwidth DRAM can't.
Bandwidth is structural (ports, banks, channels, HBM); latency is bounded by physics.
2.1concept

AMAT is the accounting identity

Isolate which term a design is trying to improve.

AMAT = hit time + miss rate × miss penalty
misses/instr = miss rate × mem-accesses/instr
CPI_mem = (MPKI / 1000) × miss-penalty cycles
Caveat: on OoO / multithreaded cores the effective penalty is only the non-overlapped stall time.
2.1worked drill

Single-level AMAT: a small hit-time increase can win

DesignHit timeMiss ratePenaltyAMAT
A · direct-mapped1.0 ns5%50 ns1.0 + .05·50 = 3.5 ns
B · 2-way1.2 ns3%50 ns1.2 + .03·50 = 2.7 ns

Mechanism: associativity cuts conflict misses. Cost: higher hit time + energy. Workload: only wins if conflicts matter.

B wins by 0.8 ns — but state mechanism, tradeoff, and workload dependence to score the answer.
2.1worked drill

Multilevel AMAT + local vs global

AMAT = HT₁ + MR₁ · ( HT₂ + MR₂ₗₒ𝚌ₐₗ · MP )

L1: HT 1, MR 4% · L2: HT 10, MR_local 25% · penalty after L2 = 120

AMAT = 1 + .04·(10 + .25·120) = 2.6 cycles
L2 global miss rate = .04 × .25 = 1%
Trap: never quote a local miss rate as if it were global. CPI cares about global.
2.1concept

Miss taxonomy: the three C's (+ coherence)

  • Compulsory — first access; exists with an infinite cache. Reduce via larger blocks / prefetch.
  • Capacity — working set > cache. Reduce via bigger cache / blocking.
  • Conflict — blocks collide in a set. Reduce via associativity / victim cache / hashing / coloring.
  • Coherence — invalidation misses in multiprocessors (deferred to Ch.5).
Test: capacity = also misses in a same-size fully-associative cache; conflict = would have hit there.
2.1concept

Write policy = traffic + consistency choice (two axes)

Write hitPairs with
Write-throughupdate cache + next leveloften no-write-allocate
Write-backupdate cache, set dirty bitoften write-allocate

Write buffers decouple store latency; write merging combines stores (unsafe for I/O-mapped registers).

Trap: write-back/through and (no-)write-allocate are independent axes — name both.
2.2concept

DRAM: row buffers turn locality into bandwidth

activate row → row buffer → column read/write → precharge

StdTransfersDIMM BWRow-miss
DDR31600 MT/s12.8 GB/s~39 ns
DDR42666 MT/s21.3 GB/s~39 ns
DDR54800 MT/s38.4 GB/s~39 ns
DDR raised bandwidth ~12×, latency stayed flat (~39 ns). Banks/channels add parallelism.
2.2concept

HBM attacks bandwidth and package distance

DDR5-4800 DIMM ≈ 38.4 GiB/s · i9 2-channel ≈ 77 GB/s · 4 HBM stacks ≈ 4 TB/s
  • 2.5D: stacks beside CPU/GPU on an interposer (mature for GPUs/accelerators).
  • 3D SRAM L3 stacking is practical; DRAM on hot logic is hard (heat).
Not magic memory — raises bandwidth + cuts traversal cost, but capacity, cost, placement, coherence remain.
2.2concept

Dependability scales with device count

10,000-processor serverUnrecoverable/undetected failure
Parity only (detect)~1 every 17 min
ECC (SECDED)~1 every 7.5 hours
Chipkill (survive whole-chip loss)~1 every 2 months
At scale, single-bit ECC isn't enough — chip-level faults become frequent, so Chipkill is required.
2.3concept

Ten advanced optimizations → five metrics

  • Hit time / power: pipelined VIPT L1, way prediction, small/simple caches.
  • Bandwidth: L1 banks/ports, multibanked L2/L3, nonblocking caches.
  • Miss penalty: critical word first, early restart, write-buffer merging.
  • Miss rate: replacement, associativity, compiler transforms.
  • Parallel rate/penalty: HW + compiler prefetching, HBM as LLC.
Name the metric before the technique. Complexity rises down the list; no knob is free.
2.3cheat sheet

Optimization summary: what improves, what pays

TechniqueMain metricCost / risk
Pipelined VIPT L1hit time / clockload-use & branch penalty; alias limits
Banks & portsbandwidthbank conflicts; port area/energy
Replacementmiss ratemetadata + update complexity
Nonblocking + MSHRspenalty / BWordering, deadlock, verification
Critical word / restartmiss penaltyhelps mostly with large blocks
Compiler localitymiss rateloop structure & alias limits
Prefetchingrate or penaltybandwidth, pollution, timeliness
HBM / channelsbandwidth / penaltypackaging, tags, capacity, placement
2.3concept

Nonblocking caches: the metric is non-overlapped stall

  • Hit-under-miss and miss-under-miss overlap latency with useful work.
  • MSHRs track each in-flight miss (destination, tag, requesting load/store); misses may return out of order.
  • Cited (Li et al., i7): one hit-under-miss cuts cache latency ~9% SPECINT2006, ~12.5% SPECFP2006.
Effective penalty ≈ raw latency ÷ memory-level parallelism — far below the single-miss number.
2.3concept

Prefetching shifts misses earlier — if timely & accurate

  • Hardware: dynamic, no ISA burden; bad prefetches waste BW + evict useful lines.
  • Compiler: knows loop structure; instruction overhead, weak on irregular pointers.
  • Cited: Skylake-SP L2 streamer ≈ 70% of CPI gain on memory-intensive SPEC CPU2017 (L3 traffic +19%). Compiler example: 251→19 misses; 27,200→4,400 cycles.
Accuracy and timeliness both matter — prefetch is a bandwidth-for-latency trade.
2.4concept

Virtual memory, TLBs & the VIPT constraint

virtual addr → TLB / page-table → physical addr → cache

  • TLB = cache of translations; avoids a page-table access per reference.
  • VIPT: index L1 with page-offset bits while the TLB translates; compare physical tag after.
index+offset ⊆ page offset ⇒ L1 ≲ page size × associativity
Trap: TLB miss (10s of cycles) ≠ page fault (millions — OS fetch from storage).
2.4concept

Side channels exploit shared hardware, not VM holes

prime → victim runs → probe (time the cache)

  • Protection stops direct access, not timing leakage through shared structures.
  • Speculation + multithreading widen the channel (Spectre/Meltdown lineage, Ch.3).
  • Mitigations (randomize, partition, flush-on-switch) all cost performance.
Security and performance trade against each other in the memory system.
2.6case study

ARM Cortex-A53 — energy-efficient, configurable

StructureSizeOrgPenalty
µTLB I/D10 eachfully assoc2 cyc
L2 TLB5124-way20 cyc
L1 I/D8–64 KiB2-way, 64 B13 cyc
L2 unified128 KiB–2 MiB16-way LRU-approx124 cyc

VIPT L1, critical-word-first, ≤4 banks, write-back + write-allocate.

Design for energy and configurability.
2.6lesson

A53 lesson: local L1 misses aren't the whole story

MetricRangeMedian
L1 D miss rate0.5–37.3%2.4%
Global L2 miss rate0.05–9.0%0.3%

L1 miss rates are ~7× higher, but the L2 penalty is ~9.5× higher.

So L2 misses slightly dominate the memory-stressing benchmarks. Weight by penalty.
2.6case study

Intel Core i9-12900 — aggressive hierarchy machinery

LevelSizeAssocLatNotes
L1 I32 KiB8-way4per P-core
L1 D48 KiB6-way5dual-port, 8 banks
L21.25 MiB10-way15private
L330 MiB15-way50shared, noninclusive

On an L3 miss the block fills L2+L1, not L3 — the LLC mostly holds blocks ejected from L2.

Design for bandwidth, latency hiding, and parallelism.
2.6synthesis

Design poles: same principles, different constraints

A53-likei9-like
Prioritypower / areathroughput
L1small VIPT, 2-waysmall fast, dual-port banked
LLCmodest unified L2large non-inclusive L3, banked
Memorynarrow busmulti-channel / HBM, prefetch
State the priority (energy vs throughput) first — the structure follows from it.
2.6–2.7concept

Benchmark interpretation: weight misses by penalty

weighted cost ≈ MPKI_level × miss-penalty_level
  • High L1D MPKI, low L2/L3 → L1 locality issue; lower levels catch the set.
  • Moderate L1D, high L3 weighted cost → memory-system bottleneck (prefetch / BW / LLC).
  • Big benchmark variance → never generalize from one workload; cache-busters dominate means.
Always ask: what's the denominator, demand vs prefetch misses, and how much penalty is overlapped?
2.7fallacies

Fallacies & pitfalls — where candidates overgeneralize

  • Fallacy: predicting one program's cache behavior from another. Miss behavior varies wildly.
  • Pitfall: too-short traces / small trace for a big cache — locality shifts by phase and input.
  • Pitfall: assuming caches deliver bandwidth — some apps must hit DRAM; need HBM/channels.
  • Pitfall: assuming an "in-between" memory tech wins — it needs a durable speed/cost/power/NV edge (PCM didn't).
Most pitfalls are measurement or positioning errors, not logic errors.
Ch.2takeaways

Four things to walk in with

  • Optimize the right term — name hit time / miss rate / miss penalty / bandwidth / power before the technique.
  • Weight misses by cost — a tiny L3 rate can dominate if penalty is hundreds of non-overlapped cycles.
  • Bandwidth is structural — ports, banks, nonblocking caches, channels, HBM.
  • Security & OS are part of the hierarchy — TLBs, VIPT aliases, page tables, VMs, coherence, side channels.
Mantra: name the AMAT term, the C, latency vs bandwidth, and local vs global.
1 / 24