mm/linux v6.8
Understanding the Linux Virtual Memory Manager · refreshed for 6.8

The memory manager,
bottom to top.

From a single 4 KiB physical frame up to a process's virtual address space — how Linux finds, tracks, protects, and reclaims every page of memory. Grounded in Mel Gorman's classic, but every structure here reflects the v6.8 source tree: folios, the maple tree, MGLRU, 5-level paging, memblock.

arch x86-64 page size 4 KiB released Mar 2024 audience undergrad OS
§9–10page tables · VMAs · faultsthe virtual view
§6–8page cache · reclaim · migrationwhat stays resident
§5buddy allocator · GFP · watermarkshanding out frames
§2–4struct page · zones · nodes · bootthe physical model
01

Orientation

"Memory management" in Linux is really several cooperating subsystems stacked on each other. Fix the vocabulary before diving in.

The stack above is the whole guide in miniature. Read it from the hardware up:

Physical page frames

The smallest unit the kernel allocates is a hardware page, almost always 4 KiB on x86-64 (PAGE_SIZE, PAGE_SHIFT == 12). Every frame is described by a struct page, increasingly grouped into a struct folio. The kernel keeps an array of these descriptors so it can map a physical frame number (PFN) to its metadata in O(1).

The physical allocator (buddy system)

Frames are handed out by the buddy allocator, which manages free memory in power-of-two blocks ("orders") per zone. It is the lowest-level allocator; SLUB, vmalloc, the page cache, and user page faults all ultimately call into it.

Zones and nodes

Physical memory is partitioned first by NUMA node (a pglist_data) and within each node by zone (ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, ZONE_MOVABLE, ZONE_DEVICE). Zones exist because not all physical memory is interchangeable.

Virtual memory, page cache, and reclaim

Each process has an address space (mm_struct), a set of regions (vm_area_structs indexed by a maple tree), and a hierarchical page table. The page cache caches file contents; reclaim (kswapd, direct reclaim, the LRU lists / MGLRU) decides what to evict under pressure.

The mental model that stays accurate The buddy allocator owns free physical frames; zones and nodes describe where those frames are and what they're good for; page tables describe where frames are virtually visible; and the page cache plus reclaim decide what gets to stay resident.
02

Pages, folios, and mem_map

Every physical frame the kernel manages has a tiny descriptor. How that descriptor array is laid out is the memory model.

struct page

Defined in include/linux/mm_types.h, this descriptor is deliberately tiny — on the order of 64 bytes — because there is one for every frame, and on a large machine that array can consume a meaningful fraction of RAM. To stay small, struct page is heavily unionized: the same bytes mean different things depending on what the frame is being used for (a buddy free block, page-cache page, slab page, anonymous page, page-table page). Durable fields are flags, _refcount, _mapcount; much of the rest is overlaid context.

The flags field is a packed bitfield holding both true flags (PG_locked, PG_dirty, PG_uptodate, PG_lru, …) and, in its high bits, the encoded node id, zone number, and section — which is how page_to_nid() and page_zonenum() work without a separate lookup.

Folios — the major change since the book

The single biggest structural change in recent years is the folio. A folio is a container for one or more physically contiguous pages managed as a unit, guaranteed never to be a "tail" page of a compound page. It makes the head-vs-tail distinction explicit in the type system (killing a class of bugs) and lets the kernel manage memory in chunks larger than 4 KiB.

Keep this distinction Think of the folio as the new unit of memory management, and the page as the unit of hardware addressing. A folio of order 0 is just one page; higher-order folios back large folios in the page cache and transparent huge pages.

Mapping a PFN to its descriptor — the memory model

  • FLATMEM — a single contiguous mem_map[]. Simple; used on small/embedded systems.
  • SPARSEMEM_VMEMMAP — the default on x86-64. Memory is divided into fixed-size sections so address spaces with large holes don't waste a descriptor per absent frame, and it maps a virtual array (vmemmap) so pfn_to_page() reduces to pointer arithmetic. FLATMEM's O(1) access with SPARSEMEM's tolerance of holes.

The accessors — pfn_to_page(), page_to_pfn(), virt_to_page() — are stable across all models; only the implementation underneath changes. (DISCONTIGMEM from the book era has been removed entirely.)

03

Zones and NUMA nodes

Memory is partitioned by node (where it physically lives) and by zone (what it's good for).

Physical memory maphover a zone
node 0 · pglist_data · local to CPUs 0–7
DMA<16M
DMA32<4G
NORMALthe workhorse
MOVABLEmigratable only
node 1 · pglist_data · remote across UPI/Infinity Fabric
NORMALbulk of RAM
MOVABLEmigratable only
DMA — ancient 16 MiB devices DMA32 — 32-bit DMA mask (<4 GiB) NORMAL — directly mapped, any purpose MOVABLE — hot-unplug & huge pages
Each node owns its zones in address order. ZONE_DEVICE (PMEM, GPU memory) isn't normal RAM and sits outside the buddy pools. ZONE_HIGHMEM is effectively dead on 64-bit — all physical RAM fits in the kernel's direct map.

Why nodes exist: NUMA

On a Non-Uniform Memory Access machine, a CPU reaches memory on its own socket (its local node) faster than memory on another socket (a remote node). Linux models each domain as a node (struct pglist_data). A single-socket machine is just the degenerate one-node case. In 6.8 the LRU lists and most reclaim accounting live at the node level, in the node's lruvec — and with the memory controller enabled, there is one lruvec per memory cgroup per node.

Why zones exist: addressing & migratability

Within a node, not all frames are interchangeable. ZONE_MOVABLE contains only frames the kernel promises are migratable — which is what makes memory hot-unplug and reliable huge-page allocation possible. Its size is set administratively (kernelcore= / movablecore=).

Zonelists & NUMA policy

When code requests memory it names a preferred zone (via GFP flags) and node (usually local). The allocator must fall back when that zone is empty — that order is precomputed into each node's zonelists. There are two per node: ZONELIST_FALLBACK (all nodes' zones, when remote memory is allowed) and ZONELIST_NOFALLBACK (this node only, for __GFP_THISNODE). Default node ordering exhausts the local node's zones before stepping to the nearest remote node, using ACPI SLIT distances. On top sits NUMA memory policy: MPOL_DEFAULT, MPOL_BIND, MPOL_PREFERRED, MPOL_INTERLEAVE, and MPOL_PREFERRED_MANY — set via set_mempolicy()/mbind() and visible through numactl.

04

Boot-time bootstrap

A paradox: the buddy allocator and the struct page array are themselves data structures that must be allocated in memory — before the allocator that would allocate them exists.

Linux resolves it with a four-step handover. Each step builds the tool the next one needs:

From firmware to the buddy allocatorboot sequence
1

Discover the physical map

The kernel can't probe RAM by poking addresses — it asks firmware. Legacy BIOS gives the e820 map (base, length, type triples); UEFI gives the EFI memory map. Only ranges marked usable become candidate RAM.

arch/x86/kernel/e820.cEFI stub
2

Record it in memblock

The boot-time allocator (mm/memblock.c, replacing the old bootmem) keeps two arrays: memblock.memory (all RAM) and memblock.reserved (kernel image, initrd, early page tables). Deliberately simple — a linear list, no struct page dependency.

memblock_add()memblock_reserve()
3

Build the struct page array

Parse ACPI SRAT (node affinity) and SLIT (distances), compute zone boundaries, then under SPARSEMEM_VMEMMAP allocate physical backing for each present section's descriptors and map it into vmemmap. This is the step that makes pfn_to_page() work.

free_area_init()memmap_init
4

Hand over to the buddy allocator

memblock_free_all() walks every region in memblock.memory but not in .reserved and releases those frames into the buddy free lists by clearing PG_reserved. Watermarks are computed, PCP lists set up, and memblock is retired.

__free_pages_coremem_init()
After this, allocations go through the normal alloc_pages() path. The command line can still steer the result: mem=, memmap=, numa=, movablecore=, hugepages=.
05

The buddy allocator

The kernel's physical page allocator. Within each zone it keeps, for every order k, a list of free blocks of exactly 2ᵏ contiguous, aligned pages.

Live buddy allocator16 frames · order 0–4
PFN 0buddy = p XOR (1<<order)PFN 15
Click alloc to split a block. Click any solid block to free it and watch buddies coalesce.
Allocation by splitting: a request for order k with only a larger block free repeatedly halves it, depositing the unused buddy halves on lower free lists. Freeing by coalescing: when a block is freed, if its buddy (p XOR (1<<k)) is also free and the same order, the two merge upward. This is a 16-frame teaching model; a real x86-64 zone runs to MAX_PAGE_ORDER 10 — a 1024-page, 4 MiB largest block — and splits free lists per migratetype.

Migratetypes & anti-fragmentation

A pure buddy system still fragments: one unmovable allocation stranded in a free region prevents it from ever coalescing. Linux groups free lists per migratetype, one per pageblock (typically the PMD huge-page order — 2 MiB — when huge pages are configured in): MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_HIGHATOMIC, MIGRATE_CMA, MIGRATE_ISOLATE — so unmovable allocations cluster together and movable regions stay contiguous, compactable into huge pages or freeable for hot-unplug. When a zone runs low on a type it steals whole pageblocks from another by a fixed fallback order.

GFP flags — what you can toleratethe allocation contract
GFP_KERNEL
Process context. May sleep, do I/O & writeback, trigger direct reclaim and the OOM killer. The flexible "I can wait" request.
GFP_ATOMIC
Cannot sleep (IRQ handlers, holding a spinlock). May not reclaim, but may dip into emergency reserves below the low watermark.
GFP_NOWAIT
Like atomic, but without reserve access — fails fast rather than blocking.
GFP_NOFS / NOIO
May sleep, but must not recurse into the filesystem / any I/O. Avoids deadlock when the caller is itself in the writeback path.
GFP_HIGHUSER_MOVABLE
User-backing pages, marked movable so they land in movable pageblocks.
Single-bit modifiers refine the request: __GFP_ZERO, __GFP_MOVABLE, __GFP_THISNODE, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL.

Watermarks & the reclaim control loop

Each zone carries three watermarks scaled from its managed page count. They form the feedback loop that drives reclaim.

Free-memory control loopdrag the slider
high 40%
low 25%
min 10%
zone state
Plenty

Above the high watermark — memory is ample and nobody reclaims.

Fast path (get_page_from_freelist): walk the zonelist, check the watermark, pull a block — for a single page, almost always from the lock-free per-CPU page list. Slow path (__alloc_pages_slowpath) escalates: wake kswapd → lowered watermarks → direct compaction → direct reclaim → OOM killer → NULL.

Per-CPU page lists & the layers above

Because single-page alloc/free is the hottest path, each zone keeps a per-CPU cache (struct per_cpu_pages) so the common case avoids the zone lock and keeps just-freed pages cache-warm. Above the buddy allocator sit SLUB (kmalloc, object caches for sub-page objects like inodes and dentries — some reclaimable via shrinkers; SLUB is the only slab allocator left in 6.8, the older SLAB and SLOB having been removed), vmalloc (large, virtually-contiguous but physically scattered), and alloc_pages / folio_alloc for page-granular memory.

06

The page cache

The modern "file buffer cache" — one unified cache of file contents, keyed by (file, offset), in folio-sized units.

Older Unix had a separate buffer cache for disk blocks and page cache for file contents, causing double-caching. Modern Linux has one unified page cache; the old buffer cache survives only as a viewstruct buffer_head objects describe individual disk blocks within a cached folio.

address_space — the core structure

  • i_pages — an XArray mapping page offsets to cached folios. It replaced the radix tree the book describes, with a cleaner API, built-in locking, and inline shadow/value entries (used by refault detection). This answers "is offset N of this file resident, and if so where?"
  • a_ops — the operations vtable: read_folio, writepages, dirty_folio, migrate_folio — how generic code calls into a specific filesystem.
  • host — back-pointer to the owning inode; plus dirty/writeback tags stored as XArray marks.

Reads, readahead, and writeback

On a read, the kernel indexes mapping->i_pages. A hit copies cached data — no I/O. A miss allocates a folio, inserts it, and calls read_folio. On detecting sequential access the kernel does readahead, growing a window of upcoming folios. A write copies into the folio and marks it dirty (write-back caching). Persisting is the job of per-bdi flusher threads (wb_workfn; no longer the single pdflush of the book era, so a slow disk can't stall a fast one), triggered by time (dirty_expire_centisecs), by pressure (dirty_background_ratio / dirty_ratio), or by fsync. balance_dirty_pages() throttles fast writers so dirty memory can't grow unbounded.

Why "free" memory is always small Page-cache folios are ordinary frames on the LRU lists. Clean cache pages are essentially free memory the kernel is borrowing — they can be dropped instantly under pressure. That's why "free" memory on a healthy box is small while "available" memory is large: most RAM is reclaimable page cache.
07

Reclaim & the OPT ideal

When memory is exhausted, the kernel must evict a resident page. Which one? Every real policy approximates an ideal it can never run.

Don't conflate these Reclaim (this section) decides which resident page to evict to free a frame. Migration (§8) relocates a page's contents to a different frame without evicting it. OPT belongs to the first.

The unreachable ideal — OPT / Belady

The theoretically optimal choice is OPT (MIN / Belady's algorithm): evict the page whose next use is furthest in the future. It provably minimizes faults — but is unimplementable online because it needs the future reference stream. OPT is a yardstick. Every practical policy approximates it using the past (recency and frequency), on the locality assumption that recently-used pages will be used soon.

The hardware gives only a weak signal: each PTE has an accessed (A) bit the CPU sets on reference and a dirty (D) bit on write. No timestamps, no counts. All of Linux's reclaim cleverness extracts a good recency estimate from those single bits sampled over time — the classic "clock"/"second chance" idea generalized.

The two-list LRUactive / inactive · anon / file
ANON · costs a swap write
ACTIVE
not scanned
INACTIVE
eviction end
FILE · may be free to drop
ACTIVE
not scanned
INACTIVE
eviction end
LRU_UNEVICTABLE — kept off the scan entirely: mlocked pages, ramfs (non-swap-backed), or otherwise pinned. Reclaim never wastes time scanning pages it can't free.
Newly-referenced pages start inactive; referenced again, they're promoted to active (not scanned until inactive is too small). Reclaim scans the cold end, consults the accessed bit via the reverse mapping: referenced → second chance; not → evict (drop if clean, swap/write if dirty). Separating anon from file matters because evicting a file page may be free while an anon page always costs a swap write.

Reverse mapping (rmap)

To evict a frame the kernel must unmap it from every address space that maps it — find all PTEs pointing at it. rmap (mm/rmap.c) solves this: file pages via the address_space interval tree of VMAs; anon pages via an anon_vma chaining the VMAs (including fork children). With rmap, try_to_unmap() walks to each PTE, samples/clears the accessed bit, flushes the TLB, and arranges writeback. It's also the foundation for migration.

Refault detection & workingset

A plain two-list LRU can thrash: it can't tell a one-off scan from a working set under pressure. On eviction Linux leaves a shadow entry in the XArray recording roughly when (in LRU eviction distance) the page left. If it's read back soon — a refault — the kernel compares the distance to the active-list size to decide whether the page was wrongly evicted, and grows protection if so. The policy gains a memory of its own recent mistakes.

The multi-generational LRU (MGLRU)

6.8 ships MGLRU (CONFIG_LRU_GEN), which can replace the active/inactive lists. Instead of two levels it organizes each lruvec into multiple generations (up to 4): youngest holds recently accessed pages, oldest is the eviction frontier. Aging advances the counter by scanning page tables — using lru_gen_look_around to harvest accessed bits from neighboring PTEs cheaply — rather than walking the global list with rmap per page. The win: far lower CPU cost to find cold pages on large memories. Exposed via /sys/kernel/mm/lru_gen/.

Either way, the same goal Classic two-list or MGLRU — both cheaply approximate OPT by inferring future use from sampled recency and frequency. They differ only in how cheaply and how granularly they sample.

Who runs reclaim · swap · memcg · OOM

kswapd (one thread per node) wakes below the low watermark and reclaims to the high watermark in the background. Direct reclaim runs inline in an allocating thread when the fast path fails — which is why a starved allocation can block. Swap is what makes anonymous pages reclaimable: reclaim writes the page to a swap slot and replaces its PTE with a not-present swap entry (swp_entry_t) encoding the area+slot. With no swap configured, anonymous pages are effectively unevictable — the kernel can only reclaim file pages and goes to the OOM killer sooner. The swappiness knob (/proc/sys/vm/swappiness) tunes how aggressively the kernel prefers swapping anon vs. dropping file pages. memcg (cgroup v2) scopes all of this per control group — memory.max, memory.high, memory.min — which is exactly why lruvecs are per-(cgroup, node). Last resort: the OOM killer picks a process by oom_score and kills it.

08

Migration & compaction

Migration relocates a page's contents to a different frame without evicting it. Compaction uses migration to manufacture contiguity.

What migration is

migrate_pages() physically relocates a page while keeping it transparently usable: isolate from the LRU, lock and unmap from every mapping via rmap, copy contents and flags to a fresh destination, rewrite every PTE and the page-cache XArray entry, then free the old frame. Because every reference is found via rmap and the page is locked throughout, the move is invisible to user space. Only movable pages qualify — pinned kernel memory and get_user_pages(FOLL_PIN) pages cannot move, which is exactly why migratetype segregation exists.

What it's used for

  • Compaction — defragmentation (below).
  • NUMA balancing — moving a task's pages to the node where it runs.
  • Memory hot-unplug — migrating every movable page off a bank before removal (the original reason ZONE_MOVABLE exists).
  • CMA — assembling a contiguous region for a device.
  • Tiered memory — demoting cold pages to slower nodes (CXL/PMEM) instead of swapping, and promoting hot pages back (the WMARK_PROMO watermark and NUMA-balancing paths participate).
Compaction — two scannersmigration ↑ · free ↓
↑ migration scannerfree scanner ↓
A migration scanner walks up collecting in-use movable pages; a free scanner walks down collecting free slots. Used pages sweep to one side, free space coalesces into large buddy blocks on the other.
Invoked by kcompactd (per-node), directly from the allocator slow path when a high-order request fails, and via /proc/sys/vm/compact_memory. This is the supply side that makes transparent huge pages succeed on a fragmented system.

Automatic NUMA balancing

To improve locality without application awareness, the kernel periodically clears the present bit on a sample of a task's pages so the next access faults. do_numa_page records which node the accessing CPU is on; a page consistently accessed from a remote node is migrated toward it. "Sample by faulting, then migrate" gradually co-locates each task with its working set, trading a few minor faults for better latency.

09

The page-table walk

The MMU translates a virtual address by walking a multi-level tree of tables rooted in CR3. With 4 KiB pages: four levels (48-bit), or five with LA57 (57-bit).

Translate a virtual addressx86-64 · 4-level
CR3
root register
holds PGD phys addr
PGD
global dir
[ — ]
PUD
upper dir
[ — ]
PMD
middle dir
[ — ]
PTE
leaf
[ — ]
4 KiB
frame
+ offset
A 48-bit address slices into four 9-bit indices (512 entries × 8 bytes = one 4 KiB page each) plus a 12-bit offset. Press step to walk it.
Each level's table is itself one physical page from the buddy allocator. The 5-level variant inserts a P4D index between PGD and PUD and uses a 57-bit address.

PTE bits

Each 64-bit entry holds the next table's / final frame's physical address plus flags: Present (P) — clear means any access faults (this is how demand paging, swap, and NUMA hinting work); R/W — read-only enforces copy-on-write; U/S; A & D (set by hardware, sampled by reclaim); PS — at PMD/PUD a leaf means a huge page (2 MiB / 1 GiB); G global; NX no-execute. Generic mm code touches these only through accessors (pte_present, pte_mkwrite, set_pte_at), so it runs unchanged on ARM64, RISC-V, etc.

Folding · TLB · KPTI · huge pages

Linux writes its mm code once against the five-level abstraction; on hardware with fewer levels, unused upper levels fold away (their offset accessor returns its input). The TLB caches translations the kernel can only invalidate — when it changes a cached entry it must flush it (INVLPG for one address, or a full flush via CR3 reload), and on SMP perform a shootdown: an IPI to the other CPUs that ran the mm (tracked in mm_cpumask so untouched CPUs aren't disturbed), waiting for all to acknowledge. PCID/ASID tagging lets the hardware keep translations for several address spaces at once, so a context switch needn't flush everything. After Meltdown, KPTI gives each process two PGDs (full kernel mapping vs. a minimal user one). Huge pages make a PMD/PUD entry a leaf so one TLB entry covers 2 MiB / 1 GiB — via explicit hugetlbfs or automatic THP (with khugepaged collapsing base pages). THP's supply depends entirely on compaction.

10

VMAs & the page fault handler

The page tables are the hardware's view. The kernel's view — and the fault handler that fills the tables lazily — is the connective tissue that ties every subsystem together.

mm_struct and VMAs

A process's address space is a struct mm_struct owning the page-table root and a set of regions. Each region is a vm_area_struct (a VMA): a contiguous run of virtual addresses with the same backing and permissions — program text, data, heap, an mmaped file, the stack. In 6.8 the VMAs are indexed by a maple tree (replacing the rbtree + linked list). Crucially a VMA describes intent and carries no physical memory by itself — frames and PTEs are created lazily, on fault.

Page fault explorerpick a scenario
In every legitimate case the handler ends by installing a valid PTE and the faulting instruction is restarted. This lazy, fault-driven population is why a process can mmap gigabytes instantly and only pay for the pages it touches.
11

The life of an allocation

Follow a process that touches a new heap page, and watch every subsystem in this guide cooperate. Step through it.

malloc → write → residentstep 1 / 6
1

malloc → mmap, no memory yet

For a large request malloc calls mmap for anonymous memory. The kernel creates a VMA in the mm_struct (indexed in the maple tree) but allocates no physical memory. The page tables for the region are empty.

vm_area_structmaple treelazy
2

The write faults

The process writes an address in the region. The MMU walks the tables, finds the PTE not present, and raises a page fault: do_page_faulthandle_mm_fault.

MMU walk#PF
3

Handler asks the allocator for a frame

An anonymous region with no page — so it calls alloc_pages / folio_alloc with GFP_HIGHUSER_MOVABLE (user data, movable, zeroed).

alloc_pagesGFP_HIGHUSER_MOVABLE
4

Buddy allocator: fast path or slow path

The allocator picks the local NUMA node, walks its zonelist, and at the first zone above its watermark pulls an order-0 page — almost certainly from the lock-free per-CPU list. If every zone is below its watermark: wake kswapd → direct reclaim → compaction → OOM killer.

zonelistwatermarkPCP
5

Wire up the frame

The handler zeroes it (no leaking another process's data), adds it to the LRU list (so reclaim can find it), sets up its reverse mapping (anon_vma), and installs a PTE — present, user, writable. It flushes any stale TLB entry.

LRUrmapset_pte_atTLB
6

Restart — and a possible afterlife

The instruction restarts; the MMU walk now succeeds. Later, under pressure, kswapd may find the page cold, unmap it via rmap with a TLB shootdown, write it to swap, and leave a swap entry in the PTE. On the wrong node, NUMA balancing may migrate it closer.

reclaimswapmigration
Every subsystem in this guide participated in the life of that one page.
12

Glossary

The recurring vocabulary, in one place. Filter to find a term fast.

13

Seeing it on a live system

The fastest way to make these concepts concrete is to read them off a running kernel.

/proc/meminfo
Global totals — free, available, cached, dirty, writeback, slab, swap.
/proc/zoneinfo
Per-zone free pages and watermarks (min/low/high). The §5 widget, live.
/proc/buddyinfo
Free-block counts per order per zone — watch fragmentation. The §5 free lists, live.
/proc/pagetypeinfo
Free blocks broken down by migratetype — anti-fragmentation state.
/proc/vmstat
Counters for faults, reclaim, compaction, NUMA balancing, swap.
/proc/<pid>/maps · smaps
A process's VMAs and per-region memory. The maple tree's contents, §10.
numastat · /sys/.../node/
Per-NUMA-node memory totals and allocation hit/miss stats (§3).
/sys/.../lru_gen/
MGLRU controls (§7). Also transparent_hugepage/ for THP.
/sys/fs/cgroup/.../memory.*
Per-cgroup usage, limits, pressure, and events (memcg, §7).

Background reading: Mel Gorman, Understanding the Linux Virtual Memory Manager (2.4/2.6 — the conceptual scaffolding); the kernel's own Documentation/mm/ and Documentation/admin-guide/mm/ (including transhuge.rst, numa_memory_policy.rst, multigen_lru.rst); and LWN.net's long-running coverage of mm changes — folios, the maple tree, MGLRU, memory tiering.