The memory manager,
bottom to top.
From a single 4 KiB physical frame up to a process's virtual address space — how Linux
finds, tracks, protects, and reclaims every page of memory. Grounded in Mel Gorman's
classic, but every structure here reflects the v6.8 source tree: folios, the maple tree, MGLRU,
5-level paging, memblock.
Orientation
"Memory management" in Linux is really several cooperating subsystems stacked on each other. Fix the vocabulary before diving in.
The stack above is the whole guide in miniature. Read it from the hardware up:
Physical page frames
The smallest unit the kernel allocates is a hardware page, almost always 4 KiB on x86-64
(PAGE_SIZE, PAGE_SHIFT == 12). Every frame is described by a struct page,
increasingly grouped into a struct folio. The kernel keeps an array of these descriptors so it can
map a physical frame number (PFN) to its metadata in O(1).
The physical allocator (buddy system)
Frames are handed out by the buddy allocator, which manages free memory in power-of-two blocks
("orders") per zone. It is the lowest-level allocator; SLUB, vmalloc, the page cache, and user
page faults all ultimately call into it.
Zones and nodes
Physical memory is partitioned first by NUMA node (a pglist_data) and within each node by
zone (ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, ZONE_MOVABLE,
ZONE_DEVICE). Zones exist because not all physical memory is interchangeable.
Virtual memory, page cache, and reclaim
Each process has an address space (mm_struct), a set of regions (vm_area_structs indexed
by a maple tree), and a hierarchical page table. The page cache caches file contents;
reclaim (kswapd, direct reclaim, the LRU lists / MGLRU) decides what to evict under pressure.
Pages, folios, and mem_map
Every physical frame the kernel manages has a tiny descriptor. How that descriptor array is laid out is the memory model.
struct page
Defined in include/linux/mm_types.h, this descriptor is deliberately tiny — on the order of
64 bytes — because there is one for every frame, and on a large machine that array can consume a
meaningful fraction of RAM. To stay small, struct page is heavily unionized: the same bytes
mean different things depending on what the frame is being used for (a buddy free block, page-cache page, slab
page, anonymous page, page-table page). Durable fields are flags, _refcount,
_mapcount; much of the rest is overlaid context.
The flags field is a packed bitfield holding both true flags (PG_locked,
PG_dirty, PG_uptodate, PG_lru, …) and, in its high bits, the encoded
node id, zone number, and section — which is how page_to_nid() and page_zonenum()
work without a separate lookup.
Folios — the major change since the book
The single biggest structural change in recent years is the folio. A folio is a container for one or more physically contiguous pages managed as a unit, guaranteed never to be a "tail" page of a compound page. It makes the head-vs-tail distinction explicit in the type system (killing a class of bugs) and lets the kernel manage memory in chunks larger than 4 KiB.
Mapping a PFN to its descriptor — the memory model
- FLATMEM — a single contiguous
mem_map[]. Simple; used on small/embedded systems. - SPARSEMEM_VMEMMAP — the default on x86-64. Memory is divided into fixed-size sections so address
spaces with large holes don't waste a descriptor per absent frame, and it maps a virtual array
(
vmemmap) sopfn_to_page()reduces to pointer arithmetic. FLATMEM's O(1) access with SPARSEMEM's tolerance of holes.
The accessors — pfn_to_page(), page_to_pfn(), virt_to_page() — are stable
across all models; only the implementation underneath changes. (DISCONTIGMEM from the book era has
been removed entirely.)
Zones and NUMA nodes
Memory is partitioned by node (where it physically lives) and by zone (what it's good for).
pglist_data · local to CPUs 0–7pglist_data · remote across UPI/Infinity FabricZONE_DEVICE (PMEM, GPU memory) isn't
normal RAM and sits outside the buddy pools. ZONE_HIGHMEM is effectively dead on 64-bit — all
physical RAM fits in the kernel's direct map.Why nodes exist: NUMA
On a Non-Uniform Memory Access machine, a CPU reaches memory on its own socket (its local node)
faster than memory on another socket (a remote node). Linux models each domain as a node
(struct pglist_data). A single-socket machine is just the degenerate one-node case. In 6.8 the
LRU lists and most reclaim accounting live at the node level, in the node's lruvec — and with
the memory controller enabled, there is one lruvec per memory cgroup per node.
Why zones exist: addressing & migratability
Within a node, not all frames are interchangeable. ZONE_MOVABLE contains only frames the kernel
promises are migratable — which is what makes memory hot-unplug and reliable huge-page allocation
possible. Its size is set administratively (kernelcore= / movablecore=).
Zonelists & NUMA policy
When code requests memory it names a preferred zone (via GFP flags) and node (usually local). The allocator must
fall back when that zone is empty — that order is precomputed into each node's zonelists. There are two per node:
ZONELIST_FALLBACK (all nodes' zones, when remote memory is allowed) and
ZONELIST_NOFALLBACK (this node only, for __GFP_THISNODE). Default node ordering
exhausts the local node's zones before stepping to the nearest remote node, using ACPI SLIT distances. On
top sits NUMA memory policy: MPOL_DEFAULT, MPOL_BIND,
MPOL_PREFERRED, MPOL_INTERLEAVE, and MPOL_PREFERRED_MANY — set via
set_mempolicy()/mbind() and visible through numactl.
Boot-time bootstrap
A paradox: the buddy allocator and the struct page array are themselves data structures that must be allocated in memory — before the allocator that would allocate them exists.
Linux resolves it with a four-step handover. Each step builds the tool the next one needs:
Discover the physical map
The kernel can't probe RAM by poking addresses — it asks firmware. Legacy BIOS gives the e820 map
(base, length, type triples); UEFI gives the EFI memory map. Only ranges marked usable
become candidate RAM.
Record it in memblock
The boot-time allocator (mm/memblock.c, replacing the old bootmem) keeps two
arrays: memblock.memory (all RAM) and memblock.reserved (kernel image, initrd,
early page tables). Deliberately simple — a linear list, no struct page dependency.
Build the struct page array
Parse ACPI SRAT (node affinity) and SLIT (distances), compute zone boundaries, then under
SPARSEMEM_VMEMMAP allocate physical backing for each present section's descriptors and map it into
vmemmap. This is the step that makes pfn_to_page() work.
Hand over to the buddy allocator
memblock_free_all() walks every region in memblock.memory but not in
.reserved and releases those frames into the buddy free lists by clearing PG_reserved.
Watermarks are computed, PCP lists set up, and memblock is retired.
alloc_pages() path. The command line
can still steer the result: mem=, memmap=, numa=,
movablecore=, hugepages=.The buddy allocator
The kernel's physical page allocator. Within each zone it keeps, for every order k, a list of free blocks of exactly 2ᵏ contiguous, aligned pages.
p XOR (1<<k)) is also free and the same order, the two merge upward.
This is a 16-frame teaching model; a real x86-64 zone runs to MAX_PAGE_ORDER 10 — a 1024-page,
4 MiB largest block — and splits free lists per migratetype.Migratetypes & anti-fragmentation
A pure buddy system still fragments: one unmovable allocation stranded in a free region prevents it from ever
coalescing. Linux groups free lists per migratetype, one per pageblock (typically the PMD huge-page
order — 2 MiB — when huge pages are configured in): MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_HIGHATOMIC,
MIGRATE_CMA, MIGRATE_ISOLATE — so unmovable allocations cluster together and movable
regions stay contiguous, compactable into huge pages or freeable for hot-unplug. When a zone runs low on a type
it steals whole pageblocks from another by a fixed fallback order.
__GFP_ZERO, __GFP_MOVABLE,
__GFP_THISNODE, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL.› Watermarks & the reclaim control loop
Each zone carries three watermarks scaled from its managed page count. They form the feedback loop that drives reclaim.
Above the high watermark — memory is ample and nobody reclaims.
get_page_from_freelist): walk the zonelist, check the watermark,
pull a block — for a single page, almost always from the lock-free per-CPU page list. Slow path
(__alloc_pages_slowpath) escalates: wake kswapd → lowered watermarks → direct
compaction → direct reclaim → OOM killer → NULL.Per-CPU page lists & the layers above
Because single-page alloc/free is the hottest path, each zone keeps a per-CPU cache (struct
per_cpu_pages) so the common case avoids the zone lock and keeps just-freed pages cache-warm. Above the
buddy allocator sit SLUB (kmalloc, object caches for sub-page objects like inodes and
dentries — some reclaimable via shrinkers; SLUB is the only slab allocator left in 6.8, the older SLAB and
SLOB having been removed), vmalloc (large, virtually-contiguous but physically
scattered), and alloc_pages / folio_alloc for page-granular memory.
The page cache
The modern "file buffer cache" — one unified cache of file contents, keyed by (file, offset), in folio-sized units.
Older Unix had a separate buffer cache for disk blocks and page cache for file contents, causing
double-caching. Modern Linux has one unified page cache; the old buffer cache survives only as a
view — struct buffer_head objects describe individual disk blocks within a cached folio.
address_space — the core structure
i_pages— an XArray mapping page offsets to cached folios. It replaced the radix tree the book describes, with a cleaner API, built-in locking, and inline shadow/value entries (used by refault detection). This answers "is offset N of this file resident, and if so where?"a_ops— the operations vtable:read_folio,writepages,dirty_folio,migrate_folio— how generic code calls into a specific filesystem.host— back-pointer to the owning inode; plus dirty/writeback tags stored as XArray marks.
Reads, readahead, and writeback
On a read, the kernel indexes mapping->i_pages. A hit copies cached data — no I/O. A
miss allocates a folio, inserts it, and calls read_folio. On detecting sequential access the
kernel does readahead, growing a window of upcoming folios. A write copies into the folio and marks it
dirty (write-back caching). Persisting is the job of per-bdi flusher threads (wb_workfn; no longer the single
pdflush of the book era, so a slow disk can't stall a fast one), triggered by time
(dirty_expire_centisecs), by pressure (dirty_background_ratio /
dirty_ratio), or by fsync.
balance_dirty_pages() throttles fast writers so dirty memory can't grow unbounded.
Reclaim & the OPT ideal
When memory is exhausted, the kernel must evict a resident page. Which one? Every real policy approximates an ideal it can never run.
The unreachable ideal — OPT / Belady
The theoretically optimal choice is OPT (MIN / Belady's algorithm): evict the page whose next use is furthest in the future. It provably minimizes faults — but is unimplementable online because it needs the future reference stream. OPT is a yardstick. Every practical policy approximates it using the past (recency and frequency), on the locality assumption that recently-used pages will be used soon.
The hardware gives only a weak signal: each PTE has an accessed (A) bit the CPU sets on reference and a dirty (D) bit on write. No timestamps, no counts. All of Linux's reclaim cleverness extracts a good recency estimate from those single bits sampled over time — the classic "clock"/"second chance" idea generalized.
not scanned
eviction end
not scanned
eviction end
mlocked pages, ramfs (non-swap-backed), or otherwise pinned. Reclaim never wastes time scanning pages it can't free.
Reverse mapping (rmap)
To evict a frame the kernel must unmap it from every address space that maps it — find all PTEs pointing
at it. rmap (mm/rmap.c) solves this: file pages via the address_space interval
tree of VMAs; anon pages via an anon_vma chaining the VMAs (including fork children). With
rmap, try_to_unmap() walks to each PTE, samples/clears the accessed bit, flushes the TLB, and
arranges writeback. It's also the foundation for migration.
Refault detection & workingset
A plain two-list LRU can thrash: it can't tell a one-off scan from a working set under pressure. On eviction Linux leaves a shadow entry in the XArray recording roughly when (in LRU eviction distance) the page left. If it's read back soon — a refault — the kernel compares the distance to the active-list size to decide whether the page was wrongly evicted, and grows protection if so. The policy gains a memory of its own recent mistakes.
The multi-generational LRU (MGLRU)
6.8 ships MGLRU (CONFIG_LRU_GEN), which can replace the active/inactive lists. Instead of two
levels it organizes each lruvec into multiple generations (up to 4): youngest holds recently
accessed pages, oldest is the eviction frontier. Aging advances the counter by scanning page tables —
using lru_gen_look_around to harvest accessed bits from neighboring PTEs cheaply — rather
than walking the global list with rmap per page. The win: far lower CPU cost to find cold pages on large
memories. Exposed via /sys/kernel/mm/lru_gen/.
Who runs reclaim · swap · memcg · OOM
kswapd (one thread per node) wakes below the low watermark and reclaims to the high watermark in the
background. Direct reclaim runs inline in an allocating thread when the fast path fails — which is why a
starved allocation can block. Swap is what makes anonymous pages reclaimable: reclaim writes the
page to a swap slot and replaces its PTE with a not-present swap entry (swp_entry_t) encoding
the area+slot. With no swap configured, anonymous pages are effectively unevictable — the kernel can only
reclaim file pages and goes to the OOM killer sooner. The swappiness knob
(/proc/sys/vm/swappiness) tunes how aggressively the kernel prefers swapping anon vs. dropping file
pages. memcg (cgroup v2) scopes all of this per control group — memory.max, memory.high,
memory.min — which is exactly why lruvecs are per-(cgroup, node). Last resort: the
OOM killer picks a process by oom_score and kills it.
Migration & compaction
Migration relocates a page's contents to a different frame without evicting it. Compaction uses migration to manufacture contiguity.
What migration is
migrate_pages() physically relocates a page while keeping it transparently usable: isolate from
the LRU, lock and unmap from every mapping via rmap, copy contents and flags to a fresh destination,
rewrite every PTE and the page-cache XArray entry, then free the old frame. Because every reference is found via
rmap and the page is locked throughout, the move is invisible to user space. Only movable pages qualify —
pinned kernel memory and get_user_pages(FOLL_PIN) pages cannot move, which is exactly why migratetype
segregation exists.
What it's used for
- Compaction — defragmentation (below).
- NUMA balancing — moving a task's pages to the node where it runs.
- Memory hot-unplug — migrating every movable page off a bank before removal (the original reason
ZONE_MOVABLEexists). - CMA — assembling a contiguous region for a device.
- Tiered memory — demoting cold pages to slower nodes (CXL/PMEM) instead of swapping, and
promoting hot pages back (the
WMARK_PROMOwatermark and NUMA-balancing paths participate).
kcompactd (per-node), directly from the allocator slow path when a
high-order request fails, and via /proc/sys/vm/compact_memory. This is the supply side that makes
transparent huge pages succeed on a fragmented system.Automatic NUMA balancing
To improve locality without application awareness, the kernel periodically clears the present bit on a sample of
a task's pages so the next access faults. do_numa_page records which node the accessing CPU is
on; a page consistently accessed from a remote node is migrated toward it. "Sample by faulting, then
migrate" gradually co-locates each task with its working set, trading a few minor faults for better latency.
The page-table walk
The MMU translates a virtual address by walking a multi-level tree of tables rooted in CR3. With 4 KiB pages: four levels (48-bit), or five with LA57 (57-bit).
PTE bits
Each 64-bit entry holds the next table's / final frame's physical address plus flags: Present (P) — clear
means any access faults (this is how demand paging, swap, and NUMA hinting work); R/W — read-only enforces
copy-on-write; U/S; A & D (set by hardware, sampled by reclaim); PS — at
PMD/PUD a leaf means a huge page (2 MiB / 1 GiB); G global; NX no-execute. Generic mm code
touches these only through accessors (pte_present, pte_mkwrite, set_pte_at),
so it runs unchanged on ARM64, RISC-V, etc.
Folding · TLB · KPTI · huge pages
Linux writes its mm code once against the five-level abstraction; on hardware with fewer levels, unused
upper levels fold away (their offset accessor returns its input). The TLB caches translations the
kernel can only invalidate — when it changes a cached entry it must flush it (INVLPG for one address,
or a full flush via CR3 reload), and on SMP perform a shootdown: an IPI to the other CPUs that ran the
mm (tracked in mm_cpumask so untouched CPUs aren't disturbed), waiting for all to
acknowledge. PCID/ASID tagging lets the hardware keep translations for several address spaces at once, so a
context switch needn't flush everything. After Meltdown, KPTI gives each process two PGDs (full
kernel mapping vs. a minimal user one). Huge pages make a PMD/PUD entry a leaf so one TLB entry covers
2 MiB / 1 GiB — via explicit hugetlbfs or automatic THP (with khugepaged collapsing
base pages). THP's supply depends entirely on compaction.
VMAs & the page fault handler
The page tables are the hardware's view. The kernel's view — and the fault handler that fills the tables lazily — is the connective tissue that ties every subsystem together.
mm_struct and VMAs
A process's address space is a struct mm_struct owning the page-table root and a set of
regions. Each region is a vm_area_struct (a VMA): a contiguous run of virtual addresses
with the same backing and permissions — program text, data, heap, an mmaped file, the stack. In 6.8
the VMAs are indexed by a maple tree (replacing the rbtree + linked list). Crucially a VMA describes
intent and carries no physical memory by itself — frames and PTEs are created lazily, on fault.
mmap
gigabytes instantly and only pay for the pages it touches.The life of an allocation
Follow a process that touches a new heap page, and watch every subsystem in this guide cooperate. Step through it.
malloc → mmap, no memory yet
For a large request malloc calls mmap for anonymous memory. The kernel creates a
VMA in the mm_struct (indexed in the maple tree) but allocates no physical
memory. The page tables for the region are empty.
The write faults
The process writes an address in the region. The MMU walks the tables, finds the PTE not present,
and raises a page fault: do_page_fault → handle_mm_fault.
Handler asks the allocator for a frame
An anonymous region with no page — so it calls alloc_pages / folio_alloc with
GFP_HIGHUSER_MOVABLE (user data, movable, zeroed).
Buddy allocator: fast path or slow path
The allocator picks the local NUMA node, walks its zonelist, and at the first zone above its
watermark pulls an order-0 page — almost certainly from the lock-free per-CPU list. If every
zone is below its watermark: wake kswapd → direct reclaim → compaction → OOM killer.
Wire up the frame
The handler zeroes it (no leaking another process's data), adds it to the LRU list (so reclaim can find it), sets up its reverse mapping (anon_vma), and installs a PTE — present, user, writable. It flushes any stale TLB entry.
Restart — and a possible afterlife
The instruction restarts; the MMU walk now succeeds. Later, under pressure, kswapd may find
the page cold, unmap it via rmap with a TLB shootdown, write it to swap, and leave a swap entry in
the PTE. On the wrong node, NUMA balancing may migrate it closer.
Glossary
The recurring vocabulary, in one place. Filter to find a term fast.
Seeing it on a live system
The fastest way to make these concepts concrete is to read them off a running kernel.
transparent_hugepage/ for THP.Background reading: Mel Gorman,
Understanding the Linux Virtual Memory Manager (2.4/2.6 — the conceptual scaffolding); the kernel's own
Documentation/mm/ and Documentation/admin-guide/mm/ (including transhuge.rst,
numa_memory_policy.rst, multigen_lru.rst); and LWN.net's long-running coverage of mm
changes — folios, the maple tree, MGLRU, memory tiering.