Some notes taking from reading the paper ‘Practical, transparent operating system support for superpages’.

Background

  • superpage : pages > 4K (like 4M)

the memory management system in mordern OS

  • kernel functionality
  • hardware support

capabilities provided by memory management

  • demand paging
  • lazy allocation
  • shared memory
  • zero copy IPC …

why increasing page size may not work?

  • internal fragmetaion
  • enlarged application footprints -> increased physical mem require and higher page traffic (I/O)

Motivation of superpages

  • building big TLB is expensive
  • physically address caches are big enough to hold working sets
  • TLB coverage: total size of memory addressed by TLB
  • why? TLB coverage increases at much lower pace than main memory size -> we need better coverage per TLB entry
  • memory size have increased -> TLB coverage rate down -> increase program size -> TLB misses increase
  • trend: on-board physically addressed caches larger than TLB coverage -> TLB misses have to access to memory banks to find a translation for a cached data

Design

a general superpage management system

  • when a process allocates memory:
    1. reserves a larger contiguous region of physical memory
    2. increasing sizes as the process touches pages in this region
    3. preempt portions of unused contiguous regions
    4. restores contiguity by biasing the page replacement scheme to evict contiguous inactive pages
  • contributions:
    • extends a previously proposed reservation-based approach to work with multiple superpage sizes
    • the first to investigate the effect of fragmentation on superpages
    • propose a novel contiguity-aware page replacement algorithm to control fragmentation
    • tackles issues like superpage demotion and eviction of dirty superpages

hardware constrains

  • fixed set of page sizes of superpages
  • aligned and contiguious in physical and virtual address
  • TLB entry only provides reference bit, permission bit set, dirty bit

design space - allocation

  • relocation:
    • entirly and transparently implement in hardware-dependent layer of OS
    • create contiguity, but copy -> high cost
    • require software managed TLB: associate with each potential superpage a conter that must be updated by handler
    • more TLB misses than reservation
    • TLB misses more expensive
    • but more robust to fragmantation
  • reservation: preserve contiguity, reserved at page fault time
    • allocate 4K, reserve 64K
    • superpage size? (see policy)
    • find contigous physical memory of that size
    • alloc 4k from that physical memory
    • issue: trade off performance gains of using a large superpage against the potion of retaining the contiguous region for later use
  • hardware-based: reduce contiguity requirement for superpage

preferred superpage size policy

  • resevation policy
    • fixed size object:
      • largest aligned superpage that contains the faulting page
      • don’t extend super page beyound object
      • no overlap with others
    • dynamically sized object
      • size of superpage is limited to current size of the object
  • preempt policy
    • when possible, preempts existing resvs rather than refusing allocation
    • when more than one resvs, preempt most recent page allocation occurred least recently

promotion to superpages

  • update page table entry for each of the pages
  • when to promote? trade-off of early promotion: reduce TLB miss vs. increased mem consumption
  • eager promotion: not transparent

fragmentation control

  • preemption
  • contiguily as resourse

demotion

  • speculative demotion: when mem pressure and superpage is referenced
  • permission change: hardware provide only a single set of protection bits
  • write: demote clean sp, then repromote later if all base page are dirty

population map

  • (somepop, fullpop) radix tree
  • purpose:
    • enable OS to map va to page frame may already reserved
    • detect and avoid overlapping regions
    • assist in making page promotion decisions
    • help identify unallocated regions

multi-list reservation

  • one reservation list for each page size, except for largest sp size
  • kept sorted by the time of their most recent page frame allocation

Implementation

  • page daemon
    • keep 3 list of pages in A-LRU order: active, inactive, cache
    • cache page available for resv
    • activated on mem pressure and when available contiguity falls low (fail to allocate a contiguous region of the preferred size)
    • more inactive pages -> higher chance of restore contiguity
    • more aggressively recover contiguity -> greater possible of overhead induced by modify page daemon

benefits

  • best-case due to superpage:
    • when free mem is plentiful and non-fragmented
    • most cases, TLB misses are virtually eliminated
    • Matrix: access pattern produces one TLB miss for every two mem access
    • mesa: downgrade, allocator not differentiate zero-out pages from other free pages
  • multiple page sizes: mcf
  • long term benefits: fragmentation