Some notes taking from reading the paper ‘Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism’.

Context for process management

abstraction

  • process: CPU, mem(address space), disk(flies)
  • thread: variable speed virtual processor (CPU), motivation: describe multi CPU

chanllenges

  • synchronization: locks
  • schedule the threads over multi CPUs
    • fairness, priority
  • efficency(minimize overhead)

Why process management?

  • parallel computation, multi-core CPUs
  • database, webservers…(softwares)

Design

how to design the most efficient thread package for parallel programs?

  1. process + shared memory
  • cons: too inefficient for general-purpose parallel programming! only handle coarse-grained parallelism well!
  1. kernel threads
  2. user level threads

how do kernel threads work

  • state of a thread is maintained in the kernel
  • problem: oblivious 2-level scheduling with no visible revocation

kernel threads limits?

  • performance: boundary crossing overhead, have to copy and check parameters in order to protect kernel
  • flexibility: ex: user thread can schedule in different methods(like the exokernel)
    • problem:
      • block problem
      • kernel scheduler does not know the priority of user process, may cause priority invasion
      • locking: may delay other process or cause deadlock

desireble properties of the solution

  • solution for user/kernel level thread scheduler
  • performance:
    • minimize the boundary crossings
  • when a user level thread is ready and a processor is available, the user level thread must able to run

virtual multi-processor abstraction

  • kernel may change variable num of processors at runtime
  • kernel has complete control over how many processors to give each address space
  • user-level thread system has complete control over which threads to run on its allocated processors
  • guarantee a physical processor to virtual processor
  • visible revocation of virtual processors, notifies user whenever a user-level thread blocks or wakes up in the kernel
  • user-level thread system notifies kernel when app. needs more or fewer processors

so what is scheduler activation?

a kernel abstraction of mapping: virtual processor <-> user-level thread system

a scheluler activetion serves 3 roles:

  • serves as a vessel, ot excution context for running user-level threads
  • notifies the user-level thread system of a kernel event
  • provides space in the kernel for saving the processor context of the activation’s current user-level thread, when the thread is stopped by the kernel

what happens when a program starts?

  • kernel:
    • creates a scheduler activetion
    • assign it to a processor
    • upcalls into the app address space
  • user-level thread system:
    • receives the upcall
    • uses that activetion as the context in which to initialize itself and run the main()

Scheduler Activation upcall points

  • add this processor(processor#): execute a runnable user-level thread
  • processor has been preempted(preemted activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the preempted scheduler activation
  • scheduler activetion has blocked(blocked activation#): the blocked scheduler activation is no longer using its processor
  • scheduler activetion has unblocked(unblocked activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the blocked scheduler activation

scheduler activation vs. kernel threads

  • scheduler activetion:

    • once an activation’s user-level thread is stopped by kernel, the thread is never directly resumed by the kernel.
    • a new scheduler activation is created to notify the user-level thread system that the thread has been stopped
    • the user-level system removes the state of the thread from the old activation, and tells kernel that the old activation can be reused
    • the user-level system decides which thread to run on the processor(based on scheduling)
  • kernel thread

    • when kernel stops a kernel thread, the kernel never notifies the user level of the event
    • kernel directly resumes the kernel thread without notification

user-level thread system

  • no need to tell the kernel about every thread operation, only about the ops. that can affect kernel’s processor allocation decision.
  • notify the kernel whenever it makes a transition to a state where #processor != #runnable threads

kernel call(user -> kernel)

  • functions:
    • add more processors(additional #processors): allocate more processors to this address space and start them running scheduler activations
    • this processor is idle(): preempt this processor if another address space needs it
  • drawback:
    • applications may not be honest in reporting their parallelism to the OS
    • solution: multi-level feedback to encourage apps to provide honest info, the processor allocator can favor user threads that use fewer processors and penalize those use more.

critical sections

  • a user-level thread could be running in a critical section at the time when it is blocked or preempted
  • possible effects:
    • poor performance, ex: other thread continue to test an app-level spin lock held by this preempted thread
    • deadlock, ex: this preempted thread could be holding the ready list lock, and the upcall attempted to place the preempted thread onto the ready list
  • how to deal with these inopportune preemption?
    • prevention:
      • avoid through the use of a scheduling and locking protocol between the kernel and the user-level
      • but has serious drawbacks:
        1. requires the kernel to yield control over processor allocation to the user-level, violating the semantics of priorities
        2. inconsistent with the efficient implementation of critical sections
        3. when come across a page fault, requires ‘pinning’ to physical mem
    • recovery(used):
      • when an upcall informs the user-level, user-level checks if the thread was running in a critical section, if so, the thread is continued temporarily via a user-lavel context switch

Implementation

  • Topaz: kernel thread
  • FastThreads: user-level thread package

processor allocation policy

  • space-shares processors:
    • precessors are divided evenly among the highest proirity address spaces
    • reduce the number of processor reallocations
    • processors are time-sliced only if the number of abailable processors is not an integer multiple of the number of address spaces that want them

thread scheduling

  • kernel has no knowledge of app’s concurrency model or scheduling policy…(like exokernel style)
  • default in FastThreads:
    • per-processor ready lists accessed by each processor in last-in-first-out order(improve cache locality)
    • a processor scans for work if its own ready list is empty
  • this paper:
    • hysteresis to avoid unnecessaty processor allocations
    • an idle processor spins for a short perios before notifying the kernel that it is avalable for reallocation

performance optimization

  • critical section: the user-level system must be able to check whether the thread was holding a lock

    1. we can set a flag when it enters a critical section, clear the flag when it leaves, then check to see if it is being continued (but this imposes overhead on lock acquisition and release!)
    2. make a copy of every low-level cretical section, at the end of each copy, we place code to yield the processor back to the resumer. When a preemption occurs, activation check if thread is in one of the critical section, if so, continues the thread at the corresponding place in the copy of the critical section
  • management of scheduler activations

    • discard scheduler activations can be cached for eventual reuse
    • discard SA can be collected and returned to kernel in bulk to reduce boundary crossing
  • benefit?

    • removing this optimization from FastThread yielded a Null Fork time of 49 usec(37 usec before), and Signal-Wait of 48 usec(42 usec before), the Null Fork benchmark has more critical sections than Signal-Wait

Evaluation

overhead of user-level thread operations?

  • inc&dec the number of busy threads and determining whether kernel must be notified
  • checking whether a preempted thread is being resumed

why cost of upcall is high?

  • overhead of scheduler activation machinery of making and completing an I/O request or a page fault:
    • we must maintain more state
    • much of Topaz is written in carefully tuned assembler, this paper’s implementation is entirely in Modula-2+

overrall effect on the performance of applications?

  • benchmark: an O(NlogN) solution of N-body problem
  • kernel thread is inefficient because of spin locks
  • when kernel involved I/O, the origin FastThreads perform bad because when a user-level thread block in the kernel, the kernel thread serving as its virtual processor also blocks, thus lose that physical processor for the duration of the I/O