Some notes taking from reading the paper ‘Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism’.

Context for process management

abstraction

process: CPU, mem(address space), disk(flies)
thread: variable speed virtual processor (CPU), motivation: describe multi CPU

chanllenges

synchronization: locks
schedule the threads over multi CPUs
- fairness, priority
efficency(minimize overhead)

Why process management?

parallel computation, multi-core CPUs
database, webservers…(softwares)

Design

how to design the most efficient thread package for parallel programs?

process + shared memory

cons: too inefficient for general-purpose parallel programming! only handle coarse-grained parallelism well!

kernel threads
user level threads

how do kernel threads work

state of a thread is maintained in the kernel
problem: oblivious 2-level scheduling with no visible revocation

kernel threads limits?

performance: boundary crossing overhead, have to copy and check parameters in order to protect kernel
flexibility: ex: user thread can schedule in different methods(like the exokernel)
- problem:
  - block problem
  - kernel scheduler does not know the priority of user process, may cause priority invasion
  - locking: may delay other process or cause deadlock

desireble properties of the solution

solution for user/kernel level thread scheduler
performance:
- minimize the boundary crossings
when a user level thread is ready and a processor is available, the user level thread must able to run

virtual multi-processor abstraction

kernel may change variable num of processors at runtime
kernel has complete control over how many processors to give each address space
user-level thread system has complete control over which threads to run on its allocated processors
guarantee a physical processor to virtual processor
visible revocation of virtual processors, notifies user whenever a user-level thread blocks or wakes up in the kernel
user-level thread system notifies kernel when app. needs more or fewer processors

so what is scheduler activation?

a kernel abstraction of mapping: virtual processor <-> user-level thread system

a scheluler activetion serves 3 roles:

serves as a vessel, ot excution context for running user-level threads
notifies the user-level thread system of a kernel event
provides space in the kernel for saving the processor context of the activation’s current user-level thread, when the thread is stopped by the kernel

what happens when a program starts?

kernel:
- creates a scheduler activetion
- assign it to a processor
- upcalls into the app address space
user-level thread system:
- receives the upcall
- uses that activetion as the context in which to initialize itself and run the main()

Scheduler Activation upcall points

add this processor(processor#): execute a runnable user-level thread
processor has been preempted(preemted activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the preempted scheduler activation
scheduler activetion has blocked(blocked activation#): the blocked scheduler activation is no longer using its processor
scheduler activetion has unblocked(unblocked activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the blocked scheduler activation

scheduler activation vs. kernel threads

scheduler activetion:
- once an activation’s user-level thread is stopped by kernel, the thread is never directly resumed by the kernel.
- a new scheduler activation is created to notify the user-level thread system that the thread has been stopped
- the user-level system removes the state of the thread from the old activation, and tells kernel that the old activation can be reused
- the user-level system decides which thread to run on the processor(based on scheduling)
kernel thread
- when kernel stops a kernel thread, the kernel never notifies the user level of the event
- kernel directly resumes the kernel thread without notification

user-level thread system

no need to tell the kernel about every thread operation, only about the ops. that can affect kernel’s processor allocation decision.
notify the kernel whenever it makes a transition to a state where #processor != #runnable threads

kernel call(user -> kernel)

functions:
- add more processors(additional #processors): allocate more processors to this address space and start them running scheduler activations
- this processor is idle(): preempt this processor if another address space needs it
drawback:
- applications may not be honest in reporting their parallelism to the OS
- solution: multi-level feedback to encourage apps to provide honest info, the processor allocator can favor user threads that use fewer processors and penalize those use more.

critical sections

a user-level thread could be running in a critical section at the time when it is blocked or preempted
possible effects:
- poor performance, ex: other thread continue to test an app-level spin lock held by this preempted thread
- deadlock, ex: this preempted thread could be holding the ready list lock, and the upcall attempted to place the preempted thread onto the ready list
how to deal with these inopportune preemption?
- prevention:
  - avoid through the use of a scheduling and locking protocol between the kernel and the user-level
  - but has serious drawbacks:
    1. requires the kernel to yield control over processor allocation to the user-level, violating the semantics of priorities
    2. inconsistent with the efficient implementation of critical sections
    3. when come across a page fault, requires ‘pinning’ to physical mem
- recovery(used):
  - when an upcall informs the user-level, user-level checks if the thread was running in a critical section, if so, the thread is continued temporarily via a user-lavel context switch

Implementation

Topaz: kernel thread
FastThreads: user-level thread package

processor allocation policy

space-shares processors:
- precessors are divided evenly among the highest proirity address spaces
- reduce the number of processor reallocations
- processors are time-sliced only if the number of abailable processors is not an integer multiple of the number of address spaces that want them

thread scheduling

kernel has no knowledge of app’s concurrency model or scheduling policy…(like exokernel style)
default in FastThreads:
- per-processor ready lists accessed by each processor in last-in-first-out order(improve cache locality)
- a processor scans for work if its own ready list is empty
this paper:
- hysteresis to avoid unnecessaty processor allocations
- an idle processor spins for a short perios before notifying the kernel that it is avalable for reallocation

performance optimization

critical section: the user-level system must be able to check whether the thread was holding a lock
1. we can set a flag when it enters a critical section, clear the flag when it leaves, then check to see if it is being continued (but this imposes overhead on lock acquisition and release!)
2. make a copy of every low-level cretical section, at the end of each copy, we place code to yield the processor back to the resumer. When a preemption occurs, activation check if thread is in one of the critical section, if so, continues the thread at the corresponding place in the copy of the critical section
management of scheduler activations
- discard scheduler activations can be cached for eventual reuse
- discard SA can be collected and returned to kernel in bulk to reduce boundary crossing
benefit?
- removing this optimization from FastThread yielded a Null Fork time of 49 usec(37 usec before), and Signal-Wait of 48 usec(42 usec before), the Null Fork benchmark has more critical sections than Signal-Wait

Evaluation

overhead of user-level thread operations?

inc&dec the number of busy threads and determining whether kernel must be notified
checking whether a preempted thread is being resumed

why cost of upcall is high?

overhead of scheduler activation machinery of making and completing an I/O request or a page fault:
- we must maintain more state
- much of Topaz is written in carefully tuned assembler, this paper’s implementation is entirely in Modula-2+

overrall effect on the performance of applications?

benchmark: an O(NlogN) solution of N-body problem
kernel thread is inefficient because of spin locks
when kernel involved I/O, the origin FastThreads perform bad because when a user-level thread block in the kernel, the kernel thread serving as its virtual processor also blocks, thus lose that physical processor for the duration of the I/O

Scheduler Activations - my notes

Contents