Some notes taking from reading the paper ‘Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism’.
Context for process management
abstraction
- process: CPU, mem(address space), disk(flies)
- thread: variable speed virtual processor (CPU), motivation: describe multi CPU
chanllenges
- synchronization: locks
- schedule the threads over multi CPUs
- fairness, priority
- efficency(minimize overhead)
Why process management?
- parallel computation, multi-core CPUs
- database, webservers…(softwares)
Design
how to design the most efficient thread package for parallel programs?
- process + shared memory
- cons: too inefficient for general-purpose parallel programming! only handle coarse-grained parallelism well!
- kernel threads
- user level threads
how do kernel threads work
- state of a thread is maintained in the kernel
- problem: oblivious 2-level scheduling with no visible revocation
kernel threads limits?
- performance: boundary crossing overhead, have to copy and check parameters in order to protect kernel
- flexibility: ex: user thread can schedule in different methods(like the exokernel)
- problem:
- block problem
- kernel scheduler does not know the priority of user process, may cause priority invasion
- locking: may delay other process or cause deadlock
- problem:
desireble properties of the solution
- solution for user/kernel level thread scheduler
- performance:
- minimize the boundary crossings
- when a user level thread is ready and a processor is available, the user level thread must able to run
virtual multi-processor abstraction
- kernel may change variable num of processors at runtime
- kernel has complete control over how many processors to give each address space
- user-level thread system has complete control over which threads to run on its allocated processors
- guarantee a physical processor to virtual processor
- visible revocation of virtual processors, notifies user whenever a user-level thread blocks or wakes up in the kernel
- user-level thread system notifies kernel when app. needs more or fewer processors
so what is scheduler activation?
a kernel abstraction of mapping: virtual processor <-> user-level thread system
a scheluler activetion serves 3 roles:
- serves as a vessel, ot excution context for running user-level threads
- notifies the user-level thread system of a kernel event
- provides space in the kernel for saving the processor context of the activation’s current user-level thread, when the thread is stopped by the kernel
what happens when a program starts?
- kernel:
- creates a scheduler activetion
- assign it to a processor
- upcalls into the app address space
- user-level thread system:
- receives the upcall
- uses that activetion as the context in which to initialize itself and run the main()
Scheduler Activation upcall points
- add this processor(processor#): execute a runnable user-level thread
- processor has been preempted(preemted activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the preempted scheduler activation
- scheduler activetion has blocked(blocked activation#): the blocked scheduler activation is no longer using its processor
- scheduler activetion has unblocked(unblocked activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the blocked scheduler activation
scheduler activation vs. kernel threads
scheduler activetion:
- once an activation’s user-level thread is stopped by kernel, the thread is never directly resumed by the kernel.
- a new scheduler activation is created to notify the user-level thread system that the thread has been stopped
- the user-level system removes the state of the thread from the old activation, and tells kernel that the old activation can be reused
- the user-level system decides which thread to run on the processor(based on scheduling)
kernel thread
- when kernel stops a kernel thread, the kernel never notifies the user level of the event
- kernel directly resumes the kernel thread without notification
user-level thread system
- no need to tell the kernel about every thread operation, only about the ops. that can affect kernel’s processor allocation decision.
- notify the kernel whenever it makes a transition to a state where #processor != #runnable threads
kernel call(user -> kernel)
- functions:
- add more processors(additional #processors): allocate more processors to this address space and start them running scheduler activations
- this processor is idle(): preempt this processor if another address space needs it
- drawback:
- applications may not be honest in reporting their parallelism to the OS
- solution: multi-level feedback to encourage apps to provide honest info, the processor allocator can favor user threads that use fewer processors and penalize those use more.
critical sections
- a user-level thread could be running in a critical section at the time when it is blocked or preempted
- possible effects:
- poor performance, ex: other thread continue to test an app-level spin lock held by this preempted thread
- deadlock, ex: this preempted thread could be holding the ready list lock, and the upcall attempted to place the preempted thread onto the ready list
- how to deal with these inopportune preemption?
- prevention:
- avoid through the use of a scheduling and locking protocol between the kernel and the user-level
- but has serious drawbacks:
- requires the kernel to yield control over processor allocation to the user-level, violating the semantics of priorities
- inconsistent with the efficient implementation of critical sections
- when come across a page fault, requires ‘pinning’ to physical mem
- recovery(used):
- when an upcall informs the user-level, user-level checks if the thread was running in a critical section, if so, the thread is continued temporarily via a user-lavel context switch
- prevention:
Implementation
- Topaz: kernel thread
- FastThreads: user-level thread package
processor allocation policy
- space-shares processors:
- precessors are divided evenly among the highest proirity address spaces
- reduce the number of processor reallocations
- processors are time-sliced only if the number of abailable processors is not an integer multiple of the number of address spaces that want them
thread scheduling
- kernel has no knowledge of app’s concurrency model or scheduling policy…(like exokernel style)
- default in FastThreads:
- per-processor ready lists accessed by each processor in last-in-first-out order(improve cache locality)
- a processor scans for work if its own ready list is empty
- this paper:
- hysteresis to avoid unnecessaty processor allocations
- an idle processor spins for a short perios before notifying the kernel that it is avalable for reallocation
performance optimization
critical section: the user-level system must be able to check whether the thread was holding a lock
- we can set a flag when it enters a critical section, clear the flag when it leaves, then check to see if it is being continued (but this imposes overhead on lock acquisition and release!)
- make a copy of every low-level cretical section, at the end of each copy, we place code to yield the processor back to the resumer. When a preemption occurs, activation check if thread is in one of the critical section, if so, continues the thread at the corresponding place in the copy of the critical section
management of scheduler activations
- discard scheduler activations can be cached for eventual reuse
- discard SA can be collected and returned to kernel in bulk to reduce boundary crossing
benefit?
- removing this optimization from FastThread yielded a Null Fork time of 49 usec(37 usec before), and Signal-Wait of 48 usec(42 usec before), the Null Fork benchmark has more critical sections than Signal-Wait
Evaluation
overhead of user-level thread operations?
- inc&dec the number of busy threads and determining whether kernel must be notified
- checking whether a preempted thread is being resumed
why cost of upcall is high?
- overhead of scheduler activation machinery of making and completing an I/O request or a page fault:
- we must maintain more state
- much of Topaz is written in carefully tuned assembler, this paper’s implementation is entirely in Modula-2+
overrall effect on the performance of applications?
- benchmark: an O(NlogN) solution of N-body problem
- kernel thread is inefficient because of spin locks
- when kernel involved I/O, the origin FastThreads perform bad because when a user-level thread block in the kernel, the kernel thread serving as its virtual processor also blocks, thus lose that physical processor for the duration of the I/O