Scheduler Activations - my notes
Contents
Some notes taking from reading the paper ‘Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism’.
Context for process management
abstraction
- process: CPU, mem(address space), disk(flies)
- thread: variable speed virtual processor (CPU), motivation: describe multi CPU
chanllenges
- synchronization: locks
- schedule the threads over multi CPUs
- fairness, priority
- efficency(minimize overhead)
Why process management?
- parallel computation, multi-core CPUs
- database, webservers…(softwares)
Design
how to design the most efficient thread package for parallel programs?
- process + shared memory
- cons: too inefficient for general-purpose parallel programming! only handle coarse-grained parallelism well!
- kernel threads
- user level threads
how do kernel threads work
- state of a thread is maintained in the kernel
- problem: oblivious 2-level scheduling with no visible revocation
kernel threads limits?
- performance: boundary crossing overhead, have to copy and check parameters in order to protect kernel
- flexibility: ex: user thread can schedule in different methods(like the exokernel)
- problem:
- block problem
- kernel scheduler does not know the priority of user process, may cause priority invasion
- locking: may delay other process or cause deadlock
- problem:
desireble properties of the solution
- solution for user/kernel level thread scheduler
- performance:
- minimize the boundary crossings
- when a user level thread is ready and a processor is available, the user level thread must able to run
virtual multi-processor abstraction
- kernel may change variable num of processors at runtime
- kernel has complete control over how many processors to give each address space
- user-level thread system has complete control over which threads to run on its allocated processors
- guarantee a physical processor to virtual processor
- visible revocation of virtual processors, notifies user whenever a user-level thread blocks or wakes up in the kernel
- user-level thread system notifies kernel when app. needs more or fewer processors
so what is scheduler activation?
a kernel abstraction of mapping: virtual processor <-> user-level thread system
a scheluler activetion serves 3 roles:
- serves as a vessel, ot excution context for running user-level threads
- notifies the user-level thread system of a kernel event
- provides space in the kernel for saving the processor context of the activation’s current user-level thread, when the thread is stopped by the kernel
what happens when a program starts?
- kernel:
- creates a scheduler activetion
- assign it to a processor
- upcalls into the app address space
- user-level thread system:
- receives the upcall
- uses that activetion as the context in which to initialize itself and run the main()
Scheduler Activation upcall points
- add this processor(processor#): execute a runnable user-level thread
- processor has been preempted(preemted activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the preempted scheduler activation
- scheduler activetion has blocked(blocked activation#): the blocked scheduler activation is no longer using its processor
- scheduler activetion has unblocked(unblocked activation#, its machine state): return to the ready list the user-level thread that was executing in the context of the blocked scheduler activation
scheduler activation vs. kernel threads
-
scheduler activetion:
- once an activation’s user-level thread is stopped by kernel, the thread is never directly resumed by the kernel.
- a new scheduler activation is created to notify the user-level thread system that the thread has been stopped
- the user-level system removes the state of the thread from the old activation, and tells kernel that the old activation can be reused
- the user-level system decides which thread to run on the processor(based on scheduling)
-
kernel thread
- when kernel stops a kernel thread, the kernel never notifies the user level of the event
- kernel directly resumes the kernel thread without notification
user-level thread system
- no need to tell the kernel about every thread operation, only about the ops. that can affect kernel’s processor allocation decision.
- notify the kernel whenever it makes a transition to a state where #processor != #runnable threads
kernel call(user -> kernel)
- functions:
- add more processors(additional #processors): allocate more processors to this address space and start them running scheduler activations
- this processor is idle(): preempt this processor if another address space needs it
- drawback:
- applications may not be honest in reporting their parallelism to the OS
- solution: multi-level feedback to encourage apps to provide honest info, the processor allocator can favor user threads that use fewer processors and penalize those use more.
critical sections
- a user-level thread could be running in a critical section at the time when it is blocked or preempted
- possible effects:
- poor performance, ex: other thread continue to test an app-level spin lock held by this preempted thread
- deadlock, ex: this preempted thread could be holding the ready list lock, and the upcall attempted to place the preempted thread onto the ready list
- how to deal with these inopportune preemption?
- prevention:
- avoid through the use of a scheduling and locking protocol between the kernel and the user-level
- but has serious drawbacks:
- requires the kernel to yield control over processor allocation to the user-level, violating the semantics of priorities
- inconsistent with the efficient implementation of critical sections
- when come across a page fault, requires ‘pinning’ to physical mem
- recovery(used):
- when an upcall informs the user-level, user-level checks if the thread was running in a critical section, if so, the thread is continued temporarily via a user-lavel context switch
- prevention:
Implementation
- Topaz: kernel thread
- FastThreads: user-level thread package
processor allocation policy
- space-shares processors:
- precessors are divided evenly among the highest proirity address spaces
- reduce the number of processor reallocations
- processors are time-sliced only if the number of abailable processors is not an integer multiple of the number of address spaces that want them
thread scheduling
- kernel has no knowledge of app’s concurrency model or scheduling policy…(like exokernel style)
- default in FastThreads:
- per-processor ready lists accessed by each processor in last-in-first-out order(improve cache locality)
- a processor scans for work if its own ready list is empty
- this paper:
- hysteresis to avoid unnecessaty processor allocations
- an idle processor spins for a short perios before notifying the kernel that it is avalable for reallocation
performance optimization
-
critical section: the user-level system must be able to check whether the thread was holding a lock
- we can set a flag when it enters a critical section, clear the flag when it leaves, then check to see if it is being continued (but this imposes overhead on lock acquisition and release!)
- make a copy of every low-level cretical section, at the end of each copy, we place code to yield the processor back to the resumer. When a preemption occurs, activation check if thread is in one of the critical section, if so, continues the thread at the corresponding place in the copy of the critical section
-
management of scheduler activations
- discard scheduler activations can be cached for eventual reuse
- discard SA can be collected and returned to kernel in bulk to reduce boundary crossing
-
benefit?
- removing this optimization from FastThread yielded a Null Fork time of 49 usec(37 usec before), and Signal-Wait of 48 usec(42 usec before), the Null Fork benchmark has more critical sections than Signal-Wait
Evaluation
overhead of user-level thread operations?
- inc&dec the number of busy threads and determining whether kernel must be notified
- checking whether a preempted thread is being resumed
why cost of upcall is high?
- overhead of scheduler activation machinery of making and completing an I/O request or a page fault:
- we must maintain more state
- much of Topaz is written in carefully tuned assembler, this paper’s implementation is entirely in Modula-2+
overrall effect on the performance of applications?
- benchmark: an O(NlogN) solution of N-body problem
- kernel thread is inefficient because of spin locks
- when kernel involved I/O, the origin FastThreads perform bad because when a user-level thread block in the kernel, the kernel thread serving as its virtual processor also blocks, thus lose that physical processor for the duration of the I/O
Author xymeow
LastMod 2018-10-09