Linux CPU Scheduling Run-Time System
The goal of SILK's scheduling framework is to generalize scheduling along two dimensions: those of the resource scheduled (e.g., CPU cycles or network bits) and the discipline used to schedule it. The framework is designed to be "dropped in" to any context where such scheduling is needed (e.g., application, OS kernel, device driver). Multiple schedulers are provided, including round-robin, priority, and several proportional sharing disciplines; new disciplines can be created by writing a self-contained object and plugging it into the framework. This framework is currently used to provide CPU isolation between slices on PlanetLab.
An actual resource scheduler built using the scheduling framework consists of two pieces:
As a concrete example, the above picture shows how our framework is used to schedule Linux tasks on PlanetLab. On the right is a hierarchy of objects. The HWResource object represents the CPU and it is scheduled using the Rate-Controlled Proportional Share (RCPS) Scheduler. For each slice on PlanetLab, a Container is created, given some number of shares, and bound to the CPU HWResource. Each Linux task running in a slice has its own Activity that is bound to the slice's Container. In the center, the CPU scheduling Run-Time System gets the next Activity from the HWResource, which makes the decision based on the shares and CPU consumption of all the Containers. Finally, the RTS sets the priority of the Linux task bound to that Activity so that the Linux scheduler will choose to run that task next. Below, we discuss each of these components in more detail.
More information on the object interfaces can be found in the Programmer's Manual (in progress).
The CPU scheduling run-time system (RTS) sits between the object framework and the PlanetLab Linux kernel. It performs the following functions:
The core of the RTS is the scheduling loop. At initialization, the RTS forks a kernel thread and sets this thread to the highest possible system priority (in Linux, this is Real-Time priority 99). This thread then enters a loop in which it calls the object framework to select the next Activity; puts the task associated with this Activity on the run queue and sets its priority to the next-highest system priority; and yields. In Linux, a yielding task is not eligible to run for one scheduling decision, and therefore the Linux scheduler chooses the task with the next-highest priority to run. After the Linux task has run for its timeslice, the RTS kernel thread runs again since it is now eligible and has the highest priority. The thread now scans all tasks bound to Activities to see if any have changed status, and looks for any new tasks that should be bound to an existing slice. All tasks managed by the RTS are pulled from the run queue, so the queue only contains unmanaged tasks (see below) and the task that the RTS wants to run. Finally, the RTS updates the accounting information for the Container of the Activity that just ran; and performs the next iteration of the loop.
There is always at least one Container in the system, the default system Container. All tasks not bound to a slice's Container are assigned to this default Container, and are managed differently by the RTS. Tasks bound to a standard Container are pulled from the run queue and only reinserted (with high priorities) when the RTS wants to run them. In contrast, tasks assigned to the system Container are left on the run queue whenever they are runnable. When the Scheduler decides to allow a task from the system Container to run, the RTS simply yields to Linux without setting any task to high priority. This allows the Linux CPU scheduler to choose the next task to run from the set of tasks on the queue based on its own policy. Note that, since the decision to run a task from the system Container is based on this Container's share, isolation is still preserved between these tasks and those bound to other Containers.
The RTS introduces a number of sources of overhead into the scheduling decision. First, the number of context-switches in the system doubles, since Linux must run the RTS kernel thread in order to make the "real" scheduling decision, and then context-switch to the selected task. Second, since the RTS is not directly informed when a task changes status (e.g., when wake_up() is called), it must scan the tasks to see if any have changed status. Currently, all tasks are inspected, but maybe it is only necessary to inspect blocked tasks to see if they have awakened. Third, the scheduling decision performed by the object framework takes some time. On the other hand, the RTS pulls Linux tasks that it is managing off of Linux's run queue, and only reinserts a task when it wants Linux to run it. This means that less time is spent in the Linux scheduler, which partially offsets the additional overhead introduced by the RTS. Quantifying the RTS overhead is future work.
Following are the limitations of the scheduling framework as currently implemented.
Hierarchical scheduling is not yet supported. Containers are bound to HWResources and Activities bound to Containers. The selected Scheduler chooses the next Container based on its internal policy, and an Activity to run is chosen from this container based on a simple round-robin algorithm. This scheme provides only a flat scheduling space, and does not allow Containers or HWResources to be nested within other Containers.
Multiple Containers cannot be created within a slice. This is related to the "no hierarchical scheduling" limitation mentioned above. In PlanetLab's CPU scheduling RTS, a Container maps to a slice and an Activity maps to a task. Therefore, due to the flat scheduling space, no CPU isolation is provided between tasks bound to the same slice in PlanetLab. This could be fixed without scheduling hierarchies by allowing a slice to create multiple Containers, divide up its available shares between these Containers, and explicitly bind individual tasks to a Container.