Courses/Computer Science/CPSC 457.F2014/Lecture Notes/LinuxSched

Jump to: navigation, search

Linux Process Scheduling, Context Switching, and Timing

This session is partly a continuation of the previous session.

In this session, we will finish up our consideration of the main scheduling topic by looking at the Linux 2.6 and 2.4 scheduler code, paying special attention to some helper routines (like those calculating priority or those performing the context switch).

  • scheduler_tick (updating time slice of a "SCHED_NORMAL" process)
  • schedule() (actually perform the selection of the next process)
  • goodness() (in 2.4)
  • recalc_task_prio
  • context_switch, switch_to

Along the way, we will more fully consider some of the support available for timing and the quantum or slice (this is related to our Measurement theme). Timing is an important topic in operating systems because a reliable signal source is needed to synchronize and kickstart various procedures. We will consider the hardware support necessary for this functionality to exist.

Focus Question: How can we understand what the conditions are for the scheduler to execute? This is a subtopic of kernel control flow (which isn't like the traditional sequential control flow of many userland programs).


  • finish overview of schedule() routines
  • overview of scheduling helper routines
  • Clock Hardware
  • Clock Software
  • Kernel support for time management


We started the class by highlighting two major principles or themes:

  • Policy vs. Mechanism (an expression of configuration vs. functionality)
  • The idea of kernel control flow as many asynchronous control paths enabled by preemption and a reliable clock in the CPU. This is yet another illustration of the close relationship between the hardware and the OS

We then reviewed the main code of both the 2.4 and the 2.6 scheduler.

The 2.6 schedule() call chain begins with a call to schedule(). This function is invoked from a number of places in the kernel whenever a scheduling decision has to be made. This is typically upon checking a flag that indicates the kernel needs to reschedule, because process priorities have changed enough to warrant it. Checking this flag usually occurs when returning from a system call or servicing an interrupt --- we (the kernel) are already executing, so let's take advantage and determine if we should let another process run.

The main "work" is to select the next task via a call (at line 5453) to the routine pick_next_task; this fairly simple function picks the highest priority task on the system from the runqueue priority buckets. The execution time of this function is bounded by a constant amount of time - it does not depend on how many processes are actually runnable or extant on the system. In the case of the CFS, pick_next_task delegates to the pick_next_task_fair routine.

Once the next task is selected, the scheduler has to initiate the actual context switch to the next process by calling the context_switch function. This function relies on the switch_to assembly macro to actually swap in the saved CPU context from the about-to-be-executed process (saved from the time it was previously taken off the CPU).

Understanding Context Switching

To understand context switching, we need to first take a closer look at two things: (1) exactly how the kernel keeps track of each user-level process (this involves more than the task struct) and (2) understanding how the kernel transitions back to userland execution (in particular, how it retrieves the full CPU context for each process --- note this is related to one of your HW1 questions about thread->ip).

The implementation of the CFS scheduler is described in the kernel documentation:

Data Structures

Where is CPU context? task_struct, thread_info, and thread_struct

We previously learned that the kernel keeps track of processes via the task_struct data structure. While this is true, it is not the whole truth. The kernel keeps two additional pieces of state around: the thread_info structure and a kernel-resident "stack" for each process. Both the thread_info structure and the kernel stack for each process are part of a union that is kept in the same two pages of kernel memory. The first field in the thread_info structure points to the task_struct (i.e., PCB) for the process. It is easy to calculate the location of the task_struct from the thread_info structure simply by referring to the current stack pointer register (b/c it points within the 2-page "kernel stack" memory region for the current process). This relationship comes in handy when the kernel provides a convienent symbol for referring to the 'currently' executing processes (in process context). The definition of the current macro is done in terms of a call to the current_thread_info() function:

6  #define get_current() (current_thread_info()->task)
7  #define current get_current()

The current_thread_info() routine returns a reference to the current (calculated via ESP) thread_info structure:

/* how to get the thread information struct from C */
183 static inline struct thread_info *current_thread_info(void)
184 {
185        return (struct thread_info *)
186                (current_stack_pointer & ~(THREAD_SIZE - 1));
187 }

Note that 'current_stack_pointer' is simply an inline assembly reference to the ESP register.

Note that THREAD_SIZE is defined as PAGE_SIZE shifted to the left one bit (i.e., multiplied by 2), and for the usual 4KB PAGE_SIZE, this means that THREAD_SIZE is 8KB. Subtracting 1 leaves 8191, then flip the bits and bitwise AND them with the current stack pointer leaves a reference to the lowest address in the "current" kernel stack, and this is where the thread_info structure (and its first field, 'task') 'get_current()' needs only to de-reference the first field of this structure (named 'task', and of type task_struct)!

The task_struct also has a field named 'thread' of type thread_struct. We previously saw 'thread' and its 'ip' field in HW1. This is not the userland value of the EIP register, however. Rather, the kernel keeps the actual CPU context split across two places: partly in this 'thread_struct' structure and partly (the EIP and general purpose registers) saved on the kernel stack for that process.

This context information is also loaded into the pt_reg structure to help service ptrace(2) requests (this is the so-called "USER" area; a reflection of the kernel stack's metadata about the CPU context)...see implementation details in the x86-specific parts of ptrace

There is yet another tempting place to look for CPU context: the cpuinfo_x86 structure, but this structure is really metadata about each physical CPU's capabilities, and not the place to store user context (as noted above, that information is stored on the kernel stack for each process). Finally, the x86_hw_tss is a structure that has CPU context, but is only for the purposes of a filling in a particular GDT entry.


The schedule() function is the entry point to invoke the scheduler's decision making. A thread of kernel control flow can invoke this function explicitly (but only in certain circumstances -- for example, it should not be invoked from interrupt handlers) and it likely gets called implicitly when the kernel performs certain naturally disruptive actions, such as returning to user space control).

schedule() very simply tries to pick the next task to run:

5452        put_prev_task(rq, prev);
5453        next = pick_next_task(rq);

When it does so, it needs to perform a context switch between the two tasks 'prev' and 'next':

5459                rq->nr_switches++;
5460                rq->curr = next;
5461                ++*switch_count;
5463                context_switch(rq, prev, next); /* unlocks the rq */

The inline context_switch function eventually invokes the assembly 'switch_to' macro. It is important to realize that the source code has hidden a really important change here -- after the call to 'context_switch', the next source statements are not executed until some time later. The context switch literally causes a user-level process to run, and only much later does kernel control flow come back to this point (when that 'next' processes is now the 'prev' process).

2872        /* Here we just switch the register state and the stack. */
2873        switch_to(prev, next, prev);

The switch_to macro invokes another C function named __switch_to:

65                     "jmp __switch_to\n"        /* regparm call  */     \

and __switch_to completes the transition by loading other CPU state.


Voluntarily giving up the CPU

There was a question on how to voluntarily give up the CPU; a process can do this by invoking the yield(2) system call. Note the invocation of the schedule() function to select another process (although it is still possible that the yield'ing process may be selected next).

Timing and Scheduling

NB: I need to fix the links here... -MEL

The sched_info data structure keeps track of, on a per-process basis, how much CPU time a process has received. The sched_entity data structure also keeps track of timing stats; a field of this type is present in the task_struct (struct sched_entity se). The sched_info data structure is conditionally compiled into the task_struct; see lines 1079 to 1081.

The sched_class data structure represents a collection of attributes and function pointers for accomplishing the different scheduler-related activities for different classes (types) of processes under a particular scheduling approach (e.g., normal, realtime).

On the topic of timing, the sched_clock_tick function updates the "current" time by a call chain involving ktime_get, which involves a call to ktime_get_ts which eventually resolves (on x86) to executing the rdtsc assembly instruction via the __native_read_tsc inline function.

Notice how we cross from the generic "get kernel time" notion to the machine-specific functionality surrounding the hardware clock.

Advanced Topics We touched on several other pieces of functionality:

  • real time scheduling (as faked in Linux)
  • sys_nice
  • sched_yield (voluntarily relinquish CPU)

We will not cover more advanced topics like:

  • balancing run queues across processors or cores
  • the full set of scheduling-related system calls
    • getpriority() and setpriority()
    • sched_getaffinity(), sched_setaffinity()
    • sched_getscheduler, sched_setscheduler
    • sched_getparam, sched_setparam
    • sched_get_priority_min, sched_get_priority_max
    • sched_rr_get_interval

Scribe Notes


Reference material for today's session about timing, timers, and clocks.

  • MOS: 5.5 "Clocks"