Courses/Computer Science/CPSC 457.W2012/Lecture Notes/KernCtrlFlow

= Kernel-Level Control Flow =

In this session, we will move our discussion of concurrency to the kernel level. We will focus on applying these issues about concurrency, communication, and synchronization at the kernel level, using several case studies within the kernel itself.

= Agenda =


 * HW1 grades available
 * go over what happens to a child when a parent process receives a SIGKILL
 * Check course support level. Questions on HW3?
 * HW2 is being graded; should be available shortly
 * HW4 and HW5 will be posted soon

Kernel Control Flow
 * kernel asynchronous / interleaved control flow picture (ULK: Figure 1.2)
 * system calls
 * CPU generates exception (e.g., segv, div0)
 * asynchronous interrupt
 * data available, new mode, interrupt handler
 * sidebar: interrupts vs. polling
 * kernel thread scheduled to execute
 * HW5: strace++ -- dynamically generate a log of kernel control flow

Running Example: Issue a Read
 * P1 issues sys_read(2)
 * VFS read
 * file system read
 * device read: kernel issues read to disk / device
 * Kernel takes up another control path
 * interrupt from device notifies kernel data is ready
 * kernel interrupt handler acknowledges, goes back to interrupted work
 * blocked P1 is now Runnable again
 * Kernel delivers data (i.e., finishes read(2)) when P1 scheduled again

Concepts
 * Define the concepts of "reentrant" and "idempotent"
 * Reentrant: several processes or tasks may be in "kernel mode" at the same time; several kernel tasks may be in the process of executing concurrently
 * How do they not interfere with each other?
 * One idea: kernel functions only modify local variables (i.e., their stack activation record): idempotent
 * Plus: careful coding of synchronization and critical sections so that global variables (e.g., the run queue) can be shared / operated on
 * Reentrancy improves device throughput (CPU ack's quickly)

Events Causing Interleaved Kernel Execution
 * user process invokes system call
 * CPU detects exception (divzero, page fault)
 * device interrupt (interrupts enabled)
 * device interrupt (w/ preemption)
 * device interrupt (interrupts disabled) - nothing

Potential Solutions
 * disable preemption when in kernel mode
 * doesn't work for SMP
 * kernel has to make sure data structures are valid (incoming and outgoing)
 * disable interrupts
 * if critical region is large, system may seem unresponsive, lose incoming data
 * on SMP machines, interrupts are disabled only locally

Semaphores
 * As we've seen, this is a counting variable that is accessed atomically
 * Logical structure of a semaphore
 * an integer (the counter)
 * a list of waiting processes (who should be woken up / notified when lock is free)
 * two atomic functions: down (i.e., test/acquire) and up (i.e., free/release)
 * Note that semaphores are like augmented mutexes: several tasks may actually be in the critical section at once (e.g., multiple readers, but no writers), and the semaphore value (i.e., counter) represents the number of available simultaneous "paths" in the critical region.
 * The down function will suspend or block a task that causes the semaphore value to go below zero
 * The up function will reawaken a blocked task if the semaphore value is zero or greater

Spin Locks
 * Semaphores might represent a significant overhead on SMP machines if critical regions (say, updating a simple count or value) are brief
 * the work of blocking a task or waking up a task as other enter/leave the critical section is a relatively large amount of work
 * a better approach on SMP machines is to cause the calling CPU to spin
 * we've seen spinlocks in action
 * as we've seen in the thread case, spinlocks are not appropriate for uniprocessor environments, since no other task could make progress

Exception Handling
 * This takes the form of translating some exception condition raised by the program (e.g., a div0 or a page fault) into some action by the kernel.
 * In many cases, this means that the kernel must generate and deliver a signal to the offending process.
 * In others (e.g., page fault), the kernel attempts to swap in the missing page
 * The IDT plays a big role here, e.g.,:
 * set_trap_gate(1,&debug);
 * set_intr_gate(14,&page_fault);
 * set_system_gate(128,&system_call);
 * set_system_gate(128,&system_call);
 * set_system_gate(128,&system_call);

Interrupt Handling
 * Exceptions (see above) typically cause the kernel to generate and deliver a signal; the work of actually handling the error condition is thus offloaded and deferred to the process that generated the exception via its signal handler (if it has one).
 * Interrupts, however, are largely asynchronous. They may not arrive when the process they are related to is even executing or runnable.
 * Types of interrupts
 * I/O interrupts: a device has data (network card, disk controller)
 * Timer interrupt (e.g., clock)
 * Interprocessor interrupts (CPU to CPU)

Deferred Execution
 * kernel control flow is complicated and interleaved
 * kernel may be interrupted doing something important
 * critical to defer work and split work into multiple subtasks

SoftIRQs and Tasklets
 * Main idea: remove work from interrupt handler routines
 * interrupt service handlers execute with interrupts disabled
 * so factoring out work can improve response time
 * this "removed" work can execute with interrupts enabled
 * SoftIRQs are reentrant, statically defined
 * tasklets need not be reentrant, can be dynamically created
 * Work Queues: generic list for a function and a set of kernel threads to service it

Code
 * ptrace(2)
 * read(2)
 * code for kernel locks

= Notes =

We looked at the kernel control path involved with sys_read, starting at the actual system call definition itself:

http://lxr.linux.no/#linux+v2.6.35.14/fs/read_write.c#L391

It doesn't take long to run into some synchronization primitives. For example, in invoking fget_light, we see the use of the RCU (read, copy, update) synchronization mechanism via rcu_read_lock and rcu_read_unlock

After getting a handle to the file structure representing the file to be read from, sys_read calls vfs_read.

The vfs_read function performs some sanity checks on its input parameters (e.g., do we have permission to read the file, does the file structure have a complete set of routines for reading the content synchronously or asynchronously?) and then invokes either the read operation registered with the file structure or calls do_sync_read.

Let's deal with do_sync_read first. We note that do_sync_read will invoke the registered aio_read function for the filesystem the file resides on, and will also invoke the wait_on_retry_sync_kiocb function, which will block the current process and invoke the kernel's schedule function (which we've looked at before).

The "read" function pointer referred to in the code at read_write.c, line 310:

ret = file->f_op->read(file, buf, count, pos);

is a field defined in the struct file data type:

This field f_op is of type struct file_operations

This function pointer refers to the filesystem-specific read function. For example, in the ext4 file system, the http://lxr.linux.no/#linux+v2.6.35.14/fs/ext4/file.c#L132 file operations structure is initialized] like this:

132 const struct file_operations ext4_file_operations = { 133       .llseek         = generic_file_llseek, 134       .read           = do_sync_read, 135       .write          = do_sync_write, 136       .aio_read       = generic_file_aio_read, 137       .aio_write      = ext4_file_write, 138       .unlocked_ioctl = ext4_ioctl, 139 #ifdef CONFIG_COMPAT 140       .compat_ioctl   = ext4_compat_ioctl, 141 #endif 142       .mmap           = ext4_file_mmap, 143       .open           = ext4_file_open, 144       .release        = ext4_release_file, 145       .fsync          = ext4_sync_file, 146       .splice_read    = generic_file_splice_read, 147       .splice_write   = generic_file_splice_write, 148 };

We can see that the read function pointer is actually initialized to the do_sync_read function referred to earlier, and this matches our intuition about dispatching a read request to the underlying device driver via the kernel's asynchronous I/O infrastructure and work queues. (See also generic_file_aio_read)

= Readings =

None, work on HW3.