System Startup

In this session, we will examine how an operating system actually starts up and transitions from a simple sequential loading program to a concurrent system of processes.

Besides highlighting the relationship between system code and the hardware/architecture, it provides an initial introduction to the topic of concurrency.

Focus Question

How does an OS create the environment and conditions on the hardware necessary to support the concurrent execution of multiple processes?

Agenda

Boot.
OS startup slides
OS startup code (see call chain list below)

Notes and References

http://lxr.cpsc.ucalgary.ca/lxr/#linux/Documentation/x86/boot.txt
man dmesg (dmesg is the log of all startup activity)

The call chain involved here is interesting for several reasons:

It shows you how deep some kernel call chains are (reflecting the design pattern of doing a little bit of work and deferring the next little bit of work to someone else)
It demonstrates how closely the startup code is related to the underlying machine
It is an exact reflection of going from sequential assembly code to a concurrent system by "manually" setting up kernel data structures, initializing subsystems, asking the scheduler to start, and creating a new kernel thread (via do_fork()) that eventually calls sys_execve() to load in the "first" user level process: /sbin/init.

Fascinating stuff.

Annotated Guide to x86 Linux boot

2.6.32 (step by step annotated guide to boot on x86)

Boot execution of the kernel begins at the start_of_setup label at line 241 in header.S: 1 and then continues in that file until line 300 with an x86 CALL instruction to the 'main' symbol: 2.

# Jump to C code (should not return)
 301        calll   main

The 'main' symbol is defined at line 125 of boot/main.c 3, which calls 'go_to_protected_mode' at line 173 4. This is expected, because the machine has started in real address mode, and a modern multiprocessing OS like Linux is expected to take advantage of the x86's so-called 'protected mode'. So this call

void main(void)
{
 ...
       /* Do the last things and invoke protected mode */
        go_to_protected_mode();
}

should eventually wind up turning protected mode in the CPU on (which by itself is relatively simple..setting a bit in CR0). But during this setup, the OS does several other important things (like enabling the A20 pin hack). The routine go_to_protected_mode() is defined at line 104 of boot/pm.c 5 and is a relatively short function, but the last few lines are quite interesting:

121       /* Actual transition to protected mode... */
122        setup_idt();
123        setup_gdt();
124        protected_mode_jump(boot_params.hdr.code32_start,
125                            (u32)&boot_params + (ds() << 4));
126 }

The calls to setup_idt() and setup_gdt() create and initialize very vestigial IDT and GDT structures (looking just above in the source code, we can see that the IDT in particular is empty). This is OK; the kernel will fill it in later during boot with the addresses of the appropriate interrupt handlers. Initializing GDT and IDT are some of the most important jobs during boot. The call to 'protected_mode_jump' on line 124 is interesting; we are not yet in protected mode --- it will be the job of that function to set the PE bit in CR0. Note how the first parameter to this function is the 'code32_start' address from the kernel's header.S file! We'll return to that in a second. First, the "protected_mode_jump" function is defined in boot/pmjump at line 26 6. It also is a short function written in assembly language (we can see how the booting code in the early stages flips back and forth between specifying actions in C and assembly). It's main purpose is to transition the CPU to protected mode:

 39      movl    %cr0, %edx
 40      orb     $X86_CR0_PE, %dl        # Protected mode
 41      movl    %edx, %cr0

The code continues to set up some segment values and then falls through to the statement on line 76 that jumps to the location in the EAX register;

 76      jmpl    *%eax                   # Jump to the 32-bit entrypoint
 77 ENDPROC(in_pm32)

The value in EAX is derived from the first parameter passed in to the "protected_mode_jump" 7 and holds the address of the value defined in the location 'code32_start' on line 152 of boot/header.S 8 (we've already seen header.S ... this is the entry point of kernel execution at the 'start_of_setup' label).

152 code32_start:                          # here loaders can put a different
153                                        # start address for 32-bit code.
154                .long   0x100000        # 0x100000 = default for big kernel

To be clear, the kernel's control flow doesn't jump to line 152; rather, it jumps to the address defined at this location. This is the startup_32 location on line 33 in the compressed version of the boot image header (at boot/compressed/head_32.S) 9: a small uncompressed sequence of x86 assembly that relocates and uncompresses (lines 138-151) the rest of the kernel code. Control flow then jumps to the uncompressed kernel image:

178 /*
179  * Jump to the decompressed kernel.
180  */
181       xorl %ebx,%ebx
182       jmp *%ebp

This location is a second 'startup_32' symbol defined on line 76 of kernel/head_32.S in 7. Here, the kernel can begin to initialize the rest of the CPU and important data structures (such as the GDT and IDT). This piece of assembly sets up the overall paging level scheme (2 level, 3 level, etc.) then continues to check the extended features of the CPU, such as PAE and whether the NX bit is available. This code then enables paging by writing the PG bit in CR0 register:

326 /*
327  * Enable paging
328  */
329        movl $pa(swapper_pg_dir),%eax
330        movl %eax,%cr3          /* set the page table pointer.. */
331        movl %cr0,%eax
332        orl  $X86_CR0_PG,%eax
333        movl %eax,%cr0          /* ..and set paging (PG) bit */

This second bit twiddling operation is another major event (the first being the transition to protected mode, which separates user code from so-called supervisor code); enabling paging now enables virtual memory along with additional page-based protection schemes to help separate each user-level process's address space from every other. After some additional sanity checks and a call to 'setup_idt' (on line 358 10) that creates a basic IDT layout with null interrupt handlers (see line 501 11), on line 468, 12 the control flow jumps to the symbol 'initial_code'; 'initial_code' is defined later in the file (kernel/head_32.S) on line 608 13. This location holds the value of the symbol 'i386_start_kernel', which can be thought of as the "C" entry point to the x86 version of the Linux kernel. This symbol is a function defined in kernel/head32.c 14 that winds up calling the 'start_kernel' function; 'start_kernel' is defined on line 536 of init/main.c 15. The start_kernel function invokes a long list of helper functions that start initialize important kernel data structures, object pools, and subsystems. A few important calls to point out include things to help you tie what you 'see' during boot to the activity that has happened up until now:

 569      printk(KERN_NOTICE "%s", linux_banner);

(You can see this output in dmesg)

 582      sched_init();

A call to initialize parts of the scheduling subsystem; we have no processes, but we can still set up the functionality that multiplexes the CPU(s) and picks the next task.

 590      printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);

Another piece of output you can see in dmesg.

 601      trap_init();

The call to 'trap_init()' is very important -- this is, at last, where the kernel sets up the majority of the important/critical interrupt vectors in the IDT. The 'trap_init' function is defined in kernel/traps.c on line 907 [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/traps.c#L907 15]. You can see how it provides interrupt service routines for important events like servicing divide by zero exceptions, the INT3 assembly instruction, page faults, and general protection faults, along with the legacy "INT 0x80" system call service routine (lines 948 to 965):

918        ...
919        set_intr_gate(0, &divide_error);
920        set_intr_gate_ist(1, &debug, DEBUG_STACK);
921        set_intr_gate_ist(2, &nmi, NMI_STACK);
922        /* int3 can be called from all */
923        set_system_intr_gate_ist(3, &int3, DEBUG_STACK);
924        /* int4 can be called from all */
925        set_system_intr_gate(4, &overflow);
926        set_intr_gate(5, &bounds);
927        set_intr_gate(6, &invalid_op);
928        set_intr_gate(7, &device_not_available);
929 #ifdef CONFIG_X86_32
930        set_task_gate(8, GDT_ENTRY_DOUBLEFAULT_TSS);
931 #else
932        set_intr_gate_ist(8, &double_fault, DOUBLEFAULT_STACK);
933 #endif
934        set_intr_gate(9, &coprocessor_segment_overrun);
935        set_intr_gate(10, &invalid_TSS);
936        set_intr_gate(11, &segment_not_present);
937        set_intr_gate_ist(12, &stack_segment, STACKFAULT_STACK);
938        set_intr_gate(13, &general_protection);
939        set_intr_gate(14, &page_fault);
940        set_intr_gate(15, &spurious_interrupt_bug);
941        set_intr_gate(16, &coprocessor_error);
942        set_intr_gate(17, &alignment_check);
943 #ifdef CONFIG_X86_MCE
944        set_intr_gate_ist(18, &machine_check, MCE_STACK);
945 #endif
946        set_intr_gate(19, &simd_coprocessor_error);
947
948 #ifdef CONFIG_IA32_EMULATION
949        set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
950 #endif
...
965        set_system_trap_gate(SYSCALL_VECTOR, &system_call);

The 'IA32_SYSCALL_VECTOR' and "SYSCALL_VECTOR" symbols are defined in irq_vectors.h [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/include/asm/irq_vectors.h#L36 16] to be the value 0x80 (or 128 decimal).

After setting up a number of other subsystems, 'start_kernel' calls 'rest_init' at line 699 17. The 'rest_init' function is defined at line 451 of init/main.c 18 and is interesting because this is the place where the kernel control flow begins to transition from single-threaded sequential code to concurrent execution. This transition happens in a couple of delicate steps that (somewhat surprisingly) simply re-use the existing kernel code for supporting the fork and execve system calls!

The first line of 'rest_init' 19 invokes the 'kernel_thread' function, passing in the name of the function that this new kernel thread should "run" (this is quite similar in form to the user-level pthreads creation, where a function pointer to the thread's 'run' routine is supplied, and also much akin to Java Thread.run() method -- except in that case, the name of the method is fixed).

       kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);

Perhaps surprisingly, the 'kernel_thread' function 20 sets up some arguments and then calls do_fork (on line 229)! The do_fork() routine's purpose is to create the context (i.e., task_struct and other meta-data) for a process. Once this completes successfully, note what has happened here --- the kernel has created an independent 'virtual' version of the EIP. This 'new' line of control flow will proceed to execute the 'kernel_init' function defined on line 846 21, while the original line of control flow will return and execute the 'cpu_idle()' call on line 473 [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L473 22]. This 'idle' process or swapper process (process zero) sits in a zero-priority endless loop [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L473 23] and has given the kernel something quite important: a second process to schedule. Once the scheduler is run for the first time, it can now choose between process zero (idle/swapper) and process one (the 'kernel_init') kernel thread. Thus, the kernel always has a process to give the CPU to.

What is neat about this arrangement is that by the time the scheduler runs for the first time, the newly-created kernel thread that is executing the 'kernel_init' function as done two special things. First, it has completed the primary boot process (there is a configured, initialized, and viable kernel running the CPU) and it has transformed into a user-level process that has a special job (being the ancestor of all other processes via the SYSV startup scripts).

The last few lines of the 'kernel_init' thread function are:

888        /*
889         * Ok, we have completed the initial bootup, and
890         * we're essentially up and running. Get rid of the
891         * initmem segments and start the user-mode stuff..
892         */
893
894        init_post();
895        return 0;

The call to 'init_post' (defined a few lines above) 24 has the effect of calling (during its last few lines) the 'run_init_process' function four times.

838        run_init_process("/sbin/init");
839        run_init_process("/etc/init");
840        run_init_process("/bin/init");
841        run_init_process("/bin/sh");
842
843        panic("No init found.  Try passing init= option to kernel.");
844 }

Note what is happening here. The newly created kernel thread is going to call the 'run_init_process' function at most four times; 'run_init_process' 25 is a two line function that invokes kernel_execve 26 --- which calls sys_execve 27 which is the functionality behind execve system call implementation 28, which we all know is how a new program gets loaded into an existing address space. So our new thread of kernel control has successfully manged to become a user-level process!

793 static void run_init_process(char *init_filename)
794 {
795        argv_init[0] = init_filename;
796        kernel_execve(init_filename, argv_init, envp_init);
797 }

If none of those four calls succeed in replacing the executable code of the init process with the contents of the ELF file named in each call, then the kernel panics at line 843; panic() is defined in the file kernel/panic.c 29 and trys to force shutdown and RESET the CPU.

2.6.32 (off the CPSC mirror of LXR)

http://lxr.cpsc.ucalgary.ca/#linux+v2.6.32/arch/x86/boot/header.S#L301
http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/main.c#L125
goto_protected_mode: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/pm.c#L104
http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/compressed/head_32.S#L33
http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head_32.S#L76
http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head_32.S#L608
i386_start_kernel: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head32.c#L17
start_kernel: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L536
rest_init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L442
diversion: creating a kernel thread, which calls do_fork: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/process_32.c#L210
what kernel thread was created? kernel_init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L846
kernel_init finishes by calling init_post: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L802
init_post uses run_init_process to invoke /sbin/init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L793
which calls http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/sys_i386_32.c#L241
which is the exeve(2) implementation! http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/process_32.c#L451

2.6.27.41 (off the LXR site)

http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/header.S#L297 (real mode startup assembly code)
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/main.c (jumped to from startup assembly code)
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/pm.c (transfer to protected mode)
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/compressed/head_32.S#L35 (startup_32, version 1)
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head_32.S#L85 (startup_32, uncompressed version)
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head_32.S#L604 (startup_32 control flow eventually gets here, after executing idt setup); this location is a call to the x86-specific start_kernel routine:
http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head32.c#L16 (which calls start_kernel() at line 40)
http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L539, which at line 691 calls rest_init():
http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L460. rest_init() then creates a kernel thread via a call to kernel_thread():
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/process_32.c#L233, which winds up asking do_fork():
- http://lxr.linux.no/#linux+v2.6.27.41/kernel/fork.c#L1314 to do the work, which brings us back to the topic of process creation.
- This call to kernel_thread is supplied an argument that points to the function kernel_init():
http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L836. kernel_init() then finishes by calling init_post() at:
http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L795, which attempts to invoke /sbin/init via run_init_process():
http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L786, which asks kernel_execve():
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/sys_i386_32.c#L239 to create a new, in-kernel task, which calls sys_execve() from within the kernel:
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/process_32.c#L670, which brings us full circle to loading a process image.

Scribe notes

s1
s2
s3

Readings

MOS: 10.3.5: "Booting Linux"
An Ode to Real Mode Setup Code

Courses/Computer Science/CPSC 457.F2014/Lecture Notes/Startup

Contents

System Startup

Focus Question

Agenda

Annotated Guide to x86 Linux boot

Scribe notes

Readings

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

pages

Print/export

Tools