Courses/Computer Science/CPSC 457.F2014/Lecture Notes/Startup
Contents
System Startup
In this session, we will examine how an operating system actually starts up and transitions from a simple sequential loading program to a concurrent system of processes.
Besides highlighting the relationship between system code and the hardware/architecture, it provides an initial introduction to the topic of concurrency.
Focus Question
How does an OS create the environment and conditions on the hardware necessary to support the concurrent execution of multiple processes?
Agenda
- Boot.
- OS startup slides
- OS startup code (see call chain list below)
Notes and References
- http://lxr.cpsc.ucalgary.ca/lxr/#linux/Documentation/x86/boot.txt
- man dmesg (dmesg is the log of all startup activity)
The call chain involved here is interesting for several reasons:
- It shows you how deep some kernel call chains are (reflecting the design pattern of doing a little bit of work and deferring the next little bit of work to someone else)
- It demonstrates how closely the startup code is related to the underlying machine
- It is an exact reflection of going from sequential assembly code to a concurrent system by "manually" setting up kernel data structures, initializing subsystems, asking the scheduler to start, and creating a new kernel thread (via do_fork()) that eventually calls sys_execve() to load in the "first" user level process: /sbin/init.
Fascinating stuff.
Annotated Guide to x86 Linux boot
2.6.32 (step by step annotated guide to boot on x86)
Boot execution of the kernel begins at the start_of_setup label at line 241 in header.S: 1 and then continues in that file until line 300 with an x86 CALL instruction to the 'main' symbol: 2.
# Jump to C code (should not return) 301 calll main
The 'main' symbol is defined at line 125 of boot/main.c 3, which calls 'go_to_protected_mode' at line 173 4. This is expected, because the machine has started in real address mode, and a modern multiprocessing OS like Linux is expected to take advantage of the x86's so-called 'protected mode'. So this call
void main(void) { ... /* Do the last things and invoke protected mode */ go_to_protected_mode(); }
should eventually wind up turning protected mode in the CPU on (which by itself is relatively simple..setting a bit in CR0). But during this setup, the OS does several other important things (like enabling the A20 pin hack). The routine go_to_protected_mode() is defined at line 104 of boot/pm.c 5 and is a relatively short function, but the last few lines are quite interesting:
121 /* Actual transition to protected mode... */ 122 setup_idt(); 123 setup_gdt(); 124 protected_mode_jump(boot_params.hdr.code32_start, 125 (u32)&boot_params + (ds() << 4)); 126 }
The calls to setup_idt() and setup_gdt() create and initialize very vestigial IDT and GDT structures (looking just above in the source code, we can see that the IDT in particular is empty). This is OK; the kernel will fill it in later during boot with the addresses of the appropriate interrupt handlers. Initializing GDT and IDT are some of the most important jobs during boot. The call to 'protected_mode_jump' on line 124 is interesting; we are not yet in protected mode --- it will be the job of that function to set the PE bit in CR0. Note how the first parameter to this function is the 'code32_start' address from the kernel's header.S file! We'll return to that in a second. First, the "protected_mode_jump" function is defined in boot/pmjump at line 26 6. It also is a short function written in assembly language (we can see how the booting code in the early stages flips back and forth between specifying actions in C and assembly). It's main purpose is to transition the CPU to protected mode:
39 movl %cr0, %edx 40 orb $X86_CR0_PE, %dl # Protected mode 41 movl %edx, %cr0
The code continues to set up some segment values and then falls through to the statement on line 76 that jumps to the location in the EAX register;
76 jmpl *%eax # Jump to the 32-bit entrypoint 77 ENDPROC(in_pm32)
The value in EAX is derived from the first parameter passed in to the "protected_mode_jump" 7 and holds the address of the value defined in the location 'code32_start' on line 152 of boot/header.S 8 (we've already seen header.S ... this is the entry point of kernel execution at the 'start_of_setup' label).
152 code32_start: # here loaders can put a different 153 # start address for 32-bit code. 154 .long 0x100000 # 0x100000 = default for big kernel
To be clear, the kernel's control flow doesn't jump to line 152; rather, it jumps to the address defined at this location. This is the startup_32 location on line 33 in the compressed version of the boot image header (at boot/compressed/head_32.S) 9: a small uncompressed sequence of x86 assembly that relocates and uncompresses (lines 138-151) the rest of the kernel code. Control flow then jumps to the uncompressed kernel image:
178 /* 179 * Jump to the decompressed kernel. 180 */ 181 xorl %ebx,%ebx 182 jmp *%ebp
This location is a second 'startup_32' symbol defined on line 76 of kernel/head_32.S in 7. Here, the kernel can begin to initialize the rest of the CPU and important data structures (such as the GDT and IDT). This piece of assembly sets up the overall paging level scheme (2 level, 3 level, etc.) then continues to check the extended features of the CPU, such as PAE and whether the NX bit is available. This code then enables paging by writing the PG bit in CR0 register:
326 /* 327 * Enable paging 328 */ 329 movl $pa(swapper_pg_dir),%eax 330 movl %eax,%cr3 /* set the page table pointer.. */ 331 movl %cr0,%eax 332 orl $X86_CR0_PG,%eax 333 movl %eax,%cr0 /* ..and set paging (PG) bit */
This second bit twiddling operation is another major event (the first being the transition to protected mode, which separates user code from so-called supervisor code); enabling paging now enables virtual memory along with additional page-based protection schemes to help separate each user-level process's address space from every other. After some additional sanity checks and a call to 'setup_idt' (on line 358 10) that creates a basic IDT layout with null interrupt handlers (see line 501 11), on line 468, 12 the control flow jumps to the symbol 'initial_code'; 'initial_code' is defined later in the file (kernel/head_32.S) on line 608 13. This location holds the value of the symbol 'i386_start_kernel', which can be thought of as the "C" entry point to the x86 version of the Linux kernel. This symbol is a function defined in kernel/head32.c 14 that winds up calling the 'start_kernel' function; 'start_kernel' is defined on line 536 of init/main.c 15. The start_kernel function invokes a long list of helper functions that start initialize important kernel data structures, object pools, and subsystems. A few important calls to point out include things to help you tie what you 'see' during boot to the activity that has happened up until now:
569 printk(KERN_NOTICE "%s", linux_banner);
(You can see this output in dmesg)
582 sched_init();
A call to initialize parts of the scheduling subsystem; we have no processes, but we can still set up the functionality that multiplexes the CPU(s) and picks the next task.
590 printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
Another piece of output you can see in dmesg.
601 trap_init();
The call to 'trap_init()' is very important -- this is, at last, where the kernel sets up the majority of the important/critical interrupt vectors in the IDT. The 'trap_init' function is defined in kernel/traps.c on line 907 [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/traps.c#L907 15]. You can see how it provides interrupt service routines for important events like servicing divide by zero exceptions, the INT3 assembly instruction, page faults, and general protection faults, along with the legacy "INT 0x80" system call service routine (lines 948 to 965):
918 ... 919 set_intr_gate(0, ÷_error); 920 set_intr_gate_ist(1, &debug, DEBUG_STACK); 921 set_intr_gate_ist(2, &nmi, NMI_STACK); 922 /* int3 can be called from all */ 923 set_system_intr_gate_ist(3, &int3, DEBUG_STACK); 924 /* int4 can be called from all */ 925 set_system_intr_gate(4, &overflow); 926 set_intr_gate(5, &bounds); 927 set_intr_gate(6, &invalid_op); 928 set_intr_gate(7, &device_not_available); 929 #ifdef CONFIG_X86_32 930 set_task_gate(8, GDT_ENTRY_DOUBLEFAULT_TSS); 931 #else 932 set_intr_gate_ist(8, &double_fault, DOUBLEFAULT_STACK); 933 #endif 934 set_intr_gate(9, &coprocessor_segment_overrun); 935 set_intr_gate(10, &invalid_TSS); 936 set_intr_gate(11, &segment_not_present); 937 set_intr_gate_ist(12, &stack_segment, STACKFAULT_STACK); 938 set_intr_gate(13, &general_protection); 939 set_intr_gate(14, &page_fault); 940 set_intr_gate(15, &spurious_interrupt_bug); 941 set_intr_gate(16, &coprocessor_error); 942 set_intr_gate(17, &alignment_check); 943 #ifdef CONFIG_X86_MCE 944 set_intr_gate_ist(18, &machine_check, MCE_STACK); 945 #endif 946 set_intr_gate(19, &simd_coprocessor_error); 947 948 #ifdef CONFIG_IA32_EMULATION 949 set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall); 950 #endif ... 965 set_system_trap_gate(SYSCALL_VECTOR, &system_call);
The 'IA32_SYSCALL_VECTOR' and "SYSCALL_VECTOR" symbols are defined in irq_vectors.h [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/include/asm/irq_vectors.h#L36 16] to be the value 0x80 (or 128 decimal).
After setting up a number of other subsystems, 'start_kernel' calls 'rest_init' at line 699 17. The 'rest_init' function is defined at line 451 of init/main.c 18 and is interesting because this is the place where the kernel control flow begins to transition from single-threaded sequential code to concurrent execution. This transition happens in a couple of delicate steps that (somewhat surprisingly) simply re-use the existing kernel code for supporting the fork and execve system calls!
The first line of 'rest_init' 19 invokes the 'kernel_thread' function, passing in the name of the function that this new kernel thread should "run" (this is quite similar in form to the user-level pthreads creation, where a function pointer to the thread's 'run' routine is supplied, and also much akin to Java Thread.run() method -- except in that case, the name of the method is fixed).
kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
Perhaps surprisingly, the 'kernel_thread' function 20 sets up some arguments and then calls do_fork (on line 229)! The do_fork() routine's purpose is to create the context (i.e., task_struct and other meta-data) for a process. Once this completes successfully, note what has happened here --- the kernel has created an independent 'virtual' version of the EIP. This 'new' line of control flow will proceed to execute the 'kernel_init' function defined on line 846 21, while the original line of control flow will return and execute the 'cpu_idle()' call on line 473 [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L473 22]. This 'idle' process or swapper process (process zero) sits in a zero-priority endless loop [lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L473 23] and has given the kernel something quite important: a second process to schedule. Once the scheduler is run for the first time, it can now choose between process zero (idle/swapper) and process one (the 'kernel_init') kernel thread. Thus, the kernel always has a process to give the CPU to.
What is neat about this arrangement is that by the time the scheduler runs for the first time, the newly-created kernel thread that is executing the 'kernel_init' function as done two special things. First, it has completed the primary boot process (there is a configured, initialized, and viable kernel running the CPU) and it has transformed into a user-level process that has a special job (being the ancestor of all other processes via the SYSV startup scripts).
The last few lines of the 'kernel_init' thread function are:
888 /* 889 * Ok, we have completed the initial bootup, and 890 * we're essentially up and running. Get rid of the 891 * initmem segments and start the user-mode stuff.. 892 */ 893 894 init_post(); 895 return 0;
The call to 'init_post' (defined a few lines above) 24 has the effect of calling (during its last few lines) the 'run_init_process' function four times.
838 run_init_process("/sbin/init"); 839 run_init_process("/etc/init"); 840 run_init_process("/bin/init"); 841 run_init_process("/bin/sh"); 842 843 panic("No init found. Try passing init= option to kernel."); 844 }
Note what is happening here. The newly created kernel thread is going to call the 'run_init_process' function at most four times; 'run_init_process' 25 is a two line function that invokes kernel_execve 26 --- which calls sys_execve 27 which is the functionality behind execve system call implementation 28, which we all know is how a new program gets loaded into an existing address space. So our new thread of kernel control has successfully manged to become a user-level process!
793 static void run_init_process(char *init_filename) 794 { 795 argv_init[0] = init_filename; 796 kernel_execve(init_filename, argv_init, envp_init); 797 }
If none of those four calls succeed in replacing the executable code of the init process with the contents of the ELF file named in each call, then the kernel panics at line 843; panic() is defined in the file kernel/panic.c 29 and trys to force shutdown and RESET the CPU.
2.6.32 (off the CPSC mirror of LXR)
- http://lxr.cpsc.ucalgary.ca/#linux+v2.6.32/arch/x86/boot/header.S#L301
- http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/main.c#L125
- goto_protected_mode: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/pm.c#L104
- http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/boot/compressed/head_32.S#L33
- http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head_32.S#L76
- http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head_32.S#L608
- i386_start_kernel: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/head32.c#L17
- start_kernel: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L536
- rest_init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L442
- diversion: creating a kernel thread, which calls do_fork: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/process_32.c#L210
- what kernel thread was created? kernel_init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L846
- kernel_init finishes by calling init_post: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L802
- init_post uses run_init_process to invoke /sbin/init: http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/init/main.c#L793
- which calls http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/sys_i386_32.c#L241
- which is the exeve(2) implementation! http://lxr.cpsc.ucalgary.ca/#linux+v2.6.30/arch/x86/kernel/process_32.c#L451
2.6.27.41 (off the LXR site)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/header.S#L297 (real mode startup assembly code)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/main.c (jumped to from startup assembly code)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/pm.c (transfer to protected mode)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/boot/compressed/head_32.S#L35 (startup_32, version 1)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head_32.S#L85 (startup_32, uncompressed version)
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head_32.S#L604 (startup_32 control flow eventually gets here, after executing idt setup); this location is a call to the x86-specific start_kernel routine:
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/head32.c#L16 (which calls start_kernel() at line 40)
- http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L539, which at line 691 calls rest_init():
- http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L460. rest_init() then creates a kernel thread via a call to kernel_thread():
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/process_32.c#L233, which winds up asking do_fork():
- http://lxr.linux.no/#linux+v2.6.27.41/kernel/fork.c#L1314 to do the work, which brings us back to the topic of process creation.
- This call to kernel_thread is supplied an argument that points to the function kernel_init():
- http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L836. kernel_init() then finishes by calling init_post() at:
- http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L795, which attempts to invoke /sbin/init via run_init_process():
- http://lxr.linux.no/#linux+v2.6.27.41/init/main.c#L786, which asks kernel_execve():
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/sys_i386_32.c#L239 to create a new, in-kernel task, which calls sys_execve() from within the kernel:
- http://lxr.linux.no/#linux+v2.6.27.41/arch/x86/kernel/process_32.c#L670, which brings us full circle to loading a process image.
Scribe notes
- s1
- s2
- s3
Readings
- MOS: 10.3.5: "Booting Linux"
- An Ode to Real Mode Setup Code