Courses/Computer Science/CPSC 457.F2013/Lecture Notes/Scribe1

From wiki.ucalgary.ca
Jump to: navigation, search

Navigation

Week 1
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Sep 13
Sep 16
Sep 23
Sep 30
Oct 7
Oct 21
Oct 28
Nov 4
Nov 18
Nov 25
Dec 2
Links
Sep 18
Sep 25
Oct 2
Oct 9
Oct 16
Oct 23
Oct 30
Nov 6 (TBA)
Nov 13
Nov 20
Nov 27
Dec 4
Sep 20
Sep 27
Oct 4
Oct 11
Oct 18
Oct 25
Nov 1
Nov 8
Nov 15
Nov 22
Nov 29
Dec 6
Tut 1
Tut 3
Tut 5
Tut 7
Tut 9
Tut 11
Tut 13
Tut 14
Tut 2
Tut 4
Tut 6
Tut 8
Tut 10
Tut 12

Contents

Links

Directory of Readings, LXR Guide, Code Examples and Slides for Class
Time Period Extra Readings LXR Guide & Code Examples Textbook Readings Slides Wiki Page
Week 1
  • MOS: 1.1: What is an Operating System?
MOS: 1.2: History of Operating Systems
MOS: 1.4: The Operating System Zoo
MOS: 1.5: Operating System Concepts
MOS: 10.1: History of Unix and Linux
MOS: 13.1: The Nature of the Design Problem (optional)
Week 2
2.1: Brief History of Intel 64 and IA-32 Architecture (skim this)
2.2: More on Specific Advances (paying particular attention to the description of the pipelined microarchitecture)
3: Basic Execution Environment (paying particular attention to 3.1 (and Figure 3-1), 3.2, 3.3, 3.4)
The Intel IA32 Developer Manuals, Volume 3A
2: System Architecture Overview (paying special attention to Figure 2-1, 2.2, Figure 2-3, and 9.1)
  • MOS: 1.3: Computer Hardware Review
MOS: 1.6: System Calls
MOS: 1.7: Operating System Structure
  • LDK: 1: Introduction to the Linux Kernel
Week 3
LXR - x86_64 syscall table data structure declaration - kernel code information
LXR - System Call Numbers
LXR - do_fork - kernel code process information
LXR - copy_process - kernel code process information
LXR - load_elf_process - kernel code process information
LXR - type definition of mm_struct - kernel code information
LXR - type definition of vm_area_struct - kernel code information
LXR - mprotect system call entry point - example of using these vm area structure fields and properties (particularly the flags)
LXR - allocation routine for the page table directory - kernel code information
  • MOS: 1.6: System Calls (you should have read this already)
LDK: 5: System Calls (this describes how to add a system call to your OS, something you'll have an opportunity to practice in your tutorials later and in HW3)
  • LDK: 3: Process Management
MOS: 2.1: Processes (this should be partly review of what we've talked about in class)
MOS:10.2.5: Kernel structure
MOS: 10.3.1: Fundamental Concepts (about processes in Linux, should reinforce some class discussion)
  • MOS: 3.1: No Memory Abstraction
MOS: 3.2: A Memory Abstraction: Address Spaces

Week 4

  • Code - one.c - example code program for address translation
  • Code - Kernel Ooops - segment registers information
  • Code - greedysbrk.c - example code program that calls sbrk() to expand the heap memory region and write to it via memset() to exhaust available memory
Sample Run of Code - example code program run
Sample Run of Activity 1 - example code program run
LXR - declaration of mem_map array - virtual memory information for iterating over page frame meta-data for the purposes of allocation management
LXR - struct page - virtual memory information representing the items in the mem_map array (i.e., meta-data about the physical page frames)
Code - oom_kill.c - code
  • MOS: 3.2.3: Managing Free Memory
LDK: 15, pages 320-322: The Process Address Space
  • MOS: 3.3: Virtual Memory (this subsection also contains a review of paging and TLBs)
MOS: 3.7.3: Segmentation with Paging: The Intel Pentium

Week 5

Output - llist.c - example code output
LXR - do_brk routine - brk(2) system call
  • MOS: 3.4: Page Replacement Algorithms
MOS: 3.5: Design Issues for Paging Systems
MOS: 3.6: Implementation Issues

Week 6

Reading /proc/slabinfo - related topics from article
Additional Allocators in Kernel - related topics from article
SLEB - related topics from article
SLQB - related topics from article
  • Code - grab.c - example code program for demonstrating compiler-declared size of userland structures and how malloc operates by exploring the data "around" a returned pointer
Details for grab.c - more details for grab.c
  • Code - break.c - example code program that invokes mmap to try to allocate a single byte
Output - break.c - example code program output
brk(2) entry point - brk(2) system call
do_brk() helper function - brk(2) system call
helper routine to find vma intersection - brk(2) system call
find_vma routine - brk(2) system call
  • LXR - struct page - kernel memory management information for holding metadata about physical page frames (and metadata used by the SLUB allocator)

Week 7

LXR - value of current from the CPU state - code
O(n) snippet
  • MOS: 10.3.5: Booting Linux
  • MOS: 2.4.1: Introduction to Scheduling (plus 2 intro paragraphs in Section 2.4)
MOS: 2.4.5: Policy vs. Mechanism
MOS: 10.3.4: Scheduling in Linux or LDK: 4: Process Scheduling
MOS: 2.7: Summary (this talks about some things we'll consider next)

Week 8

  • MOS: 5.5: Clocks
  • LDK: 9: An Introduction to Kernel Synchronization
MOS: 2.2: Threads
MOS: 2.3: Interprocess Communication
MOS: 2.5: Classical IPC Problems
Beware typos in Figure 2-23, pg 122 (semicolon placement)
  • MOS: 6.2: Introduction to Deadlocks (NB: MOS 6.2.1 should be memorized!)
MOS: 6.4: Deadlock Detection and Recovery
MOS: 6.5: Deadlock Avoidance
MOS: 6.6: Deadlock Prevention
Livelock: MOS: 6.7.3

Week 9

  • ULK: 5: Kernel Synchronization
Code - urace.c
  • LKD: 10: Kernel Synchronization Methods

Week 10

  • MOS: 4.1: Files
MOS: 4.2: Directories

Week 11

  • ULK: 1: SS: An Overview of the Unix Filesystem (pp 12---19)
  • MOS: 4.3: File System Implementation
MOS: 4.5: Example File Systems
MOS: 10.6: The Linux File System
  • LKD: 13: The Virtual File System

Week 12

man 5 fs
man 5 fstab
man 8 mount / man 2 mount
  • MOS: 5.1: Principles of I/O Hardware
MOS: 5.2: Principles of I/O Software
MOS 5.3: I/O Software Layers
  • MOS: 5.4.3: Disk Arm Scheduling Algorithm
MOS: 5.4: Disks
LKD: 14: The Block I/O Layer, especially subsection "I/O Schedulers"

Week 13

Back To Navigation

Introduction

Hello, my name is Carrie Mah and I am currently in my 3rd year of Computer Science with a concentration in Human Computer Interaction. I am also an Executive Officer for the Computer Science Undergraduate Society. If you have any questions (whether it be CPSC-related, events around the city, or an eclectic of random things), please do not hesitate to contact me.

I hope you find my notes useful, and if there are any errors please correct me by clicking "Edit" on the top of the page and making the appropriate changes. You should also "Watch this page" for the CPSC 457 class to check when Professor Locasto edits anything in the Wiki page. Just click "Edit" and scroll down to check mark "Watch this page."

You are welcome to use my notes, just credit me when possible. If you have any suggestions or major concerns, please contact me at cmah[at]ucalgary[dot]ca. Thanks in advance for reading!

Week 1

Introductions were set and we learned a bit about operating systems. Take note of the 3 Fundamental Systems Principles we talked about, in addition to the three problems we outlined.

We also learned some new commands for the shell.

September 13: What Problem Do Operating Systems Solve?

Why Operating Systems?

  • Abstract OS concepts relative to a mature, real-world and complex piece of software
  • Translates concepts to hardware with real limitations

Fundamental Systems Principles

  • OS is a term about the software that operates some interesting piece of hardware
  1. Measurement: its state, how fast it's doing it, how correctly it's doing it and ability to observe
  2. Concurrency: how to make it do what you want it
  3. Resource management: memory, CPU, persistent data -- make it coherent and separate

Main Problem

1. Manage time and space multiplexed resources
  • Multiplex: share expensive resources
  • Ex. Network (both, more time)
  • Ex. CPU (time, a bit of space because of cache and state)
  • Ex. Hard drive (space because of data storage)
2. Mediate hardware access for application programs
  • Higher level code does not need to worry about hardware level, so the Operating System 'distracts' us
3. Loads programs and manages execution
  • Turn programs from dead code to running it

What's a Program?

  • Source code: in any language, the algorithm in the source code
  • Assembly code: lower level code
  • Machine/binary code: code the machine understands directly
  • A process: program in execution

Example: addi.c

int main (int argc, char* argv[]) 
{
  int i =0;
  i = i++ + ++i;
  return i;
}
  • int argc: Number of arguments
  • char* argv[]: Contains a pointer of arguments
  • Depending on the platform and compiler you are on, i = 0 -> 4
  • An undefined behavior occurs: failure in abstraction because the compiler is making the decision

Commands

General Commands Presented in Class
Command Purpose Notes
<text_editor> <file_name> & Opens text editor in background
  • The & runs programs in the background
touch <file_name>.c Creates a text file
emacs -nw <file_name>.c Creates a text file using emacs
./file_name Runs file in the current directory
  • <file_name> alone is the machine code and usually ELF code
file <file_name> Asks the terminal what the file type is

Notes

Commands to Know Based on Lecture Notes
Command Purpose Notes
gcc -Wall -o <file_name> <file_name>.c Compiles and display any error
  • Wall displays any errors
  • -o changes output file from <file_name>.c to <file_name>
cat <file_name> Concatenates a file and displays to the terminal
hexdump -c Displays information of an ELF file in three columns to the terminal
  • The columns are: offset, byte, then the ASCII translation
objdump -d Displays the disassembly of code in three columns to the terminal
  • The columns are: virtual address, byte code, disassembled code in assembly

Back To Navigation

Week 2

A lot of new commands were introduced. You should understand what the kernel is, what's in it and what isn't. Understand the registers from x86 architecture in addition to the segment registers. Also know about the GDT and IDT, interrupts, and the DPL.

In tutorial, you should have set up your VM and be comfortable with using SVN.

September 16: From Programs to Processes

Introduction to the Command Line

  • Pipes are used when the user wants the output from one command to become the input of another command
  • Some commands are programs, while others are functionality built into the shell
  • Bash Builtin (commands compiled into the bash executable) commands displays a section of the manual and require you to search
  • Ex. Write, read
  • A way to exit loops is Ctrl+C, however it is not the real answer
  • It is basically a hardware signal from combining keyboard strokes. Because of this, the Ctrl+C sequence has data and tells the hardware that you should kill the current process because of this Ctrl+C data

Commands

Commands for history
Command Purpose Notes
history Displays history of all commands typed
history | more Splits the display of history in screen-sized to the terminal
  • Hit enter to continue to next screen
  • Hit Q/q to quit the process
  • history produces frequency distribution of commands, a good way for testing purposes
history | gawk '{print $2}' | wc Specifies the history display by printing the second argument plus word count
  • gawk parses the input into records and displays three columns: the number, word, and parameter
  • print $2 prints the second argument (word)
  • wc is word count which displays three columns: lines, words and bytes
history | gawk '{print $2}' | sort Displays the sorted history of commands to the terminal
  • Allows organization of duplication
history | gawk '{print $2}' | sort | uniq -c Displays the sorted history of commands and omits repeated lines to the terminal
  • c prints out the number for each entry
history | gawk '{print $2}' | sort | uniq -c | sort -nr Displays a reverse numerical sort of history and omits repeated lines to the terminal
  • n sorts in numerical order
  • r sorts in reverse order
General Commands Presented in Class
Command Purpose Notes
ls Displays all contents in the current directory
cd <directory_name> Changes to the directory explicitly written
  • cd .. moves to the parent of the current directory
clear Adds multiple lines to the terminal screen to make it appear like a fresh new terminal screen
make <file_name> A make file controls the process of building a program
  • You do not have to use bash and press up repeatedly
exit Built-in shell command that terminates a process
su Switches the user and makes a new sub-shell
  • Without arguments, the command automatically switches to root
stat Gets the statuses of files and provides more information
df Displays information of persistent storage (disk usage)
ifconfig Displays information about the state of network adapters
diff <file_name1> <file_name2> Produces a patch, or displays the difference between two files
  • Displays the difference from the previous source to the new source
yes Outputs a string repeatedly until killed
  • Useful for test cases, especially for loops

Notes

Commands to Know Based on Lecture Notes
Command Purpose Notes
man command Displays the UNIX Programmer's Manual section for the given command
ps Displays the processes that are currently running
  • The number is unpredictable and depends on the role of the machine
pstree Displays the relationship between all the software running on the machine
  • The descendant from a single process is called init
top Displays a list of what programs are running and what resources are being used
  • In class, the Windows environment and the terminal were running
grep yes Provides information like the process ID
  • Obtaining this information allows you to control the process
kill -<signal_number> <process_id> Kill sends a signal to a process
  • Signal numbers are basic events that the OS can generate on behalf of hardware, and deliver it to the machine
strace ./<file_name> Records what the program is asking the Operating System to do
objdump -d Displays the disassembly of code in three columns to the terminal
  • The columns are: virtual address, byte code, disassembled code in assembly
readelf -t <file_name> Displays the major portions of the file
  • ELF is a container of the different sections (contract between compiler and the Operating System)

Example: addi.c

int main (int argc, char* argv[]) 
{
  int i =0;
  i = i++ + ++i;
  return i;
}
  • Intel style: Data gets stored from the right hand side to the left
  • AT&T style: Data gets stored from the left hand side to the right
  • When this example is in assembly code, the AT&T syntax is used
  • If unsure whether a program is Intel or AT&T style, look at the next lines
  • In class, we saw that the destination was at the right hand side--thus data being stored into the right side

Example: hello.c

#include <stdio.h>
int main (int argc, char* argv[])
{
  printf("Hello, 457!\n");
  return 0;
}

makefile

all: hello
hello: hello.c
       gcc -Wall -o h hello.c
clean: ...
  • Typing make in the command line does the work (what you coded) for you

More Commands

gcc -S hello.c
more hello.s
  • printf: formats trace output conversion
  • puts: output of characters and string
  • Compiler changes the behavior (as we wrote printf, but the hello.s file has puts)


readelf -t h
  • The .text section of the ELF file is where the program code actually goes


objdump -d h
  • In main, a call to <HEX><puts@plt> happens
  • puts@plt: procedure linking table
  • The compiler doesn't know the puts command; but it is linked to the C library
  • It jumps to the table of location that you can use in the program (common way of calling libraries)


strace ./h
  • Check to see if the right system call has been made (It has, which is write())

Back To Navigation

September 18: What is an Operating System?

Question Protocol

  • Get Michael's attention - politely yell something like "question!"
  • Think of it like: The lecture is the CPU and we are the interrupts
  • Michael likes conversation more than a one-sided diatribe

Readings

  • Not a computer architecture class (recall CPSC 325/359: Computing Machinery II), but it is an API (application programming interface) for the Operating System
  • These are the control bits and pieces of the hardware that allow the Operating System to do things
  • Illustration of all the layers right down to the Operating System, and a nice tour of the loading mechanism (how a program becomes a process)
  • Good introduction of how system calls are made
  • Very amusing. Try to code along with it
  • MOS: 1.3 - Computer Hardware Review, 1.6 - System Calls, 1.7 - Operating System Structure
  • Finish Chapter 1 to reinforce lecture information

Recap: September 16 lecture

  • Command line agility and transformation of src code to an ELF file
  • Looked at command line introduction to look at how the source code is transformed into a binary
  • Learn a bit about a few topics (breadth first)
  • Example of Cross-Layer Learning, which is a gradual progress on several inter-related topics at once

Reasons

a) Gives overview of some commands from the command line
b) Gives an idea of the rich variety of commands (from ps to sftp) of what the OS supports
c) Demonstrates the use of the UNIX manual (man(1) command)
d) Uses it to drive a discussion of processes (ps, pstree, top)
* Leads to what the OS does (which basically manages processes and allows multiple processes seemingly at the same time)
e) Observes the transformation of code to binary
f) Observes the coarse-level detail of an ELF file (i.e. sections)

Learning Objectives

  • Intuition for what an OS kernel is, and what its roles and defining characteristics are, by placing it in the context of managing the execution of processes on 'top' of some hardware
  • Manage several processes

Operating System Definition

  • Authors
  • MOS: Hard to pin down, other than saying it is the software running in kernel mode (the core internals)
  • LDK: Parts of the system responsible for basic use and administration. Including the kernel and device drivers, boot loader, command shell or other user interface, and basic file and system utilities. It is the stuff you need--not a web browser or music players.
a) A pile of source code (LXR)
b) A program running directly on the hardware
c) Resource manager (multiplex resources, hardware)
d) An execution target (abstraction layer, API)
e) Privilege supervisor (LDK page 4)
f) Userland/kernel split

System Diagram

b8vhmSI.png
Bigger image

The Operating System Kernel

What is the Kernel

  • Piece of software

  • Part of it is machine language specific to what particular CPU you are running on (Linux runs on many CPUs, so many parts of it are specialized)
  • It is a pile of C code and a pile of assembly code
  • Interface to hardware

  • The kernel has the subsystem "driver manager" which are self contained pieces of software that talk to a particular piece of hardware
  • Mediates access and service requests
  • Supervisor

  • Central location for managing important events
  • Ex. The Micro kernel is a different style than Linux (where it is a small piece of core code and uses various user level processes to get the work done)
  • A piece of privileged software that manages important devices
  • Abstraction

  • Exports some layer of abstraction to programs
  • System Call API: set of system calls (in one sense it's function calls to the kernel but it's actually not)
  • Set of system calls: exports a stable interface to the application programs
  • Ex. A generic operation like write, where it writes some number of bytes to a file
  • OS: important piece of functionality so it implements some number of system calls and will be part of the public interface
  • Services exported to processes
  • Protection

  • The isolation between processes
  • What if processes want to simultaneously run? They cannot have it all at the same time. So who gets ownership of the CPU?
  • The OS kernel decides! It has to somehow enforce isolation/protection between processes
  • Also regards the example: "should process1 manipulate/read/write the state of process2?"
  • Typically not...but sometimes yes
  • The OS can't do it alone. It must rely on the architectural layers (which will be talked about later)

What Does Not Belong in the Kernel

  • The shell isn't because it doesn't have to be
  • The shell is nothing more than a program, which becomes a user level process and gets equal time as everything else

What is in the Kernel

  • To decide what needs to be in the kernel, consider:
  • Complexity of the kernel
  • You can shove all services and major components (fast because of quick access and no separation between kernel components, plus can do anything since it's privileged) but is unpredictable, and difficult to understand and debug
  • Typically do not want to add code in the kernel -- restrain yourself
  • The mistakes in the code you write inside the kernel (despite how good you are)
  • These mistakes affect everyone else, especially the scheduler
  • Kernel must be able to handle failure conditions. It needs to be stable
  • Ex. Bugs in the drivers with vulnerabilities; can take down the machine due to errors in kernel
  • Note: No hard or fast rule of what components make up the kernel
  • Still need small pieces of assembly code that manages parts of the CPU
  • Make a fundamental design decision: Monolithic Kernel vs Micro Kernel
  • Monolithic Kernel: All major subsystems (driver management, etc.) goes into the same executable image. It shares code and data, and there is no isolation
  • Micro Kernel: Makes the division explicit (like processes between different users)

Example: From ELF to a Running Process

./h "Hello, 457!"
  • How does the pile of code (from objdump) become an executable/a running entity?


strace ./h
  • Tool for measuring the state of a process, particularly measuring the interaction between user level processes with the kernel via a set of system calls
  • Allows us to observe that interface in action
  • Sees all the system calls of the life of a process

Output of strace from h Example File

  • Each line is a system call record (an event)
  • Lines are represented as such
(1) Name of system call
(2) Arguments
(3) Return value/result
  • First thing the ./h does is call execve("./h", ["h"], []) = 0


execve()

(1) ./h: Execute program (takes contents of the ELF file and writes into a portion of memory) that takes the location of the ELF file to load in memory (2) ["h"]: Array of pointers that points to command arguments (3) []: Pointer to the environment (check using env on the command line)

  • The shell has forked (a system call) itself and created an exact duplicate of itself
  • In its exact duplicate, it's still called execve


fork(2)
  • Creates a new address space
  • execve(2) in the child and loads into the new address space


brk(0)
  • Manipulates the process address space -- manipulates the memory available to a program
  • System calls are sensitive information. Do not want another program to call execve and load a new image; so the sensitive set of system calls is a set of privileged functionality with special functionality
  • Maps in various libraries into process code


write(1, "hello, 457!\n", 12hello, 457!)
exit_group(0) = ?
  • To understand write(), type in man 2 write in the command line instead of man write
  • Arguments:
(1) Writes a file descriptor -fd
(2) Pointer to the message you want to output
(3) Number of bytes
The extra hello, 457! after the 12 is just the output repeated again
  • Implication: If this is the functionality the kernel is offering, then somewhere inside there has to be the code that satisfies the functionality (later we'll learn why it's special and privileged, etc.)

Question & Answer

Q:

What does write(2) mean?

A:

It means the system call write, which is in section 2 of the UNIX manual
The notation is not a call function write with a parameter of 2; it's the system call write

Q:

How do you know the system call (2) is in the right section?

A:

For reference, click the Wikipedia page on [[1] man]]
Section 1: General commands, command line utilities
Section 2: System calls
Section 3: Library functions, covering the C standard library

Q:

What's a shell?

A:

The answer has been written above, but to reiterate:
It is nothing more than a program, which becomes a user level process and gets equal time as everything else

Back To Navigation

September 20: System Architecture

  • The relationship between the kernel and the system hardware environment

News

  • Homework 1 released, due October7
  • Before you post on Piazza, please read previous posts or ensure there isn't already an ongoing discussion
  • Difficulty: intermediate
  • Homework 2 and 3 will be more difficult and be released before October 7
  • SVN, virtualbox
  • Everyone should have a CPSC account and SVN
  • USBs are available. Need to install software on it

Looking Forward from September 18 lecture

  • How does a kernel enforce these divisions?
  • Divisions from imaginary (diagram) to something real
  • Architectural support that allows us to enforce the separation. If it didn't occur, every process would be as privileged as the kernel (bad situation) for a general purpose multi-user system
  • There is a cost for enforcing the division. If you're in Mars Rover, it's unlikely that Linux is being run; but if you're on a desktop (isolation between users and kernel) it's a reasonable thing to do
  • What's the physical support from CPU to make it a physical reality?
  • What is the necessary architectural support for basic operating system functionality?
  • How do processes ask the operating system to do things?
  • The kernel, which is a pile of software mostly in C code and various forms of assembly code. It contains critical functionality, mostly for managing hardware
  • When you use the command | command, the kernel is involved: two processes can't communicate without the kernel (nature of separation)
  • The kernel is layered (which is a design consideration and makes it modular)
  • Interaction: the kernel isn't executing processes, it sets conditions for Process1 to get a little bit of time to run on the CPU
  • Process1's code gets run by the CPU and the OS says "you've had enough time"

Learning Objectives

  • Look at the IA-32 architecture, which is a bit different from the Tiny Guide to Programming in 32-bit x86 Assembly Language
  • Distinguish between execution environment for userland processes and the programmable resources available to the system supervisor (i.e. the kernel)
  • Understand that when the CPU is in protected mode, the privileged control infrastructure and instructions, are available to a piece of system software running in 'ring 0'
  • Understand the role that interrupts and exceptions (managed via the IDT) play in managing context switching between userland and kernel space, and understand the role that the GDT and segment selectors play in supporting the isolation between the kernel and user space

The Kernel

Example: Viewing the kernel and Linux

  • Kernel -2.6 is the actual kernel source, which was about 86 MB
  • Linux 2.6.32 is 62 MB in zipped format, and 365 MB otherwise
  • Different Linux versions change some parts of the kernel
  • There are a bunch of directories (this is the root of the Linux tree)
  • As you can see, a lot of it is in C code
  • arch: architecture (Ex. x86)
  • crypto: crypto api
  • kernel: main kernel. Has things like the scheduler
  • mm: memory management code - manages address spaces of process
  • fs: file system - lots of them
  • net: networking code - shared memory
  • virt: virtualization support
  • scripts: scripts for controlling to build the kernel (which is large)
  • drivers: devices, etc.
  • To view the immense code: more net/netfilter/nf conntrack <FILE_NAME>.c
  • LXR guide
  • Take the source code and dig through it; lxr is a nice tool
  • Navigatable. It uses the source code as input, and resolves all types so it is clickable
  • Encouraged to use for homework

Refresh of x86 Execution Environment

  • What do programs think they have access to when they execute? We'll connect it to what the kernel has access to

Example: silly.c

#include<stdio.h>
int main(int argc, char* argv[])
{
   int i = 42;
   fprintf(stdout, "Hello, there\n");
   return 0;
}

makefile

all: silly
silly: silly.c
       gcc -Wall -g o silly silly.c
clean: @/bin/rm silly *.o

More steps

gdb ./silly
disassemble

System Diagram 2.0

attDhPA.png
Bigger image

Segment Registers and GDT Diagram

9ZcaAEm.png
Bigger image

Registers

General Registers, Index, Pointers and Indicator

  • eax, ebx, ecx, edi, esi, ebp, esp, eip, eflags
  • General purpose registers except for eip, ebp, esp
  • eax: return value register
  • esp: contains address that points to top of stack
  • ebp: keeps track of the frame pointer for the stack -- points to middle of the activation records that help structure the stack
  • eip: instruction pointer
  • User-level programs cannot access
  • Ex. Must use call instruction or branch. Can't write directly into it from user space
  • eflags: implicit side operands. States of what's happening
  • Code generated from silly should use registers except ebp, esp, eip, eflags
  • Stack is not in the CPU. The CPU knows where it is with esp

Segment Registers

  • What is stored is ss
  • Use of these things is one of the mechanisms that supports the kernel and user space
  • Can see the content of things -- they look different (address space from general purpose registers vs decimal values for segment registers)
  • ss: holds the segment selectors of the stack
  • cs: segment selector of code safe
  • ds: segment selector of data segment
  • gs, fs: general purpose

Global Descriptor Table

  • Content: offsets into a data structure in memory. In particular, the Global Descriptor Table (GDT)
  • GDT are segment descriptors
  • A segment is a portion of memory
  • Part of the CPU's job is to provide access (or interface)

Segments

  • Segments have important properties
  • Start (base)
  • Limit (how large is the segment)
  • Descriptor Privilege Level (DPL): 2-bits define the essence of a segment
  • It is part of the information in the segment descriptor in the GDT
  • Descriptor is privilege level
  • It is the only thing standing between the kernel and the user level
  • Major part of the architecture to protect the kernel from user level programs
  • Recap: These two bits located in some table in memory, pointed to by some registers, are the only thing that separates the kernel and user-level processes (the physical reality)
  • If you want to write a small kernel, it has two jobs:
1. Set of GDT
  • Provides different memory segments and address spaces to split off from the user level programs and kernel (enforces isolation/separation)
2. Set up a second table called Interrupt Descriptor Table (IDT)
  • Kernel services events from users and hardware
  • Bombarded by interrupts from the user and hardware underneath so the kernel does something interesting about it

Interrupts

  • The CPU can service interrupts from both hardware devices and software
  • Pure interrupts come from devices that need to tell the kernel they have data to deliver to a process
  • Software interrupts: inititated interrupts, often called exceptions
  • Exceptions can happen because of programming errors (divide-by-zero) or important conditions (executing an x86 int3 instruction)
  • The network doesn't issue an interrupt to the kernel, but to the CPU (this is an interrupt)
  • The CPU doesn't know what to do -- it does the fetch, decode, execute cycle
  • So the OS gives it an answer: depending on the interrupt vector (identifier of interrupt), it will look it up in kernel memory: the entry for that particular entry value (has address of function in kernel that knows how to handle that particular interrupt)
1. Can be invoked from userland (Ex. int3: software interrupt)
2. Can come from programs, the same process as before
  • 0x80: software interrupt, a pointer to a routine that handles system calls (more detail next session)
3. Interrupts that can happen because the program code executing has a program (Ex. divide-by-zero)
  • The CPU attempts to do it, then raises an interrupt which has to be serviced somehow
  • Allows the kernel to do interesting things in the hardware and software
  • Enforces some isolation between the kerl and user level processes
  • IDT and GDT are tables that live in kernel memory and contain entries that:
  • Are pointers to kernel function (IDT)
  • Are addresses of portions of memory (and size) and privilege level (GDT)

Descriptor Privilege Level (DPL)

  • Two DPL bits allow separation (four bits: recall privilege rings
  • 0: OS kernel
  • 1,2: OS services
  • 3: user level, applications
  • No rings: they don't exist, just the bits in the descriptors do
  • The kernel's address space (which is another program) has code and data, the memory segments that contain that code/data have descriptors that have bit 00 in the DPL
  • User level is mapped to descriptors that have 11
  • So if the user level program attempts to run in the kernel, the memory containing that card marks to 11 (the processor observes the information because it knows where the GDT is stored)
  • Let's say it tries to access the kernel memory mapped to 00
  • As the CPU attempts to execute the instruction, it sees the difference in privilege level and does not allow the user program to execute in the kernel

Concluding Points

  • Interrupt Descriptor Table Registers (IDTR) holds the address in kernel memory of the IDT
  • The IDT contains 256 entries of 8 bytes each(2048 bytes)
  • Global Descriptor Table Registers (GDTR) holds the location of GDT
  • The Operating System's ability to function as multi-user system (may be tested on diagram)
  • Understand the organization of IDT, GDT
  • The CPL bits in CPU, and DPL bits in segment descriptors separate the user and kernel mode

Continuation of silly.c

break *0x80483fc
run
info reg
  • The address chosen to break at is a mov instruction
  • info reg: current state of CPU (contents of user-visible registers) -- see the location of the stack

Back To Navigation

Tutorial 1: Introduction to VirtualBox

  • Robin's website
  • Find the Virtual Machine image in /home/courses/457/f13
  • Will receive a USB key and you have to download "CPSC Linux" on it (see Piazza)
  • Do not modify anything for the VM
  • "Read access" only VM

Back To Navigation

Tutorial 2: Introduction to SVN

  • Subversion revision control
  • Documentation for SVN from Apache
  • Purpose: a version of a software in the repository and be able to control the versions of softwares, documentation, or files
  • If you're in a team and several people are modifying the code, this is the perfect environment to control each version
  • Use SVN to submit the assignment inside the VM

Directory for the SVN

https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>
  • If you need help, e-mail cpschelp[at]ucalgary[dot]ca

Most Used Commands

  • To check if SVN is installed on your computer, type in svn in the terminal and a message should appear
  • Create a folder (Ex. SVN), as it will be the mission control
svn status
  • Structure of the repository
svn info
  • Information about the repository
svn log https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>
  • Shows your comment, the number of the revision control (important when you forget the number and you want to restore a version) and the time
svn checkout https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>
  • Set up the SVN
  • Upon success, you will get Checked out revision #
  • You now have communication with the repository--remember the folder you created is the one communicating with the repository (Ex. SVN)
svn add <FILE_NAME_TO_ADD_TO_REPOSITORY>
  • Adds to the repository (usually use before commit)
  • You will get A <FILE_NAME> when added successfully
svn commit
  • Uploads the file into the SVN repository
  • If you get an error about an editor, type svn commit -m "<INSERT_TEXT_TO_EXPLAIN_ACTION>"
  • Upon success, you will get Commited revision #

Guide for SVN

Beginning Steps

svn mkdir https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>/<DIRECTORY_NAME> 
  --username <YOUR_USERNAME> -m "<MESSAGE_TO_DISPLAY>"
  • Now, cd on your local drive to the folder you want to add to repository
svn import <DIRECTORY_TO_ADD_FROM_LOCAL_DRIVE> 
  https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>/<DIRECTORY_NAME>/<DIRECTORY_TO_ADD_FROM_LOCAL_DRIVE> 
  --username <YOUR_USERNAME> -m "<MESSAGE_TO_DISPLAY>"
  • This is now uploaded to the repository
  • Now you need a "Working Copy," which is a copy on the local drive that is used to compare the version to the one stored online; and this is how commit knows that changes have been made
  • You should be in a directory that contains all your projects on SVN
svn checkout 
  https:// forge.cpsc.ucalgary.ca/svn/courses/c457/<YOUR_USERNAME>/<DIRECTORY_NAME>/<DIRECTORY_TO_ADD_FROM_LOCAL_DRIVE>  
  --username <YOUR_USERNAME>
  • Now your "Working Copy" is downloaded into the repository

Additional Steps (most used)

svn add *
  • Any files not already under version control (i.e. added into the repository) will be added to the repository; after this, you can just use commit
svn commit -m "<MESSAGE_TO_DISPLAY>"
  • Any changes made to the files that already exist in the repository will be updated. If you have created new files, renamed or deleted them, then you have to add them.
  • If you work on "Working Copy" and commit from the folder from the local drive, then the changes will be made directly to your repository!
  • Alternatively, you can disregard all <DIRECTORY_TO_ADD_FROM_LOCAL_DRIVE> and when you checkout, you should be working in the folder that was transferred to your local drive

Back To Navigation

Week 3

September 23: System Calls

  • If system calls are privileged functionality, then HOW can applications invoke them?

Key Questions

  • How does a kernel enforce division between userland and kernel space?
  • Motivation: without a clear transition mechanism and line, then any program will be free to do anything. Doesn't go well for a safe and stable OS
  • What is the necessary architectural support for basic OS functionality?
  • How do processes ask the OS to do stuff?
  • Basic architecture support made available by a CPU helps isolate the kernel from processes by:
1. providing memory separation and
2. providing a mechanism for processes to invoke well-defined OS functionality in a safe and supervised fashion

System Call API: Roles

  • Processes cannot do useful work without invoking a system call
  • In protected mode, only ring zero code can access certain CPU functionality and state (Ex. I/O)
  • OS implements common functionality and makes it available via the syscall interface (application programs do not need to be privileged and they do not need to reimplement common functionality)
1. An API: stable set of well-known service endpoints
2. As a "privilege transition" mechanism via trapping

System Call API Diagram

hvT0h0a.png
Bigger image
  • System call API
  • A set of well defined send points
  • Mechanism to transition across the invisible virtual line
  • Front half of API
  • Look like a set of function calls
  • Ex. write(2), fork(2), execve(2), brk(2)
  • Back half of API
  • User level programs aren't worried about what device they're writing to
  • Kernel has details of how to accomplish that for a specific file, in a specific file system, on a specific device
  • Written in "Pk", has to jump to kernel but now allowed to do it
  • Since memory page permissions is marked 11 and goal is to move to page marked 00
  • So how do you actually accomplish it?

The Set of System Calls: Functionality of System Calls

  • About 300 system calls on Linux. Actual definitions are sprinkled in kernel source code
/usr/include/asm/unistd_32.h

Review: Calling Functions on x86

Stack Diagram

SHVMXlm.png
Bigger image

Example: silly.c version 1

#include<stdio.h>

int foobar(int a, int b)
{ 
  return a+b;
}

main (int argc, char* argv[])
{
  int i = 42;
  i = foobar(7, 8); // Function call
  fprintf(stdout, "Hello, there\n");
  return 0;
} 

Running the program

gcc -Wall -o silly silly.c
./silly
objdump -d silly
  • Where is main calling foobar(7, 8)?
  • Call instruction to address (0x8048412). It's a jump to that location
  • Look at instructions before the call. Agreed semantics for a caller to invoke callee
  • main sets up some activation records, foobar sets the rest
  • To chain activation records, save previous ebp on stack
  • Get current value of esp and store in ebp
  • Get arguments in reverse order. First one is 8
  • each 'box' is 4 bytes, so 12 bytes in total
  • Add 7, 8 and eax stores 15
  • Pop ebp from stack and return
  • Calling a function branches to that location and executes

Example: silly.c version 2

#include<stdio.h>
#include<unistd.h> // Used for syscall

int foobar(int a, int b)
{ 
  return a+b;
}

main (int argc, char* argv[])
{
  int i = 42;
  i = foobar(7, 8); // Function call
  write(1, "hi from syscall\n", 16); // System call
  fprintf(stdout, "Hello, there, %d \n");
  return 0;
} 
  • Call to write in main in address 0x8048462
  • Not actually invoking it directly. We can, but we're only users and does not have high enough privilege level
  • Code compiler generated is jump to this plt
  • Invokes a library in glibc called write(2) and redirects to it
  • If we strace, we would see a call to write(2)
  • Eventually glibc is C code and has to invoke the system call
  • Can't just call kernel sys write. Has to do something else
  • Speak to machine (literally to CPU and OS without all the junk)

Example: write.asm

;; write a small assembly program that calls write(2)
;; write(1, "hello, 457\n", 11
;; eax/4 ebx/1	ecx/X edx/0xb
;; 4 for write, 1 for 1, pointers to addresses, 11 for 11
;; Where to get string? Data section!

BITS 32
GLOBAL _start ;; declare where it starts. Need entry label to program as no main

SECTION .data ;; create for memory address
mesg	db		"hello, 457", 0xa	;; NASM doesn't tolerate \n	
					
SECTION .text ;; thinking back to ELF; this is the key part
_start:
  mov	eax, 	4 				;; holds number of system call
                                                ;; How to know what numbers to use? 
                                                ;; more /usr/include/asm/unistd 32.h 
  mov 	ebx, 	1 ;; stdout
  mov 	ecx, 	mesg				;; memory address to String
  mov 	edx, 	0xb
  int 	0x80
  • How to invoke system call? Can't do call write
  • Convention: somehow pass arguments so OS can access, and then invoke control flow transition
  • Goal: need DPL 11 to become DPL 00
  • How it is accomplished: invoke an interrupt instruction
  • INT is in a code page at ring 3, but CPU is going to execute it
  • When CPU sees the vector number (INT N), CPU knows it's an interrupt
  • Upon bootup, kernel creates IDT with a list of registered interrupts (function pointer that deals with system calls, code inside kernel)
  • CPU services event and changes privilege to 0 and transfers control to kernel
  • Kernel runs on behalf of the process
  • No direct control transfer from program or glibc
  • Some exceptional condition happens, which is what CPU notices

Build and Run Program

nasm -f elf32 write.asm
file write.o
ld -o mywrite write.o
./mywrite
  • What system call should we invoke, and what arguments to use? Put it into registers
  • Goal: make an executable, which can run through gcc but want to try without
  • From this, we get an error: Segmentation fault
  • We need to execute the right system call and also a way to exit a process

Add to End of Program

mov     ebx,    eax
mov 	eax, 	1 		;; exit system call
int 	0x80
  • int 0x80: nothing special
  • Form is int, then some arg (an interrupt vector)
  • Some are reserved for handling errors but this is an arbitrary number

Assembly Program with System Calls

  • Run program more efficiently
  • Get direct control of machine - no need for libraries, etc. that normal programming gives you
  • Make small programs
  • Don't have a programming environment with programming infrastructure
  • Want to talk directly to machine
  • Answer: writing malware
  • Injecting code (system call) into somebody's process
  • Hopefully have program execute it
  • Since you're talking directly to machine you can do interesting things

Incorrect Answers

  • Write drivers
  • Drivers are nothing more than a set of functions in data in the kernel
  • Typically you write drivers in C: parts of it speak to machine code, but no system calls in driver as you're already in the kernel
  • Make interrupts
  • This is userland code. Interrupts are kernel functionality (interrupt service handler) and eventually gets translated in x86

Example: mydel.asm

;; try to call unlink on ?
;; unlink("secret.txt") is our goal

BITS 32 ;; 32--bit program

GLOBAL _start ;; defines symbol

SECTION .data
filename	db	"secret.txt"

SECTION .text
_start:
  mov		eax, 	0xA
  mov 		ebx, 	filename
  int 		0x80
  mov 		ebx, 	eax
  mov 		eax, 	1
  int 		0x80

echo "secret" > secret.txt
man 2 unlink
locate unistd.h 
more /usr/include/asm/unistd_32.h

Question & Answer

Q:

Isn't add a system call?

A:

It's not a privileged functionality nor does it change memory size
Instructions using memory location is in the process address space, which can be used internally. Some are privileged but user space cannot have access to it

Back To Navigation

September 25: Process Creation and the Process Control Block

Main Observation

  • Needs to be some way for OS to expose CPU to program (process abstraction)
  • OS exist to load and run user programs and software applications

Continuing Discussion Themes

  • Consider completion of the procedure (how ELF becomes something OS can read)
  • Look at process and observe how they are created and destroyed from the system call layer (recall destruction from last lecture)

Terminology

  • Task
  • Generic term referring to an instance of on of the other things
  • Commonly used terminology refer to a running process
  • Process
  • Program in execution, with program counter
  • CPU records what's happening at current instruction
  • Has instruction pointer (what instruction are you executing next), CPU context (state of CPU that talks about current computation)
  • Set of resources, such as open files, meta data about the process state
  • Files stored on hard disk; OS can help keep track of which file they're currently accessing
  • Memory space (i.e. process address space)
  • Thread
  • Does not have all the heavy weight stuff that process does
  • Some number of threads of control within a process
  • Maybe its own stack and instruction pointer
  • Lightweight process
  • Linux doesn't make distinction between Process and Thread
  • Difference: lightweight processes may share access to the same process address space

Process Control Block

  • Meta data records state of execution, CPU context, memory data, etc. in kernel memory (number of data structures)
  • Task struct: C type in kernel
  • Essence of process control block
  • Set of metadata about a running process/program
  • Refer to LXR: sched.h (1215-1542)
  • Instantiation of struct task_struct is normal C syntax for declaring a complex data type
  • The method holds a number of fields that are the meta data
  • Notable: not nicely commented on purpose because it dirties source code (usually comment the why and what, not how)
  • 1216: volatile long state;
  • Key pieces of meta data; needs to know whether this process is allowed to have the CPU
  • Physically impossible for all processes to use the CPU at once as it cannot do more than one thing at a time
  • Not all processes are capable of running or may be ready to execute
  • CPU might be waiting on a different process (inter-process communication)
  • Could be some form of I/O (disk or even slow user)
  • 1271: struct mm_struct *mm, *active_mm;
  • Pointer to set of memory regions, thus memory pages that define a process address space (i.e. where code is stored, dynamic data)
  • 1274: int exit_state;
  • Returns -1 or 0, or write a system call for system exit
  • Value goes to the kernel in the system call layer, and it lands in one of these places
  • 1288: pid_t pid;
  • Process identifier
  • 1296-1308: task_struct parents and list_head children and siblings
  • Relationships between processes
  • 1345: const struct cred *real_cred;
  • Ensures a user can't manipulate another user's processes

Process States

  • Recall line 1216: volatile long state;
  • Data memory is a piece of information that helps OS decide whether a process should be given to CPU
  • Existence of this field implies that somehow there's a life cycle of a process (set of different states)
(a) TASK_RUNNING: process eligible to have CPU (i.e. run)
(b) TASK_INTERRUPTIBLE: process sleeping (Ex. waiting for I/O)
(c) TASK_UNINTERRUPTIBLE: same as above, ignores signals
(d) misc: traced, stopped, zombie

Process Diagram

JQKKoly.png
Bigger image
  • A process is created through the fork system call
  • Shell is running with ID 1234
  • In its existence it calls fork(2), a duplicate of the parent process
  • Process ID 1235 is its child and bash runs
  • When created process in stack of kernel code is completed, the result is in a runnable state
  • Semantics of state: process is eligible to get CPU, but doesn't necessarily have it
  • List of runnable processes stores runnable processes: switch processes
  • Kernel scheduler goes in process control block 123, holds context and loads in CPU (registers and instruction pointers)
  • OS no longer in control and process is running on CPU
  • Death occurs and state terminates, giving up control and killing itself
  • Voluntarily wait/sleep through system calls that does I/O; transfers control of kernel
  • When a process asks OS to do something, the decision of eligibility occurs
  • Process may be waiting for it to get back to kernel

Process Relationships

  • Why does pstree exist? Why isn't it a flat list?
  • Process creation: system starts up and one of the results of start up is the init process
  • Process ID of 1, and roots process tree
  • Every other process is the child of init
  • init runs start up scripts (implication of ELF file)
  • It's some program that exists somewhere (Ex. /bin/silly)
  • If init initiates silly with execve, init would not even exist
  • execve Overwrites current address space with image
  • Convention for spawning processes: fork(), then in the child do an execve()
  • Duplicate process but immediately replaces with code
  • One of init's children is {gnome-terminal}
  • bash session has super user shell and pstree
  • Its parent of pstree is some bash shell, which is a known terminal, and has init parent

Process Creation in Linux

  • Processes come from processes, via fork(2) (actually clone(2))
  • This system call creates a (near) copy of the "parent" process
  • (parent) /bin/bash -> fork(2)
  • -> (child) /bin/bash
  • Isn't it wasteful to duplicate, then replace with execve?
  • Should be quick as quick creation is important
  • Three mechanisms make process creation cheap:
  • Copy On Write (read same physical page)
  • When you read memory, you can share; but if you write you can get your own copy
  • Lightweight processes (share kernel data structures)
  • vfork: child processes that uses parent PAS, parent blocked
  • fork-execve pair (sometimes not) so do the bare minimum
  • Duplicate task_struct, patches control block (not ID)
  • struct mm struct doesn't actually duplicate those memory pages; keeps the same reference
  • To create a process/thread, just invoke fork and both the parent/child will share mm struct reference (set of memory data)
  • What happens when child dies, but parent needs to gather final info on it?
  • Case of diagram. Situations occur when parent spawns child and wants to wait for exit status
  • If kernel services a system call from the child and erases it, the parent never gets an answer to its question
  • Processes that call exit() is known as zombie state (this does bookkeeping for the parent)
  • What about when the parent dies but the children live on?
  • Children get different parents -- re-parent the process to init

Example: myfork.c

#include<stdio.h>
#include<unisd.h>

int main(int arg, char* argv[]) {
  pid_t pid = 0;
  pid = fork();

  if (0 ==pid) {
    // Child
    fprintf(stdout, "Hey, I'm the child");
  } else if (pid > 0) {
    // Parent
    fprintf(stdout, "Hey, I'm the parent. Child's pid is %d\n");
    } else {
    // Fail
  }
return 0;
}
  • pid gets result from fork
  • If it succeeds, it will have two versions running side by side
  • To distinguish the two, look at what it returns
  • 0 returned to child, -1 to parent

Output

Hey, I'm the parent, the child's pid is 4668
Hey, I'm the child
  • Two processes are running
  • Parent and child is both instances of a.out
  • Parent of a.out is bash
  • Maintain relationships through use of fork
  • Even with fork, it can do interesting computations

Back To Navigation

September 27: The Process Address Space (Memory Regions)

Thoughts

1. Much of what an OS does is:
  • Namespace management: allocate, protect, cleanup/recycle
  • CPU, memory, files, users
  • If we expect these services, then the kernel must contain code and data supporting it
2. What is a process?
  • PCB (task_struct)
  • CPU context (Ex. eip (not shared))
  • Virtual CPU
  • Miscellaneous (files, signals, user/id)
  • Memory space, mm_struct (possibly shared)
  • Virtual memory
3. What makes the lines real?
  • P1 | P2 | P3 | ... | Pn

Process Address Abstraction

Multiplex Memory Diagram

SvKrODH.png
Bigger image
  • Processes of different sizes: simplistic approach of dynamically allocated memory can't happen, so the virtualization is supported (diagram is a fabrication)
  • Contract for user level programs - reserve top-most gig of memory for your own code in data structures
  • User level code/data is able to map from 0 to 3 GB
  • Kernel data goes below the line
  • Above the line: process code
  • Address range - name space with 3 billion things in it
  • When a process is created through fork, and you want to execute:
  • The OS takes contents and writes into the space
  • Heap: dynamically allocated memory, vs Stack: statically allocated memory

Bottom of Diagram (Figure 3.1 in slides)

  • For finding stuff in an address range
  • Left: segmentation
  • Causes CPU to talk to a particular segment
  • Linux says there are 4 segments; only difference is the DPL
  • Right: paging
  • Support for virtual memory
  • Input is a particular address or number in the range
  • Every program believes it has unique and independent access to the virtual range
int x = 0x1000;

vs

main(...)
{
  x = 0x300;
}
  • x can be overwritten
  • No conflict, but what makes it a different program? OS sets up distinct address spaces
  • Note: Know the difference between an ELF file and objdump
more one.c
objdump -t one | grep michael
  • Basically the name stores 0xDEADBEEF
  • Symbol michael has an address; by copying one to two, the same variable of the same name is at the same address
  • Same address, but there's supposed to be no conflict

Example: Look in hex editor for one.c

  • Letters swapped for DEADBEEF (EF BE AD DE)
  • Replace with FECA01C0

Side note: Endianness

  • This is a side note I have added that I hope will be of assistance to you
  • Different systems store information in memory in different ways
  • The two ways would be Big-Endian and Little-Endian
  • Gulliver's Travels


Memory Offset0x0 0x10x20x3
Big-EndianDEADBEEF
Little-EndianEFBEADDE


  • Hacked the ELF and run the two programs at the same time
  • They were at same addresses but now they're not
  • Recall the diagram. A translation process occurs
  • Processes given their own space: physical memory is different even though the address may be same
  • Internally, a virtual address is mapped to some physical location
  • Two programs don't conflict, they do not even need to know they have to exist
  • OS has to allow this to happen
  • Imagine you have 4 gigs of RAM with 140 odd some processes
  • Consider that each process believes it can address 4 billion things
  • P1 believes it can do everything, but P2 would have a subset of it
  • Even if all processes are equal, they would overwrite data
  • So can you allocate memory? Not necessarily. What if address space is full of processes? Where are you going to get your new memory?

Why have a PAS abstraction?

  • Giving userland programs direct access to physical memory involves risk
  • If you give every process raw physical memory, they can read and write arbitrarily
  • Share multiplex between multiple processes, don't want them to have direct access
  • Hard-coding physical addresses in programs creates obstacles for concurrency
  • If you didn't have PAS, and you gave programs direct access, the implication is in the code, as you're talking about hardware
  • This is problematic. What if you pull a RAM chip out? Gotta change the program and this isn't good
  • Want to write a program and have it run for some years, you don't want to change it
  • Most programs have an insatiable appetite for memory, so segmenting available physical memory still leads to overcrowding
  • Virtual memory size: 240 GB
  • Even though it has only 8 GB memory, it can support more

PAS Preliminaries

  • Physical memory is organized as page frames
  • A PAS is a set of pages
  • Pages are collections of bytes
  • MOS: "PAS is an abstraction of memory just like the process is an abstraction of the CPU"
  • Page translation mechanism and virtual memory enable a flexible mapping between pages and page frames

Process Address Space

  • PAS is a namespace
  • PAS is a collection of pages
  • PAS is really a set of linear address ranges
  • This set of virtual address ranges need not be "complete"
  • In Linux, "linear address range" is memory region, and reflected in kernel code. Memory regions do not overlap

How to Multiplex Memory

  • OS is using most of its memory, most of its time

Example: two.c

pmap 12008
  • Print out the internal organization of a process' address space
  • Elements inside address space
cat /proc/self/maps
  • Can access to the raw data
  • Virtual file system that views into kernel data. Self is this process' ID
  • Stack, heap, etc. - relatively in the same space
  • Programs such as /bin/cat has internal code and data, but also depends on libraries
  • Two libraries: ld and libc
  • Why does ld appear 3 times, and libc 4 times? Look to the leftmost columns
  • 2nd column: permission markings for the particular entry/component of process address space
  • PAS is composed of memory regions, which map to a data structure in kernel
  • PAS is made up of things that are made up of types and code in data
  • libc
  • r-xp: code
  • r--p: no write - global, data (constants)
  • rw-p: static data
  • Heap and stack match to where you expect it to be: leftmost column (start, end address)
  • Don't overlap - memory regions are distinct from each other
  • Can't bleed with each other, and the middle is where dynamic libraries go
  • Text can't bleed in data
  • Values of those addresses: low number to a high number
  • Have to write 3 billion in hex: can deduce answer from this
  • Max value of start of stack: 3 GB - 1 = 0xbfffffff
  • + 1 = 0xc0000000 can load into pointer, r/w, etc.
  • Not allowed to touch these in user-level program

PAS Operations

  • fork(): create a PAS
  • execve(): replace PAS contents
  • _exit(): destroy PAS
  • brk(): modify heap region size
  • mmap(): grow the PAS; add a new region
  • Create or operate on memory regions (components of PAS)
  • munmap(): shrink PAS; remove region
  • mprotect(): set permissions

break.c

  • Uses mmap() to get a new memory area to be executable, readable, and private (not a file mapping, just raw memory)
  • pid = 12029
cat /proc/12029/maps
  • PAS: sleep, then call mmap() which modifies the picture. Returns pointer to an address (new memory region)
  • It's not in the list. Do it again
  • Marked, but doesn't have a file mapping
  • May be on homework. Particularly, generate code in run time (this is how it's supported)

lxr 1065-1097

  • struct mm_struct is declared here
  • Pointer to PAS - where all the data is
  • Logically split it
  • Help OS keep track of where PAS is and where all memory regions is (in a list)
  • rest: where memory regions are located
  • struct: a container for all memory regions
  • vm_area_struct *mmap: first data member with particular memory
  • Has a start, end and maintained in a list as a set of flags
  • Kernel gets data (that was displayed) from here

Question & Answer

Q:

(From class example) Left is 140 x 4 GB, right is 4 g. So how do you access it all?

A:

Use the hard disk to expand stuff (stuff you're not using is on hard disk), it's how you store
How does kernel store a virtual range? We'll get to this later

Q:

(From the r-xp in libc) What does the p stand for?

A:

p is private, and sometimes there's s which means shared
Lightweight processes can share the same address space, which is defined as a thread
Note that /bin/cat is not sharing anything

Q:

Why is there a gap in the addresses between libld and liblc?

A:

When you put them directly next to each other, it doesn't allow you to load things in between. In particular, put pages there (in case there's overwrite, or an attempt to read variables or write and go past it), which C and Assembly allows. The OS can do this too, and it's before you interfere with the code
If you ask the OS to allocate a memory region, and allocate it next to each other; they are in essence the same thing

Q:

libc has ---p which is not read, write or execute. What's the deal?

A:

Theorize! Particularly, look at the OS and load the glibc to find out

Q:

Can you tell us more about the test?

A:

Read the textbook (mainly MOS)
No sample exam or previous ones (review provided for the final)
Can ask people who took the course before, but the questions won't be the same
Style: similar to the survey at the beginning of the course -- short answer questions
Likely no writing code (not fair without a compiler)
3-4 questions as it's a 50 minute exam. IT won't be a long, laborious test or have lots of questions
Conceptual question
Monday's content included

Back To Navigation

Tutorial 3: Introduction to C

  • C is a procedural small programming language, hence fast
  • Used for:
  • Embeded softwares
  • Open-source projects
  • Operating systems
  • C is not a low-level language (machine languages like x86. ARM), it's a high-level language

Simple Data Types

  • char (1 byte)
  • short int (2 bytes)
  • int (either 2 or 4 bytes)
  • Depends on the compiler, but for most compilers it would be 4 bytes
  • long int (either 4 or 8 bytes)
  • Most compilers would be 8 bytes
  • float (4 bytes)
  • double (8 bytes)
  • Size of variables important because you're allocating memory (next tutorial)

Arithmetic Operations

  • Similar to Java and C++
  • Arithmetic: + - * / = ++ -- %
  • Conditional: M M- -- !- <- <
  • Boolean: && || !
  • Bitwise: ~ & | ^ << >>
  • Use any IDE you feel comfortable with
  • EMacs, vim, nano, gedit, sublime text

Text Replacement Pre-processing

  • Define macros by using #define, and used as form of simplification
# define MAX_SIZE 10
  • Other text replacement pre-processing are header files (.h). C library declares its standard functions in header files

Header files included

  • stdio.h: standard input and output
  • printf, scanf functions
  • CPlusPlus has a lot of references

Compiler: GCC

  • Review manual
  • gcc -v or man gcc for compiler options

Makefile

  • Use to automate the compilation of programs
  • Make up of rules
target: source
  (recipe to make target from source)
all: example1
  example1: example1.c
                    gcc -o example1 example1.c

Back To Navigation

Tutorial 4: Introduction to C

  • Create a makefile, and understand structures
  • Learn about pointers, what they mean and how you use them

Back To Navigation

Week 4

September 30: Memory Addressing in the Process Address Space

  • Concept of memory addresses
  • Very process-based
  • Don't get overly-tempted on this particular mechanism as a possible test question (it's very technical)
  • Important to know the translation, as it informs our understanding of virtual memory

Recall

  • Process address space: place where it stores its code and data
  • Abstract notion in Linux, is made up of memory areas ("memory regions" in Linux terminology)
  • Memory areas are collections of memory pages belonging to a particular address range
  • PAS has dynamic memory areas not present in the ELF, and a process can dynamically grow and shrink
  • A page is a basic unit of (virtual) memory; a common page size is 4 KB
  • Small contiguous range of memory
  • Memory areas and regions are composed of pages, which have data
  • How does your program address a piece of memory?
  • Naturally it includes a memory expression (Ex. mov eax, 0904abcd;) and finds wherever "0904abcd" is, moves to eax and executes

Memory Addressing Diagram

6CJ4Ddq.png
Bigger image
  • When the CPU is fetching from the .text section (in the RAM)
  • What mediates the communication between CPU and RAM is the Memory Management Unit (MMU)
  • Asks RAM to "read these bytes at this address"
  • Tricky, because OS space multiplexes resource, like RAM, and permit multiple processes to use the single resource
  • Causes problems:
1. If able to have a magic substance to expand without zero cost; wouldn't need any mapping -- just allocate new addresses
2. A single byte of the memory in the computer (clearly a byte isn't going to fit the PAS into physical memory)
  • Somehow OS, in cooperating with CPU, need to look up what's at the location

Example: one.c

  • x_michael variable lives at an address (0x8049858)
  • Apparently these variables are located at the same address
  • Particularly objdump -t one | grep michael
  • See Figure 3.1
  • Part of internals of the MMU
  • Primary task is to translate the "address" into an actual physical location in RAM
  • This circuitry enables the OS to have different processes occurring

Definitions

1. Logical address

  • Parts: [ss]:[offset]
  • Segment selector, 32-bit offset
  • Valid ss stored in segment registers
  • ss is an index to GDT (a data structure the OS sets up when it boots, which keeps track of all the memory segments)
  • Mapping an address implicitly, is prefixed by the code segment (Ex. [cs]:0x8049858)
  • cs points to a piece of memory where it describes its beginning/ending and privilege level (segment descriptor user level code)
  • starts at x -> limit is y (look for the address)
  • In Linux, x is 0 and y is 4 usually, so you look for the 32-bit offset
  • See address: ignore logical address, and start at linear/virtual address


2. Linear address/virtual address

  • Virtual address space is simply a range
  • Can be split up and placed in different physical locations
  • Address and bytes next to it don't have to be physically contiguous in RAM, because there's a mapping process (just logically is contiguous)
  • Not an address -- looks like an address (32-bit number) but isn't treated like an index into one contiguous space
  • Can treat it like a collection of offsets or addresses (just a resource), so that you can split up the physical memory
  • Dir, Table, Offset
  • Relationship between a 32-bit number is mapped to a physical location (using pages)


3. Physical address

  • Actual location in real RAM
  • Expect it to get recycled and reused, but is not visible to compiler
  • Good thing as processes shouldn't be hard coded
  • Compiler believes there is a logical address and linear/virtual address
  • MMU translates from 1 (logical address) -> 2 (linear/virtual address) -> 3 (physical address)

Example: Translate an address to a physical one

  • How does this process work?
  • Start with a linear address: 0x08049858 which is split up into three different parts
  • Bits depend on the paging scheme you'll be using
  • Pages are typically 4 KB (4096 bytes = 2^12 = 12 bits)
  • Offset: 12 bytes, Table: 10 bytes, Dir: 10 bytes
  • Contract between MMU and hardware mechanism
  • MMU does translation and fetches the correct data at the correct place in memory (supported by OS)
  • OS needs to know the Page Directory and the Page Table because the calculation is indexing into the data structures
  • Live in kernel memory, and kernel has Page Table
  • pgd_t * pgd; (lxr 214)

Pointer to where page diretory lives

  • Know where page directory is

Example Diagram

0El19Gj.png
Bigger image
  • How many entries are in it? 1024
  • Use the first 10 bits of linear address to find the entry (which is a location of a page table belonging to the process) referred to
  • Next 10 bits used to actually tell what the actual pages are referring to
  • Last bits used to actually index into
  • Enables MMU to translate in machine speed, to find the virtual address
  • Actually comparatively slow: read/write memory into three data structures
  • Solution: instead of MMU translating the process, it's to cache the process
  • Translation lookaside buffer (TLB) maps virtual address to physical
  • It's at the beginning of the procedure: MMU looks into it to ensure it hasn't seen the process before
  • TLBs are small, and can't cache all of memory translations, but it speeds the procedure
  • Caching: program does 90% of work, only 10% of code executed
  • Numbers aren't a physical address
  • cr3 maintains the location of pgd, but each entry is physical location of the next one

Question & Answer

Q:

The Page Directory is specific for every process, is that true for TLB as well?

A:

Yes, the page directory is unique to every process, as is for the TLB. In particular, when you change processes, the context switch (executes processes off the CPU and puts a new one) is partly evaluating the TLB, flushing out contents, and adding new code
Don't want to swap too quickly, or overhead happens
Some TLBs have support for marking/tagging TLB lines on a per process basis

Q:

Virtual address maps to a unique physical location. How do you actually get a unique location?

A:

Part 1 of that question is not entirely correct as there is freedom to change mapping. Virtual address is an illusion, it makes processes believe it has access to the full 4 GB range
Part of the OS that changes the Page Tables and underlying mapping is paging, which is part of virtual memory address
OS changes mapping when appropriate (makes room for processes)

Q:

What happens in a context switch?

A:

Data structures are maintained for every process -- they're kernel data structures, which are alive for the lifetime of a process
The CPU and MMU know which Page Table to did the translation

Q:

Page Tables are for every set of processes. Are the Page Directory and Tables sitting in the physical memory?

A:

Yes. It's part of kernel memory, which there's a design trade-off: kernel sets up an execution environment and let the processes run
What if you don't want to use half a gig of memory? This procedure lets you have sparse addresses as not all processes use the full 4 GB of memory

Q:

You switch out processes enough to write RAM on disk. What about Page Tables?

A:

Some OS does that, but Linux does not. Particularly on kernel, the processes are pinned into memory and do not leave it

Q:

In disc swap, do you have to update the table?

A:

Has to update the last mapping and potentially update page tables and directories. You only need to swap out discrete chunks and the mapping doesn't change

Q:

The kernel memory space gets a certain amount. When a new process is created, is a new page directory and table created as well? What if so many processes exist, wouldn't the kernel not exist as well?

A:

When a process is created, data structures have to be created. We know how they're created (fork(2)) where address space is copied. Replicate the process control block, and mm_struct points to the parents'. So you write on demand
What happens when you create so many processes? Old days, it hangs the machine as it tries to allocate memory to every single process. Some kernels run out of memory, and it can't do anything
Response varies, as kernel could crash (early UNIX days) or self-aware enough (like Linux) that it's running out of memory, so it tries to kill processes. Also, resource limits prevents it from happening

Back To Navigation

October 2: Test Review Material

  • I wrote this section to lay out what I need to understand (note: does not cover everything in the notes). I hope this is helpful for you regardless

Week 1

  • Fundamental Systems Principles: measurement, concurrency, resource management
  • Problems: time and space multiplexed resources, hardware, load programs and manage execution
  • Difference between hexdump and objdump

Week 2

  • Pipes: output from one command is input for another
  • gawk: parses input into records
  • pstree: displays relationships of processes
  • strace: records system calls and signals used in a program
  • AT&T style: source is to the left-hand side
  • OS definition: software (source code) in kernel; program running on hardware; resource manager; execution target; privilege supervisor; userland/kernel split
  • What is the kernel
  • Software: C and assembly code
  • Hardware interface: driver manager talks to hardware
  • Supervisor: manages important events
  • Abstraction: system calls, services export to processes
  • Protection: isolate processes to ensure a process isn't being incorrectly manipulated
  • Not in kernel: shell because it's a userland program
  • What to put in kernel
  • Complexity: hard to debug, so avoid adding code despite being faster
  • Mistakes: affect compiler, happen despite how good you are
  • Monolithic (sub systems in an image and no isolation) vs Micro Kernel (division for different users)
  • General Registers, Index, Pointers and Indicator (eax, ebx, ecx, edi, esi, ebp, esp, eip, eflags)
  • Segment registers: supports kernel and user space, decimal values vs address space for general purpose registers (ss, cs, ds, gs, fs)
  • Global Descriptor Table: offsets into a data structure in memory, has segment descriptors
  • Segments: portion of memory with start, limit, DPL (Descriptor Privilege Level)
  • DPL: 2-bit definition for segment, only thing protecting kernel from userland
  • To write a small kernel
  • GDT: enforce separation
  • IDT: services events from users and hardware
  • Interrupts from devices (data to deliver to process) and software (initiated interrupts, often exceptions)
  • Interrupt sent to CPU, OS looks up interrupt vector in kernel memory
  • Can be invoked from userland, come from programs, programming errors
  • IDT: pointers to kernel function
  • GDT: addresses of portions of memory and privilege level
  • CPU checks for the DPL of a program to ensure privilege level is appropriate

Week 3

  • System call invokes software interrupts, sent to the CPU, and transfers control to kernel
  • System call API: well defined send points to transition across invisible line
  • Privilege level is not high enough, so example write(2) jumps to plt, invokes library in glibc and invokes system call
  • To get DPL 11 to 00, invoke interrupt:
  • CPU services event, changes privilege to 0 and transfers control to kernel
  • System calls in programs: efficient programs, no programming environment, talk to machine, write malware to inject code
  • Processes: running software application vs Userland Processes: programs in execution
  • Task: instance of other things (refer to ruing process)
  • Process: program in execution with program counter and other elements
  • Thread: threads in a process
  • Lightweight process: thread is a process, and vice versa. Difference is that lightweight processes may share access to some process address space
  • volatile long state: is process allowed to have CPU?
  • Process states (running, interruptible, uninterruptible)
  • struct mm_struct: pointer for memory regions
  • Process created through fork(2)
  • Duplicates parent process and result is in runnable state
  • init is root of tree, has start-up scripts
  • execve: overwrites current address space with image
  • Spawns process by fork(2) then execve(2) from the child
  • Copy on Write (reads same page)
  • struct mm: parent and child process (through fork(2)) shares the reference (sets of memory data)
  • If a child dies and parent needs information, it will get no answer; vice versa, child gets a new parent
  • Child is in zombie state, as the process is killed and information is still required of it
  • Process abstraction: OS exports view of virtual CPU (snapshot of executing state of a program)
  • Physical memory has page frames
  • Process Address Space: set of pages (bytes)
  • PAS abstraction of memory like process is abstraction of CPU
  • PAS linear address ranges (memory region don't overlap)
  • Linear address space -> memory regions -> memory descriptor
  • Memory region: set of consecutive pages
  • Page: set of linear addresses and the data
  • fork(): OS takes content and writes in space
  • Processes given own space: physical memory (virtual address -> physical location)
  • PAS abstraction: Do not give programs direct access to physical memory
  • Multiplex memory: heap, stack doesn't bleed with each other
  • Use hard disk to expand memory

Week 4

  • Paging: allows process to see all of the virtual address space without using physical RAM to be installed
  • Memory Management Unit: transforms virtual address to physical address (reads bytes)
  • Logical address: [ss]:[offset]
  • ss index to GDT (keeps track of memory segments)
  • cs: maps address implicitly. Points to a piece of memory describing privilege level (0 -> 4, look for 32-bit offset)
  • Linear/virtual address: a range that can be split up and placed in different locations
  • Mapping process for the collection of offsets of addresses
  • Physical address: actual location in real RAM
  • Page Directory specific for every process
  • TLB also unique (it's at the beginning of MMU to cache memory and speeds up process)
  • Context switch (switches processes on/off CPU)
  • Page Directory and Table sits on physical memory

Back To Navigation

October 4: Virtual Memory

Recall

  • Process Address Space
  • Memory addressing (Eg. MMU's segmentation and paging support)
  • Virtual memory
  • Paging/page replacement algorithms

Example

  • objdump on an a.out file
  • See do_work and main
  • What is mov DWORD PTR ds:0x80496dc, 0x1?
  • Logical memory address because prefixed with it is a segment selector register, that points to GDT
  • Virtual address (series of bits): 0x80496dc
  • Where is it in physical memory? No idea
  • Physical location may change arbitrarily - do not care where variable winds up in the physical realm

Problem Origins

  • See Figure 3.1
  • Segmentation and paging break link between physical location and variables
  • Provide isolation between kernel and userspace programs (use of segment descriptors, etc.)
  • Independent process spaces become real because of each process gets virtual address space
  • Processes cannot access same physical (or virtual) memory location without asking kernel for help setting up IPC (threads, shared memory, sockets, or file I/O)

Focus Question

  • If all processes believe they have 4 GB of memory, but only 2 GB on the machine, the OS doesn't keep all memory spaces
  • All processes believes it has its own virtual address space
  • OS should support the illusion that you can run as many processes on the machine
  • Implication: Math doesn't add up
(1) Only have a limited amount of RAM
(2) On average, NNN processes
(3) Processes don't know how much memory they will actually need
(4) Each believes they can access at least 4 GB

The Gap and Problems

  • Total memory demands outstrip physical supply
  • # of processes * 4GB (worst case on 32-bit)
  • 140 * 4GB (picture above) ~= 560 GB
  • But only 8 GB of physical RAM on the box
  • Internal fragmentation (segments)
  • External fragmentation (pages)
  • Heaps grow dynamically. How to keep the new physical addresses associated witht the contiguous virtual addresses?

Seeds of a Solution

Memory Hierarchy Diagram

YbIpqXV.png
Bigger image
  • Array of storage and capacities of devices
  • Hierarchy of storage location has properties that varies as you go up or down in this stack
  • Registers: Logically every variable declared should live in a register, but that's not true: CPUs only have a limited amount of registers
  • Small number of them, but they're fast (next to data path, access within 1-2 cycles)
  • CPU cache: frequently accessed instructions are close to CPU
  • Primary memory: inexpensive, cheap primary memory available, but needs are never satisfied
  • SSD: Need to go over a set of wires, a bit slower
  • Rotational disk, optical disk: slower, but larger
  • Logically want a variable to live in a register, but the variable can actually live anywhere in the stack (must get it into the right place at the right time)
  • Inside behind virtual memory
  • Sees as much variables as can physically address and knows where the data is

History/Previous Approaches

  • Swapping
  • Only have so much physical memory, so disk runs all of the processes; takes one process at a time, loads into physical memory that's available then stops it and swaps it and do the next job (i.e. on a context switch)
  • Space multiplexes resources
  • Limited size box, put a program at a time in a box
  • Problematic for dynamic memory allocation
  • Suboptimal - swap an entire process in and out of available physical memory which is slow and has lots of overhead
  • Programs need to be specific about memory, and programmers don't like that restriction: "Why should I care or worry about how much memory to allocate?"
  • Overlays
  • Programmers use program to split into fixed-sized chunks so OS knows how to deal with the unit at a time
  • Force you to define program as a linked list: run the first bit of code, swap, then so on
  • One overlay points to the next; the program occupies the same memory for its lifetime
  • Take small physical memory and run many different programs and knows what the next one is
  • Programmers need to do the work of splitting
  • Not satisfactory. Program does work and doesn't solve problem of unpredictable memory
  • May allow us to move in hierarchy in a predictable fashion

Virtual Memory

  • More elegant version of overlays
  • As an OS, memory can be addressed or manipulated in these fixed sized chunks called pages
  • OS has conversation with memory in terms of pages
  • Memory reference: talk to memory a page at a time
  • Allows amortized overhead (swapping in physical memory)
  • Particularly, there's a gap and a layer of indirection to insert
  • Pages have standard size, and maintain the collection of pages (subset of entire program) in memory as the "working set" of a process
  • If need to use physical page frame underlying any page, can swap that page out to disk and bring in the new, needed page
  • If working set is relatively stable, and most programs don't use their full virtual address space, virtual memory can multiplex primary memory enough to maintain the fiction implied by the PAS abstraction
  • Page address translation mapping between logical page and physical frame doesn't need to be consistent
  • Logical page can move around between physical frames, and the way the OS keeps track is by updating the corresponding page table entries
  • Adding this layer of indirection is fundamental for virtual memory. Logical pages that are absent from physical memory can spend time on some other storage medium (i.e. disk)

Virtual and Physical Memory Diagram

Bva0Wxw.png
Bigger image
  • Primary memory can be organized into fixed sized chunks called pages
  • Limited amount. Using memory circuitry, break down a particular virtual and physical page
  • Any page from any process, could occupy the physical page at any point in time
  • Every process can have as many pages as it can address (store variables, etc.) and now we juggle: when a process runs, any code and data is referencing and paged, and lives in physical memory. If not, will get it for you (happens in start up)
  • ELF, OS reads and creates PAS and a set of virtual pages, and store content somewhere in physical memory
  • 140 processes later, physical memory looks like a mish-mash of virtual pages from a bunch of different processes
  • Mostly works because not every process (even though it can address 4 GB or more) actually uses that memory -- greedy, but not that greedy
  • Processes access a relatively similar set of pages - notion of working set
  • OS help keep those hot virtual pages in primary physical memory
  • Illusion that virtual memory contains all code and data, and is readily accessible
  • But when you need to load in a virtual page and it's not in physical memory, a Page Fault occurs
  • MMU does that translation and uses it to index in the page tables, and the last translation from page table entry to physical backing page doesn't resolve -- entry is null
  • Error condition. OS notices this, and has other storage where you can store pages that's been swapped out of memory
  • In disk, storing virtual pages not in primary memory, but now you have to bring it back to a physical memory
  • Swapping a new page means servicing a page fault and reading the data in from disk (two costly procedures), but alternative is more expensive: full swapping or inflexible overlays


  • Have instruction (recall mov at beginning of class) which lands in CPU, and CPU executes instruction (fetch/decode cycle) and looks at address. Need to write 1 into this memory address (CPU attempts to do this). It takes that value and sends it to the MMU. MMU, do you know where it's in physical memory? Where should I put it? MMU obliges and does the translation process
  • Different outcomes:
(1) Virtual address is actually present in physical memory
  • Instruction completes: value 1 is sent over memory bus and written to some physical location (fast)
(2) Virtual address is not present in memory
  • MMU does look up and the pointer is null. CPU notices that it can't resolve the virtual address to a physical one: thus exceptional condition
  • Exceptions handled by OS kernel. CPU raises memory exception through IDT. Part of OS (function in kernel) deals with page faults that register in IDTs
  • Different types of page faults
  • May not be mapped into a physical page yet. In that case (1) free physical frame is available or (2) physical memory is used up
  • (1) OS goes to disk, grabs contents of virtual page, maps to free physical frame and is happy
  • (2) Full, so OS decides which of the physical pages to evict - takes contents to write to disk and swap new virtual page into recently evicted physical frame
  • Instruction like load memory address n - goes through page table translation
(1) Notice no pointer of physical page, so traps OS (i.e. CPU raises exception)
(2) OS gets involved and the page is in either physical memory or not. If not, SIGSEGV (illegal memory reference) occurs. If on hard disk, get it and read it in
(3) Try to find a free page. If so, write it in and update data structures that describe the mapping between virtual and physical, then re-execute instruction in userland

Linux OOM Killer

  • May not have any memory left, which is a bad situation
  • OS allow programs to execute. If it can't, it's bad because no memory can be given back. If it's stuck, no progress can be made
  • Picking a victim is heuristic. Avoid:
  • Processes that might be sensitive (i.e. owned by root)
  • 0, 1, or kernel thread directly accessing hardware
  • init - would make it unstable
  • Processes that access hardware - causes damage to hardware as it's sensitive
  • Long running job - can lose lots of progress despite its accumulation of states
  • Pick:
  • Victim with lots of pages to reclaim (+children)
  • Possibly low static priority

Effective Access Time

  • How much do page faults degrade performance?
  • Can be slow: don't want to have a lot of page faults
  • Don't want to be common access case despite having support and programs not aware that data is living on disk (disks are slow relative to registers and memory)
  • Can calculate on this diagram. Particularly, get an idea on how poor systems perform by looking at page faults and the cost of it
  • EAT is sum of two components:
  • Normal performance: (1-p) * access_time
  • Worst case/fault performance: p * pf_overhead
  • Where p is the rate of page faults [0..1]
  • Generally gets idea of how well system is working

EAT Example

  • Overhead is slow to access time - invoking interrupt
  • Diagram overhead: disk read, service page fault including initial interrupt like 'no page table', potentially goes to disk and page frame, and restarts vs
  • Access time: normal memory access time
  • 40x more(even at low rate): avoid this. Even if 1/1000 memory references wind up at page fault, it is costly
  • Don't want to get in a situation where system is servicing page faults.

Example: greedy.c

  • malloc forever allocates 1k at a time and check the result to see if it's null
  • Exits if it's not doing it forever (soaks up memory quickly)
  • Go to System Monitor
  • Memory + Swap History: 2 GB memory on VM, using less than 10% and swap has half a gig (file on disk somewhere) and not using any of it
  • Good state with no page fault services
  • How to make it use more? See code
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char* argv[])
{
  char * c = null;
  for(;;) 
  {
    c = (char*) malloc(100 * 1024 * 1024);
    memset(c, 'X', 100 * 1024 * 1024);
    if (null == c)
    { exit (-1); } 
    else 
    { fprintf(stdout, "Address of c is: %p\n", c); }
  }
  sleep(1);
}
  • Memory goes up. What's going to happen? Swap will be used (still in usable state)
  • Eventually terminal says "killed" as it is a waste of its time (OS rescues)

Question & Answer

Q:

If there are no free frames available, what happens?

A:

OS has a decision: it looks at one of them and picks a victim
Kicks an occupied one out, and writes it (saves it for later)

Q:

What if the hard disk (reserved for extension of physical memory) and the physical memory are full?

A:

Problem: can't do anything - allocate more memory, create new processes, or make progress with current processes
OS fails all requests from memory. All system calls will return null, -1, or errors (not really a solution)
You have to make some space somehow. Pick a different victim: a process or a few to kill, and reclaim the space they're using in swap or primary
Ex. 150 processes, OS makes a decision on some heuristic/criteria and kills the process and reclaims the pages
If physical memory full, not a lot of space was made; bad state of computer to get in (and should avoid it)
Try to control this as a user: make swap reasonably big so it's several gigs
Buy more memory, but can't do a lot of rational things
For heavily loaded systems, usually there is redundancy or load balancing: if one goes down, the others can still handle requests. Hopefully the user system should never get into that

Q:

What if the OS dynamically increases the swap spaces?

A:

Swap usually has a fixed size
Potential downsides: If OS is willing to increase at run time, you'll just feed the addiction of the process (you give it more disk, eventually you run out again)

Q:

What happens if you write a page to a bad block?

A:

Physical devices aren't perfect. Sometimes disk has manufacturing flaws, so not all portions of it are valid
Deal with it as OS keeps track of disk blocks that are no longer responsive and tries to avoid writing data
Usually modern OS maintain the bad blocks and tries not to write data

Back To Navigation

Tutorial 5: Introduction to Writing Shell Code

  • Shell writing: provides text-only user interface for Unix-like OS
  • Primary function is to read commands that are typed into a terminal windows and then execute them

Input/Output

  • Can use FILE type for performing I/O operations
  • FILE type is a pointer to a particular file

Error handling

  • Always check for return values. rad man pages to know what function returns (some are null, some are -1)
  • Study the different signals when running into errors in programs

Writing UNIX Shell

  • Always use manual for functions not familiar with
  • strtok(): divide string into sub-strings according to a token or character
  • Treat each token as an entity (to get user input like cd <directory>
  • memset(): Way to put a value in an address
  • Use arrays

Back To Navigation

Tutorial 6: Introduction to Writing Shell Code

  • Finish shell coding example

Back To Navigation

Week 5

October 7: Page Replacement

Recall

  • Talked about virtual memory, what it is (illusion of having entire PAS in memory at same time)
  • Not true because limited physical amount of memory in machine
  • Saw how to put pages on the memory hierarchy
  • If physical memory is full, choose a page to evict to swap

Memory Management Recap

  • Process Address Space: How does the kernel support the ABI memory abstraction?
  • Virtual construct that an OS makes available to ever process in the machine
  • Provides a sense of isolation and protection between kernel and userland
  • Memory Addressing: Why doesn't programs' manipulation of variables conflict/collide?
  • Look-up mechanism - where in physical memory stores an instruction?
  • Maintain independent process address spacing
  • Map similar or same virtual addresses to physical
  • Virtual Memory: How does a kernel keep all processes in memory at once?
  • Access to PAS and also an unlimited amount of it
  • Figure 3.1: Entry in Page Directory
  • Paging: Page tables set up, and every entry has a special format
  • Information in the entry supports paging but also efficient page replacement

MMU as a Function

  • Look at pseudo-code in slides
  • CPU executing an instruction
Ex. MOV 0x123, 0x1
  • CPU gets a sequence of bits/bytes and activates certain parts of the CPU, and store value in variable
  • CPU begins to execute the instruction - needs to know where to write into memory
  • Systems without virtual memory - physical address
  • But this would be a logical address. CPU must convert to a physical location
  • Takes ds:0x1234 and needs MMU to translate it
  • logical_addr.segsel: look up in register list, get logical address (ss: offset)
  • Value in ds - gdt_idx used to look up into the GDT (in kernel memory)
  • Fetch ss from GDT (in process, may talk to memory) and store in segdesc
  • privcheck(segdesc.dpl, cpl): DPL (limit to check for privilege - make sure the CPL matches with DPL) checks if instruction can access memory. Hopefully yes. If not, CPU generates exception and a segmentation violation
  • virtual_addr: if successful, logical address translates to virtual address (take base attribute descriptor, add offset of logical address -> virtual address)
  • physical_addr: take virtual address, pass through page tables (function that maps virtual address to physical location) and return a physical address
  • isvalid(physical_addr): CPU checks if it's valid. If not (points to null page - no allocation to an address), a page fault may be thrown
  • Checks privileges, etc., important in servicing page faults that help us with page replacement
  • tlb.save(physical_addr, virtual_addr): If we are, derive mapping from virtual to physical and ask TLB part of MMU to save and cache the relationship. This ensured future instructions don't have to talk to memory to resolve the physical address of the instruction
  • memory.do(physical_addr, op, N): finally, put out to the memory bus - read/write to/from the physical address
  • Memory addressing supports the notion of separate process address spacing - OS needs to set up data structures that page table can rely on to execute successfully
  • Need MMU support to provide abstractions what programmers believe they have access to on modern OS

Bits: Contents of Page Table Entries

Paging Diagram

o57M9fc.png
Bigger image
  • Translation has to do with looking up entries in the data structures
  • Each 10/10/12 index in each of the page tables
  • pgd -> page directory for particular process
  • First 10 bits help index into the page directory (points to page table)
  • Second 10 bits index into that particular page table
  • Page table points to an actual page, and last 12 bits index somewhere into the page (Some byte)
  • Keep track of metadata important to OS to communicate about the state of the page
  • Valid page? Recently accessed/written?
  • Hardware can set some of the bits, so it can make an informed decision whether to kick a page out or not
  • 20 bits: page frame index
  • Present bit: is this logical page in the frame?
  • Protection bits: R/W/X (sometimes R implies X) eg. code page, may have r/w bits set, etc.
  • Dirty bit: was this page recently written to?
  • Accessed bit: hardware sets this; OS must unset
  • Used to determine candidates for swapping
  • User/Supervisor: ring 3 or everything else

Page Replacement

  • Virtual pages that belong to different processes
  • Virtual memory allows us to divorce each virtual page from the actual physical frame
  • RAM has a limited number of physical page frames and the idea of virtual memory is to manage the mapping
  • Virtual memory allows us to expand the memory to other devices (hard disks, potentially network)
  • If all spots are taken, should VP5 be on disk? No, does no good. Want to put in memory so you can reference it
  • Which to take out of memory and store into disk?

Picking a Destination Frame

Free Frame vs No Free Frame
Free Frame No Free Frame
  • Free frame page
  • OS selects it, fetches data from disk, and writes page into frame, updating the indexing as needed
  • OS will:
(1) Select some frame (via page replacement algorithm)
(2) Evicts its contents (if it's dirty) to disk (and mark its indexing information in the page tables as 'absent')
(3) Write the new page contents to frame

No free frame

  • Select one of the frames, break the mapping and replace with the one to add
  • Keep data and write into disk
  • Informed vs uninformed to execute decision
  • Informed: use extra 12 bits to decide which page to note (which page recently written to? Can't overwrite data; must write to disk first)
  • Replace old one if not recently written to
  • Hardware doing OS a favor. OS reads the bit, and does not delete recently accessed bit.
Approaches and Explanations
Name Approach Pros Cons Notes

Simple Approach: Random

Select random frame to evict

Simple to implement and understand

May select heavily-used frame (causes page fault or a page that needs to be run)

  • Simple, but wrong decision made 50% of the time
  • Servicing page faults to bring back code you want to reference is costly

Best Approach: Optimal

Select frame that will be referenced furthest in the future

Optimal; maximally delays each page fault

Fantasy, cannot happen

  • How to know if frame isn't going to be referenced for a long time?
  • Benchmark: compare instances of other approaches to it
  • If you know when each page is referenced to in the future, can generate minimal number of page faults and compare to other functions

Simple Approach: FIFO (First In, First Out)

Evict frame with 'oldest' contents

Simple to implement and understand, low overhead

May select an important frame

  • Given memory reference string, keep a linked list of numbers and keep track of oldest reference
  • OS tracks when a page has been last referenced to (hardware has information)
  • Evict the page at the front of the list (may not be needed if it hasn't been referenced to for a long time)
  • Sub-optimal: Frame may be important and heavily referenced to (not being referred to for a while doesn't imply unimportance)
  • Process states (real time running time) for priority queue is difficult and costly
  • When looking at scheduling, information isn't actually kept (expensive since process is independent of OS)

Second Chance Approach: Improve FIFO

Evict in FIFO order, but check if page is in use before evicting (page gets second chance)

Vastly improves FIFO

Degenerates to FIFO; shuffles pages on list

  • Mostly equivalent to "clock" (clock doesn't shuffle pages around a list); relies on hardware providing a "referenced" bit
  • Using FIFO, look at first one (oldest page) and check if it's 0 or 1
  • If it has been referenced (1), want to give it a second chance
  • Write value to 0, move to back of list, and move on to next value
  • R gets updated asynchronously
  • Hardware updates the bits independently of the procedure (mark a page when accessed. Only when it's serviced as a page fault would it be looked in software)
  • Problem: If all frames are 1, it's an overly inefficient way to get to FIFO
  • Improvement: circular buffer and pointer (update pointer, no need to move nodes)

Good Approach: Least Recently Used

Evict frame that has not been referenced for a long time

Semantics closely match Optimal Approach

Costly implementation; hardware requirements

  • Keep track of an ordered list of what page frames has been least recently used
  • Lots of hardware support than a single bit. In reality, most algorithms are just approximations or models
  • 4 physical page frames; make a particular memory reference string (Page 0, 2, 1, 4 3 9 1 1 2 0)
  • First 4 frames are page faults - assume empty memory
  • Next memory reference is 3 - write on Frame 1 (least recently used)
  • 1+ in Frame 3: there's a hit, since page 1 is already in physical memory
  • Service 0, Frame 1 is least recently used

Question & Answer

Q:

What is a CPL?

A:

Two bits stored in the CPU. The control register tells you what process is executed

Q:

What if another process accesses another virtual address? Would that caching still point to the same physical address and then skip it?

A:

It's a different process, so you don't want this mapping to happen. On context switch, which maps virtual addresses to some physical one, are no longer valid and TLB is flushed. Certain TLB supports whether the entry is valid or not

Q:

How many bits are in the metadata?

A:

Depends on the hardware and OS. Refer to the Bits slide.

Q:

Why leave 20 bits? If you're only accessing 4096 entries, can't we allocate more?

A:

First part of the entry page address is 20, as it accesses it to a page
In each scheme, the page table has 2^10, 4024 entries. Want every entry to point to any physical page
This is the beauty of the mechanism - address is virtually contiguous, not physically contiguous

Q:

Every entry in the page table should identify a page, but with 4k pages, do we need the last 12 bits of the entry?
Entry is 32 bits, and it's supposed to point to a particular page, so do I need the last 12 bits? Or can they all be 0?

A:

They can all be 0, because it uniquely identifies a page and pages occur only at a 4k boundary, so the last 12 bits can be anything
An entry page address holds the actual page being spoken about. The rest is used however we want

Q:

How do we get what was referenced last?

A:

You don't. This is the problem that real solutions run into. Somehow, you want an unlimited history of every page, and then efficiently compare; but good luck with that
One solution is to take advantage of the bits (some number)

Back To Navigation

October 9: Recap

Test 1

  • Satisfied, because rare to have a normal distribution
  • Points add up to 1000 -- don't panic about this grade
  • Stop by office hours for test results (~1 week we will talk about it)
  • Contact instructors more; can explain more in depth
  • If Michael doesn't see an interrupt, he can't service it

Homework 2

  • Three programming problems (some have questions that you shouldn't neglect in README file)
  • 66% of them are kernel-level programming
  • Problem 1: userlevel program that does something interesting
  • How to manipulate PAS and memory regions
  • Generate code at run time and execute the code
  • Fairly easy if you get the right system calls involved
  • Brush up on NASM, x86
  • Problem 2: kernel that does something interesting
  • Questions are to get you looking in the right direction
  • Have access to Linux Source Code -- valuable tool you should use (purpose is to look at kernel code)
  • Problem asks you to profile the location of the stack in every PAS on the machine (in different ways)
  • From userspace, write whatever to collect that information (write a command to start the stack - hint in Q1)
  • Collect information from the kernel itself by writing code that is dynamically loaded in kernel and extracted
  • Summary: compare the two approaches: which is better, more reliable, more efficient; what is the best way to do it?
  • Problem 3: change the kernel
  • Extend functionality of kernel to implement a new mode that a process can be in
  • Particularly, add a system call that marks a process in a certain way; if it's marked like that, the kernel starts to fail that system call
  • Build user-level utility that uses the system call (add to kernel which is described in LDK)
  • Implement user level in kernel functionality (add system call and implementation) and assess what you did; show that it works

Assessment Exercise

(Q1) If we didn't have computers (devices that implement algorithms), would we still need operating systems?
  • Politics: resources (tax dollars) and a set of programs (social programs, rules, etc.) and need to decide how to multiplex tax dollars to maximize value => Multiplex resources
  • Users don't need to know details of subsystems (superuser is PM/president)
  • Industry: resources (people) and management (manage people) => Abstraction
  • File folders of information on certain projects, and only certain projects can look at certain information => Data protection files
  • Libraries: multi-user book OS (concept of multiple users of a resource, checking in/out, etc.) => Scheduling
  • Point is that concepts and principles behind OS apply to other disciplines
(Q2) What is the difference between:
Protected mode
  • Recall System Architecture slide
  • Architecture mode (collection of functionality, x86 -> CPU)
  • Supports isolation between PAS
  • Start-up: real mode - no virtual memory, address translation, etc., and switching bits become protected mode

Kernel Space Recap Diagram

DQMLYhm.png
Bigger image
Kernel space
  • Space in which kernel exists
  • Virtual address space available in CPU
  • Kernel resides in the same virtual address space (i.e. range) as user level programs
  • Is separate and distinct from userland (because of DPL bits that don't allow userland to touch kernel)
  • What's in the kernel: data structure that is the task list (elements are task struct <- PCB)
  • GDT, IDT also in kernel: IDT - interrupt vectors are indexes of the table and function pointers point to the code struct
  • Kernel has some number of structures (just an ELF) and one of the functions is something that implements a syscall dispatch
  • Another table that matches syscall numbers to function pointers that point to kernel code
  • Schedulers and page allocators (code)
Supervisor mode
  • Ring 0, have access to privileged instructions
Root user
  • User account (super user, highly privileged)
Do root's processes run in ring 0?
  • Trick question -- non-sensible. Processes don't normally run in ring 0 (it's supervisor mode, kernel's mode) and most userlevel processes (included those in root) make a system call, and OS runs in ring 0 on their behalf
  • Kernel threads are run in root user, and actually run in ring 0

Question & Answer

Q:

What is the correlation between exit time and score on the test?

A:

No correlation. Lesson: don't rush and don't panic, it's okay to take a long time

Q:

How much of the assignment are we able to do now?

A:

Problem 1: Everyone can probably do
Problem 2: Do user level script, LKM based on tutorials, but details of how to get information is a learning exercise
Problem 3: A learning exercise. Straightforward for writing system calls, user level utility to write system call will be in tutorial or on the web, but the hard part is what you have to do in the kernel to start filtering system calls on a per-process basis. It can be straightforward: a data structure that's a system call table that does a system call dispatch
No point in waiting. Learn by yourself. You still get pieces of the assignment in lectures (system call, implementation), but lecture/tutorials are more a safety net, not something you should depend/wait on

Q:

Compared to the first one, how straightforward is this and how long will it take?

A:

Fairly straightforward. Harder than HW1, but this is an activity you're familiar with. Learning semantics and working in an unfamiliar environment will be harder

Q:

How thorough should our man pages be?

A:

Expect to find on the web; doesn't have to look glorious. Just be familiar with how it's documented, as programming involves: 1/3 writing, 1/3 maintaining, 1/3 documenting.

Q:

Can you clarify on the differences between kernel space and paging, and each process which has 4 GB of memory? Does each process have 3 GB, and does the kernel live on each process?

A:

This is just a virtual space, and the model we're talking about
No, processes cannot access the virtual addresses since it belongs to kernel (mismatch of CPL bits)
The kernel isn't replicated in PAS, it simply is. Each is a process running user space - kernel keeps track of a process using a data structure (list)
One data member -> MM struct -> describes the layout (memory regions) of a PAS -> page tables (index mechanism - describes where in physical memory the virtual parts in address space are) -> pages (for the kernel itself)
Every process has a process to pages that belong to the kernel in/from mm_struct
When processes make a system call, the kernel executes on behalf of some process
Kernel needs pages as well (code and data need to be in physical memory)
Rather than replicating, pages are embedded in mm_struct for every process
When it tries to execute it, that will be translated to the MMU and the DPL check will occur
Mappings are all there to allow kernel to execute, but processes don't have access to it

Q:

Why does the process need to keep track of the 1 GB of kernel pointers?

A:

The kernel does this, not processes. MMU knows to reference. Implicitly puts into everyone's mm_struct so you don't have to need several
Processes don't need it and if it tries to access the virtual address range, MMU doesn't allow

Q:

Is supervisor in real mode?

A:

Supervisor mode is a subset of protected mode. Real mode is code that's executing with ultimate access to hardware

Q:

What is the root user and how does it run in ring 0?

A:

Root is just a user (like any other) but the kernel makes an exception for it - can invoke certain system calls, checks are more relaxed
Root's programs run in ring 0 generally

Back To Navigation

October 11: User--level Memory Management

  • The API
  • Most systems export to user-level programmers for growing/shrinking memory (i.e. dynamically allocating memory)
  • Involved in creation/deletion of objects
  • Approaches/algorithms and data structures to support memory management

Recall

  • Processes have address space supported by memory regions
  • Supported by virtual memory address translation, then physical memory
  • Stack: (API) -> PAS -(memory regions)-> VM (memory address translation) -> Physical

Example: Modified Exercise

  • Ask for more memory using malloc(3)
  • Give back memory using free(3) (1 free for every malloc)
  • calloc: provides us with clear memory (same with malloc)
  • realloc: resizes the memory block
  • OS willing to adjust properties of PAS
  • Write a simple program for malloc(10KB)
  • Where is 10k appearing in PAS? What part of PAS is responsible for holding, by default, dynamically allocated memory?
  • The heap: With 0 at top, it grows down (gets bigger)
  • Starts out at a position in memory (address that defines absolute end of heap) and when more heap is needed, call malloc for some number of bytes
  • malloc(1024) -> increases heap by 1 KB

Example: llist.c

  • Defines node structure and continuously builds a linked list (runs forever; every second it takes onto the list) and memory is unbounded
  • How many bytes is this structure?
  • 3 fields
  • Ask the machine itself: 12
  • For future reference: struct is 4 bytes because of address (pointer is 4 bytes)
  • data: 1 byte
  • num: 4 byes
  • link: 4 bytes
  • Access isn't a line (data | num | link)
  • It pads the entire struct with 3 bytes (nice to index into)
  • Dynamically allocate this thing: 12 bytes - increase heap by 12 bytes
  • Heap doesn't actually, though. Particularly the OS deals with memory by pages (4 KB)
  • Disconnect of implementation of malloc and what the OS will do
  • In class, recall commands:
  • pmap PID: gives structure address space for process
  • cat /proc/PID/maps gives heap
  • strace -o llist.strace -e trace=mmap2.brk ./llist
  • Look at the behaviour of the program as it executes, and as it calls malloc
  • What does it do in terms of system calls? Sees the relationship between malloc and processes
  • tail -f llist.strace
  • Program starts up and maps into libraries
  • Calls to brk, then nothing else
  • Suggests you can malloc as much as you want, and the OS doesn't get involved
  • Beyond scheduling this process, OS is not involved in the execution of llist
  • Where is malloc getting its memory? It's not increasing the heap
  • malloc doesn't directly manipulate the heap
  • Occasionally it invokes brk or mmap to ask for more memory in terms of some number of pages, but glibc malloc implementation doesn't own internal memory management (program is interacting with it)
  • Doesn't reach to the point of asking for more space (no need to increase heap)

The Problem

  • Problem: support growing and shrinking memory allocations
(1) Disconnect between mapping and the OS: doesn't want to ask OS to allocate a 12 byte struct, and OS shouldn't worry about it
  • Give a few dozen pages, and malloc will do the rest
  • OS passes job on user-level utility. Doesn't need to make it hard with memory management algorithms
  • System call is clean interface
(2) Want to grow in variable sized chunks and shrink them
  • Let us assume what's going on in memory management:
  • Has some number of bytes; memory manager needs to give away 12 bytes from 0

Memory Allocation Diagram

vL4BHP2.png
Bigger image
  • Give 12 bytes, return 0 (node is at address 0)
  • Continue to give more memory in 12 byte chunks
  • Ex. P1: 12, 12, 12, 12, 1024 (just give it) -> 1048
  • P1: Free(P2), then another program wants 12 bytes
  • Can give next 12 bytes, or use the free spot

Solutions

  • Assume you have some pile of memory and it's at time T. Already given away some blocks of memory (fragmented heap with some free spots)
  • When another process comes along, we can make a decision according to these strategies
  • First-fit: start at 0, take the first open hole that satisfies the memory
  • But always going through zero--not likely to find the right hole size at the beginning
  • Next-fit: keep track of where you left off last time
  • Next time the process wants memory, start at the last pointer as the following space probably has a hole
  • Best fit: looks at how much memory is available from all the available holes
  • Checks what the memory request size is and whether it can fit in a hole
  • If it does fit, consider it as a candidate (c1)
  • Collects all the holes that fits the request (all c's), and checks for the tightest fit
  • Worst fit: doesn't care to find best fit, just want to find the largest hole and put in there (greedy)
  • glibc might be doing this

Fragmentation

  • Problems: wasting time running back and forth
  • Best/worst: takes time looking for holes, comparing, etc.
  • Problem: listens to the user. Asks for weird sizes of memory, and we're arbitrarily putting memory into chunks the user wants to deal with, not the OS
  • Solution: deal with memory in fixed sized chunks of memory (like OS)
  • Problem:
(1) Internal fragmentation
(2) External fragmentation
  • Ex. A tremendously fragmented situation. If you collect all the holes, you find little free space (might be tight or smaller than likely requests are going to be)
  • There might be a GB of memory, but the individual spaces are too small -> problem
  • If continuous in memory, can satisfy future requests
  • Most programs ask for small, medium or large (few 10s of bytes, 1 KB or a few dozen MB) chunks
  • Memory maintains indexes and a series of pointers to the actual memory holes of the size and maintains memory chunks of that size
  • Problem: may exhaust a certain bucket; you have a choice to deny the request, or an even harder choice to co-collect some resources and hand it out (but OS does internally)
  • What about if the program requests size for large size?
  • Internal fragmentation: across all sizes, internally may have enough memory to satisfy memory but it's already been allocated
  • Make the best trade off. It is easier for memory managers to deal with fixed sized chunks to translate an external fragmentation problem to internal
  • malloc
  • Amount of memory activity didn't touch or modify anything of the OS
  • What if it did? Rapidly talking to heap

Example: Find the brk System Call in LXR

  • Where to find brk? Search for it
  • "brk": confusing since these three characters can appear anywhere in the kernel
  • LXR says 'found brk as a local variable, as an element of class structure, and the rest of text found in several places
  • Not a local variable (since it's a system call)
  • System call implementatations are usually named "sys_<NAME>"
  • "sys_brk": function prototype or declaration, and free text (not that one)
  • Click the first one: declaration with a signature,but where's the implementation? In the code!
  • Entry point for system breaks (sys_brk is a symbol and refers to a location in kernel text)
  • The location to find functionality is defined in this macro: syswrite (define 3 syswrite as it has 3 args)
  • "define1(brk)"
  • Which of these five hits are we looking for? Not dealing with spark or alpha. nommu (machine has mmu, so not that one) -> thus mmap (memory management portions of kernel in memory)
  • A macro that expands to a function signature
  • Can read this code mostly without anyone's help as it's C code
  • unsigned long: declaring global variables (may be return value or otherwise)
  • unsigned long newbrk, oldbrk: newbrk increases heap from oldbrk
  • struct mm_struct *mm = current->mm: talks about and maintains memory regions of the current processes
  • down_write(&mm->mmap_sem): understand semantics later, but basically is a primitive to ensure we have exclusive access to the memory region of this process (may have to do in HW2)
  • Pre-processor stuff:
  • #ifdef is a symbol: do this, otherwise do the other thing (kernel can make config choices and one might be this functionality; either min_brk is set to end_code or start_brk)
  • if (brk < min_brk): brk is syscall, and supposed to be in address; so if requested access is in queue, don't do anything
  • goto out: quite disciplined, minimize code replication, and particularly useful for a common exit point

Question & Answer

Q:

Can you have a 'bucket' for every power of 2?

A:

Yes, but buckets are a design decision, so you have to consider where do you cut it off?

Q:

Can you be clever with virtual memory and have a fragmented virtual memory, but have it physically contingent? Perhaps it would be more compact

A:

RAM chips don't care where the bits are. It will satisfy more-or-less a consistent window of time for any address
Don't have to worry about keeping things contiguous in memory chips (not worried about fragmentation either)
Fragmentation -> worried about the time it takes the memory manager to scan indexes that it contains

Q:

Does malloc always retrieve memory?

A:

It's a virtual lie: runs out of memory (i.e. OS not willing to give us) then malloc will fail; so test the return value (null -> no memory given)

Back To Navigation

Tutorial 7: Introduction to Linux Kernel Modules

  • Base Kernel Code vs Modules
  • User level vs Kernel Level code
  • Exercise on Robin's website

Advantages

  • Can modify kernel without need to recompile base kernel
  • Debugging and maintenance is easier
  • More constrained version of code; adding a function - only need to debug that section isolated from kernel
  • No performance penalties
  • Adding new functionality
  • Base kernel <- LKM (new functions adding to kernel that are isolated from it)
  • Adds new functionalities to our OS)

Disadvantages

  • Won't be loaded until the system boots
  • Might cause performance penalty
  • Could potentially break your system
  • Even though you're in userspace, you're adding code to kernel (interaction with OS)

LKMs Uses

  • Device drivers
  • Filesystem drivers
  • System calls (add new ones or override existing)
  • Network drivers

Back To Navigation

Tutorial 8: Introduction to Linux Kernel Modules

  • Finish exercise

Back To Navigation

Week 6

October 16: Kernel Memory Management

  • Recall brk() system call in LXR from last lecture
  • Do not be intimidated by the code
  • 246: SYSCALL DEFINE1(): macro that defines entry point we can jump to
  • Entry point held in the system call table, which is accessed by system call dispatcher team, which is reached via the IDT
  • brk: end of the dynamic data section (address keeps track of end of where valid allcoated memory is in PAS) - oldbrk and newbrk -> expand heap and need to keep track
  • 250: mm_struct *mm = current->mm: gets value of current (current process symbol is a macro that refers to the currently executed process)
  • 253: down_write(&mm->mmap_sem): on a member of mm_struct
  • Controls concurrency, as many processes could be executing brk() at the same time so you want a way to serialize the access to these things
  • Control concurrent access to the rest of the function
  • 297: up_write(&mm->mmap_sem): releases lock of this critical section (which does some sensitive work)
  • 255: #ifdef: inheritance occurs here - preprocessor language
  • If define, code does one thing else does another; basically semantically same
  • min_brk set to end of code or start of break (basically same values)
  • Recall stack (code -> data -> heap -> stack)
  • 260: if (brk < min_brk)
  • brk is the third arg of SYSCALL_DEFINE1(): refers to argument of syscall
  • man 2 brk: argument to brk - PAS address in and asks to expand the heap to the address given
  • Ask heap to expand (kernel free to listen or not)
  • If brk is less than min_brk, ask heap size to expand to where it already covers
  • Asking for something you already have
  • 269: rlim: local variable = current (executing task struct) which has a signal member that has a member of an array of resource limits (probably a macro that refers to a slot in the array) which has a rlim_cur member (how much data the process can have)
  • 270: (brk - mm->start_brk) + (mm->end_data - mm->start_data): Requested amount of memory minus the requested size (start and end) + the current start_data
  • If either of conditions are true, goto out
  • 274: newbrk = PAGE_ALIGN(brk): kernel changes to fixed sized chunks
  • Macros typically capitalized
  • 275: oldbrk = PAGE_ALIGN(mm->brk): page align to current
  • If they're equal, goto set_brk
  • Request something you already have (label)


Lines 279 - 299

  • 281: do_munmap for memory unmapping
  • Main work (291): if do_brk checks if brk changes, and if it doesn't then it's equal to old_brk -> failure
  • Look at it by clicking in Function -> mmap
  • Allocates new memory region or expands current
  • 1993: arguments are address and length, which returns an int value
  • 1998: struct rb_node: data structure embedded in memory region mappings
  • 2002: len = PAGE_ALIGN(len): adjusts to ensure a sane value
  • 2003: if (!len): if = 0, returns (if brk is 0 then you get current value of brk back)
  • 2063: vma = vma_merge(...)
  • Could be the case that heap is expanding into or over existing anonymous memory mapping, so you can just absorb it (merging virtual memory regions)
  • 2071: kmem_cache_zalloc(...): if not, the kernel allocates memory (its own memory management)
  • Lines 2077-2083
  • Standard manipulation
  • Later success, you do 2085: mm -> total and return new address
  • 295: out: retval returns new brk value
  • Kernel can't depend on user-level programmers to get it right, so lots of checks happen
  • Pattern in kernel frequently: does a little work then delegates to other functions

Example: DEFINE3(read)

  • System called dispatch when called
  • Userspace: print(f) -> write (glibc) -> int 0x80 -> kernel (CPU service)
  • 374: struct file used so kernel can manipulate them
  • Based on file descriptor, get a handle on it
  • 379: if (file): file not null (file structure that corresponds to a fd) then do work for user, else bail
  • 380: file_pos_read(file): first find what offset you are inside the file
  • 381: vfs_read: in kernel, start with sys_read -> vfs_read -> read -> fs -> device
  • If file structure has f_op and has a read member, then go ahead and get function pointers (gets address)
  • Click read() and get a lot of results
  • Function pointer's value is valid at run time - unsure statically
  • Delegating work - sys read, vfs_read (generic) then get to a function pointer which is delegated to (many different file systems on machine) a particular implementation for a particular file system
  • File system must find physical system
  • Lesson: don't get frustrated, just keep digging

Example: grab.c

  • In main:
  • Defines struct aircraft a data type with three variables
  • Fields of aircraft are set
  • Prints various sizes
  • Where is aircraft
  • Use glibc_test
  • Receives a pointer
  • malloc size of aircraft
  • memcpy a variable, the parameter
  • Cast a into c, and rewind by 8 bytes (8 bytes before what malloc returned to)
  • Start incrementing pointer byte by byte, and printing what's there
  • Output:
  • Sees various sizes
  • Address of a: local variable on stack (addresses on stack usually 0xbf...)
  • glibc allocates memory off the heap and confirm it's on the heap
  • Rewinded 8 bytes, so beginning of temporarily allocated structure -> malloc returns to a location (the args/variables)
  • Padding happens
  • Shows how glibc malloc works: allocates chunk of memory and returns a location (the data before it refers to other things) - can touch that memory and manipulate it
  • Data is embedded into a data structure that malloc used to maintain
  • Nothing preventing to modify it

Question & Answer

Q:

How to know if it's a macro or system call?

A:

Macros are typically capitalized, and also by learning, and it doesn't really matter because macros are ugly but the effect is the same as system calls (transfers control to code)
Kernel doesn't want a full function call even if they're split

Q:

With constant transferring of code, what's the bottom layer? Where do you end this delegation?

A:

Often find bottom layer in device driver. Can't delegate forever, but it's a nice pattern - chunks up and separates out responsibilities in byte-sized chunks
If you have one function that implements what system read does, plus (...), it's totally unreadable, full of bugs, unmaintainable
Nice software engineering path - sys_read is a nice API -> VFS provides choices for features of what file system you'd like. Doesn't care about how data is organized (CD-ROM, memory, etc.)
Division of responsibility is nicely reflected in the code organization
Eventually grab a bit off the physical reader

October 18: Kernel Memory Management (cont.)

  • Kernel is just an ELF (big program)

Kernel Diagram (with brk and mmap)

ecOSGST.png
Bigger image
  • Kernel calls syscall section, and part of the interface has to do with memory management (brk, mmap)
  • Two pieces of functionality that kernel exposes to user-level programs (i.e. they can invoke syscalls to somehow get more memory)
  • brk and mmap are semantically different:
  • brk is responsible for increasing the heap size (a particular memory area in PAS)
  • mmap is more generic - it maps files into address space (can create anonymous memory regions)
  • mmap manipulates any part of the PAS you want to, and brk is for heap
  • API that kernel exports in terms of memory management
  • Simple interface - giving away memory in terms of pages (want to increase the heap by 1 byte but you get a page instead), because kernel prefers to deal with memory in paged chunks (simplifies implementation for system calls)
  • Internal implementation of brk: change the characteristics/properties of the heap memory region (adjust start, mostly end of heap)
  • Lookup memory region in current PAS, adjust some properties, etc.
  • Relatively simple job - a design decision to keep kernel simple
  • Hand you pages and not worry about memory management and what you do with the pages
  • Simple interface not naturally used by a client
  • Don't use brk because you already have malloc


  • malloc: library routine
  • User-level programs call these
  • No special privileges to invoke brk - hopefully kernel is doing the right amount and in the appropriate fashion
  • brk returns new location of the end of heap (valid address where memory extends to)
  • Don't need to bother with brk and mmap - user doesn't care what's happening in memory as long as it gets the end result - a further abstraction
  • Don't need to deal with looking at memory and error checking
  • One of malloc's aspects: don't need to know the underlying system (malloc in theory should work in windows box) -> abstraction
  • In essence: want to know where in memory you are
  • Signature: void *malloc(size_t n)
  • brk deals with memory in page sized chunks, but user level programs don't want to deal with pages, they deal with pieces of memory that are particular types of sizes
  • malloc offers specialization - interface in memory in arbitrary sized chunks (fine grain manipulation of memory)

Example: aircraft

  • What's the relationship between malloc and brk?
  • Does malloc ever call brk? When?
  • Running the program, a process loop 'runs forever' which copies fields into a newly created instance in struct aircraft, and iterates over memory byte by byte
  • If you let a program run a loop forever, the program crashes because it will reach the last byte, then try to add one when there isn't one
  • The CPU adds 1 to current value of c (go fetch content at that address)
  • The MMU looks it up in the page tables, and finds no entry there; it means the page was swapped out to disk (but not this time)
  • CPU services the page fault, but it's not even valid. It's not in the address range that defines heap
  • SIGSEGV occurs


  • Lessons:
  • malloc would like to think its allocating an object for us, but it's not. It's sauce on top of brk
  • It gives us some number of bytes at some address, and says that the bytes (beginning at a, goes to n) ranging from a -> a+n is valid and can be safely accessed. But it's just some pointer in somewhere in the middle of the heap and can go back or forward if we want to (also might be other objects in the space)
  • Given any other address space, we can access any address we want to
  • Implication: can access or manipulate content of object in the heap that we want
  • Interesting objects, but the only thing that stops us is if we happen to walk outside the heap address range
  • The CPU is not actively interfering
  • The OS is not because at this level, the distinction and classification system does not exist. The compiler checks that you have the right source code and it produces x86 assembly
  • The only memory region and heap or other anonymous memory mappings are created in mmap
  • The only time part of the system is not going to let you do something is if you run outside the bounds, or you try to access memory area outside of where the bits actually let you
  • void pointer: address, so a holds 32-bit address in memory region heap (because malloc is accessing the heap, which is default memory area)
  • No explicit resource limit


  • If you invoked malloc enough where every byte in the heap is allocated, malloc notices that it has no more free blocks and asks the OS for help
  • Invokes brk and gets more pages
  • Effect is that the heap end increases
  • The OS doesn't have to worry about keeping track of where individual objects are in heap, it just grows/shrinks it
  • Up to you to provide efficient dynamic memory allocation
  • The kernel doesn't want to implement glibc malloc inside itself. It's buggy, messy and error-prone code

Example: give() and recycle() Pseudo-Code

  • Implementation of malloc: takes some number of bytes and returns a pointer to you
  • Imagine you have the task of implementing your own memory manager:
void *give(size_t n); // malloc
void recycle(void *addr); // free
  • Allocate bytes and return some bytes (malloc and free are similar)
void *give(size_t n)
  // Obtain memory (stores pages so it gives pointers out for programs)
  // Keep track of bounds (brk does this for you as well)
  // Need a data structure that tracks allocated blocks
  // Need a data structure (could be the same) that tracks free blocks (holes)
  brk(); // Check that you're uninitialized, get memory, keep track of it, etc.
  // Heuristic code goes here which informs the approach of what memory to give away
  • Use the heap or own memory management? No heap - more control, change permissions, etc. and would not interfere with other objects
a= give(4); five(100); give(5); ...free(a); give(3);
// Find 4 contiguous 'available' bytes
// return address if there are 'available' bytes
  • What happens when you give(4)? It gives away 4 bytes
  • The decision to make what the page look like uses the heuristics we talked about (first, best, next, worst, fixed)
void recycle(void *a)
  // Error checking, etc...
  add_to_freelist(a,n); // This address for some n bytes is added to a list - perhaps an index with pairs in give()
  • Perhaps a homework question

Bucket Allocation

  • Deals with free memory chunks
How does the kernel allocate memory for itself?
  • Not malloc since it's for allocating bytes in a PAS, not the kernel
  • The kernel has its own internal memory management
  • Ex. The user fires up a new process
  • Runs a new program (fork and execve) and it allocates space in PCB where kernel does on its own behalf
  • The kernel has different implementations of this functionality: kalloc
  • Refer to Kernel Diagram
  • Zone allocator - physical implementation of memory
  • Talks to buddy algorithm, which maintains the bucket approach to allocate memory
  • Particular memory allocator SLAB, SLOB, SLUB (more details on this blog post)
  • Buckets contain pages, so any time the kernel needs small ones, they get the size, otherwise they get chunks
  • Can borrow from another level for memory

Question & Answer

Q:

You're using data structures give and recycle. The implication is that not a lot of data structures are used. How about malloc data structures?

A:

There are static data structures, but it's not a solution
There is an API for dynamic memory allocation. So information about what blocks are free/allocated is embedded in the heap
Print out 'around' a (in Example) (a little before, a little after a -> other numbers that glibc uses to find the next block)

Q:

The kernel's memory is set in physical memory. What happens when all dynamic memory is used up?

A:

When it's all gone, you have to pick somebody, but there's nothing to reclaim
Memory allocation forced to do it by user space (if it can believe it runs into infinity. Kernel does as much as it can; if it does something ridiculous, it fails)

Back To Navigation

Tutorial 9: Building a Kernel from Source

  • Kernel vs Distribution
  • Distribution is built on top of the kernel
  • The more stable the kernel of the version is, the less likely your distribution is going to crash
  • Find it on kernel.org
  • Latest version is 3.11 and we are using 2.6.32
  • Download source code -> configure -> compile base kernel code -> compile modules -> [install modules -> install kernel files - do at home] -> modify bootloader (GRUB)
  • Finish before next tutorial as you're going to add more functionality to it by next tutorial
  • sudo yum install <PACKAGE>: installs anything you're missing
  • make menuconfig: configures the kernel with an interface
  • make <SPECIFIC SECTION> (like modules_install): minimizes the work make does

Back To Navigation

Tutorial 10: Building a Kernel from Source

  • Frequently use Snapshots (in case you mess up anything)
  • Continue exercise; don't concern with logic, just follow steps

Back To Navigation

Week 7

October 21: System Startup

Recall

  • What processes are, how they come into being, how they are given address space
  • Utility of modern OS is predicated to do many things at once
  • How do you enable multiple processes to execute concurrently on the same hardware on the same CPU?

Test 1 Comments

  • Question 1: system call API and semantics, how it operates compared to regular system calls
  • Question 2: 5 fields of process control block, the metadata
  • Question 3: process life cycle
  • Question 4: fork important to system call API and process life cycle

Homework 2

  • How OS handles system call - when you call, it links to the system call table - does it directly jump to code, or is there an intermediary function?
  • Interrupt happens, system calls dispatch routine (looks up in table) then it goes to entry point to that particular system
  • Can't write in UNIX unless you have a modified kernel
  • Problem of Question 1: not to test x86 assembly, but how system calls are invoked (particular way it happens)

System Startup

  • In bootup: Start VM - exposes virtual CPU, screen blinks, logo appears, then the GNU GRUB starts up
  • GRUB: small program that runs that gives you a menu to select
  • Can select any OS you want to as long as GRUB knows where to find it
  • Hand off between hardware initializing itself and the OS actually being able to run and take over
  • Pick a kernel - screen is blank, then you are watching the details of boot
  • When you see the output, the OS is already running - it's initializing services (i.e. creating and running user level processes)
  • Not booting, it's already booted and is running a number of programs (sets up various services)
  • What happens in between the time of selecting the virtual machine? Why does it happen?
  • How to take hardware from an unpredictable state and get something useful out of it?
  • Between selecting and the process (page 3 and 4 on the Slides) is a lot of work
  • Startup is interesting:
(1) Transition from sequential to concurrent execution
  • Likely not doing two things at once - code does one thing at a time
  • Hardware does one thing at a time, it literally cannot do more than one thing at a time
(2) Bootstrapping
  • If you're executing instruction after each other, how do you index them to do different things or switch between them?
  • CPU multiplex resource - sharing insequential resources among many constructs/virtual CPU (a process)
(3) Fundamental OS job of hardware initialization (servicing interrupts and exceptions)
  • Multiplex state so it's useful
  • Bridges startup to interrupts, scheduling, concurrency, processes
  • Low level hardware events - OS is an event handler
  • How to take state of hardware to get into position to user level processes
  • Startup takes hardware in unknown state and imposes enough order on it so that the kernel can run (and spawn userspace processes)
  • Before thinking about memory management (PAS), first job is to: turn hardware on and get into a useful, known, predictable state
  • Want to run mode but need to get hardware in a state
  • Bootstrapping: relationship between ROM, BIOS, bootloader and kernel startup code
  • Known code in read only memory and when it powers on, it transfers control to that code (small size, may know a little about hardware it's dealing with )
  • Minimal code to full blown multi user OS - chain of handoffs

Startup Overview

  • Initial startup program in ROM: hardware reset, load BIOS
  • BIOS comes up and its jobs is to make sure the devices attached are in a good state (make sure everything works) and then hands off control
  • BIOS program: initialize devices, transfer control to bootloader
  • Chain of trust where OS doesn't believe what the BIOS does since it was written a long time ago
  • Linux reinitializes all devices
  • Bootloader (LILO, GRUB): choose OS, boot options, transfer control to kernel load return
  • Grub: choose OS and maybe pass parameters, and from there kernel is alone
  • Each handoff is minimal work to get to the next stage
  • Main idea:
  • Load OS kernel into memory
  • Initialize kernel data structures
  • Create tasks
  • Transfer control to them

Setting up the Execution Environment

  • Initialize hardware (IRQ lines, etc.), doesn't trust BIOS results
  • Prepare IDT and GDT
  • IDT: intimately related to state of hardware (once you initialize it, you need to map code that handles it to the interrupts)
  • GDT: describes internal memory organization (if you start computing without memory management, not going to do well)
  • Create OS data structures (task list and whatnot) for kernel operations
  • Initializes core OS modules
  • Scheduler, memory manager, etc.
  • Enable interrupts
  • Implication is we've been executing without interrupts (good thing because you don't want to be interrupted when setting up key data structures)
  • Don't have concurrency until this point - no ability to do multiple things at once
  • Willing to receive signals from hardware and software interrupts from processes running to deal with concurrency
  • Create null process (i.e. swapper), create init process
  • Details differ between OS, but a few processes are created
  • init: process 1, and null: process 0 (does nothing)
  • Null: be present and be prepared for scheduling
  • If OS can do nothing else because all processes are sleeping, the OS can switch to this one
  • Two processes that exist as well as the scheduler, where it can switch between them
* Switching between the two is how you get concurrency out of a sequential procedure
  • System can do a little bit of one job, then stop and so forth. It does it quickly enough that it looks like it does multiple things at once

Timeline (for Linux Boot)

  • Setup function with control flow
  • Start in something sequential in assembly code, and kernel_thread (creates a kernel process) executes init first
  • BIOS (no OS, random RAM)
-> CPU.RESET pin (cs, eip = 0xfffffff0)(ROM)
-> BIOS is here
-> Real address mode
-> no GDT, no IDT
  • Bootloader (LILO, GRUB)
-> MBR holds bootloader loader
-> Main bootloader reads map of available OS kernels
  • Assembly language: setup() "function"
  • Assembly language: startup_32() (decompress)
  • Assembly language: startup_32() (setup PID 0)
  • The start_kernel() function
-> ...kernel_thread() (create PID 1 /sbin/init)
  • Main job: locate and retrieve kernel image
  • Other jobs: POST, copy 1st sector to 0x00007c00

Linux Startup Code Chain

Setup()
startup_32()
startup_32()
start_kernel()
+sched_init()
+kernel_thread()
/sbin/init
/etc/rc.d/init.d/...


  • If you can come away from today (+ reading) and understand it all, it's a good sign
  • Poem describes the control flow and phases of what's happening
  • In details of the [poem is the relationship between hardware and OS as it comes into being
  • Early half of procedure is intimately tied to details of architecture


  • LXR
  • 301: calll main: entry point of where kernel is
  • 128: void main(void)
  • Code you execute in beginning has to be relevant to hardware (you're running x86)
  • 176: go_to_protected_mode()
  • Tied to hardware. If you look at x86 CPU, you can be put into different modes (startup in real address mode) and protected mode (no memory protection) means you've got paging
  • From real to protected -> separate user and kernel code
  • 104: go_to_protected_mode(void) sets up IDT and GDT
  • 95: static void setup_idt(void) loads IDT
  • startup_32: sets up procedure that uncompresses the kernel
  • __init_startup_kernel: scheduler is created, most subsystems exist. Kernel is in a state that can do things
  • init_refork rest_init(void): creates kernel thread flags, and then kernel_thread(kernel_init)
  • kernel_init: invokes init_post()
  • kernel_execve: invokes system call from kernel space
  • kernel_thread: implication of do_fork which creates a bootstrap (context for a single process to exist)
  • sys_execve: basically replaces anything that executes

Conclusion

  • Kernel boot code creates control thread and process is responsible for running kernel init function which calls this procedure
  • We have a process living on the machine. Kernel init tries to run programs which tries to create another process so the OS can switch between them
  • Try to get hardware into a state that executes more complicated code
  • Once it sets up its data structures (stop here, OS not very useful as it won't be doing anything) the kernel then creates internal code via call to kernel thread (equivalent to a fork) and creates two processes
  • Uses those to invoke existing startup scripts on machine, and job is to look at all the startup programs
  • Job of init looks into directly and executes it with fork_exec (/etc/rc.d/init.d/)
  • User level functionality - code doesn't care about or need it but makes machine useful to us
  • Now it's their job to start other processes on the machine
  • Create concurrency out of nothing by 'manually' constructing two special tasks after environment setup (and loading of kernel code) has been done)
  • The transition happens when (0) these tasks exist and (1) turn on interrupts and (2) invoke the scheduler
  • Memorize major phases of what happens

Question & Answer

Q:

Even without the process, do we have concurrency because we have interrupts?

A:

Between the time of interrupts and creating processes - no, you're switching away to something else. What do you come back to?
By creating processes, by default we have options: Particularly, let's say we're interrupted in the sequential procedure. You return by making sure everything is initialized

Q:

What happens if the OS isn't found?

A:

The BIOS starts running, and transfers to bootloader. GRUB believes it should jump to somewhere, but there's nothing there so the machine won't start (gets no functionality). Can't do anything because there's no code there

Back To Navigation

October 23: HW1 Review and HW2 Hints

  • Perception: somehow material in lectures, tests, assignments and tutorials are not related
  • In Michael's shoes: Taught college-level courses for 10 years, and has a clue of how to teach. If you accept that as a pre-condition, what is Michael thinking?
  • Not OS, giving theory homework. Draw a connection
  • What major concepts have we talked about?
  • What is OS
  • What is process
  • Kernel roles
  • Illusion of unlimited CPU/memory
  • Memory management (userlevel, kernel)
  • Internal, external fragmentation
  • Virtual memory
  • Paging
  • System calls
  • Protection
  • Isolation
  • System architecture
  • Wiki, man pages, professors, textbooks, LXR, Google, etc. for resources
  • Learning process: reflect on what you did, and what Michae expects you to do

Homework 1

  • Problem 1: where to find authoritative information? Important skill to have
  • Does not care if you memorized facts, cares that you know where to find stuff (man pages)
  • Form of this exercise is to get you to look and search
  • Use man pages and the "see also" section
  • Problem 2: learning by observing what the system does
  • var log stores important messages
  • System log: what system is doing, and hw2 (output from kernel module to this file)
  • Manage device interaction
  • strace prints out system calls
  • Problem 3: what critical components of an OS are there?
  • OS supports concurrent processes
  • What's important, what's not; what belongs in the kernel
  • Distinction of OpenBSD (default processes) and Linux
  • Problem 4: give real experience with what kinds of functionality there are
  • Mystery: data, tell what's going on
  • Used to writing code, but that's not everyone's thing - use prowess to analyze data
  • Assignments are a tool to asses how much you've absorbed

Homework 2

Problem 1

  • Userlevel program and consulting mmap man page
  • Should not take that long. mmap is what you want
  • Understand how to create memory area at run time where you can put code (implication is that memory area has set of permissions - r-x [read-execute])
  • Invoke 2 system calls
  • Demonstrate understanding of system call calling convention for x86 Linux
  • System calls different from function calls, related to protection
  • Implementation should be straightforward - file with sequence of bytes (10-20 bytes) and they represent x86 machine code (not ELF); just a file of raw bytes
  • Produce in hex editor (Under Programming tab, Bless is installed) or command line (Check Piazza post by Michael)
  • Echo hex bytes to some files, if you know hex values of x86 commands you can fire up NASM and copy program as a basis
  • getpid is wonderful
  • Exit doesn't work, but there are other system calls

Problem 2

  • OS does ASLR, but the point is knowing how OS keeps track of memory regions, virtual memory and memory process on a machine
  • Related to 'where to look for info for processes and PAS?'
  • LXR (code), file for memory regions - /proc/PID/maps
  • Looks like a file system, but it's an image of various pieces of kernel memory so user doesn't have to modify kernel source code
  • Question 0:
  • Write a program that forks and look at process memory regions to see if their addresses are different
  • Terminal and background processes - examine the pid
  • Question 1:
  • Intent is to get eyeballs on LXR
  • Hint: name directories in base directory of Linux (mm, usr, arc, documentation, fs - file system code)
  • In this directory, it may talk about process list
  • Code that accesses data structures created to memory regions
  • Need to do something? Has kernel done it already? Look at that code
  • Question 2:
  • Memory regions as Linux defines them, talks about data structures that keep track of memory regions namely mm struct0, precisely vm_area_struct
  • Does the data structure actually contain code?
  • Question 3:
  • proc/pid/maps: user-level script - posted on Piazza
  • Collect info using script and do analysis on it
  • Question 4:
  • LKM: practical skill and contrast for modifying kernel source code
  • Extend kernel functionality, write LKM that walks process list - task structs, PCB in kernel that keeps track of all processes running
  • Attributes of PAS - collect them to provide you can read kernel source code and understand how to access them (concurrency is hard)
  • Hint: observe how other parts of kernel walks across list (OOM killer, proc/pid/maps)
  • LKM tutorials on Wiki
  • Question 5 (6 in the question):
  • From userspace, observe information then compare the approaches - both are accessing same data and data structure. Which is the better approach?

Problem 3

  • Question A:
  • Add a system call to the kernel (doing in tutorial) that has particular signature and particular effect. What's that effect? Simply marks a process in a certain way (in a POD)
  • Add a field to task struct that keeps track of how long you're in POD
  • Question B:
  • System calls - write value in a new attribute of task struct and PCB and have another part of kernel interpret that state (modify behaviour based on this mark)
  • Demonstrates that you're able to access documentation and code related to PCB, general OS concept, that helps keep track of processes

Question & Answer

Q:

Is there any similar system calls for last question?

A:

Use reeducate(2) and there are other system calls that fetch or set values (one way to determine is to look at syscall, unistd.h and read man pages) but in tutorials, you will implement part of reeducate.
System call itself is simple - write a few fields

Back To Navigation

October 25: Process Scheduling

  • Links to various pieces of source code on Wiki
  • Chapter 4 more detailed (add to readings)

System Startup

  • OS can transition from sequential code execution to concurrent execution. It:
  • After sequential procedure of setting up bytes, it initializes scheduler
  • Gives scheduler two processes to operate via do_fork and sys_execve(swapper process 0, and init process 1)
  • Does it through standard system calls that allow you to replicate PAS
  • Enables interrupts
  • Allows newly created scheduler to begin choosing between these two processes - scheduler has work to do
  • What happens immediately is that 1 forks and execs some number of processes
  • One of the processes (init) creates children via do_fork() and sys_execve()

Problem of Scheduling

  • As more processes are created, and some hundreds or thousands of processes come to exist, how does the OS choose which process should be given to the CPU?
  • Ex. Have 5 graduate students and want Michael's attention. They all show up at the same time at the door, so who to pick first?
  • How to make decision? How to fit unlimited tasks in limited amount of resources?
  • Pick one at random
  • First come first serve
  • Priority / Urgency / Favorites
  • Estimate of workload, most work last
  • Alphabetically
  • Save history of service, note progress
  • Work on newly complete
  • Progress on incomplete
  • Equal slots
  • Round robin
  • Clone self and deal with others (physically impossible)
  • Scheduling is a mechanism that creates the illusion of simultaneous execution of multiple processes
  • Important to remember. At a human time scale, say you have 50 minutes free and give each student 10 minutes by starting alphabetically. Then all resources are exhausted
  • Can't service all at once - give them 3 seconds and make them feel like they're being serviced at once
  • Modern CPU operates quickly - it can do roughly a billion processes per second (gigahertz)
  • Instructions that use a single clock cycle

Considerations for Scheduling

  • Scheduling metrics:
  • Time
  • Work type
  • Responsive (delay)
  • Consistent (delay variance)
  • man 1 nice, man 2 nice


Progress is #1
  • Process that is running actually progresses through its code (i.e. does stuff and gets things done)
  • Goal of scheduler takes that collection of work and takes it to ensure that it makes progress
Fairness
Priority
  • One process is more important than the other (ranking)
  • How to balance fairness and priority?
  • Tightly linked is the notion of efficiency of the scheduler
Efficiency (of scheduler)
  • Scheduler should be fast to decide which process to pick
  • Overhead to making this decision and keeping track of all the data necessary
  • How to balance time it does stuff and allocate to get work done?
Quantum
  • Time scales, slice of time you want to give a process to make progress (should be reasonable)
Pre-emption

Quantum

  • Default amount of time given to processes to accomplish progress
  • Designing a scheduler, you have to pick a value for quantum
  • Implications for too small/short: context switch overhead is high
  • Implications for too long: destroys illusion that scheduler wants to provide (concurrency)
  • From any program's perspective:
  • my_stopped_time = N * quantum.length
  • Where N is number of processes on the machine
  • What else might affect the selection of quantum?
  • Short quantum and long ones are bad, but relatively long quantum should be a good one. A long quantum does not make interactive applications unresponsive, they have a relatively high priority so they can pre-empt other (batch) processes
  • Relatively long quantum: does not make processes slow, because we can fix it
  • We have control over other features that can distinguish between background/non-interactive tasks and interactive (priority should be boosted)

Pre-emption

  • Ability for OS to stop a program from running
  • Kernel can interrupt a running process at any point in time or when its quantum expires
  • Enables: "Do something else more useful"
  • Pre-emption does not imply suspension
  • Pre-emption in Linux 2.6 can happen either in user mode or kernel mode execution
  • Algorithm: fetch code, decode and execute driven by the value of instruction pointer
  • Most CPUs fetch, execute, and check for interrupts
  • Ex. Clock - how much time (quantum) happened since the last time the scheduler was actually invoked?
  • OS steps in, stops the current process, invokes scheduling priority, picks the task, and allows it to continue executing
  • Pre-emption: allows it to pre-empt a currently executing process

Process List vs Run Queue

  • Process state, life cycle, etc.
  • Logical conditions for candidates and processes actually on the machine
  • List of all processes is logically distinct from the data structure used to organize processes into groups ready to be scheduled
  • Concepts are different, but implementations may share a data structure or be two completely separate lists
  • Indexing into list is separated in Linux, data structure contains only those running (actually candidates)
  • No point in looking for entire task list (if not runnable) in the decision
  • You make a quicker decision if considering only the ones capable of using CPU blanket

Priority

  • Most scheduling approaches/algorithms can be boiled down to the generic concept of 'priority'
  • Different approaches calculate different absolute or relative priorities for the collection of runnable processes
  • We have control over this number by nice(1) and nice(2)
  • First come first serve - reasonable
  • Round robin - fair, give pension to everyone
  • Estimate of workload - knew how much time it could take up, so take care of short/long ones first and throughput increases (get rid of processes in minimal time) but work required is unbounded
  • How to know what work is required?
  • Boils down to priority selection - who is most important to run now?
  • Most important may be one of the above
  • Scheduler uses some image of round robin but it's quite hard to arrive at a priority value
  • Establish dynamically calculated priority on every runnable task on system
  • In priority calculation: what kind of task is it? (interactive - more responsive, batch, what bound, etc.), how much attention has been paid to this task already? (hard to know, OS plays guessing game)
Parameter that allow effect of priority - nice(1)
  • Nice: not greedy. Intent is if background processes can be done in two weeks, put low level long term computing that doesn't interfere with other users' shell program (lower its priority)
  • Or increase priority to root only, not to user
  • Nice affects process behavior

Types of Processes

  • Traditional classification
  • CPU-bound
  • Many CPU and memory operations: scientific computing, financial software, rendering
  • Is procedure primarily using CPU and nothing else? If you're doing math and using functions then you're CPU bound
  • I/O-bound
  • Much use of devices; may spend time waiting for input
  • Is process interactive? Is it a background procedure?
  • Linux: no deadline, but real time is labelled as such and higher priority
  • Every task comes with tight bounds and a set of deadlines, but it's not what a multi-user OS uses
  • Alternative classification
  • Interactive
  • Batch
  • Real-time
  • A batch process can be I/O-bound or CPU-bound (database, rendering)
  • Real-time programs are explicitly labelled as such

Scheduling Policy and Algorithms

Conflicting Objectives

  • Fast response time (processes seem responsive)
  • Good throughput for backgound processes
  • Avoid 'starvation'
  • Ensure all processes are getting some attention
  • Interpretation of set of rules to govern this behavior
  • Scheduling policy: rules for balancing these needs and helping decide what task gets CPU

Scheduling algorithms

  • Ex. The one with least attention goes first: high throughput but process with complex problem doesn't get serviced, starves and dies
  • Round robin: everyone gets equal share of time

Round Robin Scheduling Algorithms Diagram

qTTSVHz.png
Bigger image
  • Ex. R, A, B
  • Students have an arrival time with workload estimation (amount to do)
  • Time quantum could be 5 units and context switch overhead is 1 minute
  • Arrive all at same time, with different units of work
  • Timeline with first come first serve: who to pick now? Pick an arbitrary ordering
  • A, B, R - tiebreak order (alphabetical)
  • A runs for 3 units of time (but quantum is 5)
  • FCFS: A only has 3 units of work to be done. Typically we associate quantum only with round robin
  • When A is done, it will be done and invoke exit()
  • It's going to call the schedule function (executing inside itself so it can pick next one)
  • What about context switch? Quantum only means something in the round robin approach, and context switch can be ignored
  • FCFS: shortest job, only run task and complete its work
  • Context switch time - on average task, context switch includes calling scheduling, swapping
  • Compare structure of different schedulers because they're radically different approaches

Question & Answer

Q:

Is there a value that looks for priority? If I have a window in the foreground, will this get serviced more?

A:

Yes, Windows uses this approach. The current process that owns the screen get more relative time slices (distinguishing between task that's interactive and a background task)

Q:

What's the difference between nice and priority?

A:

Nice: data that goes into a calculation for overall priority. By default it has a nice value of 0, so it's not taken into account

Q:

Can you have two processes that come at the same time? Is this possible because of the bus?

A:

In a single CPU, you can't, because it would be executing fork and it can't do two things at once
In a multi processor system, there is no reason you cannot have it. One part of the OS can be forking and the other part can do it too

Q:

Does the scheduler look at the exact arrival time, or 'since that last time it ran'?

A:

No, processes come and go all the time. Basically you have a set of processes that do not care when they've arrived; it's just a collection
It does matter in a window, how much attention you pay to a process
If A has 50% progress, it's likely not to be scheduled right after

Q:

Is there a context switch time in the FCFS scenario? Since you only remember the context when you remember the last process

A:

To load the next process, there still needs to be some context switch
In the diagram, the ticks are calls to the schedule function and part of it is the context switch

Q:

Do you need a context switch from 0 to 1

A:

No, we could but on the test it will explicitly say to count context switch or not

Back To Navigation

Tutorial 11 & 12

  • Work on exercise that helps start Assignment 2: Problem 3

Back To Navigation

Week 8

October 28: Clocks, Timing, Interrupts, and Scheduling

  • Policy vs. mechanism
  • A scheduler structure
  • Easy to look at memory manager/scheduler and complaint notion of what it's doing and how it's doing it
  • Distinction in both 2.4 Scheduler (earlier) and 2.6
  • Linux version 2.6:
  • Data structures and internal organization, how it picks next task
  • Code that manages what should be next
  • Assigning priority vs picking next task
  • Mechanism: which process should run next?
  • Policy: drives or controls that
  • Different approaches for driving priority
  • Mechanism enforces the decision policy has made
  • Process Life cycle
  • ready <-> running -> sleep/wait -> ready

Popular Scheduling Approaches

  • FCFS: nice, but not much responsiveness because arbitrary ordering and long jobs get clogged up on queue, short jobs may not run
  • Run to completion
  • Short job next: high responsiveness, long processes get shuffled to end of list
  • Round robin: fair approach, every process gets equal time slice and if we pick quantum well enough there is equal footing in responsiveness

Round Robin Scheduling Algorithms Diagram

MSv3iU5.png
Bigger image
  • Context switch happens in black box
  • How to know it's a decent approach? How to compare?
  • Metrics: empirical (simulation)
  • This is an artificial way of emulating scheduling decisions - doesn't actually matter when they arrive
  • Actual decision OS uses is different than contrived examples, but there are a few metrics we can use
(1) Wait time: note when a process arrives, and when it first gets serviced
  • Look at metric and keep it low - when process arrives, we want to quickly service it
(2) Turnaround time: ability to make progress, and ability to finish the work you have
  • From arrival to completion time, how long does it take for process to get work done?
  • Turnaround difference when it arrives to when it finished
  • In general, low values are good

Linux Support for Scheduling

  • Flags for scheduling policy
  • Has mechanisms that support multiple approaches to scheduling
  • Normal scheduling and real-time scheduling
  • Linux can use both by defining flags
  • Lines 35-38: SCHED_NORMAL is normal policy, SCHED_FIFO and SCHED_RR are real-time variants
  • Linux fakes real-time scheduling
  • Processes marked SCHED_FIFO and SCHED_RR is given a drastic priority boost, and OS scheduler schedules them first
  • However most multiuser OS don't use these real-time approaches. Linux does something else
  • Gut feeling when creating function to prioritize processes:
  • FIFO
  • Sort array efficiently somehow (heap)
  • How to calculate priority?
  • Nice is a value pointing somehow, how to calculate relative priority?
  • Just look at it -- but how?
  • Think about for loops

Code: sched.c

  • n tasks runnable, which one goes next? Write a for loop and iterate over it
  • Not in LXR, on Wiki
  • Schedule function - previous, next is important to keep in mind of processes
  • Check to use real time
  • Repeat_schedule:
  • Next pointer to task_struct
  • idle_task(this_cpu): if processes are sleeping, who gets to run? Have a process that's always runnable
  • Swapper process from system startup - process 0, if OS can't decide who should run, it runs this
  • Implication: select next to be idle process
  • c = -1000: some counter
  • list_for_each(tmp, &runqueue_head): Linux has internal data structure and macros
  • Pointer to task_struct, runqueue_head: every CPU has a runqueue associated (i.e. set of runnable processes) and it will be iterated over, storing the task we consider in tmp
  • p: iteration pattern
  • if (can_schedule(p, this_cpu)): test if p can be scheduled for the CPU
  • int weight: consider how much goodness task p has
  • Goodness: passes in task_struct
  • Weight is number to return. If in scheduler, and wants to pick another process and is iterated over every other process but current is still schedulable, it will run it
  • goodness(p, this_cpu, prev->active_mm): metadata, history of priority of process
  • if (p->mm == this_mm||!p->mm): meat of the code
  • Don't swap out memory, and take into account weight += 20 - p->nice then return
  • If p is a real-time thing, boost by 1000 and selects for real time purposes
  • If we iterate over task, eventually pick highest priority
  • Order n scheduler considers every task on machine
  • Each client has server thread, loop is long every time you call schedule (lots of work to do so don't want context switch to be long

2.4 vs 2.6 Scheduler

  • Linux 2.4 is okay, but how to do better? O(n)
  • Just pick the next task
  • How to avoid iterating over n processes to pick the best one?
  • Needs some sort of priority queue
  • Why use an algorithm when we can use a data structure?
  • 2.6 scheduler
  • O(1) procedure of answering who to run next, is running in constant time
  • I.e. no matter how many processes you have, answer will always be a particular length of time - a known constant
  • Heart is put_prev_task and literally fetch pointer of next task
  • Maintaining multi level feedback
  • Pick next task is straightforward. Making it more complicated is that sometimes it considers other scheduling classes (i.e. other real time processes on machine)
  • Basically delegate to another function that picks the next task of this data structure

Run Queue

  • Every CPU has a run queue
  • There are fields, which point to a couple of arrays
  • One array is active and another is expired
  • Active: processes that haven't had a chance to run in this time slot
  • Expired: processes that have already run
  • Run queue has these two lists. Who should run next? What is 0th entry of the active list?
  • As it finishes the work or makes a syscall, does counting (drops off list to some position in expiry list)
  • Array is 140 processes long. Easy to ask who is next - if no one is at position 0, go to the next one
  • For loop: why not O(N)? Because there are only 140! Bounded by a constant
  • Governed by policy, backhand

Completely fair scheduler

  • Heuristics that determine which slot to fall into
  • Scheduler takes round robin idea and flips on head: in RR you pick quantum and that's how process runs
  • RR gives same time to every process. What happens if you own 25 and I own the last one? He's getting more service than you are. Similar problem happens in between processes

Completely fair relax quantum

  • Squeezes some progress in well defined space based on number of processes on machine, relationship to time
  • More processes, their share of CPU goes to a minimum level - cost of scheduler running (cost must be accounted for)
  • Requires some fraction of processing time
  • More processes: set a floor on the minimum amount of time that each process gets to run
  • Go to Scribe 2's notes for more quantum notes

Back To Navigation

October 30: Concurrency, Communication and Synchronization

  • Readings for today are the most important in entire course
  • Concurrency as a problem, not a solution
  • Ex. Michael asks a question and everyone answers at once - can't discern an answer
  • Efficient, but concurrency has apparently all processes executing at same time - not all it's cracked up to be. Must do it safely
  • Problem: trying to access the same thing

Seeds of a Problem

  • OS wants to convince users that machine is doing as much useful work as possible and that all applications are making reasonable progress and have fair access to the underlying system resources
  • Problem of concurrency is Interprocess Communication (IPC) problem: applications want to communicate with each other and share state. IPC is killer app for techniques that manage concurrent execution
  • If you have two processes, and are communicating through some buffer, then there are problems if synchronization doesn't match up between the three elements

Focus Question

  • How to make sure this works safely and correctly?
  • How can two otherwise independent tasks coordinate their access to a shared resource?
  • Once we leave sequential, serial execution behind, we open up an opportunity for incoherent access and updates to data

Concepts

  • Situations where two or more tasks compete to update some shared or global state; such situations cause the final output to depend on the happenstance ordering of the tasks running rather than a semantically correct computation
  • Essence of interaction is you have a race between process1 and process2 to access the buffer. Results are not predicated on algorithm, it's predicated on interleaving of the two jobs involved in interaction
  • If result relies on arbitrary ordering of processes' race, it's a bad thing
  • Critical sections
  • Code regions or slices that contain instructions that access a shared resource (typically sharing some type of memory or variable)
  • Multiple tasks or threads entering the same critical region without some control or coordination mechanism will interleave with each other in ways that produce race conditions and (probably) incorrect output
  • In some sense a critical section is: (1) a code location or sequence of instructions (2) an environment and set of conditions
  • Concurrency problems: shared state i.e. variable, data structure, some resource that two or more processes are interested in reading/writing or checking state of
  • Instructions in processes that access the shared state make up a critical section
  • Set of instructions that operate on some shared state (Ex. mov [var] in p1, mov[...][var] in p2)
  • Tempted to think of it as code. That's fine, but not so much as instructions that are involved, but by the shared state
  • The code that manipulates the state are part of the critical state
  • "Is this a critical section?" Can't tell by sequence of instructions; is data involved in that sequence potentially manipulated by multiple threads?
  • Mutual exclusion
  • Only a single task should have access to a critical region at a time
  • Enforcing mutual exclusion helps serialize and normalize access to critical regions and the memory, data or state that the code of a critical region reads and writes
  • Don't let the shared state happen
  • Doesn't let multiple threads manipulate shared state at same time
  • If you allow multiple processes to access one critical section, implication is that they are all at different offsets (executed at different instructions in critical sections)
  • Bad, so need a way to enforce one task

Classic IPC Problems

  • Producer-consumer
  • Some shared buffer
  • You have something (program, device, etc.) producing signals, noise or data
  • Do something useful - you are capturing it and it's efficient to buffer the event
  • When process has events to generate, store in buffer big enough for purposes
  • Something else observing the events and reading out of the buffer
  • Two threads p and c (in a global shared int array)
  • p generates random values, writes to b, and c reading
  • If p attempts to write to buffer that is full, it goes to sleep
  • c wakes up and reads from the buffer. Hope that it's able to tell when it's empty (test condition)
  • Testing another shared state - both p and c are reading/writing that value so we need something to do the coordination
  • Dining philosophers
  • n-1 chopsticks, as someone is left without it
  • Readers - writers
  • Producer-consumer is a variation of this
  • Some threads want to read state, some of them want to write state; many threads can read at once but when you want to write, how to give it priority or not?
  • Many parts of kernel can affect the critical condition at once: readers can do their jobs, writers have to wait until the readers drop out

Practical Example

  • OS has done things right. Two processes are executing at once - history and grep
  • Both are sharing a state
  • The pipe: magical utility that is a shared buffer
  • Whatever history produces, it is put into the shared buffer while grep attempts to read from that buffer
  • How does it work?
  • The OS is designed for this, and sets up and controls access to buffer
  • history runs for a bit, then takes off; grep runs for a little and reads, then CPU pre-empts grep and sometime later history runs and produces more output
  • Example of successful concurrency
  • If using OS to communicate, it is hopefully implementing correctly and enforces access to a shared state in appropriate fashion
  • history (write(2)) and grep (read(2)): ask OS to manage access state

Example: object.c

  • Simple program defines a data structure, with 3 fields
  • Job of program is to manipulate these fields and print something about the data structure
  • Sequential single-threaded program
  • Look at code manipulating an instance of this thing
  • do_work: manipulates a few fields of the data structure instance
  • What if you add more to do two threads? Thread 1 and thread 2 executes do_work at same time
  • At user level, hopefully kernel doesn't get affected
  • Executing the same tmp->fields is interesting
  • Look at the assembly code (Ex. X.field1 = 100)
  • Not atomic: one statement cannot be accomplished in the scope of one machine instruction
  • Potential exists for the CPU to start executing a line of code, be interrupted, suspend current processes and executes another sequence of instructions (e.g. like those from the child) that reads or writes the shared state
  • mov the value into memory address that is calculated by [ebp-4] and treating that as address: several cycles
  • Single instruction, kernel can decide that thread1 is no longer permitted to use the CPU; someone might be more important
  • Pointer dereferences, manipulating a field - not one atomic assembly instruction. Translates to 2-3. Uncontrolled interleaving of instructions
  • No guarantee of computational results which is a bad thing
  • Essence of critical section (multiple threads manipulating these memory locations) so there is a risk of inconsistent state or incorrect answers
  • Risking answers depending on the ordering of when things get to run
  • Heart of issue: when many things happen at once and the assembly doesn't happen automatically, you get chaos results
  • This is wrong!!!!! Concurrency introduces a problem, somehow we have to control/fix it (else you have incorrect program)

Simple Code Example: Family Squabble

  • Say the parent forks off child and they both proceed to try to update contents of 'X' with different and distinct values, each in an infinite loop
  • Imagine you have a parent and child (thread1, thread2) and access shared resource struct foo. How would you actually create two threads and execute code at same time?
  • fork() creates child process and we know that the second process executes code right after fork() returns. Boom, you have concurrency and you run into a critical section
  • You get two threads after fork executes
  • Two processes execute the same code: is there an IPC problem? Concurrency problem? No
  • Situation 1: they both try to update contents of foo with different and distinct values, each in an infinite loop.
  • Semantics of creating two processes is such that there are different PAS (COW - copy on write)
  • Execute the same instructions, but when they attempt to write into a memory address the OS says 'distinct address space, creating it separately' - by default, they're writing into different memory locations
  • fork() has no conflict: no critical section because it's not writing to same section
  • Situation 2: clone() with set of flags has concurrency issue
  • clone creates another thread of control that executes in the same PAS
  • Critical section manipulates shared space
  • fork() has no shared space even if they're executing the same instructions (not in the same state), but clone() has shared state
  • Lack mechanism for coordinating atomic access to a shared state

Potential Solutions

  • Lock variables (recursive problem)
  • Introduce some global shared space that processes can use (indicated writing/reading and stops other states)
  • Global flag variable is a piece of shared space that they're attempting to access - recursive problem
  • Disabling interrupts (not trustworthy, multi-CPU systems unaffected)
  • Don't let machine do more than one thing at once
  • Enter critical section, disable interrupts, do instructions, then enable interrupts
  • One problem: impolite. What if something else not accessing the PAS is doing a system call? It won't be able to if interrupts are disabled
  • Don't trust process to disable interrupts
  • Typically the architecture has a way to disable on one CPU; the second CPU can go in anyways
  • Spinlocks/busy waiting (feasible but wasteful since CPU is busy doing 'nothing')
  • Solution if done correctly
  • Spinlock: Ex. while(not.allowed)
  • Test a variable - in shape, it's the right solution. The implementation may be wrong as it still depends on a shared piece of state
  • How to get semantics of spinlock with tools on OS?
  • Problem: CPU jumps - tests and jumps
  • Also what if process sleeps? It takes up CPU space

Synchronization Primitive: Atomic Instructions

  • CPUs can embed some operations that support the basis of atomic operations e.g. single-instruction-cycle reads or writes cannot be interrupts, and the CPU can suport some serializing or locking primitives (such as instructions)
  • Peterson's Solution
  • Test-and-Set (MOS: pg 124)
  • Compare-Exchange (x86)
  • All the above rely on spinlocks/busy-waiting
  • Spinlocks have the potential for priority inversion (i.e. a higher-priority process is spinning while a lower priority process is executing with a lock held)

Concurrency Controls

  • Methods/approaches/techniques to enforce serialize access or safe access to critical sections locks
  • Mandatory locking
  • Advisory locking
  • See e.g. file-level locking,which we will discuss later in the semester ('man 3 flock')
  • Advisory locks allow cooperating processes to perform consistent operations on files, but do not guarantee consistency (i.e. processes may stll access files without using advisory locks possibly resulting in inconsistencies)

Semaphores

  • Up/down
  • P/V history (Proberen/try) and (Verhogen/raise)
  • Nice abstract ide, but needs to be atomic in a real system
  • Possible implementations:
  • System calls up and down
  • Disable interrupts
  • Protect each semaphore critical section with a lock / CMP-EXCHANGE (so other CPUs don't modify even if interrupts disabled)

Multexes

  • "Mutual exclusion": special case of semaphore (no need to maintain a count)
  • Example operations (from pthread library)
  • pthread_mutex_lock
  • pthread_mutex_unlock
  • pthread_mutex_trylock
  • What if multiple threads are blocked on a mutex? How to decide who gets into the critical section next?
  • Random wakeup, FCFS, perhaps a race condition for wakeup

Example Mutex Implementation

  • Code from MOS: pg 131
  • Not a bad idea to have a shared piece of state that indicates whether someone else is in critical section (mutual exclusion)
  • Not only want to check it, but if it's available, you want to seize it - how to do both things in an atomic, uninterruptible fashion?
  • Can't unless Intel has done it for you
  • No piece of code you can write that can do two things at once
  • Need architecture (supports separation) to have physical reality (instructions like test, set) in a single, atomic, uninterruptible fashion
  • Intel implements it where an instruction tests it
  • Slice of code that implements correct entry/exit from a critical section (i.e. mutex enforces mutual exclusion)
  • Copy mutex to r, set it at the same time
  • What's mutex? Compare to 0 - if it was, jump to a label OK
  • OK returns to wherever you were before 0 - first instruction of critical section

Question & Answer

Q:

Wouldn't mutex produce a problem of ordering if processes need to complete before others, and it doesn't run yet but another process comes in and the critical section has other processes?

A:

Not problem of ordering, but there is another problem: if you have a 'which is more important' situation in the critical section, it's a different problem. Ordering problem - statements intervene

Back To Navigation

November 1: Deadlock

  • Careful programming solution of meta seeding
  • See Slide 3 for diagram
  • Order in which things get acquired to is when deadlock arises
  • p1 holds r1, r2 and p2 wants r1
  • Doesn't get r1, and it tries to enter mutual exclusion and keeps testing which is fine
  • p1 finishes the job and releases r1 and r2; p2 gets access to r1 but falls back to a critical section
  • Problem: p1 gets access to r1, scheduler pre-empts p1, takes off the CPU and p2 gets a chance to run; gets r2 because it is free, but p1 comes back and attempts to get r2 but p2 has it...
  • Situation: multiple processes wait on each other but they can't give up on their own resources because no progress is happening
  • Deadlock definition: set of processes waiting for an event that only another process in the set can cause
  • Two or more processes are blocked waiting for access to the same set of resources in such a way that none can make progress

Seeds of the (Meta)-Problem

  • Safe concurrent access requires mutual exclusion
  • Processes typically need more than a single resource
  • Key issue: imposing logical order on a sequence of locking operations

Conditions Necessary for Deadlock

  • Read textbook and Wiki and memorize
(1) Mutual exclusion: each resource is only available via a protected critical section (single task)
  • Precondition - if you don't have this, you don't have deadlock
(2) Hold and wait: tasks holding resources can request others
  • Tasks can acquire resource or a subset and basically holds it while asking for it
(3) Circular waiting: must have two or more tasks depending on each other
  • Many threads and many resources
(4) No pre-emption: tasks must release resources
  • Tasks are responsible for releasing the resources
  • No third party to force to give up a resource
  • Have to hold to be a potential for deadlock; doesn't mean if all are present, you're going to get deadlock (just likely)
  • Depending when these are scheduled, you may not see it

Deadlock is a Manufactured Problem

  • Try to fix one problem, and another arises
  • Resources are usually locks in this context
  • Textbook discusses resources as physical devices, but in real systems the resources are like locks/etc. (constructs)
  • Resources that cause deadlock are the locks and semaphores protecting entry to the critical regions permitting access to these devices
  • Resources are usually locks in this context

Solution

  • For analyzing a system that is prone to deadlock
  • Resource graphs (read in textbook) - perhaps on test
  • A modelling approach to understanding deadlock
  • Processes are circles, resources are squares, directed graph and edges between process and resource -> process requesting resource
  • Edge between resource to process -> process holds the resource
  • Sequence of events: you can draw a resource graph
  • Detecting whether or not there's deadlock is easy - if there is a cycle in the graph, you have deadlock (recall Inception slides)

Approaches to Handling Deadlock

(1) Carry out a sequence of resource requests
(2) Update the resource graph with the corresponding edge
(3) Check if graph contains a cycle
  • Avoidance
  • Ignore it: who cares
  • OK in short term, but later hard to understand a bug (short-term gain, long-term bug)
  • Diagnosing this is hard because you need to simulate the set of exact ways
  • Implement a dynamic avoidance (e.g., speculative execution, resource graphs) approach to inform resource allocation sequencing
  • Takes resource graph and checks each request
  • Let deadlock happen; detect this condition and attempt to recover
  • Action
  • Real systems are large, overhead to track requests and do DFS, so let deadlock happen then draw a resource graph and do something to recover
  • Prevent it from happening in the first place (usually by scrupulous coding); attempt to make one of the four deadlock conditions impossible or untrue
  • Ensure it doesn't happen in the first place. Real solution. Let deadlock happen, and ensure the preconditions don't happen

Detect Deadlock

  • Detecting cycles
  • Done in many ways, such as if you have a linked list, send a pointer iterating over the list and compare with current node under consideration. If they are equal, there is a loop
  • Detecting cycles in graphs typically done through some form of depth-first search
  • Algorithm in the book describes depth-first approach to exploring the graph to see if any outgoing edges lead back to the "root" node
  • Assume a list of nodes
  • For each node N
  • Let root = N
  • Add root to temp list L
  • if(root is already in L) cycle();
  • Any explored outgoing edges?
  • Yes: pick edge. DST node becomes new root
  • No: back to previous node

Recover from Deadlock

  • Recovery tactics
  • Pre-emption: suspend and steal. Depends on resource semantics
  • Suspend tasks and take away some resources they hold -- cannot have that claim (approach)
  • Rollback: checkpoint and restart. Select a task to rollback, thereby releasing its held resource(s)
  • Periodic check point. Rollback to point of system before deadlock occurs then let it run again. Hopefully another process happens and you avoid deadlock that way
  • Cost means robust, inefficient checkpoint/restart mechanism
  • Terminate: (1) break cycle (2) make extra resources available
  • Terminate a few processes (involved in cycle), but resources become available to other processes (blunt approach)
  • Terminated process' work is wasted

Deadlock Avoidance

  • Banker's Algorithm
  • Dijkstra 1965
  • Art of abstraction hinders application to reality
  • Detect deadlock - elegant, but doesn't work
  • Need to know maximum resource request, the stuff in advance, but can't
  • Physical devices break
  • Number of new tasks hard to predict
  • Not tested
  • Safe states: there exists an ordering of resource allocations such that all tasks run to completion even if maximal demand made on the system instantly
  • Unsafe != deadlock, it's risky
  • Giving out credit: but delivered credit is unevenly distributed, and remaining credit may not allow the system to reach a safe path (i.e. allow some sequence of tasks to complete)

Deadlock Prevention

  • Break at least 1 of those 4 conditions
  • Program Structure
  • Introduce sequential processing (to avoid mutual exclusion)
  • Without mutual exclusion, we have unsafe concurrency or sequential processing (safe, no deadlocks but lose the benefits of concurrency)
  • Tough to break hold and wait (prediction again)
  • Can't know if we've done the right thing (taking away resources is bad)
  • No pre-emption: tough to kill or pre-empt because semantics and data correctness may rely on this
  • Wipe out work or put system in inconsistent state (what if they were in the middle of printing? The printer driver state gets affected)
  • Circular wait: sequentially enumerate resources; impose (code) rule of sequential access
  • May be able to break. If there's dependency (essence of deadlock), can compose a strict ordering of requests to avoid possibility of acquiring multiple processes
  • Requires good programming

Code: Pthreads to Cause Deadlocks

  • In userland, may be more complicated
  • Userlevel threading (pthreads)
  • Use library of illusion of threading at userlevel (no system calls)
  • Example code
  • Starts two threads, and they access two variables with shared resources (do it safely with mutual exclusion)
  • Access resources, then exit
  • Run program as many times as we want
  • Put /tie 1
  • Forced first thread to count to 1 billion then acquiring second one - but nothing is working
  • After first thread counts, it should attempt to get mutex-2, but it's stuck (constantly does a mutex lock)
  • Second thread grabbed the second mutex and tried to get mutex-1, but it's already held by the first thread
  • Have all four conditions of deadlock
  • Minimal delay (0.5) no problem even though conditions exist
  • No deadlock? VM is single CPU system, and pthreads is user level library (no real threads)
  • The threads created are not real processes, they are simulated inside (created anonymous regions that pthreads simulate)
  • Threads share the rest of the PAS
  • lib pthread is a simulated thread, with its own scheduler inside, not relying on OS scheduler
  • Without a delay, pthread scheduler picks each jobs quickly
  • Poke OS scheduler and pthread scheduler: a delay into the first thread, caused CPU to take process off and put to sleep; other processes run; OS runs the process and pthread wakes up and then deadlock happens
  • Just because conditions are there, doesn't mean deadlock happens - want to avoid this
  • What does it take to write a threading program?
  • Wouldn't it be nice to have the ability to spawn threads and deal with slightly independent subtasks?
  • If you want real concurrency, depend on clone

Example: tie.c

  • Code available on Wiki
  • Buffers and pthread_t
  • Handles for threads (job1, job2)
  • Mutexes, synchronizing primitives that jobs will use to have safe concurrent access (box1_lock, box2_lock) - no one global lock
  • Associated a particular mutex with a particular resource
  • Appropriate programming paradigm in kernel
  • Many data structures have a mutex or lock embedded in them
  • mm_struct has a mmap thing - controls access so that processes must access this first
  • Responsibility and wait on program
  • Note that there is nothing mandatory in the lock - can create a third one that pays no attention to the locks
  • If you're writing in a team, and a noob needs a variable and doesn't use the lock (nothing prevents them from doing this), you may have issues
  • Retrieves delay from command line, multiple 1 billion for delay and count
  • Initialize locks so code can use it appropriately
  • Create threads (order doesn't matter)
  • pthread library calls mmap to keep track of stack and context of the thread - CPU context and stack (record of execution) is important
  • Critical section - data rather than sequence of instructions
  • pthread join to collects threads
  • Once pthread_create is called, the scheduler inside pthread is free to begin execution of code
  • do_some_job
  • Signature follows the pthread_create syntax
  • mutex_lock, address of mutexes (lock them)
  • do_report prints status whether you locked it
  • lock box1 and now in critical section
  • Difference between job1 and job2 is do_job and do_some_job
  • do_job
  • Delay counter
  • Tries to acquire second lock after locking he first
  • Not entirely unrealistic - lots of kernel code locks something, calls a routine with work, then locks something else
  • Often do not lock sequentially
  • Hit critical section - box2 and box1 change
  • Where does it go wrong?
  • locked boxes in a different order
  • Users can only control the order of which you do thing
  • Lock box1, then box2; but in the other do_some_job it is reverse (potential for circle)
  • Solution: lock in same order. Must follow this discipline or else kernel will deadlock
  • Download the code, then try to fix it
  • pthread code useful for homework 3
  • MUST know conditions of deadlock
  • Should be able to draw resource graphs if asked

Back To Navigation

Tutorial 13: Bashes

  • A way to change the source code according to some modifications, in addition to: modification of the custom kernel
  • Use this file to update to the original kernel to apply to the custom kernel

Example: File1

hello
world
robin

Example: File2

hello
world
robin
bye
  • Bash applied from file1 to file 2 - file1 should add "bye"
  • Execute diff <FILE_TO_BE_BASHED> <FILE_THAT_WAS_CHANGED>
  • diff File1 File2
  • + signs are additions to the source code
  • - signs are things taken away from the source code
  • Today we create a bash for the custom kernel: follow exercise
  • Send the diff file to SVN

Back To Navigation

Week 9

November 4: Kernel Synchronization Primitives

Announcements

  • HW3 is released
  • Range of due dates
  • Turn it in earlier, gives you your mark before final exam
  • 5 problems listed, 3 problems to do
  • Choose problem 1 (more work) and 2 (if you do both then bonus) and should do problem 5
  • SVN time stamp to monitor submission
  • Submit when you're confident as to not confused TA
  • Problem 1 is easy, but lots of user-level programming
  • Implement a page replacement algorithm

Pthreads

  • Page threads i.e. user level concurrency
  • Recall pthread library, deadlock and part of API to get mutual exclusion
  • Deadlock: stick head in sand, ignore potential for damage in critical sections
  • Benefit: performance as you're not busy locking, checking mutex and so forth

Example: urace.c

  • One thread doing something
  • main has time, creates pthread which points to do_job and has a handle to job1, then prints statuses
  • do_job has for loop that increments a variable (substantial work)
  • See how fast it is

Example: srace.c

  • Imposes some locks, particularly in do_job before modifying global state, it locks the state
  • urace took barely any time (0 seconds)
  • srace took more time (4 seconds) - more overhead
  • Price is probably worth it
  • Design decision: what is a critical section, how large should it be?
  • Locking is "expensive" (4x cost)
  • If you're going to define a critical section, make it as small as possible and minimize number of sections in code
  • Design challenge: defining what's in the critical section and where/when to have them
  • Safe way to do things, but there's performance cost
  • As a designer, you have to decide how much parallelism and concurrency there is; may be dependencies between threads (critical section) so you have to protect those
  • If you're going to incur cost of critical section (modify state) you might as well do as much as you can
  • Critical section is slower than unsafe code, but unsafe code is unsafe!

Correctly Servicing Requests

  • Kernels are naturally concurrent - servicing requests in system calls from processes and interrupts from hardware; does it basically at same time
  • Requests not completely independent. If they were, there would be no need for synchronization
  • Kernel is naturally asynchronous
  • Startup - creates processes, enables interrupts, and kernel is fully concurrent
  • Sequential - well-defined order, easy to understand
  • Event handling system - don't have this. Any function or subset of them can run at any given time
  • Interleave, multiple paths - requires synchronization

Main Ideas

  • Atomic operations in x86
  • When kernel has to do something important, needs to rely on hardware platform to provide primitives
  • Ex. Servicing interrupts
  • Kernel locking/synchronization primitives
  • Mutexes
  • Kernel pre-emption
  • Kernel can pre-empt process - one CPU, n processes, stop any one of them and schedule another
  • Kernel interrupting user level processes, but Linux kernel can pre-empt itself (can stop its own threads)
  • Can have multiple concurrent control of multiple paths
  • "Big kernel lock"
  • In early days of kernel, there was one mutex
  • Corrects concurrency but is inelegant and allows people get lazy about defining critical sections
  • Use the semaphores of mutex appropriate task at hand (embedded in data structure)
  • If there was one big lock, you wouldn't get ideas like kernel pre-emption

Synchronization Primitives

  • Atomic operations
  • CPU can do things in one cycle, and ways to enforce it
  • Linux enforces ability to update automatically
  • Disable interrupts (cli/sti modify IF of eflags)
  • Assembly instructions, but on a per-case basis (not best solution)
  • Lock memory bus (x86 lock prefix)
  • x86 has prefix for every instruction that locks memory bus - enforces synchronization
  • Spin locks
  • Semaphores
  • Graceful version of spin locks (see differences later which is an important skill)
  • Sequence locks
  • Read-copy-update (RCU) (lock free)
  • Barriers
  • Spinlock or semaphores make sure one task or thread controls in a critical section, whereas barrier accumulates multiple threads
  • Machine instructions
  • Barriers are serializing operations that 'gather' and make operations sequential

Motivating Example: Using Semaphores in Kernel

  • Code that operates mm_struct
  • Uses part of kernel synchronization API (manipulates semaphores)
  • struct mm_struct: fields pgd, vm, etc. be intimately familiar with
  • Fields to help kernel maintain meta data about memory space and pages
  • Other data instances that help kernel code access memory in structure in safe way
  • Slides pg. 8 ->

m_count: reference counting; how many threads are sharing memory mapping

  • mm_count: how many control paths are accessing
  • down_read, up_read
  • Definition or boundaries of critical section
  • Marks entry of critical section (down_read(&current->mm->mmap_sem) : do something important to this memory, like read operation)
  • Shouldn't be writing or modifying state because critical section is reading it
  • Signal that you're going to write to a critical structure
  • Indication of kernel control path that you're a reader of the state
  • Enforces concurrency, and is voluntary (most locking is advisory; have to agree to participate in the scheme)
  • Without these, you live with consequences - uncomfortable thing
  • After manipulating these data structures, you understand more
  • If you want to touch data structure, use proper locking/unlocking, down_read/up_write for semaphores

Primitive One: Atomic Type and Operations

  • On x86, know that some things are atomic
  • Simple asm instructions that involve 0 or 1 aligned memory access (1 cycle)
  • read-modify-update in 1 clock cycle (Ex. inc, dec)
  • Anything prefixed by IA-32 'lock' prefix (happens in atomic fashion)
  • Kernel can define atomic_t
  • Definition in Slides pg. 13
  • static inline...: adds to atomic type
  • Translates to machine code, LOCK_PREFIX is macro defined to lock prefix for assembly instruction
  • Force memory bus to be locked, to export atomic manipulation to counter or some other variable
  • Declaring type not just atomic, has to invoke hardware primitive to enforce synchronization

Example: Reference Counters

  • Refcounts: atomic_t associated with resources, but keeps count of kernel control paths accessing the resource
  • atomic_t - avoid mutex/spinlock/upwrite/etc. as it's more overhead
  • atomic_t update a single variable instance (no overhead, elegant)

Primitive Two: Spinlocks

  • Exports a mutex (check state of mutex constantly)
  • Look at multiple wrappings and layers of definition
  • May be frustrating, but good because there is a generic locking API and semantics eventually have to be transformed for platform
  • spinlock_t: raw_spinlock type with rlock
  • raw_spinlock_t: arch_spinlock_t raw_lock
  • Indicates whether you're a reader or writer
  • Mutex flag indicating whether you go in critical section is wrapped by these things
  • Pass in spin_lock: raw_spin_lock calls and then calls another one and so forth
  • Combination of inline functions and macro defines
  • Not incurring function calls (locking to be as cheap as possible so you can delegate to an appropriate primitive)
  • Slide 18, Line 137: __raw_spin_lock(lock): Pass in lock variable and do three things
(1) Kernel is fully pre-emptable so want to disable it: critical, as you're in middle of acquiring mutex and don't want interruptions
  • Macro with a do-while loop
  • Barrier as been discussed before
(2) Kernel draws resource graphs to detect deadlock
(3) Slide 19, Line 144: LOCK_CONTENDED: macro, and locks
  • Passes first argument as reference to lock, and function pointer do_raw_spin_trylock and do_raw_spin_lock
  • Slide 19, Line 393: Call try to acquire lock and semantics of it; if you don't get it, try again
  • Line 394: lock_contende(): completes book-keeping and requests
  • Line 395: lock(_lock): pointer to do_raw_spin_lock and passes in
  • Slide 20, Line 136: do_raw_spin_trylock
  • arch_spin_trylock(): delegates again
  • 7-8 layers down, and finally delegates a __ticket_spin_trylock() (Slide 21) which does assembly code stuff (closing loop)
  • Generic API to acquire spin lock, specific to architecture being run on (architecture only thing that is foundation for concurrency)
  • Physically need something to complete concurrency in one cycle
  • Slide 21, Line 88: cmpxchgw is like testing (same type of thing)
  • Publisher: know semantics - basically same as textbook (try to acquire mutex, if you have it go to critical section, otherwise yield and stay at top of critical section...keep trying to acquire)

Primitive Three: Semaphores

  • Primitive 2 is a bit of a waste: not elegant as you don't know difference between readers/writers (may have multiple readers, but do not want to let in a writer)
  • Waste to constantly spinning, checking - scheduling another process to try to acquire mutex (if it can't it sleeps)
  • Wasted CPU
  • Solution is semaphores. Difference is it is a list of processes on semaphores that are asleep, and doesn't waste CPU time

Important Caveats about Kernel Semaphores

  • Semaphores are not like spinlocks in the sense that the invoking process is put to sleep rather than busy waits
  • As a result, kernel semaphores should only be used by functions that can safely sleep (i.e. not interrupt handlers)
  • Slide 25, Line 79: rw_sempahore
  • Line 80: rw_sem_count_t count: semaphore
  • Semaphores produce code by Dikstra and basically are counting mutex - counts how many threads are currently in critical section
  • Semantics is try and release
  • A counting mutex - count of things available/how many threads can enter critical section
  • Line 81: spinlock_t wait_lock: has nothing to do with semaphore, but protecting access to wait - list (protected by spinlock as it's a shared state)
  • Many processes may arrive to this wait_list, waiting to get it (critical section, so acquire a lock to put self on list)
  • Line 82: struct list_head wait_list: list of processes/task that wish to acquire semaphore so kernel can refer to it
  • Effect of down/up_write is to register yourself to the list
  • Data structure that tells difference between spinlocks, semaphores and what kernel is doing
  • Acquire semaphore for mmap_sem and calls down_read (read a few variables out of the stack)
  • Basically same thing - down_read and up_read has same layers of concurrency
  • Slide 25, Line 21: might_sleep calls schedule and might be put to sleep (kernel knows you may/may not get it)
  • Boils down to assembly code: get to the machine
  • Slide 28: task leaves critical section 0 then calls call_rwsem_wake (line 111) that does book keeping and calls rwsem_wake
  • Who gets to enter critical section next?
  • May have collection of readers and writers waiting on wait list - same pattern where kernel knows someone elf critical section
  • If there's writer in front of queue, we will let them know (FIFO); readers try to race for it (rwsem_do_wake)
  • If writer on queue, get to go first; if not, wake everybody up and let them make progress

Advanced Techniques

  • Sequence locks
  • Solution to multiple readers=writer problem in that a writer is permitted to advance even if readers are in the critical section
  • Readers must check both an entry and exit flag to see if data has been modified underneath
  • Read-Copy-Update: synchronization primitive
  • Designed to protect data structures accessed by multiple CPUs; allows many readers and writers
  • Readers access data structure via a pointer; writers initially act as readers and create a copy to modify. 'Writing' is just a matter of updating the pointer
  • RC update: treats readers and writers initially, particularly in circumstances where pointer to data collection
  • Enters critical section that manipulates complex data, and if someone indicates writer - enter critical section with no problem, but copies the data section to update the copy, then update pointer (atomic operation) - lockless way to get readers/writers to manipulate
  • Only for kernel control paths; disables pre-emption
  • Used to protect data structures accessed through a pointer
  • By adding a layer of indirection, we can reduce wholesale writes/updates to a single atomic write/update
  • Heavy restrictions: RCU tasks cannot sleep
  • Readers do little work
  • Writers act as readers, make a copy, then update copy. Finally they rewrite pointer
  • Cleanup is correspondence complicated

RCU Example: getppid(2)

  • Kernel needs to do synchronization safely so it needs own primitives
  • Semantically no different than locks and semaphores and foundation is based on machine architecture exposes

Question & Answer

Q:

Wait list vs FIFO queue. Why isn't this FIFO queue?

A:

Link list is doubly, and kernel's generic doubly lined list structure
We will see as someone exits critical section and calls up_read, what does kernel do? Chooses to implement FIFO or wake them all at once and let them race to acquire mutex. More of a scheduling problem

Back To Navigation

November 6: Test Review Material

Back To Navigation

November 8: Threads (User--level Concurrency)

GDB

  • Powerful tool that may be daunting, look further than what system calls are doing
  • Lets you recover contents at what point your program crashes at

Example: loader code

  • Successful run, but let's look at gdb
gdb -q ./<program_name>
  • Start gdb:
run
  • Run the program
  • Where to see when program crashes? GDB intercepts segmentation faults and stop program from there
  • Where to set a breakpoint? Line numbers suck
disassemble 
  • Ignore source for now, but this assembly code is what's really going on
  • Look at call *%eax, you can find he address there
break *0x80486a3
  • When program is run in gdb, stops at this point
  • At this point, new memory region should be set up; let's examine valid memory address
x/FMT ADDRESS
x/16i $eip
  • Looks at 16 instructions starting from eip
x/16x $eip
  • This is in hex
stepi
  • Execute next line

Pthreads

  • May be useful for HW3
  • User-level threading libraries are easier to use than calling kernels and stuff

Example pc.c

  • Code available on Wiki
  • Encouraged to download it
  • Shares access to global states: buffer and position
  • int buffer[BUF_SIZE]: Integer buffer of some spots long
  • Ex. 5 spots in buffer
  • void* produce(void* arg): Producer generates an integer at a time to place into end of buffer list
  • void* consume(void* arg): Consumer reads integer from beginning of buffer list and deletes that item
  • Must protect this shared state (pos); must control/synchronize access
  • One thread touching x, no concurrent access - need not protect it with lock
  • Inside critical section, calling a few functions
  • Too much work being done?
  • No need to lock it if you don't need to do it in the next few instructions
  • Manipulating shared state: print_buffer which iterates over the whole thing
  • Even though it's just reading, must ensure values in pos and buffer remain stable
  • Producers and Consumers use this method
  • insert should be in critical section because of all the changes it makes to pos
  • Just because it's a function call, doesn't mean it's the wrong choice for making critical section size
  • As a code designer: does his piece of work belong in critical section (i.e. reading/writing shared state)?
  • What about the if statement?
  • Not shared, no need to put it in critical section

Output

  • Produced/removing elements
  • Insert failed, dropping value #
  • Buffer is size 5
  • Producer runs first
  • Overwriting elements may not be best; design choice is to drop an item when buffer is full
  • Happens in packets - can't force something physically impossible
  • Producer, Consumer execute at different rates; buffer is controlled
  • What would happen if we took reigns off? No sleep
  • Fails to remove any items from the list
  • Execution as fast as possible
  • Remove pthread_mutex_lock and pthread_mutex_unlock in Producer and Consumer
  • Consumer runs first, same pattern
  • Concurrency problems hard to find and reproduce

Challenges

  • Producer, Consumer inserting/removing one element at a time in controlled conditions
  • Try to get them to insert/remove random different elements at the same time
  • Unless you introduce critical section, things run parallelism

Back To Navigation

Tutorial 14: Time Measurement

  • Many functions to measure computation - time(1), time(2), gettimeofday(2)
  • Programs from UNIX (go to terminal and type time followed by any command you want to measure time of)

How to Use time(2)

  • Returns wall-clock time
  • Used as time_t start = time(NULL)
  • Execute whatever you want to measure
  • To get the starting time and then: time_t end = time(NULL)
  • To get number of second, subtract values

How to use gettimeofday(2)

  • Another way to return wall-clock time
  • Used as:
struct timeval start, end;
gettimeofday(&start, NULL);
  • To get starting time and then: gettimeofday(&start, NULL);
  • More precise, can get measurement in microseconds, etc.
  • Can measure any operation, need to provide start and end point for getting the time (subtract it)

Back To Navigation

Week 10

November 13: Signals

  • Signals offer a way to send short, pre-defined 'message' to processes
  • OS supports generating and delivering signals
  • Simple API for sending messages
  • Easy to know semantics because it's small
  • Number delivered by OS between processes
  • Source of signal: user, another process, kernel, kernel on behalf of hardware
  • Can implement reasonable response to most signals
  • Signals provide a way to:
  • Notify a process that an event has occurred
  • Allow the process to run specific code in response to that event

Example: kill signal

  • If you didn't know what signal to send, look at man pages
  • Tempted to be root for SIGTERM, SIGHUP
  • SIGUSR1
  • Most processes don't have custom definition for signal handlers from both signals
  • Signals are exceptional conditions - wouldn't build a network server from these
  • Signals can be sent, see a list, and can catch some signals

Basic Signal Workflow

  • Signal s is generated (may be in response to an illegal memory reference, or processes agreeing to send signals as a way of process communication)
  • Generated through system calls which involves the kernel
  • Looks up target involved and deliberate the signal
  • Attempts signal delivery to current process - may be current process generating signal
  • If target is not current process, the signal is pending
  • Once OS looks at these filters, it delivers signal (writes small number in list to task struct, then when process can execute the OS forcibly jumps to event handler)
  • Else kernel delivers signal to process
  • But process may have blocked this signal
  • Process may also be masking this signal
  • If delivered, execute default action or sig handler


  • Signal generation
  • Kernel data structure is updated to register the presence of this new signal instance
  • Each generated signal is delivered at most once
  • Generated signals may not be delivered (with delay, or at all)
  • Signal delivery
  • A generated signal may be forcibly delivered by the kernel by either changing the task's state (e.g. STOPPED) or executing a specific signal handler
  • Delivery may be delayed; such a signal is termed 'pending.' Additional pending signals are discarded


  • Blocking/masking signals
  • A process may temporarily block or mask a certain set of signals
  • SIGPROCMASK(2): manipulate current signal mask
  • SIGPENDING(2): get pending signals


  • Actions on signal delivery
  • A process can ignore the signal
  • The default action occurs (terminate, dumpcore and terminate, ignore the signal, stop he process, continue the process)
  • Catch the signal via a registered signal handler
  • Ignoring != blocking


  • Catching a signal
  • A signal is an asynchronous interrupt from the point of view of the user code
  • User process has an opportunity to run a signal handler
  • Signal handler invocation (if present) is prepared by the kernel

Diagram: Figure 11-2

  • Slide pg 15
  • Normal program flow (user mode) -> do_signal / handle_signal() / setup_frame() (kernel mode) -> signal handler -> return code on the stack -> system_call() / sys_sigreturn() / restore_sigcontext() -> normal program flow
  • OS notices initiation of event, which is a call do_signal() - job is to look at target of signal and who to deliver it to; will there be mask of signals to receive?
  • Jumps asynchronously to signal handler, which does small amount of work (it's just a function) and control flow jumps to end position of program
  • Entering, leaving signal handler is complex - OS needs to ensure context is correct and restore as much of original context it can
  • Issue: signal is not a normal message, it's an exceptional condition
  • Default intuition for mode of computing - weird conditions, terminate and restart program. Lose lots of accumulative state
  • What to put in signal handler?

Using ptrace(2)

  • Enables one process to trace another (examine its state, control its execution read/write memory)
  • Relies mostly on signals for STOPPING and CONTINUING the traced child's execution
  • Bad system call because it does a lot of stuff
  • Exceptional because it purposefully breaks the isolation between processes, particularly ptrace'd programs give access (can rewrite return values, change things, etc.)
  • Versatile, but weird and unlike other kernel functions
  • Normal interface has powerful requests - read data from process address spacepeektext, peekdata)
  • Same request: Linux uses flat segmentation model, so text and data is basically mapped to same segment
  • Programming using ptrace is a pain - responsible for driving execution of process, and when you stop, examine or changing process then you are responsible for continuing execution
  • Continues until some other signal happens. Probably avoid - need PTRACE_SYSCALL, PTRACE_SINGLESTEP which allows process to go fom one system call to another

Example: synfer.c

  • Source
  • Mimicks ptrace
  • init_attach(char *tpid)
  • Convert char to long and do ptrace(PTRACE_ATTACH, tr_pid)
  • Step to system call - stops for entry and exit
  • Value of eax is system call write
  • Checks flag if entering or exiting
  • Entering: prints data
  • Exiting: check return value

Question & Answer

Q:

Can't you use ptrace for malware? How can you prevent abuse?

A:

Defense of ptrace: compares ID of issuing and target process - notably, if take over Firefox then can ptrace other processes (once I'm you, I can do whatever I want)
  • Used for malware, debugging and security models

Back To Navigation

November 15: Files and File I/O

Announcements

  • USRI on Dec. 2
  • Final exam on Dec. 18 at 8 AM

Perspective

  • Traverse layers of abstraction, and begin with familiar place: files
  • User-level tools for manipulating files (content, metadata)
  • See how files are managed and manipulated - instead of clicking them, see how OS supports the data
  • C library functions and structures that represent and manipulate a file stream as a buffered sequence of bytes
  • Use System Call API to manipulate a file, illustrating three ways to ieract with file content:
  • Have our experience mediated by software applications
  • Have our experience mediated by the C library
  • Ask the kernel directly for service
  • Familiar with write(2)
  • Interface: Write to file descriptor, address, how many bytes to write to
  • Generic, doesn't say how OS is locating bytes and putting in the file descriptor - magically appears in standard out
  • Question: how does kernel provide support for managing persistent data?
  • We've seen how kernel supports storage of volatile data in primary memory, how it's addressed, organized in pages, and extended across memory hierarchy as virtual memory
  • Manage non-volatile data
  • Answer: 'file' abstraction, generic wisdom that everything in UNIX is a file
  • Powerful analogy: managing, manipulating most resources (rather than just files alone)

Class Experiment

  • Everyone writes favorite sentence (from MOS!)
  • Individuals pass it to the center
  • Those in the center, pass it down to the front
  • Who encrypted or compressed their message?
  • Piles of paper
  • Find one of the sentence, where is it?
  • Somewhere in one of the piles
  • Look at each paper, trying to find the contents in the paper
  • Take paper out of the pile to look at it
  • Find another sentence: what to do with the current paper?
  • Put it back into the pile - anywhere you want
  • Not easy to find file - no identifiers (different color, etc.) and no names associated to it
  • Time consuming
  • Need for disciplined approach for managing data
  • If we left for today, where would papers go?
  • Challenge for file system
  • Hard disk fail
  • Tower of Backup - do you backup your backups?
  • Need for effective way to manage files
  • If we had to find paper and label it, chaos ensues - not really a solution
  • Label, but how to find specific contents? Must still open each file and sift through it

Questions

  • Some questions were covered above
(1) Tell me how to find your piece of paper
  • Look at contents
  • Need for naming, indexing
(2) Tell me how to retrieve the 3r word of your sentence. The 75th
  • Find file (perhaps linearly)
  • Look up some byte in the file (may/may not be there)
  • Scan linearly byte by byte? Wouldn't it be nice if there was an operation where it jumps to that byte?
  • Solving this problem is the entire genesis of why file systems exist
(3) How do I handle duplicate sentences?
  • Compress files into one
  • How to manage representation or storage?
  • Access data at once?
  • Chaos, OS needs to synchronize or serialize some access to files
(7) Who can see your paper? Your sentence? Who can modify it/them? Who can delete them?
  • Concurrency issues
(9) What if we have 1450 students in the class? 14500? 1.5 million?
  • What if data were physically distributed?
  • Questions are system design questions
  • Half of our discipline isn't just inventing the wheel, you look at existing solutions/components and see if they meet requirements of problem
  • See what existing file systems do for you - what guarantees do they provide?
  • Write behind it, maybe data distribution happens, maybe a file system that talks to SSD, RAM, etc.
  • Understand top to bottom what guarantees the file system provides

Problem of Files

  • Why files?
  • Virtual memory isn't sufficient - it's volatile when you power down a machine
  • Not big enough - size and scale of data increased rapidly, need to manipulate in a reasonably efficient fashion
  • Understand problem of persistent storage as analogy of non persistent file
  • Useful data needs to persist beyond power cycle
  • How to find what file you need? Indexing into virtual addresses, memory addressing
  • How to asset ownership and control of files, protection issue, segmentation DPL, page protections
  • How to dynamically grow a file's content and not litter copies of old data? How to give resources back to OS? Memory allocation
  • How to keep 'hot' data close to CPU/registers? Virtual memory
  • How to support performance in efficient fashion? To keep data you're reading/writing a working set of data?
  • Linux tries to use physical memory - not just for processes, but for file contents
  • Liunx reads off disc and stores it in memory so access to it is fast

File Content

  • What is a file?
  • Relationship between bytes worth capturing by delineating bytes from each other
  • Data, bytes grouped together under some context/associated with other data
  • Container
  • Manipulated in its own right. Some set of operations on the contents
  • What is in a file?
  • Student answers:
  • Distinctly different information
  • Bytes (hopefully)
  • Metadata
  • Header and list of allocation blocks
  • Address to pointer/Identifier
  • How to find file, point to bytes inside the file
  • Physical location
  • May find bytes of contents
  • All bytes of file is not beside each other - why shouldn't they be contiguous?
  • Virtual memory issue: notion to expand, and need to allocate space, which implies a need to handle collection of bytes in less than file-sized chunks
  • Collection of blocks, which are the bytes of the contents
  • Pointing to a file - point to physical location of fix sized subsets of the file

File Type

  • How to manipulate bytes? Is there a value to see it as a different type?
  • Don't want to just see all files as a sequence of raw bytes - not useful as we manipulate persistent storage meant to interact in certain ways (images, ELF files..)
  • Lots of file types. If given A, B, C, how to know what file it is?
  • Bill Gates added .123 to say what the file is - pointless but keeps track of file type which is useful
  • Users expected to know which program could read and write a particular file, extensions not used
  • File extensions not necessary, but file type is necessary (what program to run it on)
  • Many developing file formats keep track internally of what the file type is: the first few bytes (along with some other structure) often describe the file format (like the ELF)
  • Don't mess with file name, do something internally in file itself
  • man 1 file: determines file type
  • Ex. File A with 4 MB - what file type is it?
  • To determine it, can look at the sequence of bytes
  • To check if it's an image - parse it and if it works, great! Can try an interpreter
  • Seemingly simple problem is a fundamentally hard problem - problem of language recognition
  • Here is a stream of bytes, recognize the language for me - build a parser that recognizes the parser that exactly recognizes it
  • Magical bytes that tells you what type of file it is
  • Ex. ELF in beginning - OS believes it is an ELF because of beginning of file
  • Binary programs have readable ACII content
  • UNIX files contain pure data, not markup

Manipulating Files as a User

  • Any program that understands the file format plus commands you should recognize:
  • ls, cd, chmod, lsof, pwd, chown, chattr, lsattr, cat, namei, more, hexdump, touch, df, ln, file, stat, rm, cp, lsusb, unlink, lsmod, lspci, mv, mkdir, dd, chgrp, basename, grep, rmdir and so forth

File Attributes

  • UNIX files typically do not embed metadata about the file ('metadata' is different than file format structure)
  • Metadata is expressed via the type of file system the file resides on, or via the CFS delegation of device or file system specific kernel functions actually implement open, read, write, seek...
  • To write programs, C library has facility built on system calls and allows user to manipulate file/attributes associated to it
  • Manipulation of metadata of file as UNIX files do have small amount of metadata associated with them
  • chattr(1): standard set of attributes a file has where files are marked
  • Even though space of metadata is there, requires OS to understand attributes
  • lsattr(1): most files don't use attributes - either not useful or uses them
  • Can change attributes to be immutable - good to prevent accidental deletion

File Permissions and Ownership

  • Files, like memory, contain data. Users expect to be able to express ownership and access control rules about these collections of data
  • Secure, isolate, protect data in the file. UNIX has two major approaches:
  • Mandatory Access Control (MAC)
  • Discretionary Access Control (DAC) - use for this course
  • Owners of file has discretion to set permissions
  • Where do these permissions get stored?
  • Can use chmod(1), chgrp(1), chown(1)
  • C Library: File-related glibc calls
  • FILE* stdin/stdout/stderr;
  • Physical I/O characteristics concealed - man stdio(3)

Example: ll

  • Should be ls -l
  • First columns describe a policy/accessing a particular file
  • 3/4 column describe ownership of file - primary user that belongs to a group
  • Collection of letters r-w-x (read, write, execute)
  • d (directory) - more letters like s (sticky bit), l (link between files)
  • Ability to manipulate bits - important metadata of machine
  • chmod changes permission
  • Collection of permissions is a number
  • First group for eye, second is cpsc, third is everyone else
  • rwx rwx rwx (user group other)
  • If each of these is a bit, can represent in 2^3 = 8 different permissions
  • 0 permission doesn't exist, 1 does
  • For r: 5 - chmod 555 <file_name>

Directories

  • Files are themselves a way of grouping or associating a collection of data. The internal semantics of a file can be almost arbitrary (indeed, you can 'hide' an entire file system or even virtual machine inside a single file)
  • Collecting and grouping files can be done via directories
  • Directories are a way of providing structure to a namespace, and directory names are the elements of that namespace
  • In UNIX, directories are files

Paths

  • Path expresses one way to traverse the directory namespace. A path is composed f all the directories 'along the way' to a file or directory
  • Relative and absolute file pathnames
  • 'Special' directories and path elements: '.' and '..'
  • Two different paths can refer to the same file: ln -s /home/michael/foo and /tmp/bar

Conclusion

  • Files are an abstraction that allows programmers, users and the system to manipulate arbitrary collections of persistent data
  • Files go beyond this use case to provide a convenient and flexible interface to manipulating many different types of data
  • You can manipulate files with command-line tools, with C library functions and 'directly' with system calls

Back To Navigation

Week 11

November 18: File Systems

Recall

  • Files are virtual construct that defines name space in terms of sequence for a range of bytes
  • Parts of that file is scattered across a physical medium
  • A programmer sees a file as a name, path, sequence of characters, but really is an attribute of a file
  • Used to thinking it's a collection of bytes in a file - largely contiguous but physically scattered across machine
  • If we can access files in system calls or glib, machine does a lot of work to give us illusion of contiguous storage
  • We know size, data, permissions, etc.
  • File system keeps track of files and their content. In Linux, a deeply layered approach is taken to defining a hierarchy of implementation of various services


Function calls fopen("...")
-> open ("...")
-> SYSCALL layer (sys_open(...))
-> (VFS -> file system -> buffer/block I/C -> device driver)
  • File system is logical arrangement of data; kernel has other layers to talk directly to hardware
  • Performance layer to keep file data as close to CPU as possible (buffer)
-> I/O container


  • Ideal storage unit: unlimited
  • Within it, might have another construct: file F from position k -> j
  • Can access any byte efficiently and in constant time
  • How large should we allow file F to get? Should we allow it to grow and shrink?

Motivation

  • As OS or file system developer, responsible for managing file content: storing actual parts of the file onto some medium and providing lookup and indexing facilities to support read(2) and write(2) of that file
  • What sort of flexibility and power does a file system need? What are the types of files it is likely to deal with? What are the design requirements?
  • Around 448549 files on a typical desktop box
  • Relatively many small files, only a few large files
  • Most files range from 1KB-5KB
  • Where are the files? (Content not name)
  • How can you keep track of space needed for growing these files?
  • OS system call API (Eg. open, read, write) hides such detail from us
  • Question: how does kernel provide support for a similar translation between some byte offset in a file and actual content on a physical device like a hard disk, compact flash, optical media, or other storage?
  • Answer: file systems. To design and build an effective one, you should try to understand the likely workload and typical environment - this open-ended design requirements leads to quite a number of different file systems

Example: seek.c

  • Open file, write message, seek 1 GB of file and write again
  • Disk space doesn't change
  • Content of new file (moralcodecastinsteel.txt*) - where is it?
  • A hole in the middle
  • When reading some number of bytes, the middle is just a bunch of zeroes - OS generates them on the fly
  • Virtual space in file (address) ranges from 1 GB, but physical space it takes up is smaller
  • Implies OS is doing some sort of mapping/lookup in the virtual space and particular byte number
  • Design of mapping is heart of a file system - whether or not it's efficient or appropriate

Types of File Systems

  • How to implement lookup mechanism? Each different answer is a different file system
  • Blue directory names are different file systems
  • Directory structure
  • Often think about directory structure. In Linux systems, you have 'root' of tree denoted by "/"
  • "usr", "bin", "etc", "opt", "local", "home"...
  • Tree can be as deep as you want
  • Tree defining set of relative lookup operations
  • Logical lookup, but nothing about arrangement on disk
  • open("/home/eye/457/lectures/files/seek.c", O_RDONLY) might be an absolute path -- uniquely defines this seek.c. Useful human level tool
  • Given this, seek.c is a collection of bytes. OS has task of associating data structures that point to a physical space
  • File system is NOT a directory

Perceptions

  • Files (and their names and the glibc and syscalls that manipulate them) may be the most visible interface to your persistent data (and thus define your experience and intuition)
  • Directories (and their names) may be the most visible facet of a file system, but these concepts are far from the same thing
  • Directory tree namespace is a hierarchy of names that can help: organize your data into related groups and help impose some kind of access control on that data
  • In UNIX, since most resources can be represented as a file and the directory tree offers a unified naming scheme and semantics, it can be difficult to tell what role a file system (or several of them) play behind the scenes


Mounted File Systems

  • Directory tree name space can have many file systems attached to it at the same time
  • Experience: Windows exposes different devices with C:, D:, E:, etc. drive specifier. In a sense, it's multiple directory trees. Can mount a remote disk drive or directory tree via network (NFS, CIFS/Samba). Plugging in a USB disk causes another drive designator to appear
  • UNIX hides this kind of thing: no need to create a separately rooted directory tree for each new device. Instead, physical devices (and file systems present on them) can be joined to almost arbitrary points in the existing directory tree naming structure
  • Can attach file systems to directories - entirely seamless to indicate what file system your directory is stored on
  • Code in kernel designates what the right set of routines and data structures are for a respective collection of files in virtual name space
  • mount: mount a file system
  • Can mount a file system to a (sub)directory of another file system
  • Can mount multiple file systems to a single mount point, effectively stacking multiple file systems on to one point. Previous file system is partly hidden
  • Each process can have its own namespace (i.e. its own directory tree)

What is a File System?

(1) Collection of mode in kernel
  • Collection of data structures and functions that specify how the kernel should perform file-related operations
  • Supporting the 2nd point
(2) File format or network protocol specification
  • Specifies the details of the lookup mechanism
  • Defines structure for distributing files over a physical medium
  • Choosing how you do allocation of this data, and how you index certain bytes in a file, optimization imply the choice of this format/replacement
  • Format is essence of file system and functions of allocation, indexing, pre-fetching, pre-allocation, defragmentation, etc.
  • Are foregone conclusion of how the files themselves are placed and how the indexing information is represented
  • Same responsibilities as memory management - allocating storage, and talking to actual physical media (same activities)

Roles of a File System

  • Most important job of a file system is keeping track of data
  • Hard job since 'virutal' file space abstraction needs to be mapped down to a physical device - particularly when asking the device for more free space
  • Roles of file system:
  • Managing the need to allocate more space (and all that goes with this, including performance management, deferring writes, etc.)
  • Understanding and talking to physical media (usually through a device driver)

What is in a File System?

  • While files contain (better word: name) data, file systems contain the indexing structure that point to files and directories (file system data structures also typically contain files' associated meta-data)
  • For many (but not all) file system formats, the in-kernel memory version of these data structures is a direct reflection of the information stored on the disk
  • Much like page tables are stored in physical memory, this metadata must physically reside somewhere
  • File systems typically have more than one superblock, although only one is active at a time

Where is File Metadata Stored?

  • Metadata and attributes of the stat output was split across several things such as the directory and the data structures of the file system. The inode (index node) is the solution as it is a one-to-one mapping of the inode (data structure) and inode number to a file
  • Ability to keep track of size and permissions - kernel storing these attributes
  • stat <file_name>: provides information
  • Storage is combination of places -- all directories are also files
  • On Unix-like system, metadata and file itself - data structure that remembers it
  • inode (index node)


  • Inodes have unique identifier that remembers File F starts at some offset k
  • Lesson: given a key/identifier in directory structure (virtual namespace of keys), how does OS take this info and begin to look up information? The answer is inodes
  • Problem: pathname lookup
  • Starting from current->fs->root or from current->fs->pwd, for each directory entry in path, get that inode, extract info' entry matching next path component is used to derive next inode
  • Must check permissions along the way at each directory, expand symbolic links along the way, traverse file systems (if path component is a mount point), handle namespaces (same file path string in two different namespaces may resolve to different files)
  • Goal: wind up with file's inode
  • If you misspell it, OS checks if identifier is valid
  • ID valid but unmounted - paths are split into directories and OS may have to seamlessly mount it all
  • Similar to memory management - given a logical address, it does a bunch of things to find physical space
  • Gives data structure to inode of that particular file
  • Doesn't hand it to you, but hands it to struct file that keeps track of metadata
  • Inodes are a useful solution to a number of challenges:
  • Convenient place to store metadata
  • Collections of them are smaller than FAT tables
  • Sharing files (i.e. setting up hard or soft links) is easier to support updates

Problem: Which Disk Blocks Belong to File F?

  • Once we have the inode, how to find which disk blocks contain the actual file content? How does the OS go about associating a particular set of blocks with a particular range of bytes for an arbitrary file?
  • Considerations: performance, seek time, simplicity of implementation
  • Most algorithms assume a fixed-size block of data. Reminiscent of paging vs segmentation
  • Contiguous allocation
  • Start at position 0, continuously allocate space - doesn't allow you to do graceful expansion - shuffle/copy whole thing
  • Pros: easy to implement; reads are efficient - a single seek to beginning
  • Con: fragmentation (files grow and shrink and require shuffling around)
  • Linked list allocation
  • End of every block of File F is a pointer to the next File F - blocks can be anywhere in storage medium
  • What if you have to index into half a gig? Must be constant time operation; linked list requires you to follow it
  • Pros: conceptually simple: a file is composed of blocks that point to each other
  • Con: random access is slow because we must traverse the list
  • FAT Allocation
  • Table that is a collection of the pointers - tweaked 'link list' approach to factor out all 'list' pointers embedded in each block. Keep pointers in a single coherent table
  • Can look up block number in table - random access easier as whole block can be dedicated to data
  • Need to spend more RAM to hold the entire table
  • Need to have entire table in memory to be constant time operation; for large disks, you have a large table even if entries in tables aren't used
  • Flat inode
  • The collection of inodes at beginning of the disk each hold a number of pointers to the disk locations of the blocks containing their file data
  • Multi level index scheme
  • At beginning of disk, you have collection of inodes that are pointers of blocks of each file. You have one inode per file, and inside the inode is an array that does the indexing similar to page tables (small amount of memory that indexes a large amount of space)
  • Common file sizes notion - first few pointers direct to data.

Questions & Answers

Q:

Do you do this analysis (of allocation approaches) before building a system? How do you go about it?

A:

The real world is concerned about what we did yesterday. Measure in high fidelity way, and see if there's a disconnect between the way we are doing things and how we should be
Measure how people are using files today to write a file system

Q:

When you do a 'hard link' (way to reference directory from another) is there only one inode?

A:

Want to make /data/seek.c be a shortcut to /home/eye/457/lectures/files/seek.c
OS makes links - hard link and soft links
Do you need two different inodes to refer to the same file?
The identifier is being duplicated, not the creation of another file
Soft link: entry in the directory that stores the file you want to duplicate
Inode pointer to file name because you're creating another file (allocating another inode to point to file name rather than adding another entry to whatever directory that is)
Different inode
Hard link: ln <file_to_duplicate> <target_name>: links files
Name is duplicated, same inode

Back To Navigation

November 20: The Linux VFS

  • Specification of timing measurements in Assignment description. For full marks, do it
  • In kernel source code, many file systems to use - some specialized (network, reading/writing format)
  • Disconnect - some file systems that are specific, then a system call that opens/read/write any file we want to without worrying about physical length or file system is
  • If writing on file on a NFS mount, you just call F_OPEN/READ/WRITE - don't care to know that particular file system is mounted
  • Amount of effort hiding those details is a huge effort
  • Part of it is Virtual File System

Command Line Utilities

  • echo "Portlandia" > file.txt
  • Part of shell that opens the file name and calls F_WRITE to put contents in file.txt
  • cat file.txt
  • Simple program that opens the file, then reads and writes to stdout
  • C code
-> fopen()
-> open("path",flags,mode) = fd ... read(fd, buffer, num_bytes);
  • Location, r/w, permission bit
  • The string "path" would be the "file.txt" name in the current directory, but could have provided an absolute path name like "/home/eye/file.txt"
  • Kernel knows to take file string into rest of what kernel can work
  • As a user, stop caring what happens at open()
System call layer: open, read, write ...
  • Is VFS system call layer? No
VFS: below system call API for file-related system calls


  • These two APIs are right beside each other and works well because benefit of having this system call interface is quite stable
  • No need to change code to read/write file, which is great
  • Simple, stable API with stacking mechanism
  • Without VFS underneath system call layer, need code that worries about how to locate file name/paths on differentiate media in different systems
  • Would make the API buggy and error-prone as packing that code in open/read/write -> lots of code that is error-prone
  • Design need for factoring out different file systems -- it's an internal API for kernel
  • vfs_read() system call delegates to the read function
  • Kernel looks at a data structure that represents the file path and all metadata with it; also will know what file system it's stored in
  • Invokes particular file system -> delegates to a generic I/O layer that caches so file access is fast -> piece of hardware (solid state storage, USB, network)
  • Seven layers of calls that happen, and each layer deals with a specific abstraction that allows us to expose a nice, consistent interface for user level programs and flexibility to access different devices and media at once, in addition to many ways of storing information

Inode Data Structure

  • struct inode
  • How system support open/reading bytes from a file
  • inode is a generic concept and specific kernel data structure
  • Index node that references metadata about a file, allowing an OS kernel to contain metadata and reference the specific locations of it
  • Unifies file system to VFS
  • Couple of versions of inode
  • Generic (looking at here) which is an in-memory representation for general purpose inode information, or when you open a file an inode instance is created in kernel memory
  • ext4_inode: bytes on disk belonging to ext4 file system
  • Fields, mainly, but Line 461: array of single/double/triple indirect indexing blocks that can talk about large/sparsely populated files
  • Lines 256-260: size is ET4_N_BLOCKS recursively defined by symbols, that matches 12 directly mapped + 1 (from Line 258) + 1 (from Line 259) + 1 (from Line 260) -- single, double, triple

open(2) System Call

  • open(2) system call
  • Line 1053: entry point of table for open(2) system call
  • Some familiar arguments (name of system call, list of type of first arg; format of type and name
  • Implementation of open(2) system call is short - almost nothing there
  • Line 1060: ret is most important - needs to delegate job to someone else
  • Entry point to system call that has a helper function (do_sys_open())
  • Line 1031: pass in familiar options
  • Line 1034: turning file name into a fd so that kernel doesn't have to deal with arbitrary strings
  • Resolve fd by opening it
  • Associate string temporarily to a small integer for the rest of the system calls
  • Line 1039: do_filp_open(): internal kernel service routine for loading metadata about the file, i.e. inode and referencing to it as file *f
  • Create mapping of struct file and fd; if it's successful, can call fd_install (Line 1045)
  • Line 1018: fd_install
  • Local variables that set it to current (currently executed process)
  • Line 1022: spin_lock protects data
  • Turns linked list into something that can be treated into an array (Lines 1023-1024) then associate task file created based on file name based on fd (Line 1025)
  • Work of actually opening files - is the path legal, does it exist, does it need to be created, does it have permissions?

read(2) System Call

  • Dealing with specific fd, pointer to userspace buffer to read into and number of bytes to return
  • Line 374: work in function is struct file which is a data structure standing behind fd
  • Line 378: call routine fget_light() to look up fd and return data structure associated to it (to file)
  • Line 379: if successful and no null pointer, do work of vfs_read()
  • Line 381: find file offset and semantics want to start reading at position of some number of bytes in buffer, referring to a structure
  • Argument checking: can we access, are we allowed, etc.
  • Lines 289-302: the meat; if you know what file (keeps track of metadata) is not null, then do a ret =..., else do another statement
  • Line 291: file->f_op a member of that is a table of function pointers
  • Line 292: if function pointer not null, do read routine for file - represents delegation to specific file system of where it's stored (where data rests)
  • If file is on ext2 or ext4, that step is where vfs delegates read routine to go to that file system
  • Line 294: else do_sync_read a general purpose read
  • Line 252: leave VFS, and reach block I/O layer (does I/O on logical arrangement of blocks not yet at device driver)
  • Line 262: for(;;) accesses f_op to aio_read - kernel does asynchronous I/O which makes sense as calls to read/write are slow compared to CPU
  • Usually these are put to sleep because scheduler can't know when data might be delivered. Kernel is providing generic operation which eventually goes to a block layer that talks to memory - files are in primary memory/RAM for speed
  • Go back to linux/fs/ext2 and not much appears - see header file
  • File_operations initializing operations
  • Function pointers assigned to address of kernel
  • Implementation of read is do_sync_read
  • VFS gives ability to invoke particular piece of functionality

Back To Navigation

November 22: Extended Attributes in Ext4

  • Feature of many modern OS
  • File systems are a data storage mechanism
  • Major data it stores is files
  • Other attributes of files kept for bookkeeping and other purposes (file permissions, etc.)
  • Opportunity to store important metadata about files and file system
  • Today's topic: ability to store metadata and what that metadata might be used for
  • Hint: Slide deck probably won't be too helpful or useful (in Michael's words)

Goal

  • Permission bits: chmod to modify them
  • Other metadata: extended attributes, access control lists (built on extended attributes)
  • Finer grain notion of who can modify/access rather than you (group) and everybody else

Access Control Lists

  • Traditional security mechanism
  • Security world: access control matrix, which is a table
  • Columns: resources in system - files, anything (way to identify them)
  • Rows: principles, subjects involved i.e. actors
  • Table describes who can do what to each resource
  • Interesting because it can be read in two ways. The act of reading it implies how the system is set up, because the matrix enforces restrict access to resources
  • Expresses a policy (rule that should be true about the system)
  • Subject's perspective vs object's/resource's perspective (row-wise vs column-wise)
  • Row-wise: Capability, i.e. every user has a set of capability
  • E.g. Alice can rw to /etc/x, etc.
  • Column-wise: access control list, i.e. /var/r has access control list where Mike has n permissions, etc.
  • We take perspective of access control list - need to store the ACL with the resource, because upon opening/reading/writing it, you want to look at the permissions. In practice, take ACL approach to security
  • Is the approach better/worse than Linux permission bits?
  • Better as you have more refined permissions
  • Specific users and permissions vs groups and everyone
  • More general mechanism. Allows totality of permissions - can still create groups of collection of subjects
  • Need to keep data somewhere, so original designer said we split it in 'you' 'your group' and 'everyone else' but the table is general enough to capture all permissions

Example: Setting Access Control Lists

  • Locasto logs into ssh. Goes to /tmp and types password into a file
  • cat locasto.password.txt -> returns secret7
  • Cannot read it - UNIX permissions do not allow
  • chmod 644 locasto.password.txt
  • Retains own permissions and everyone gets read access
  • Want only a few people knowing the password: can let you into the 'profs' groups, but allows them to gain access to everything else
  • How to share files with teammates? Do not use Dropbox!
  • Make a new group, add them to the group, then properly set UNIX permissions
  • groupadd foo
  • Normal users can't create groups
  • UNIX permission bit is lame - doesn't let you do what you want to do, which is composing an arbitrary set of people to get a set of permissions
  • mkdir michael
chmod 400 locasto.password.txt
  • Want to share with group. Who are you?
  • Ex. kkajorin
  • getfacl locasto.password.txt
  • See standard UNIX permissions - how these things are implemented
  • By default, every file has a default set of access control lists, and have user/group/other
  • setfacl -m "u:kkajorin:r" locasto.pasword.txt
  • Added user when looking at getfacl, but unless you're in that directory, you cannot see the file
  • Permissions on directory is rwx
  • To allow user to get into directory needs read access
  • chmod 701 michael
  • Semantics of r/w/x on directories
  • Directories are files and lists of other files; changing into it executes it, listing just reads it
  • User can read file. Cannot list anything else in the directory though
  • setfacl -m "u:kkajorin:rw" locasto.password.txt


  • How is the kernel doing it? Where is this data stored?
  • File system, as data can't just appear out of nowhere in memory; it is stored alongside files
  • Extended attributes, or ability to associate arbitrary key-value pairs with every file
  • getfattr locasto.password.txt examines this stuff
  • Nothing happened. Need to dump information
  • getfattr -d locasto.password.txt
  • Result are still disappointing. What about -m pattern? Default pattern is user, but everything we want is specified with "-"
  • gefattr -e "hex" -m "-" -d locasto.password.txt
  • Encoded to see how they map to r/w/x for arbitrary groups and users
  • Extended attributes are useful for own purposes
  • How to associate a particular arbitrary key-pair value?
  • setfattr -n "457" -v "is taking too long" locasto.password.txt
  • Permission denied
  • rm locasto.password.txt, echo "secret" < foo.txt
setfattr -n "457" -v "is taking too long" foo.txt
  • Operation not supported. Can go to 'see also' in manual
  • Attributes to support arbitrary key-value pair stuff is defined as set of namespace for keys (ACLs, etc.) - different categories for these namespaces
  • What's wrong with asking OS? Not giving valid key namespace
  • setfattr -n "security.457" -v "..." locasto.password.txt
  • Operation not permitted
  • setfattr -n "user.457" -v "..." locasto.password.txt
  • Valid
  • How to retrieve attributes of this file?
  • getfacl only gets access control
  • getfattr -d foo.txt
  • Shows everything
  • Gets you associated arbitrary metadata with any file you want
  • Mark up permissions or key words or indexing words
  • Track field types or default permissions of opening file
  • Powerful, customizable mechanism
  • getfattr -e "Text" -m "-" -d foo.txt
  • Displays everything
  • Can have arbitrary values and name-value pairs that have to do with other systems
  • Metadata hiding around in these file systems - point is, file system must support this. Not only is it implementing getattr, setattr system calls but must store this metadata
  • Wanted to add attribute to file
  • setfattr -n "user.something" -v "457" hello.txt and now the "user.something="457" appears


  • cp hello.txt /media/exthw3
  • Everyone can read "hello.txt"
  • getfattr -m "-" -d hello.txt
  • Where is the label? Extra attributes not automatically copied over; metadata is not in the file, it's part of the file system
  • Not preserved, just recreated
  • Similar difficulty when copying files between Windows, Linux, and Mac - trying to write out a file system to CD
  • When transferring data from one file system to another; not just one place to another, it's one format to another
  • May not be reserving attributes and will bite you later. E.g. if you transfer files and directories that are case sensitive to a directory that isn't
  • Bottom line: useful sharing mechanism to temporarily give access to files for some people

Question & Answer

Q:

If you copy to the file system, would it be different?

A:

cp hello.txt /home/eye/h.txt
getfattr -m "-" -d h.txt
No, you can -p to preserve. It's not part of the default since cp has no idea what you're asking it to do. Doesn't know ahead of time that you're copying the next directory over
Stuff that aren't basic attributes of files won't be copied. When file systems have trouble of case of length of file name, preserving extra stuff isn't in name

Q:

Where are timestamps kept?

A:

Three timestamps: Access, Modify, Change. Where is info contained?
Check the LXR. There are two inodes for timespec (Recall last lecture)
stat accesses basic metadata of a file - name, timestamps, user, owner, group, how many links, inode number, etc. which is associated to the inode of the file
struct inode and ext4 inode are different - each file system has a specific inode struct that contains metadata of a file system
Fields are replicated in other children of inode - literally written on disc in inode

Q:

Will copying something to a new file system maintain metadata?

A:

Copying something creates a new file of the destination
Timestamps would not be copied over - new ones would be created. Depending on tools, you can preserve old information that overwrites what would have been there ('now')

Back To Navigation

Week 12

November 25: Example File Systems

  • Learning in HW3: seemingly simple problems like writing one big file or many small files is not as straightforward as they appear
  • Exercise may be deceptively simple, but people inquire if it's the performance you should be getting
  • Mounted three different file systems in the lectures folder (/home/eye/457/lectures/filesystems/[ext2]/[ext3]/[fat]
  • OS unifies operations in VFS
  • When you stretch system to its limits, how is it possible to write 1 GB of data? Or a million files?
  • How do we design a minimally viable file system?

Minimum Viable File System

  • Treat disk as an ideal virtual storage medium. File systems can mostly ignore specific hardware properties, but take some common properties into account: virtual model is the common design target
  • Assume you have a well-behaved storage system (rotational disk)
  • Care to provide some number of units of space, 0->M
  • Stick a file system to it - but what is its job?
  • To store files and retrieve chunks of the file content
  • What are we supporting?
  • Operations: open/create, read, write, seek, close (also: unlink but don't consider this yet)
  • Recall that a file names a collection of bytes (block of bytes)
  • API that should be supported: traditional UNIX file I/O syscalls
  • open("path")
  • Implies a way to translate path names to 'files' or 'inodes'
  • open()
  • Lookup file in directory, which may take long for large directories
  • seek(offset)
  • Implies a way to perform 'random access'; O(1) access to any byte offset in a file
  • Note that support for seek implies or requires 'seeking' within something: a virtual file space made up of a range of bytes from zero to max file size
  • seek()
  • This may take a long time for large files; we don't wish to visit every byte along the way
  • write("msg")
  • Lazy is efficient, but may lose data on a power loss; requires journaling
  • read()
  • Cache content in primary RAM
  • close()
  • What data structures?
  • A list of open files per process (belongs at VFS level)
  • For each open file, a current offset
  • A way to map files from where they are in the directory tree to where they might reside on a physical device
  • Key mechanism: map(byte offset) -> logical block number
  • Example:
byte offset = 1253
block_size = 1024
lbn = 1253/ block_size
lbn = 1253 / 1024
lbn= 1
  • This means that the byte offset 1253 is in logical block 1 of the file
  • In ext2, finding this logical block is a matter of following the second direct-mapped entry in the multi-level index array
  • This logical block could be anywhere on the disk
  • A file system really keeps track of which blocks belong to which files

Minimum Viable System Diagram

4PCnVvL.png
Bigger image
  • One is on the disk
  • The File System: start at beginning of hardware, write whatever you wish, close the file and repeat
  • These files take up to a random location k (did not add to diagram)
  • Add a new thing to keep track of where the next sequence of bytes are
  • Problem: used i->j for file Two, so where to stick partition?
  • Solution: stick it somewhere else to p - may have different entries like file Two, a position to record and a pointer to somewhere else
  • How to know where file is? Maybe dedicate some portion of beginning of disk to this meta information
  • Need meta data for things like where the starting location of each existing file is
  • Not all files are contiguous; where are all the corresponding parts? File Two and Two' needs a way of indexing to find the next set of bytes of the same file
  • File is a way to name bytes (no need for it to be contiguous)
  • Files not as arbitrary length of bytes, but to needs to constrain contents of a file in terms of a fixed size (a block)
  • Important so that they are predictable (but doesn't have to be)
  • Pick some KB; file system can manage files in terms of sets of blocks rather than bytes
  • If file Two needs a byte, can take n bytes to store an index for next location of file Two'
  • Or take the entire block, copy to another location then add another byte
  • Adding a byte to a file should not require copying, mangling or adding metadaa to it. If you're doing a seek to a specific position of a file, don't want to count for scattering of metadata intermingled with actual data
  • Talked about indexing allocation, FAT allocation which is the foundation of what file systems are
  • How to get data on a disk? File systems do it efficiently; but it also looks up files and tries to find a particular byte in a file
  • How it does this job is what file system it is
(1) Need to know where start of file is
(2) Need to know where subsequent bytes of file is located
  • Design requirements: Delete, expand existing file

File Allocation Table (FAT)

  • Need to solve problem of finding the first byte or position of a file/block of a file, and an offset within it - need ability to do random access
  • Table at beginning of disk that describes this
  • Table allows you to take a particular file and the file system translates the file to a logical position on the disk

FAT Table Diagram

SsI5mxN.png
Bigger image
  • File One starts at position one
  • Have Next "block" and end of file (EOF)
  • Notice that you can map pretty much any file to the start of a position/offset in table/array and follow a set of pointers to find the next location
  • Easy, flexible system
  • Need to store the table in memory, but memory is cheap
  • FAT 12, 16, 32 versions
  • 32 takes 32 bits to an address -- what's the most you can address? 4 GB of data
  • This limits file size. 16 is even worse
  • Table gives nice ability to know where files are (required)
  • Choice of size impacts max file size to have, and number of files you can have
  • Why impact number of files? With no space in memory (don't want to go to disk), there's no flexibility
  • Size of the table is creates a hard limit K limited to memory (RAM)
  • Issues: complete table, every entry exists (may be EOF, 1 or unused)
  • Must store the whole thing in memory
  • Solves problem of start of file, allows to gracefully move from one portion of file to next; but have hard limit of file size and number of files

Ext2/Ext3/Ext4

  • Distinction of 2 and 3: 3 has journalling, 4 is a revision of 3
  • Ext2 is the default file system in Linux
  • How is it different than FAT or the minimum viable file system? What design choices were made?
  • Have to pick sizes when it comes down to it - number of files still matter
  • In ext2, a file is a semi-contiguous series of blocks, with descriptive metadata contained in an inode (index node)
  • Allocation strategy is to give 8 blocks to new files in anticipation of future growth/writes
  • Block size can vary, but is usually 1 KB, 4 KB or 8 KB

Ext2/Ext3/Ext4 Diagram

vvhfjqx.png
Bigger image
  • An ideal storage medium which also splits the disk (typically does it when partitioning disk)
  • swap area helps virtual memory, rest of it is about storage
  • Split up disk into Block Groups
  • Any particular block can belong anywhere on the disk
  • Inside a Block Group, the format of ext2:
  • Physically on the storage unit
  • Not infinite in size
  • In a Block Group:
  • The start is a Superblock
  • Main/primary superblock copy; other copies are scattered over disk
  • Describes properties of a file system stored on the particular location of the disk
  • File system meta data, version, etc.
  • Number of inodes and disk blocks
  • Start of free blocks
  • Group descriptors
  • Has metadata about other parts of the block group
  • Number of directories
  • Pointer to block bitmap, and inode bitmap
  • Bitmaps
  • Indexing structures purely for performance - instead of linearly scanning though inodes, can use these
  • Bitmap block: 1 block in length (e.g. 4 KB), a bitmap of free blocks
  • Bitmap free inode: 1 block in length (e.g. 4 KB), a bitmap of free inodes
  • Inode table
  • A few blocks are dedicated to list of inodes
  • Set of data structures that allow you to find files - one inode for every file
  • Not these are set at file system creation time, thus limiting the default number of files
  • Data blocks
  • All the data files
  • Inodes contain an array that serves to index into the Data Blocks portion of the disk
  • Index Array
  • 0-11: direct map/contain block number
  • 12: single indirect index (points to another block containing pointers to data blocks)
  • 13: double indirect
  • 14: triple indirect
  • These point to specific disk blcoks, but are not 'in' the inode itself; the inode contains metadata, not data


  • When creating the file system, have to dedicate a certain number of blocks to give space for the inode
  • Max number of inodes, max number of files
  • Real limitation: once you create a file system, the number of files you can store is fixed
  • Implication: have all inodes that you want on the disk
  • Ext2/3/4 does this well: given an inode, it recovers the location of the rest of file
  • In an inode structure, there is an array of indexes (a bunch of pointers to actual disk blocks) so that the file doesn't need to use the whole disk block
  • Each inode holds pointers into the logical positions on the disk, which allows the file system to traverse array quickly to locate file data
  • If the file data is larger than some k size, can look up in single or double indirect indexes
  • Ext2 gets around some restrictions
  • Can address large files and find them efficiently - finding a particular block is not a linear scan
  • It requires looking up the inode, looking into the array and when given the offset, indexing into the array (multiple pointers may be referenced)
  • Not unlimited: things like ext2 are limited in size of files they can manipulate
  • Metadata that starts at the beginning of each block - not just data
  • Having it means you are dedicating a portion of disk space to metadata, so don't have all of it free
  • Although can efficiently locate inodes and data blocks, still have hard limits on number of inodes and files
  • Shortcomings: all files, at minimum, have a block size. Even if you have a small file, it will still take up a block
  • Small price to pay as you're still able to select in fixed size blocks (despite small files taking up an entire block when not needed)


  • Example: opening/writing bytes on a file, then writing 10 bytes in the 512 block
  • How many blocks to allocate? At the moment only need 1
  • But consider behaviour of systems - what do they tend to do? Expand! So the file system will grab a free inode, fill its data structure with the location of free block then write some data to it
  • Files want to expand. Even though most files are small, you will start using up entries - not a better performance by randomly selecting a free block
  • When you create files, ext file system preallocates and defines 8 blocks for you ::* 512/8 blocks = 4 kb -> 32 bit, which makes sense because files are relatively small
  • Good strategy as most underlying pieces of hardware are pretty efficient at reading/writing something there than writing contiguous blocks - trade off between 'spend 7 unused blocks versus expanding files'
  • Managing piles of blocks: when files are allocated, 8 blocks are written - if grown above 8 blocks what happens?
  • There is a well of data blocks: can be highly fragmented, particularly systems that have been running for years
  • Is fragmentation a problem for ext2/3/4?
  • If you start to access and read, then end but need more; need to do more inode indexing. But if you're reading and go from the start to end and need to read, it's fine
  • Doesn't matter for read/write. If you need to do it for next chunk of file (next block) is there a performance hit in terms of locating the next block?
  • Issues: underlying file storage and performance of file system and locating the next one
  • Logically speaking, fragmentation is not an issue for indexing as at most you're doing 3 pointer references - physical address to logical block is not an issue
  • Can have highly fragmented blocks. Not concerned about doing a write, and running out of space. Just go back to the free block list, fill it in, and so forth - suboptimal but it works
  • Problem: When you get a full file system with many files, or when you've used a lot of number of blocks, indexes have to be searched for 3 inodes which can be expensive
  • Inode table has pointers, like a linked list (table -> block1 -> block2 -> block3)

Ext2 vs Ext3

  • Ext2 tries to keep files and directories relatively close, but it's a fairly simple and elegant file system that allows you to find data quickly and have large files
  • Gives us the basic. If computers are perfect, can walk away
  • Ext3: sometimes computers crash and when they're crashing, CPUs are doing something and some of the code executed is writing data to a file (pending writes)
  • Program trying to write to some fd, but what if machine dies? Writes are half-complete, writes in queue (that use programs believe have happened) are not flushed
  • Power cycling the machine should go from picking up to where it fails - checkpointing everything in machines like IBMs, but mostly just hoping for this
  • Hope for crashed state to start
  • Problem is not that the data at a particular block/file is gone or corrupted - that's a user problem they can deal with (no appropriate integrity checking or checksum) but the problem is de-allocated inode
  • Kernel data structures that help manage data I/O might be inconsistent or corrupted -- not just he user data
  • Non-atomic operations metadata for file system is corrupted. With multiple CPUs, lots of issues - multiple transactions happening. If it's messed up then the file system is messed up and the program may not work, thus the machine may not boot
  • Answer: take a database class and learn what they do to ensure operations are successful
  • Keep a 'journal'

Ext3: Journalling File System

  • Ext3 adds this
  • Gives you different options, boiling down to keeping track of log of operations that the system is about to do
  • One option - journal: all file system data and metadata operations are logged
  • Before system does anything, have fixed size buffer that you write transactions onto; can write both data and metadata but impacts performance as you capture everything about data before you are doing something
  • Another option - ordered: metadata operations are logged, but real data is written first. Reduces chances of file content corruption (ext3 default)
  • File system looks at state of journal and sees if they're committed, pending, or about to happen
  • Can recover, commit ones that are appropriate and stop and point of corruption
  • Last option - writeback: only metadata is logged, which is fast (e.g. XFS)
  • Keep track of metadata, and then write data to disk so that write data disk happens, then record what you did - when you recover you go up to that point, or last successful point, and if the transaction of writing disk happens just fold


  • Journal is a small buffer of file system operations/transactions
  • Journal is a small set of disk blocks (e.g. 64K) orderd as a cicular buffer of recent disk write operations
  • Journalling: record of file system meta data
  • Goal: avoid constantly running consistency checks on entire FS. The point is to avoid having minutes-long procedure occur when the system comes back up

Writeback: XFS

  • Design goals for this file system (from Silicon Graphics) centered on supporting intense I/O performance demands, large (media) files, and file systems with many files and many large files
  • Terabytes of disk space (so many files and directories)
  • Huge files
  • Hundreds of MB/s of I/O bandwidth
  • XFS was a clean-slate file system design effort that supports 64-bit file address space
  • Internally manages groups of disk blocks and inodes in Allocation Groups (AG)
  • Also manages groups of blocks called extents rather than individual blocks
  • Part of what makes it efficient is the use of B+ trees for many of its internal data structures rather than the bitmap approach of ext2
  • Example: it uses two B+ trees per AG for keeping track of groups of free blocks (or extents)
  • Entries in the tree are extent start, length pairs, and one B+ tree is indexed by the start block of free extents and the other is indeed by the length of free extents
  • This pair of data structures makes it quick to find an extent of e right' size or to find an extent by the right 'position' (or intersect two sets of possibilities)
  • XFS can easily dynamically scale the number of files it support because pools of inodes are created on demand rather than set at file creation time
  • XFS supports 'large' directories (i.e. directories with many file in them)
  • Many previous file systems performed a linear search through file names in a directory. Some previous approaches used a hashing approach to speed up this lookup
  • XFS also provided support for journalling metadata to improve crash recovery times
  • XFS improves I/O performance by using multiple read-ahead buffers (i.e. if byte b in a file has been read, it is likely that bytes b+1...b+k will be read as well)
  • It bypasses the system buffer caches, and it is multi-threaded to allow many simultaneous readers

Back To Navigation

November 27: Devices and Device I/O

Overview

  • Don't interact directly with system chip
  • Experience computer with keyboards, controllers, media types, etc.
  • Qualitatively different way interacting with computers - major role of OS is to make devices useful
  • Ability for OS to deal with vast array of devices is interesting task

Main Design Challenge

  • OS has to offer a consistent way to access such devices and control them, but the wide variety of internal organizations and physical properties of these devices offer a challenging target
  • Device management exemplifies all roles of an OS: multiplexing resources; protection and separation of resources; offering a consistent API to user applications; servicing asynchronous events, and managing concurrent access
  • Has to close the gap between what user level expects and what they export to users and low-level hardware
  • Bridging gap between low level physical reality (devices) and user level programs
  • OS has to offer consistent and sufficiently powerful interface to user level applications. On the other hand, it has to deal with a wide variety of hardware

Topics

  • Expose how OS perceive/exposes devices to users
  • Character, block devices
  • Major, minor numbers
  • mknod(1), mknod(2)
  • mkdir, mkfifo
  • Special devices (pipes, FIFOs, etc.)
  • Pseudo devices (/dev/null, /dev/zero)
  • Garbage can: sink where you don't care what goes into it
  • Operationally a sink for output that do not care to maintain
  • /dev/null: when written to, takes it and throws it away
  • echo "bye" > /dev/null
  • Program writes to stdout and redirects to file
  • 4 = open("/dev/null") then write(4, "bye", 3) probably happens
  • Can observe: looks like a file, operated on like a file, but does not behave like a file
  • Pseudo devices can do whatever you want - LKM
  • udev; dynamic device registration in /dev
  • I/O transfer / BUS
  • Memory mapped I/O
  • DMA
  • Interupts vs polling
  • man 3 makedev
  • Devices are nothing ore than files
  • Exposes to users to manipulate device through standard system calls
  • Powerful design

The Device Landscape

Device Controller Diagram

idMt4Ix.png
Bigger image
  • A part of the kernel is code dedicated to talking to a particular device (device driver)
  • Typically implemented as LKM
  • Nothing more than a pile of code
  • Purpose: to hold conversations with an actual device (anything you can think of)
  • Device makes physical reality happen - printing device, monitor, etc. - has some physical effect in the real world
  • How does the OS know how to manage that particular device?
  • No native code in kernel to talk to device
  • Public half of device - kernel needs to help manage discussion or communication, so devices include hardware or processing registers/memory buffer in public side (device controller)
  • Device manufacturers hide their secrets and export minimal control interface in data storage (device)
  • Device controller accepts commands and data from device driver and does whatever
  • Ex. Hard disk: commands/requests passed from kernel to device driver (buffer layer) eventually asks for block number
  • Not actual location on physical disk, but the request passes through device driver, translated to some request (at particular cylinder/block, given some bytes) and device controller can ignore/rewrite request
  • How internal organization of device is not necessarily known or matches the public API
  • Have kernel driven by process requests, translated through file I/O layer of direction, and goes through code that knows how to talk to specific device; then go through some hardware communication and hopefully get some data back
  • Interesting details, how is communication achieved?
  • CPU's role:
  • Executing kernel code to execute statements to device driver
  • Interrupts
  • Occur if device reads some number of blocks, devices are slow so request has been serviced by controller and the initiating process gets taken off, while something else is being executed
  • If device controller gathers bytes of communication, device might signal CPU and say 'hey I got this data you asked for'
  • What happens when an interrupt comes from the device to an interrupt? Algorithm: check, fetch, execute...
  • How does transfer of control happen? CPU is doing some arbitrary calculation. Asynchronously an interrupt happens, and it looks up in IDT
  • Given a particular device, an interrupt vector given, look up in IDT what happens next
  • Does the CPU job, then goes to receive routine and goes to control registers/state of buffer (fixed number of bytes that become available)
  • Device driver is responsible for reading this data out into some buffer in its own memory space. Acknowledges interrupts, gets data, and leaves data in buffer to call schedule or returns from interrupt handler to pick up
  • Not in case of receiving network packet and immediately gets data - runs from hardware, store in memory, resume; eventually program doing I/O will be rescheduled and it can extract data back to stack into itself and gets copied from kernel space to some user space (control flow)
  • Asynchronous, event-driven, unpredictable. Dance between hardware device, CPU and OS

Device Files

  • /dev lots of files - create device representatives
  • Every device installed on machine has a handler in this directory
  • OS can export devices as named file systems
  • Can mount a device (mount /dev/sdb /media/sspace) at any particular point in the file system
  • No need to worry about device; can take context of device and attach anywhere in directory tree - don't care about underlying hardware

Device I/O API

  • Devices usually treated in the same way as file: have a name that appears in the file system and standard system calls like read(2) and write(2) can send and receive data to and from device
  • A thing is not what it is named, but what is done to it
  • Recall type of a program, instance of it is not that type; it's what happens in that sequence of bytes
  • Devices typically viewed, accessed, and controlled as either:
  • Block device - data is accessed a block at a time; random access
  • Character device - a stream of characters (or data values)
  • Generally, OS can view device as either of these: two different semantics
  • Block device: I/O treated as transactions of random access at any particular block (fixed sized collection of bytes)
  • Hard disks (get a whole whop of bytes)
  • Random access, get block number
  • Character device: semantics of sequence of byte values
  • One value, one byte
  • Mouse, many devices in here
  • Categorizations is useful in abstract sense, may not match reality
  • Rotational media - stream of bits read in serial, and controller collects them into a block of data that exposes itself to device drier after
  • Devices don't intrinsically need to be either - just a convenience of how controller exposes to device

Major/Minor Numbers

  • User or OS creates files that represent each device
  • For kernel to know which driver should aid communication with the device, the kernel needs to map drivers to device files
  • Does this by using major/minor numbers, which define a name space of device types and subtypes
  • List somewhere, and even though there are system calls like mknot for creating devices, have to give major/minor number
  • Numbers help OS decide when a write system call is invoked
  • 4th column of cat /proc/self/maps is major/minor numbers
  • Files are resident on major 08, minor 3 (hard disk)
  • How OS knows which device driver are: a set of routines talk to which device, even though it's exposed to interface - based on file names and locations

Example Driver

  • How grungy we've become when you drop out of bottom of kernel - drawing halfway in kernel because much of code is not like kernel - must know about hardware
  • drivers/net/3c501.c
  • Comments: reality is hardware is complex, buggy, process of manufacturing is a tall order
  • Producing these physical realities making sure other pieces of software can be a hard job
  • People who can write device drivers are strange. Must know what's going on at the hardware level and how to talk to kernel


  • 683: when device receives data from network
  • Observation: low level I/O, hiding behind things like inw and outw (assembly instructions that ask CPU or another controller variation of communication to device over sequence of bytes
  • A lot of low level interface is documented in device drivers - never a straightforward thing. Must be aware of idiosyncrasies of actual device
  • 508: tulip
  • Interrupt routine - servicing interrupt is about what system call
  • ioread32: next to hardware
  • Acknowledge interrupt, literally asking privileged instructions and has to be in kernel mode - sending magic values to some control device
  • Unlike other kernel code


  • One model:
  • Another: map buffers to physical portions of memory so retrieving data uses standard assembly instructions to move data to CPU
  • BMA: communication memory to memory - what trade offs to take?
  • Chip complexity, speed, simplicity of architecture
  • How communication happens: medium between devices and CPU, and requests goes across medium

Question & Answer

Q:

Does IDT have unique entries for each device? It doesn't seem plausible because every time it adds a new device it would be recompiling the kernel.

A:

IDT is a table, and can dynamically write entries. There are some default entries like 0x80, but some reserved ones that handle software expectations - when it's empty, can write any handler

Q:

Don't we handle address translation in the CPU?

A:

That's memory. Block numbers are for block devices is a different kind of address
One way to view a file are blocks, but also they define an address space of bytes. If you want to find an offset b, divide by block size and get a block number
Has nothing to do with memory translation where CPU is given a block (doesn't know where it is because devices are not perfect and blocks may be bad; device keeps track of itself and doesn't expose to CPU)

Q:

How is the IDT created?

A:

IDT is an x86 specific thing. Been around since the beginning - it's a traditional data structure. It's a way to handle interrupts. Digital devices have to handle them. How it gets done, job of system software

Back To Navigation

November 29: Disk and I/O Scheduling

Overview of Mass Storage Structure

  • Magnetic disks provide bulk of secondary storage of modern computers
  • Drives rotate at 60-200 times per second
  • Transfer rate is rate at which data flow between drive and computer
  • Positioning time (random-access time) is time to move disk arm to desired cylinder (seek time) and time for desired sector to rotate under the disk head (rotational latency)
  • Head crash results from disk head making contact with the disk surface - bad
  • Disks can be removable
  • Drive attached to computer via I/O bus
  • Host controller in computer uses bus to talk to disk controller built into drive or storage array
  • Magnetic tape
  • Early secondary-storage medium
  • Relatively permanent and holds large quantities of data
  • Access time slow
  • Random access ~1000 times slower than disk
  • Mainly used for backup, storage of infrequently-used data, transfer medium between systems
  • Kept in spool and wound or rewound past read-write head
  • Once data under head, transfer rates comparable to disk
  • 20-200 GB typical storage

Moving-head Disk Mechanism

  • Check diagram on slides
  • Two important factors to get to the right sector to read the content
  • Get to sector s
  • Two different motions; the head should get to the right cylinder c, and rotation to get to the right sector s

Disk Structure

  • Disk drives addressed as large 1-D arrays of logical blocks, where the logical block is the smallest unit of transfer
  • The 1-D array of logical blocks is mapped into the sectors of the disk sequentially
  • Sector 0 is the first sector of the first track on the outermost cylinder
  • Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost

Disk Scheduling

  • OS responsible for using hardware efficiently. For disk drives, this means having a fast access time and disk bandwidth
  • Two components
  • Seek time: going backward/forward
  • Time for disk to move heads to cylinder containing the desired sector
  • Rotational latency: rotation
  • Additional time waiting for the disk to rotate the desired sector to disk head
  • Scheduling algorithms are about the seek time - the head going backward/forward should be small so you have faster access to the disk
  • Minimize seek time which is approximately equal to seek distance
  • Disk bandwidth is amount of information you get in a period of requesting and getting
  • Total number of bytes transferred, divided by total time between the first request for service and the completion of the last transfer

Example

  • Several algorithms exist to schedule the servicing of disk I/O requests
  • Request queue (range of arm) goes from 0 - 199
  • Head pointer: 53
  • Initial state of the seek arm
  • Queue: 98, 183,37, 122, 14, 124, 65, 67


First Come First Serve (FCFS)

  • Start at beginning of queue and go to very next element in queue
  • Worst algorithm
  • Total head movement: 640 cylinders

Shortest Seek Time First (SSTF)

  • Selects the request with the minimum seek time from the current head position
  • Wherever the head is, goes to closest cylinder
  • SSTF scheduling is form of Shortest Job First (SJF) scheduling
  • Downsides of SJF scheduling - might starve some accesses
  • Queue is pre-defined, but should by dynamic
  • Never gives access to far cylinders (starvation)
  • Total head movement: 236 cylinders

SCAN/Elevator

  • Disk arm starts at one end of disk, moves towards the other end and servicing requests until it gets to the other end of the disk. At this point the head movement is reversed and servicing continues
  • Scans the entire area, going from 0-199 and comes back (like a sweep)
  • Downside: consider this as a dynamic system. The access to the different cylinders are not uniform (no fair access to disc)
  • Total head movement: 208 cylinders

Circular Scan (C-SCAN)

  • More fair, more uniform wait time than SCAN
  • Scans the entire disk
  • Head moves from one end of the disk to the other, servicing requests as it goes. When it reaches the other end, it immediately returns to the beginning of the disk, without servicing any requests on the return trip
  • Treats cylinders as circular list that wraps around from the last cylinder to the first

Circular Look (C-LOOK)

  • Version of C-LOOK
  • Arm only goes as far as the last request in each direction, then reverses direction immediately without first going all the way to the end of the disk
  • Doesn't go all the way left (0) or right (199), saves cylinder migration

Selecting a Disk-Scheduling Algorithm

  • Depends on what you're doing - should switch between algorithms
  • SSTF is common and has natural appeal
  • SCAN and C-SCAN perform better for systems that place a heavy load on the disk
  • Performance depends on the number and types of request
  • Requests for disk service can be influenced by the file-allocation method
  • Disk-scheduling algorithm should be written as separate module of the OS, allowing it be replaced with different alogorithm if necessary
  • Either SSTF or LOOK is reasonable choice for default algorithm

Disk Management

  • Low-level formatting or physical formatting - dividing a disk into sectors that the disk controller can read and write
  • To use a disk to hold files, the OS still needs to record its own data structures on the disk
  • Partition the disk into one or more groups of cylinders
  • Logical formatting or 'making a file system'
  • Boot block initializes system
  • Bootstrap is stored in ROM
  • Bootstrap loader program
  • Methods such as sector sparing used to handle bad blocks
  • Numbers are uniform, can't predict behaviour of process to access disc. Have hundreds of programs running and each of them might want to access disc, don't know how to behave
  • Not only thing that affects disc bandwidth, just seek time
  • In addition, have rotation of disk and can happen at same time
  • Reading/writing content to disk is another factor which can add more time
  • Might be content of one file - assume it is adding content to different sectors and it's distribution of it

Question & Answer

Q:

What kind of assumptions can we make about distributions?

A:

Hard to have prediction of how they behave. Consider they are dynamic of system. Might be some, like if you have a good disk (brand new) and contents are sequential - better prediction of it
If disk is old and content it spread everywhere, it has different things. In that case, might have a better guess but in general would have random.

Q:

What does OS manage? Where does scheduling happen?

A:

Should be separate from OS - driver does this. Should not be fixed, as scheduling algorithm switches. Might be the case that if OS sees that, switches. Might have multiple OS
Controller manages the way things are put on disk. Size of sectors and things like this. The way of getting access depends on the processes running in the system. Not a fixed thing to put on a disk - it's dynamic and depends on behavior. Whatever is in inside disk, there is flexibility to decide at run time depending on how system is used - better seeking time
Controller considers scenarios of how to store on disk. Scheduling - the way to seek it. Should be in drivers

Back To Navigation

Week 13

December 2: User--level Networking

  • Management of I/O, management of physical devices, timers, interrupts, ability to transfer data to processes, system call layer, etc.
  • Network is a group of machines/nodes that speak the same language
  • Network exists at their pleasure and only as a set of distributed state among their OSnetwork stacks: the data structures and functions that make up the part of the kernel responsible for creating, delivering and receiving packets
  • Information flows between nodes because the nodes have the same delusion in the same distributed fashion
  • Buzzwords: TCP, sockets, MAC address, ports, IP address, HTTP, routers, switches, hubs, server, ports, latency, bandwidth, server, caches, OSI, ARQ, DNS, packets, ISP, NAT, OSI
  • Protocols are a way to organize information
  • Complex, lots of languages involved


  • OSI network stack
  • Heart: stacking of information
  • Bottom: physical layer, where signals of wires, etc. happen
  • Above: link layer
  • Above: network layer
  • Above: session layer
  • More layers on top embedding/nesting protocols (HTTP, DNS)
  • On top is how human perceives the messages
  • Layers mean there are series of encoding that happens to information (that comes from user, device, program) and communicate across network to another machine (like SSH, going to a web browser)
  • Piece of information gets wrapped up as the payload of sequence of packets
  • Each layer has a header or metadata appended to beggining of packet, and get a stream of physical 1s and 0s
  • Handling abstract model of communication
  • Lie! No such thing as layer -- just a sequence of bits, and draw arbitrary locations in bit stream to call it something
  • Sequence of 1s and 0s
  • Each layer accomplishes a job - maintains a certain type of maintainability and connectivity
  • Set of protocols
  • Link layer: direct link between two machines
  • Network layers gets one packet to another
  • PHY: actual transmission

Example: User Level Utilities

  • Know where machine network is. Identifiers: IP and MAC addresses
  • IP - valid in network layer, MAC - valid in link layer
  • Can go to whatsmyipaddress.com
  • ifconfig: tool that enumerates/listing valid network interfaces on machine
  • Pieces of hardware? Some might be. This machine has a wireless interface
  • et0 doesn't exist - piece of software that eventually asks MAC OS kernel that throws bits out
  • Lists out interfaces - virtual construct that represents a piece of network interface card, but don't have to and often don't
  • Ethernet interface, hardware/how address is hexadecimal MAC address (48 bit number supposed to uniquely identify this piece of hardware)
  • OS stores this number - if things work the way it's supposed to, it's actually supposed to be burned on hardware and OS checks it when it creates the data structure
  • Need hardware that cooperates - software stops at link layer
  • IP addresses are meant to uniquely identify a host, and they're 32 bits
  • How many hosts? 3^32= 4.3 billion, then run out of addresses. NAT is useful because it makes many address look like one
  • inet addr<code>: 10 - addresses reserved for non-routable address -- not an IP address to communicate over internet.
  • Information helps identify machine
  • <code>Bcast: 10.0.2.255 - writes data to network and get to some location
  • Average number of machines between you and destination - traceroute -n www.yahoo.com sends a particular type of packets and all machines along the way see the packet and respond to it in a certain matter
  • 16 - internet is about 16-18 hops wide
  • Need cooperation of other machines to communicate with anyone else - if someone in the middle goes down, communications can go down
  • Usually pick another path, but major routing (layer 3) disruptions occur - one machine has gone down
  • No security: organizations that agreed to cooperate with each other - internet is big bubble with a few number of organization. Nothing prevents you from being one of the organizations. Signed contract to share a connection, forward some traffic. No one node actually knows whether or not traffic should be coming from a peer. Cases - Pakistan advertises yahoo IP space and all traffic goes there. Nothing prevents people from messing with the algorithm
  • If you provide wrong info, no global authority saying it's wrong; you're just setting information


  • ping www.yahoo.com: sends ICMP packet to exchange control information from other nodes
  • Asks OS kernel to formulate a packet with a certain TTL (time to live) and send it to whatever topologies from you and address
  • Number of things interposing on communication
  • ICMP is a different protocol than traceroute
  • Running on VM and CPSC network, filters heavily and probably re-writes TTL to not expose information
  • TTL is higher than it should be (63 from example)


  • Wireshark: cache at eth0
  • Nothing is happening. Once ping www.yahoo.com can see traffic generated
  • Wireshark pulls apart each packet
  • Record of each packet sent out/coming in
  • Breakdown of all fields of packets
  • Raw information - truth of what's being delivered (sequence of bytes) and programs make use of this by OS breaking it down
  • Example of type of information OS contends with

Back To Navigation

December 4: Kernel Network Support

  • OS support a few major activities
  • Network: data source
  • Kernel mediated, reads/write to space
  • Networking support in OS
  • System call layer and kernel network filtering architecture
  • Communications endpoints, addressing
  • Reading from the network (packets)
  • Writing to the network (to some node either locally or across internet)
  • Kernel has a way to mediate how packets flow through the network - most interesting piece

Netfilter Architecture

  • Parts of OS kernel that deal with and manage how packets traverse kernel
  • Virtual machine: 10.0.15.2 has one defined network interface (eth0)
  • Traffic comes in and out over this interface
  • Other side of interface: network interface card receiving/sending packets to aid kernel
  • Kernel has code to manage mechanism of receiving packets and doing a few different things
  • Does the packet have a destination IP address of 10.0.15.2? If it does, deliver it to some process (Pn) that's running/expecting data
  • Packet comes in, kernel decides if it's for me, if it is then deliver to waiting application (web server, etc.)
  • If not for me, may be a gateway or forwarder - may be in front of a larger network in front of many other nodes
  • Traverse kernel, interfaces and put to a different interface
  • Spot 3: routing decision has been made, and may send information out to interface A vs B
  • To 5: outgoing processes
  • All code tied to one thing: every UNIX box can do all of this, even though you only routinely exercise one or two of the paths
  • Ability to intercept packets (like bad requests)

Example: User Tools Communications

  • ifconfig
  • IP address: identifies itself to other computers -- can have multiple to associate with a single interface
  • Identifier machine - non-routable IP address
  • On real Internet, like: ssh.locasto@cl.cpsc.ucalgary.ca
  • What does this interface have? Real IP address -- machine on internet can contact this machine (136.159.5.22)
  • Public address; can send a packet to the destination IP address, and routers take the packet, shuffling eventually to csl machine
  • netstat
  • Forms of sockets that are local that behaves like files
  • Heavily used folder - information tells you who this server is communicating with
  • Most columns deal with tcp protocol
  • Who am I talking to? Local address is IP address (0.0.0.0 is any IP), but :902 is the port
  • Port: destination within the machine - port number is how OS kernel decides who should be given the IP address
  • LISTEN -- services being run. Can find what port means in file/services
  • ESTABLISEHD -- connect to port 22 (ssh) - foreign ID/source IP address of where that client came from
  • OS willing to maintain binding and state of these connections

Example: User Tools Reading

  • Wireshark is GUI
  • Can just read data that's there
  • sudo tcdump -i eth0
  • Try to resolve various DNS
  • CadmusCo source, packet #9
  • Range of bytes that have arrived to network card
  • Network card should be interested in this message
  • Who sent this message?
  • Layer 2 protocol - MAC addresses
  • tcpdump -i en0 -v -X -A
  • Give me everything 0 driver location sSee what driver is controlling these pieces of hardware)
  • If you're not using cryptography, can see what you're typing
  • It is reality that if you're using a wired interface, nothing inherently private about it
  • People like to pretend it's not a public activity

Example: User Tools Writing

  • netcat (nc(1): allow you to connect to tcp connections
  • Connects to machine -- what happens first is lookup between name and IP address
  • VM does not know where IP address is - kernel looks up IP address of who to communicate with using DNS request
  • Once connection is established, you speak http to me
  • Wireshark reassembled the web protocol and display packets
  • Note:
  • Initiating network connection and communicating with it involves a few protocols - implicitly, often involves a DNS requests and may involve who's gateway
  • Making a connection and diagnosing network problems is rarely a one problem activity. May want to ping, but may not even have a valid IP address, maybe DNS is down
  • User perspective - can't surf web, but may be regarding network
  • more nc.out: system calls related to this activity
  • Establishing connection, put a few bytes together, send over the wire
  • Much of set-up has to do with binding a socket locally, and bind a port
  • Number of sockets satisfy DNS request - first has to establish DNS connection with a known DNS server
  • Once received from socket, can see socket treated same as the file
  • Received a reply and can connect machine, then send bytes
  • Once get a reply, connection terminates

Question & Answer

Q:

Are ports PIPS?

A:

No, much like a process can have a number of open files; a port identifier is like that. One process can be opened by many ports, and no need for port number to be equal to PID. Like the mapping of file descriptors, kernel has map of open clients?

Q:

Can two processes belong to one port?

A

No. A web server can bind clients, need sockets on them. Any information that comes back has to go through a program - need to find a port so OS can deliver data

Well known port numbers for certain protocols, but need not bother with it

Back To Navigation

December 6: Final Exam Review

  • Concept map for topics, structure in three main themes
  • What an OS is, what its responsibilities are, how it turns programs into processes (ELF, binary formats)
  • How system architecture supports OS jobs (protection, isolation) and system call API layer (how it operates)
  • When processes are created, how is concurrency created and managed by OS? What are fundamental blocks of concurrency? What kind of synchronization problems does concurrency introduce?
  • Scheduling algorithms to choose which process to run, ways to control access to resources, deadlock, etc.
  • Management of data: volatile memory, primary memory as a resource, how to allocate it and how to claim it, how to address/find parts of it, looked at volatile storage
  • Persistent storage, same issues allocating/deallocating names and structures
  • Page replacement algorithms, disk scheduling algorithms

Structure

  • 20% of your grade, and 1 sheet of notes (8.5 x 11, front and back, any font size)
  • Allowed a calculator -- some math questions
  • Multiple choice, intention is because the late exam date needs to be processed quickly
  • Variety of problems, some recall/trivia-type questions, others involve work (take a concept and apply it, running scheduling algorithm to completion)
  • Encourage to answer quickly, not get hung up to over-analyze questions - no trick questions


  • Two major portions
  • 22 T/F
  • Variety of topics. If you apply common sense, should get them right
  • Esoteric - intend to separate A from A+
  • Apply logic, get answer
  • 22 MC
  • File systems (reading BOTH textbooks should help)
  • 5 questions on this
  • Concurrency, know scheduling algorithms and how to simulate them
  • Know what turnaround time is
  • Know what problems might arise in deadlock, and how you might break that
  • Be familiar with page replacement algorithms
  • Understand what the internal organization of synchronization primitives is (all differ drastically in semantics and structure)
  • Understand how a synch primitive may/may not apply to a problem
  • Ex. here's synch 1, synch 2, select which one is most appropriate
  • Familiar enough with procedure/mechanism of memory addressing and translation
  • Understand separation of MMU/hardware's/CPU's role and OS role
  • Ex. here's an operation, who's doing it, the CPU or OS?
  • Know how to do memory page replacement
  • Disk scheduling - plenty of resources, descriptions in OS textbook - know them!
  • Fun question for the last question. Here's a modification of a scheduler, what does it do?
  • Idea: not only asking you knowledge of schedulers, but also reality: if you make this modification of a real OS scheduler, what are the implications?


  • 5 questions on file systems, 6/7 with T/F
  • Know it in depth since it's important to know how persistent data is managed. Concentrate on the ones we've actively talked about (ext2/3, XFS)
  • Textbook helps, but can be bogged down by details - intersection of textbook and lectures
  • No content from 1.1-10.1, but emphasize on system call interface
  • Ask about: main OS concepts - page replacement, file systems, scheduling algorthms, etc. basic skills, types of problems in a typical OS course


  • Contents of assignment will sort of be on the test, but related to stuff we talked about
  • Not going to ask system call trace, kernel code details, use mmap system call to ...
  • Flavored toward general OS systems, T/F has trivia, but should be able to answer from experience
  • Assembly instruction on the exam (just one), snippet of code (C) with semantics of it - straightforward whether you understand machine computational model
  • What is OS role of this piece of code?
  • No translation for hex, binary, sizes, etc.
  • Calculator usage: track how far a disk has moved, and add hard numbers is tough
  • 4-function calculator should suffice (TI)
  • Something short of a cellphone
  • Even distribution of topics - good chunk of file systems, stuff that appears in first half is stuff didn't really ask
  • Hardest thing: T/F may be intimidating at first ('wow what a random fact') - but apply logic to it and it should make sense
  • Hardest problem: probably last one since it's conceptual. Here's a change in the OS, tell me what it means
  • A few may be labor intensive (disk scheduling, track numbers)
  • All questions worth 5 points that add up to 200
  • Not cut and paste LXR stuff and its location, but should understand how certain constructs are expressed perhaps in kernel source code. Not necessarily have to remember all fields of task struct, but...:)


  • After class, HW2, HW3 rubrics will be posted and can get idea of mark
  • General advice for the rest of the degrees: if you have skills, go get paid. More for CS students: you're going to forget all this stuff (unless you teach or go into research)
  • Should be able to refresh memory easily - go to right resources
Hope to leave course with some amount of ability to get a good/bad resource - stack overflow is great, but good textbooks, references manuals, etc.
  • Also: not particular knowledge, but not to be afraid of digging into a system - used to being a user, and if it's not behaving it, look at how to fix problem - should understand how something works
  • Designing interfaces: requires understanding of what the machine is doing - biologists study living organisms, we study a computer - if we don't know what's inside this thing, probably not a very good computer scientist

Back To Navigation