Courses/Computer Science/CPSC 457.F2013/Lecture Notes/ExampleFileSystems
Contents
Example File Systems
In this session, we consider several different designs for file systems.
File Systems
- Ext2, Ext3, Ext4
- FAT
- The XFS file system: http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html
- FUSE
- NFS
- Samba/SMB/CIFS
Lecture Notes
These are my written notes from lecture preparation.
Focus on: Ext2, Ext3, XFS, FAT
Learning outcomes should focus on how each of these file systems express different design decisions and support the storage, naming, lookup, and indexing of file contents.
The main task: store many (N?) files of 0..Max length.
- max files?
- location of meta-data?
- all that space for a directory?
Design Questions
- How long should file names be? Unlimited?
- Should the file system support an unlimited number of files? A limit regardless of disk size?
- How are directory contents organized? Are files that are "close" in the directory tree "close" on disk?
Minimum Viable File System
Treat disk as an ideal virtual storage medium. File systems can mostly ignore specific hardware properties, but take some common properties into account: virtual disk model is the common design target.
- need to name files: how many bytes?
- need to find content at file[pos] where 0 <= pos < M
- allocation strategy?
- block size? byte addressable?
- relationship of files to directories? close? on disk?
- is file location meta-data intertwined with file content?
What is a File?
A file names a collection of bytes (really: blocks of bytes).
What API should be Supported?
The traditional Unix file I/O syscalls.
- open("path") // implies a way to translate path names to "files" or "inodes"
- open() //lookup file in directory, which may take long for large directories
- seek(offset) // implies a way to perform "random access"; O(1) access to any byte offset in a file. Note that support for seek implies or requires "seeking" within something: a virtual file space made up of a range of bytes from zero to max file size.
- seek() //this may take a long tiem for large files; we don't wish to visit every byte along the way
- write("msg") //lazy is efficient, but may lose data on a power loss; requires journaling
- read() //cache content in primary RAM
- close()
What data structures?
- A list of open files per process (this belongs at the VFS level)
- For each open file, a current offset
- a way to map files from where they are in the directory tree to where they might reside on a physical device
- key mechanism:
map(byte offset) -> logical block number
e.g.,
byte offset = 1253 block_size = 1024 lbn = 1253 / block_size lbn = 1253 / 1024 lbn = 1
This means that the byte offset 1253 is in logical block 1 of the file. In ext2, finding this logical block is a matter of following the second direct-mapped entry in the multi-level index array. Of course, this logical block could be anywhere on disk.
A file system really keeps track of which blocks belong to which files.
Ext2
In ext2, a file is a semi-contiguous series of blocks, with descriptive meata-data contained in an i-node (index node).
Allocation strategy is to give 8 blocks to new files in anticipation of future growth / writes.
Block size can vary, but is usually 1KB, 4KB, or 8KB
On disk format is to create some number of Block Groups. Within each Block Group, there exists:
- Superblock (main / primary superblock copy; other copies are scattered over the disk)
- file system meta data, version, etc.
- number of inodes
- number of disk blocks
- start of free blocks
- Group Descriptor
- number of directories
- pointer to block bitmap
- pointer to inode bitmap
- Block Bitmap: 1 Block in length (e.g., 4KB) (bitmap of free blocks)
- inode Bitmap: 1 Block in length (e.g., 4KB) (bitmap of free inodes)
- List of Inodes; note these is set at file system creation time, thus limiting your default number of files
- Data Blocks
Note that inodes contain an array that serves to index into the Data Blocks portion of the disk.
Index Array: 0-11: direct map / contain block number 12: single indirect index (points to another block containing pointers to data blocks) 13: double indirect 14: triple indirect
Note that these point to specific disk blocks; they are not "in" the inode itself; the inode contains metadata, not data
Ext3
Consider what happens when the system has pending writes and crashes. Writes are half-complete, writes in queue (and that user programs believe have happend) are not flushed. Worse yet, the kernel data structures that help manage data I/O might be inconsistent or corrupt --- not just the user data.
The answer is to keep a "journal": a relatively small buffer of file system operations / transactions. The journal is a small set of disk blocks (e.g., 64K) ordered as a circular buffer of recent disk write operations.
- Journaling: a record of file system meta data.
- Goal: avoid constantly running consistency checks on entire FS. The point here is to avoid having a minutes-long procedure occur when the system comes back up.
We can manage this journaling activity in three ways:
- Journal: all file system data and metadata are logged in journal.
- Ordered: metadata are logged; but real data is written first. Reduces chances of file content corruption (ext3 default)
- Writeback: only metadata is logged. Fast. (e.g., XFS)
XFS
Design goals for this file system (from Silicon Graphics) centered on supporting intense I/O performance demands, large (media) files, and file systems with many files and many large files.
- terabytes of disk space (so many files and directories)
- huge files
- hundreds of MB/s of I/O bandwidth
XFS was a clean-slate file system design effort that supports a 64-bit file address space. Internally, it manages groups of disk blocks and inodes in what it calls Allocation Groups (AG). It also manages groups of blocks called extents rather than individual blocks. Part of what makes it efficient is the use of B+ trees for many of its internal data structures rather than the bitmap approach of ext2. For example, it uses two B+ trees per Allocation Group for keeping track of groups of free blocks (or extents); the entries in the tree are extent start,length pairs, and one B+ tree is indexed by the start block of free extents and the other is indeed by the length of free extents. This pair of data structures makes it quick to find an extent of the "right" size or to find an extent by the right "position" (or intersect two sets of possibilities).
XFS can easily dynamically scale the number of files it supports because pools of inodes are created on demand rather than set at file creation time.
XFS supports "large" directories (i.e., directories with many many files in them); many previous file systems performed a linear search through file names in a directory. Some previous approaches used a hashing approach to speed up this lookup.
XFS also provided support for journaling metadata to improve crash recovery times.
Finally, XFS improves I/O performance by using multiple read-ahead buffers (i.e., if byte b in a file has been read, it is likely that bytes b+1..b+k will be read, too), it bypasses the system buffer caches, and it is multi-threaded to allow many simultaneous readers.
Command Line Utilities
We can mount different devices (disks) to different places in the file directory tree, side by side. Each device can have a different file system written on it, and the OS, via the VFS, will make standard operations like read and write work seamlessly. We can check what file systems (and devices) we have mounted via the mount command:
(eye@mordor filesystems)$ mount /dev/sda3 on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/sdb on /media/ext2hw3 type ext2 (rw,nosuid,nodev,uhelper=udisks) /dev/sdd on /home/eye/457/lectures/filesystems/ext2 type ext2 (rw) /dev/sde on /home/eye/457/lectures/filesystems/ext3 type ext3 (rw) /dev/sdg on /home/eye/457/lectures/filesystems/fat type vfat (rw) (eye@mordor filesystems)$
We can look at the properties of each file system and notice some differences.
(eye@mordor filesystems)$ stat -f ext2 File: "ext2" ID: 2dcea5e479696fce Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 322518 Free: 322038 Available: 305636 Inodes: Total: 82080 Free: 82069 (eye@mordor filesystems)$ stat -f ext3 File: "ext3" ID: fc6ff15b457a66cb Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 322528 Free: 313847 Available: 297463 Inodes: Total: 81920 Free: 81909 (eye@mordor filesystems)$ stat -f fat File: "fat" ID: 86000000000 Namelen: 255 Type: msdos Block size: 4096 Fundamental block size: 4096 Blocks: Total: 327036 Free: 327035 Available: 327035 Inodes: Total: 0 Free: 0 (eye@mordor filesystems)$
We can see that ext2 gives new files 8 disk blocks:
(eye@mordor filesystems)$ cd ext2 (eye@mordor ext2)$ ls foo.txt lost+found/ (eye@mordor ext2)$ stat foo.txt File: `foo.txt' Size: 0 Blocks: 8 IO Block: 4096 regular empty file Device: 830h/2096d Inode: 12 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-11-25 12:04:41.000000000 -0700 Modify: 2013-11-25 12:04:41.000000000 -0700 Change: 2013-11-25 12:04:41.000000000 -0700 (eye@mordor ext2)$
Readings
- man 5 fs
- man fstab
- man mount