Courses/Computer Science/CPSC 203/CPSC 203 2008Winter L03/CPSC 203 2008Winter L03 Lectures/Lecture 15
Contents
Lecture 15
This is the final database lecture before we switch over to Logic and Circuits next week.
The last few classes focussed on first developing the query trace mechanism, second looking at several common query patterns using that mechanism, and finally combining the query patterns we learned to design and implement data analyses in a relational database. Today we introduce you to a complementary, but rather different skillset: the principles of database design. Like "design" in other fields, Database design is part "science" and part "art". We will introduce some basic principles for database design, the key concept of database normalization, and then go through several design examples.
OBJECTIVES:
- You will be introduced to the basic principles of database design, including Database Normalization.
- You will have the opportunity to experience database design using a realistic dataset (the Who Are We" survey data).
- House Keeping
- Text Reading -- Chapter 11 -- Pay particular attention to the sections: Data Storage, Managing Data , and Data Mining.
- Assignment 1 Extension -- Now Due: Midnight March 24th.
- NOTE: Tutorials the week of March 17th will be devoted to TA's working with you on Assignment 1
- Additional Analysis Examples on BB: Under "Course Documents", "AnalysisExamples for Assignment 1".
- Time Boxed Assignments
- Skillset for TBA3 is now up on Wiki
Database Design Guidelines
Database Design Goals
Database Design is an iterative process. rarely do we get to design a database 'from scratch'. Often some data already exists -- often in spreadsheets and text files. Our job is to figure out how to turn it into a well structured relational database that meets the following Design Goals :
- The "Entities" -- (things in the real world) are clear.
- The Relationships between entities are maintained.
- Storage is efficient (i.e. we want to take up the least space in the computer).
- Redundancy is minimized (i.e. the same information is not showing up in several places).
- No data is lost (i.e. from the relational database -- we can re-create the original data set).
Database Design Principles
Fabian Pascal in "Practical Issues in Database Management" has aptly summarized the essence of database design with respect to normalization as, "The Key, the Whole Key and Nothing but the Key".
Recognizing "base sets" in a database, and tying them to unique keys takes one far along the path of database design. We will not cover "Normalization" proper in this course, but will sample from Normalization theory to offer the following general guidelines in developing a database as a series of related tables:
- Identify Base Sets (the distinct entities you are dealing with). Call these base sets your entities.
- Identify the unique "Primary" key (stands in for the members of the base set)of each entity.
- Identify Primary-Key/ForeignKey relationships amongst entities. These relatinships allow you to join entities together.
- Make sure all non-primary-key attributes in an entity depend on the unique Primary key, and only the primary key. Or in other words: "The Key, the Whole Key and Nothing but the Key".
Database Normalization
Often, our starting point for database design is some pre-existing data, either in a spreadsheet, an existing database, or often in some legacy file format.
In this case, we often go through an iterative process of what could be called "table therapy". Moving the existing data closer to a form that matches our key principles above. This process is called Normalization. Normalization is a way of systematically breaking an existing table of data down into smaller tables to bring the dataset into line with our database design goals.
There are 5 levels of Normalization, called the Normal Forms. Most functional databases are brought to 3rd Normal Form (3NF), so we will only focus on the first 3 Normal forms.
1NF':
Remove repeating groups (they are essentially parts of the same base set). These may take the form of several columns which represent the same kind of data ... or multivalued columns where rows are repeated with essentially the same data.
2NF :
Ensure all columns in an entity are functionally dependant on the WHOLE KEY. This requires you to first define the primary key of an entity, and second determine that columns are dependant on that key. All non-key columns should depend on the primary key.
3NF:
Ensure all columns in an entity are functionally dependant DIRECTLY on the whole key. This requires you to further ensure from 2NF that all columns are dependant on only the primary key, and no other non-key colummns. If further dependancies are found, a table in 2NF is further broken to tables in 3NF.
The end result of a process of normalization is that a few (or single) large tables get broken down into smaller tables which can be more compactly stored, and in which entities are separated, but which also allow the recreation of the original dataset.
Iterative Database Design Exercise
We will use the Survey Data Set.
- We will work in "Groups" for 30 minutes -- to iteratively design a database from the "Who Are We" Survey data.
- There are 2 Sub-Teams per group:
- White Sub-Team are the Designers
- Black Sub-Team are the 'Bug Finders' -- Critique design in terms of the 'Design Goals and Normalization Rules above.
- Rapidly work your way through at least 2 iterations.
- Iteration 1: White Team initial design. Black Team Critique.
- Iteration 2: White Team re-design. Black Team Re-Critique.
- Last Part of Class we will have a couple of Teams present their Designs.
- Tuesday I will present a 'Normalized' Design for the Survey Dataset
TEXT READINGS
TIA 3rd Edn: Chapter 11. 462 -- 503
TIA 4th Edn: Chapter 11. pp 484 -- 525
Resources
- The Essence of SQL. A Guide to Learning the Most SQL in the Least Amount of Time. 1996. By David Rozenshtein
The Query Trace mechanism covered in this lecure comes from this book.
- SQL Visual Quickstart Guide. 2005. By Chris Fehily.
A practical introduction to developing queries.
- "Practical Issues in Database Management. Chapter 5. By Fabian Pascal.
The principles of Database Design are very practically set forth in this book.