Courses/Computer Science/CPSC 203/CPSC 203 2008Winter L03/CPSC 203 2008Winter L03 Lectures/Lecture 14

Lecture 14
Last Class we looked at a number of common query design patterns, focussing on individual queries. This class we introduce some principles for designing a data analysis across a number of queries. We emphasize a visual method of thinking about data analysis from a database using the dots-and-edges approach. We illustrate an example of an extended data analysis across multiple queries.

OBJECTIVES:
 * You will learn to use a visual dots-and-edges method of sketching out the queries you may need to solve a particular problem in a database.


 * House Keeping
 * Text Reading -- Chapter 11 -- Pay particular attention to the sections: Data Storage, Managing Data , and Data Mining.
 * Midterm Results: Class Average: 16.4/20 (82.2 %). An Exam Key is available in CPSC Office: ICT 602. Ask for Susan Lucas (Undergraduate Administrator)
 * Assignment 1 Extension -- Now Due: Midnight March 24th.
 * NOTE: Tutorials the week of March 17th will be devoted to TA's working with you on Assignment 1
 * Additional Analysis Examples on BB: Under "Course Documents", "AnalysisExamples for Assignment 1".
 *  Time Boxed Assignments 
 * TBA2 -- Wednesday/Thursday this week.
 * Skillset for TBA3 is now up on Wiki

Mini-Tutorial: Data Analysis Design
(This material is expanded from the same named section, listed but not covered last lecture, and is the focus of today's lecture).

Dots and Edges Approach to Sketching An Analysis
Up to now, we have looked at 'static' dots-and-edges models of a relational database, which we have called Entity-Relation-Diagrams. These diagrams emphasize the structure of a database, and the relationships between tables that allow us to query.

Now, we will look at a 'dynamic' dots-and-edges approach to defining a data analysis as a series of transformations on data.

In the Entity-Relationship model, the "dots" represented entities (tables) and the edges represented relationships between tables (based on their primary-key/foreign-key structures).

We'll call our new model the "Dataset-Transform" model. In it:
 * the "dots" now represent a particular data-set:  Every Table and Query Result is a node.
 * the edges represent a transformation of the dataset (via a query): The Actual Queries are Edges.
 * The pattern is, given some initial datasets, a series of transformation occur, creating intermediate datasets:The final dataset, is your solution.
 * Name Queries Sequentially (easier to keep track of inputs) and with meaningful descriptions (what is this query doing).

The Analysis Design Process
Assuming the above ""Dataset-Transform" approach, we can develop an analysis combining sequential queries.

The following notes offer a disciplined approach to developing a data analysis, and is particularly suitable to the case where much of the data organization and intermediate data processing is done via queries in a relational database.


 * 1) Begin with the End in mind and work Backwards. What is the final goal of your analysis?
 * 2) What are your data sources and how do they need to be organized to achieve your analytical goal.
 * 3) What statistics and calculated variables do you need to use, including interim calculations.
 * 4) Draw a "path" from source data to final analysis and use it to,
 * 5) Break pieces of the analysis down into small steps (in a relational DB each of these steps could be a query, so one ends up with a series of queries that correspond to each step in the analysis).
 * 6) Check the accuracy of each step of the analysis by
 * 7) Confirming the data coming into that stage of the analysis
 * 8) Confirming the data exiting that stage of the analysis
 * 9) in particular the entering and exiting data in each stage must be in the form, and range that you expect.

A Few Example Questions

 * From a Database on Publishers, Titles, Authors, can we identify the profitability of different book Publishers?
 * From a Database on Student Website preferences, can we identify the most commonly preferred websites?
 * From a Database on Employee Pay and Performance -- can we some standards for pay and performance? (next Tuesday)

TEXT READINGS
TIA 3rd Edn: Chapter 11. 462 -- 503

TIA 4th Edn: Chapter 11. pp 484 -- 525

Resources
The Query Trace mechanism covered in this lecure comes from this book.
 * The Essence of SQL. A Guide to Learning the Most SQL in the Least Amount of Time. 1996. By David Rozenshtein

A practical introduction to developing queries.
 * SQL Visual Quickstart Guide. 2005. By Chris Fehily.

The principles of Database Design are very practically set forth in this book.
 * "Practical Issues in Database Management. Chapter 5. By Fabian Pascal.