Courses/Computer Science/CPSC 203/CPSC 203 Template/Lecture Template/Lecture 12

From wiki.ucalgary.ca
< Courses‎ | Computer Science‎ | CPSC 203‎ | CPSC 203 Template‎ | Lecture Template
Revision as of 19:21, 16 October 2008 by Mishtu.banerjee (talk | contribs) (Summary)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Housekeeping

(none)

Required Reading

as per Required Reading for Lecture 10

Introduction

Last class we emphasized the understanding of individual queries via:

  1. The Query Trace Mechanism (essentially an algorithm for predicting the result of a query)
  2. Eight basic query patterns (generalized patterns you should be able to recognize and apply).

However, in the process of moving up the Information Hierarchy, we need to think in more complex terms than an individual query. We need to examine how we could orchestrate a series of queries to move from our raw data, to specific information we wish to gain. This lecture will examine the process on Data Analysis Via Sequential Queries.

At the end of tdoay's lecture you will have:

  • Been introduced to some basic relational operations -- to generalize "Relational Thinking" from the previous classes
  • Been introduced to the process of creating analysis from sequential sets of queries.

Glossary

These terms are modified from "An Introduction to Database Systems" by C.J. Date.


  • Relational Algebra -- a set of operators that take tables (aka relations) as their input and return tables as their output. The key relational operations are given below (and you can see they closely match the set theoretic operators we've previously encountered, with the restriction that inputs and outputs are tables).


  • Restrict. Returns a subset of rows from a table including all tuples that satisfy a specified condition. It is also called Select (but not to be confused with the SELECT keyword in SQL).
  • Project. Returns a subset of columns from a table.
  • Product. The Cartesian Product of two tables which contains every tuple in one table matched against every tuple in another table.
  • Union. Contains all tuples that appear in EITHER or BOTH of two tables.
  • Intersect. Returns a Table with all tuples that appear in BOTH tables.
  • Difference. Returns a table with all tuples that appear in the First table and not in the Second table.
  • Join. Returns a table with all tuples where an attribute from Table 1 (say the Pkey) matches an attribute from Table 2 say the Fkey).
  • Relational Algebra Operators and SQL:
    • SELECT defines which columns appear in a table, like Project.
    • FROM identifies the tables whose Product is the first step in the Query Trace Mechanism.
    • WHERE defines the conditions that need to be satisfied like Restrict
    • WHERE is used to define Join conditions for two tables.

Concepts

Relational Thinking

We are now at a point where we can summarize the elements of Relational Thinking, looking back to what we already know, and looking forward to what we will learn in this lecture and the next. Relational Thinking views the world as composed of Entities and Relations that can be represented in databases, and views the process of problem solving as one of developing queries on Entities and Relations. The small set of concepts below summarize the "Relational Thinking" world view.

  1. Represent Systems as E-RDs. Viewing Systems in terms of Entity-Relationship Diagrams. (Lectures 10-11)
  2. Queries are Subsets. Understanding the Query Trace Mechanism (Lectures 10-11).
  3. What can be queried is bound by a small set of Relational Operations. Generalizing from the Query Trace Mechanism to a basic understanding of Relational Operations (Lecture 12 --this lecture)
  4. Sequential Queries Define a Path from Raw Data to Solution.Viewing Problem Solving in Relational Databases as a process of moving from Raw Data to Information in a series of linked queries. (Lecture 12 --this lecture)
  5. Design Databases from Base Sets and their Relations. Understanding the general Design Principles behind Relational Databases (Lecture 13 -- next lecture)

The Dataset-Transform Model: Sketching Analyses as Directed Graphs

Up to now, we have looked at 'static' models of a relational database, which we have called Entity-Relation-Diagrams. These diagrams emphasize the structure of a database, and the relationships between tables that allow us to query.

Now, we will look at a 'dynamic' approach to defining a data analysis as a series of transformations on data.

In the Entity-Relationship model, the nodes represented entities (tables) and the directed edges represented relationships between tables (based on their primary-key/foreign-key structures).

We'll call our new model the "'Dataset-Transform"' model. In it:

  • Every Table and Query Result is a node. Nodes now represent a particular data-set:
  • The Actual Queries are Edges. Directed edges represent a transformation of the dataset (via a query): '
  • The final dataset, is your solution. The pattern is, given some initial datasets, a series of transformation occur, creating intermediate datasets:

The Analysis Design Process

Assuming the above ""Dataset-Transform" approach, we can develop an analysis combining sequential queries.

The following notes offer a disciplined approach to developing a data analysis, and is particularly suitable to the case where much of the data organization and intermediate data processing is done via queries in a relational database.

  1. Begin with the End in mind and work Backwards. What is the final goal of your analysis?
  2. What are your data sources and how do they need to be organized to achieve your analytical goal.
  3. What statistics and calculated variables do you need to use, including interim calculations.
  4. Draw a "path" from source data to final analysis and use it to,
  5. Break pieces of the analysis down into small steps (in a relational DB each of these steps could be a query, so one ends up with a series of queries that correspond to each step in the analysis).
  6. Check the accuracy of each step of the analysis by
    1. Confirming the data coming into that stage of the analysis
    2. Confirming the data exiting that stage of the analysis
    3. in particular the entering and exiting data in each stage must be in the form, and range that you expect.


A Hint: As you design and implement an analysis, name queries sequentially (easier to keep track of inputs) and with meaningful descriptions (what is this query doing).

  • for example: qry1_GetData, qry2_SortData, qry3_TransformData,qry5_SummarizeData.

Analysis Via Queries Examples

Examples 1 and 2 are covered in the document below:

Media:DataAnalysisViaSequentialQueries.doc



Example 1: Goal -- Profitability of Publishers

Lect14Diag1.png


Example 2: Goal -- Rank Websites from Most to Least Commonly Preferred

Lect14Diag2.png


Summary

  • There is a small set of concepts that support "Relational Thinking" .
  • Analysis in Databases can be viewed as a directed graph of sequential queries.

Text Readings

(to be added)

Resources

  • Date, C.J. 2000. An Introduction to Database Systems. 7th Edn.
    • Today's glossary is drawn from Chapter 6 of this book.
  • The Essence of SQL. A Guide to Learning the Most SQL in the Least Amount of Time. 1996. By David Rozenshtein
    • The Query Trace mechanism covered in this lecure comes from this book.
  • SQL Visual Quickstart Guide. 2005. By Chris Fehily.
    • A practical introduction to developing queries.
  • "Practical Issues in Database Management. Chapter 5. By Fabian Pascal.
    • The principles of Database Design are very practically set forth in this book.

Homework

Again, explore each of these databases. Copy and modify queries to practice.

Questions