Courses/Computer Science/CPSC 203/CPSC 203 Template/Lecture Template/Lecture 6

From wiki.ucalgary.ca
Jump to: navigation, search

Housekeeping

Required Reading

The required readings from the textbook fluency ... are:

  • Chapter 14. Fill-in-the-Blank Computing. pp 374-406.
  • Chapter 15. "What If" Thinking Helps. pp 411 -437.

Also -- it is assumed you are up-to-date on online Lab Manual readings.

Note -- Material from Required Readings may show up on mid-term and final exams.

Introduction

Today we examine the transition from the first to second stages of the Information Hierarchy: from Data to Information. We do this through visualizing a number of statistical ideas in a simple graphical representation, followed by a couple of examples.


At the end of this lecture you should:

  • understand a broad sampling of statistical terminology you are likely to encounter
  • be able to relate the various statistical terms in terms of a simple graphical visualization of data existing in a "grid"
  • have practiced reasoning your way through a data set via a couple of examples.

Glossary

Know both the meaning of these statistics, and how to access them in a spreadsheet:

  • Mean - the 'centre' of a set of values, aka the 'Average'
    • =AVERAGE(Cell:Cell)
  • Median - the middle value in a set of values
    • =MEDIAN(Cell:Cell)
  • Mode - The most frequently occuring value in a set of values
    • =MODE(Cell:Cell)
  • Standard Deviation - a statistical measurement of the spread of its values on either side of the mean
    • =STDEV(Cell:Cell)
  • Count - a function that counts the amount of data values
    • =COUNT(Cell:Cell)
  • Sum - adds all the numbers in a range of cells
    • =SUM(Cell:Cell)
  • Min -- minimum in a set of values
    • = MIN(Cell:Cell)
  • Max -- maximum in a set of values
    • = MAX(Cell:Cell)
  • Range -- Max - Min
  • Precision The limits of our measuring instruments. The "box" within which all observations appear equal.
  • Scattergram -- A 2 dimensional display of data points on an X-Y plane.
  • Cartesian Plane A rectangular coordinate system that associates each point with a pair of numbers. The basis of Scattergrams. See http://dl.uncw.edu/digilib/mathematics/algebra/mat111hb/functions/coordinates/coordinates.html
  • Population All the observations we are interested in.
  • Sample A subset of the observations we are interested in. Usually created via some sampling process -- random sampling, stratified sampling, systematic sampling, etc. To make correct inferences from samples, we must assume they reflect the population we are interested in.
  • Correlation How strongly related two variables are.
  • Regression The degree and form to which a dependant variable (Y) is a function of an independant variable (X). The resulting regression equation expresses Y as a function of X, + an error term.
  • Classification Individual cases(data observations) are placed into groups based on one to severable variables . Classification is used to break a data set up into groups often by searching for natural "breaks" in the data (i.e. areas with sparse data) that separate areas with dense data.
  • Outlier An observation that is numerically distant from the rest of the data.
  • Residual -- the difference between the value predicted by a model, and the value actually observed.
  • Confidence Interval -- when dealing with univariate data, an interval within which you expect a population parameter to fall in. For example, a 95 % confidence interval around the mean, would be the interval within which you expect the "true" mean must reside.
  • Confidence Ellipse -- the extension of the confidence interval concept to multivariate data. In bivariate data the interval is extended to an ellipse, in multivariate data, to a hyper-ellipsoid.
  • Skewness -- measures how symmetric a data distribution is around it's centre. Negative Skew means there is a longer tail of data to the left, and the majority of data is on the right (mean is lower than median). Positive Skew means there is a longer tail of data to the right, and the majority of data is on the left (mean is higher than median). See visuals at: http://en.wikipedia.org/wiki/Skewness
  • Kurtosis - measures how "peaked" a data distribution is. The higher the kurtosis, the "flatter" the distribution due to infrequent but extreme deviations from the average value.
  • Cumulative Distribution Function -- for a measured value, x, the cumulative distribution function of x, CD(x) is equal to the number of observations less than or equal to x, divided by our sample size, n. The cumulative distribution function is essentially created by summing all numbers <= to a value x, resulting in a monotonically increasing curve with an "S" shape. The interesting property of the cumulative distribution function is that it incorporate the "higher moments" of a distribution (i.e. the mean, variance, skewness and kurtosis).
  • Percentile --- for a measured variable, the value of that variable, below which p percent of the values of the distribution are less than or equal to. For example if the 95th percentile for height was 6'2", 95 percent of the population would be below, or equal to this height. 5 % of the population would be above this height.
  • Quartile -- the same concept as for Percentile, but now rather than focussed on the pth percent, we are focussed on the qth quartile (25 % of the data). A common calculation is of the interquartile range around the median (which represent 50 % of the data, 25 % below the median and 25 % above the median)
  • Event -- a discrete outcome.
  • Probability -- The likelihood of an event occuring. In an event space (say a six sided dice), it is the frequency of an event occurring (say rolling a five) over a very long run of experiments (rolls of the dice). The probability of an event A is represented as P(A)
  • Conditonal Probability -- Given that some event A has occurred,how likely is it that some other event B will occur. Conditional Probability of B given A is represented as: P(B/A).
  • Exploratory Data Analysis -- Exploratory Data Analysis was essentially defined in Tukey's 1977 book of the same name. Data are examined and clues uncovered, to generate hypotheses about the data. You could consider this akin to detective work. Tukey stressed in particular visual methods to look for groups, outliers, residuals that show further pattern. He contrasted EDA with "Confirmatory Data Analysis" (see below). The core idea in EDA is to look at the data for suggestive patterns (often visually detectable patterns).
  • Confirmatory Data Analysis -- Confirmatory data analysis is basically judicious in nature. We have a hypothesis (which may have been originated in an EDA stage). We have some data we believe is a critical test of the hypotheses (not the same data as used in EDA). We then attempt top determine the "degree" to which our hypothesis is supported or refuted.


This lecture is focussed on giving a visual introduction to these statistical terms and concepts Note: the Mean, Median, Mode are all measures of 'location'.

Concepts

A Visual Introduction to Statistics

  • Our data 'lives' in the Cartesian Plane (or extensions therof).
  • The data could represent the Population or a Sample from the Population we are interested in.
  • We can represent this in 2D as a 'Scattergram
  • The Min' and Max set the boundaries for where data resides in our Scattergram.
  • The 'cells' in the Scattergram reflect the Precision of data.
  • The 'cell' with the most data is the Mode
  • The Mean and Median are two ways of estimating where data is most frequent in a Scattergram, i.e. the central location
  • The Standard Deviation is a measure of how data varies around the Mean. It can be imagined as an ellipse drawn in a Scattergram.
  • We can imagine the Correlation between two variables as an ellipsoid.
  • We can imagine the Regression between two variables as that line through the data that minimizes the deviations of Y from the line.


Examples

We will discuss two examples ---> Do You have all the Variables You Need to draw a conclusion?

Milgram's "Authority Experiment"
  • Test how people react to Authority (wanted to understand how war-crimes occur in Authoritarian regimes)
  • 40 males 20-50 years
  • 1 Authority figure
  • 1 actor
  • Results
    • 40 participants gave shocks to 300 volts
    • 5 refused to continue past 300 v
  • 4 more gave one more shock than quit
  • 2 refused to contnue past 350 v
  • 1 each refused at 345, 360, 375 volts
  • 14 in total "refused" authority
  • 65 percent gave shocks up to 450 volts.
  • 45 % stopped at before 450 volts.
  • Question: Does this study "prove" that "anyone" is likely to obey an authority figure under the right circumstances.
  • see:


Conditional Probability Example ---> Cause and Effect

Read the document below, which gives you a very simple introduction to the notion of Conditional Probabilities, and how it can be used to tease apart wheter an "effect" has a single, or multiple causes.


Summary

  • A number of statistical terms are introduced ....
  • In the context of a grid visualization of basic data analysis ....
  • Followed by 2 examples of how we can think in the grid (Milgram's "Authority Study"; Thinking about Cause and Effect in terms of conditional probabilities.

All of which is geared to giving us insight on how to move from data to information by understanding how common statistical principles are connected within our grid visualization.

Text Readings

See "Required Readings" above

Resources

Good, P.I. 2005. Introduction to Statistics Through Resampling Methods. Wiley

Homework

Given the conditional probability example above, make up in a blank spreadsheet a grid where the "Cause" is how many cigarettes are smoked daily, and the "Effect" is "Age of Death".

Make up your own numbers (based on intuition or some Internet research).

Then, using this grid, calculate:

  1. Conditional Probabilities.
  2. Cumulative Distribution Functions for "Cigaretters Smoked" and "Age to Death".
  • Chart these.

Make the spreadsheet self-contained on a single page, with good visual design.

Questions