Courses/Computer Science/CPSC 203/CPSC 203 2008Winter L03/CPSC 203 2008Winter L03 Lectures/Lecture 21

From wiki.ucalgary.ca
Jump to: navigation, search
  • House Keeping
    • Final Exam: Mondy April 21st, 12-2P.M., KN RED.
    • Text Readings -- Problem Solving Associated Topics. Chapter 10. (see TEXT READINGS from last class: Know System Development/Program LifeCycle, Design Techniques and Building blocks for programming) )
    • NOTE: Tutorial 1 the week of March 31st will be devoted to TA's working with you on Group Projects, while Tutorial 2 will be devoted to Assignment 2.
    • Thursday Lecture this week: I will review Assignment 2 and prep for next week's group project presentations at end of lecture.
    • Group Project Presentations Begin Next Week -- schedule is at the top of: http://wiki.ucalgary.ca/page/Courses/Computer_Science/CPSC_203/CPSC_203_2008Winter_L03/CPSC_203_2008WinterL03_TermProjects



Lecture 21

Last lecture we looked at some approaches to problem solving, and used the illustrative example of finding "they keystone website" for the Internet. In this lecture, we will look at a particular solution to that problem, upon which Google is founded: the notion of Page Rank. Effectively those sites with the highest Page Rank would be the keystone species of the World Wide Web -- the major hubs that information-ally link the World Wide Web together.

We will illustrate how a web search engine works from a Top-Down-Design perspective, hone in on how Page Rank technology works, and finally relate the consequences of that technology to issues such as:

  • economics
  • privacy
  • freedom


Having begun with a big-picture view of the Internet in our early lectures, we will end there with some speculations about the evolving structure of the Internet.



OBJECTIVES:

  • You will be able to view Search Engines from a Top-Down-Design Perspective.
  • You will understand, and be able to calculate Page Rank for simple examples.
  • You will be able to connect a technology, Page Rank to related social issues created by the technology.



How Search Engines Work

  • Top Down Design Example
    • Get Information ---> Spiders/Web Crawlers
    • Organize Information ---> Index
    • User Queries Information ---> User Interface and Query Engine
  • The Spider or Web Crawler -- this is a program that "crawls" across web pages, and retrieves tag headings and text content, the basis of a database.
    • Searches pages
    • Follows Links
  • The Index -- this is a form of "Inverted File" that takes the database, and re-represents it in terms of an index of keywords, linked to web-pages (similar to the index at the back of a text book)
    • Transforms results from Spiders forays into a single index
  • The Query Interface -- this is what you, the user see, when you wish to do a query, as well as what goes on "beneath the hood" to process the query.
    • How to make Search Easy For Humans
    • Link between questions you can ask and the Index

Page Rank

Once we have an Index, page rank could be said to be the algorithm that determines how we sort items in the Index, i.e. what to return at the top of a search list.


As a formula: PR(u) = SUM (PR(v)/L(v))

In English: The Page Rank of a page, "u" is based on summing all the Page Ranks/ # of Links from each page "v". Where for a given Page "u", "v" are all the pages that link into it.

Technical Difficulty: This algorithm is actually iterative. So, to find the Page Ranks for "v", you have to apply the algorithm to the pages that link into v, and so on. One begins by initially assigning a page rank to each page that is 1/#pages. So if your "universe" of web pages were 10, each page would be initially assigned a rank of 1/10. Then, utilizing the formula above, you calculate a first iteration page rank for each page. You repeat it with a second iteration page rank, and continue doing so until the page ranks (hopefully) stabilize. While this kind of calculation can be done by hand for a very small system, it rapidly becomes extremely computationally intensive.

  • illustrative example using Dots-and-Edges graph.


For more technical details go to: http://en.wikipedia.org/wiki/PageRank



Control Issues

  • Economics
    • Are the rankings objective
    • How do paid ads show up
    • What happens if a pages rank suddenly changes due to changes in Google's algorithm
    • Click Fraud
    • Link Spamming
  • Privacy
    • "I took it off the net, but now it's cached by Google."
    • Blogs and Emails often searchable
    • Moving Search Technology to the Desktop: google office, Google mail, etc.
    • Mining Search Data (see Searching the Searches)
  • Freedom
    • Political Control of Website Search Results
    • Searching the Searches
  • Big Picture -- Back to Scale Free Networks
    • Probability of a new site linking to existing site with many links ..... higher page rank.
    • So .... Does Page Rank algorithm promote Internet as a Scale Free Network.
    • and, do the "issues" described above affect structure of a scale free network.

I will leave you with that question to ponder.


Resources

Google's Page Rank and Beyond. The Science of Search Engine Rankings. By Amy N. Lnagville and Carl D. Meyer. 2006. By Princeton University Press.

The Google Story. 2005. By David A. Vise. Delacorte Press.

The Search. How Google and its Rivals Rewrote the Rules of Business and Transformed Our Culture. 2005. by John Batelle.Portfolio Books.

Information Politics on the Web. 2004. By Richard Rogers.MIT Press.