Princeton University
Computer Science Dept.

Richard L. Smith '70 Freshman Seminar

Google and Ye Shall Find???


Andrea LaPaugh

FRS 117

Fall 2007


Directory
                   General Information   |   Schedule and Assignments   |    Blog  (login for announcements)

   Weeks 1 through 3

click here for current weeks

Week 1, Sept. 19:

Topics: 
What is search?
Ideal search?
Representing text digitally.
POSTPONED: Methodology of computer search before the Web.

Questions for class discussion ( our discussion is not limited to these, but they will help you prepare):
In "As we may think.," Vannevar Bush clearly got the technology wrong:  he could not know about the coming digital technology revolution.  But ignoring the technology used,
*What of  Vannevar Bush's vision have we achieved?
*What of  Vannevar Bush's vision do you expect we will eventually achieve?
*What do you think Vannevar Bush "got wrong" in terms of his vision?
*Are there any parts of his vision that you think are impossible?

What is your concept of "ideal search"?

In The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture, one of Battelle's overarching themes is trust.  In Chapter 1, he discusses several aspects of trust in the context of search.  Do you agree with his assessment?   Are there aspects of trust that he does not discuss?

Written assignment due this week:  NONE

Reading for discussion today:
     *Bush, Vannevar,  As we may thinkAtlantic Monthly,  July 1945.
     *Battelle, The Search:  Chapter 1

References for technical material:
     *Howstuffworks "Computer Memory Basics"
     *Howstuffworks "Types of Computer Memory"
     *Howstuffworks "How Bits and Bytes Work"
     *American Standard Code for Information Interchange - Wikipedia, the free encyclopedia:  "Overview" and "ASCII printable characters"
    

Week 2, Sept. 26:

Topics: 
Methodology of computer search before the Web
Model of the Web - Graph structures
Web pages
information in HTML
Using the Web in search, Part  I

Class discussion:  This week we will begin by looking at the methods that search engines use to retrieve and rank text documents (anything consisting primarily of written words).   We will then examine how things change when documents go on the Web.  Think about how you decide if a document is relevant and how that might be turned into an automated method.  Also bring your questions about why documents get ranked the way they do.  Since Google and the other search engines use "secret formulas", we won't know, but we can take an educated guess at what is going on.

Written assignment due this week:  Please visit the Assignment 1 page.

Non-technical reading for today:
     *Battelle Chapters 2, 3, 4.  Battelle covers a lot of ground quickly because he concentrates on the history and only mentions the technical aspects.  We'll spend more time understanding the key ideas of the  technical aspects.  The history is fun, and we will certainly include some in our discussion (but little of the history of all the Web search engines before Google).

References for technical material:  
     *(Originally for week 1) Information retrieval - Wikipedia, the free encyclopedia: see the timeline.  We will not discuss the technical development in this entry.
     *(Originally for week 1) Amit Singhal, Modern Information Retrieval: A Brief Overview, In Bulletin of the  IEEE Computer Society Technical Committee on Data Engineering,   2001, pp. 35-43.   (pdf; access limited to Princeton University.)   The mathematical development in this article is more sophisticated than that which we will use, especially Section 2.2 on Probabilistic Models.   Read for the main ideas.  Read the math if you are interested.
     *Inverted_index - Wikipedia, the free encyclopedia
     *Graph (mathematics) - Wikipedia, the free encyclopedia
    


Week 3, Oct.  3:

Topics: 
Using the Web in search, Part  II
"Gaming" Web page ranking (Spam)
Temporal consistency of ranking
POSTPONED: Finding content:  Web crawlers
                        Building an Index
                        Invisible Web

Class discussion:  During our last class we discussed a lot of technical material.  Your first task is to bring to class your questions on that material.
Also think about the issues Battelle raises in the part of Chapter 7 that I have assigned below.  Does Google have any obligation to have search results change "gracefully" as their ranking algorithm changes?  What is spamming (bad) of search engine results versus effective presentation (good) to obtain a good ranking for your page?

Written assignment due this week:  Please visit the Assignment 2 page (pdf).


Reading for discussion today:
*Battelle, Chapter 7 thru "The AdWords Connection"   You don't need to have read Chapters 5 and 6, to read this part of Chapter 7, which deals with consistency of Google ranking and "Search Engine Optimization" (SEO). The rest of the chapter deals more generally with the economics of search.  We are not ready to discuss the economics more generally,   but you are welcome to read ahead.  You'll want Chapters 5 and 6 to read beyond "The AdWords Connection" in Chapter 7.

There are many, many articles on how to (or not to) improve the visibility of your site in Google (and other search engines).  Here is a sampling, including some from Google itself:

* Google Loves Transparent Links & Hit Counter Spam, an interesting article in Search Engine Journal.  I only just discovered this site, but it seems useful.
*
Unnatural Linking Patterns And Search Engine Rankings  From the site searchenginepromotionhelp.com, an SEO site.
* Google's Webmaster Help Center - Webmaster Guidelines   Discusses good versus bad behavior with respect to improving your site's ranking.  See in particular the section "Quality Guidelines".
* Google's Webmaster Help Center - What's an SEO? Does Google recommend working with companies that offer to make my site Google-friendly? 

References for technical material:
*(originally for week 2) Sergey Brin and Lawrence Page, Anatomy of a search engine Proc. Intern. World-Wide Web Conference (WWW7) 1998.  This is the original public description of Google.  Amended  guidance on reading this article (update from week 2 posting):  It is a technical article, and there are many details I expect you to skip; in particular, you can skip Sections 4.1 - 4.4 and the appendices.  We are scheduled to start discussing crawling and index building today (although we may not get to them) and Sections 4.1 - 4 .4 are relevant.  However these sections are terse and filled with technical jargon.   We will discuss the important points in class.   The article is of historical interest and provides a good outline of the issues of Web search even if you skip all the technical details.
*(originally for week 2)Scientific American: Feature Article: Hypersearching the Web: June 1999


click here for weeks 4 through 6


last revision of content Fri Oct 12 10:26 EDT 2007
Copyright  2007,  Andrea S. LaPaugh