CS 435 schedule and assignments 09

Princeton University
Computer Science Dept.

Computer Science 435
Information Retrieval, Discovery, and Delivery

Andrea LaPaugh

Spring 2009

General Information | Schedule and Assignments | Project Page | Announcements

CHECK BACK FOR UPDATES AND ADDITIONAL READING

WEEK 1
Tues. Feb. 3: Overview of course topics and organization. Inspiration: discussion of As We May Think.
Begin classic information retrieval of text if time.

Class presentation slides (pdf): introduction; beginning of classic IR
Reading:
- Bush, Vannevar, "As We May Think," Atlantic Monthly, July 1945 (the famous memex paper): link to The Atlantic html version or ACM page containing link to pd f version.
- Introduction to Information Retrieval, Preface and Chapter 1, Introduction and sections 1.1, 1.4, 1.5.

Thurs. Feb. 5: Foundations: classic information retrieval of text.

Class presentation slides (pdf): rest of classic IR models
Reading, either of:

Introduction to Information Retrieval, Chapter 6: Introduction and sections 6.2, 6.3 and 6.4 except 6.4.4.

Note: Introduction to Information Retrieval discusses building indexes for document collections before it discusses models of documents and queries. Therefore, we are reading later chapters first. Please don't worry about the exact content of the index for now. If you prefer to read a text that treats the topic in the same order as we are going to, you can read the chapters assigned below in Modern Information Retrieval instead.

*Modern Information Retrieval Chapter 1, section 1.4; Chapter 2 sections 2.1-2.4 and 2.5.1 - 2.5.3.

Assignment 1 is now available.

WEEK 2
Tues. Feb. 10: Latent Semantic Indexing.

UPDATED class presentation slides (pdf): LSI
Reading:

Introduction to Information Retrieval Chapter 18 (note that section 18.1 is helpful background but not absolutely necessary).

Of interest:

References to Papers on LSI from Telcordia Technologies, where LSI was first developed. Includes link to Deerwester, Dumais et. al.

Thurs. Feb 12: Extended model.

Assignment 1 due today.
Class presentation slides (pdf): extended model
Assignment 2 (pdf) is now available.

WEEK 3
Tues. Feb. 17: Ranking Web pages.

Class presentation slides (pdf): link-based ranking
Reading:

Introduction to Information Retrieval, Chapter 19, Introduction, Sections 1-4.
Introduction to Information Retrieval, Chapter 21.

Also of interest, original papers:

(HITS algorithm) Kleinberg, Jon, Authoritative sources in a hyperlinked environment, Journal of the ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632. (Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998 and as IBM Research Report RJ 10076, May 1997.)
(PageRank algorithm) Page, Larry and Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technologies Project TR, Jan. 1998. (Early version: L. Page. PageRank: Bringing order to the web. Stanford Digital Libraries Working Paper 1997-0072, Stanford University, 1997. )

Thurs. Feb. 19: Evaluation of retrieval systems; spamming search engines

Assignment 2 (pdf) due today.
Class presentation slides (pdf) - evaluation (FINAL)
Reading:

Introduction to Information Retrieval, Chapter 8

WEEK 4
Tues. Feb. 24: Evaluation continued; indexing

Class presentation slides(pdf): index use and structure (FINAL)
Reading:

Introduction to Information Retrieval, Chapter 1, Sections 2 and 3.
Introduction to Information Retrieval, Chapter 2
Introduction to Information Retrieval, Chapter 3, Introduction and Section 1.

Assignment 3 is now available.

Thurs. Feb 26: indexing continued.

Project proposal due today.
Class presentation slides: see Feb. 24.
Reading

Introduction to Information Retrieval, Chapter 7, Section 7.1, except 7.1.6, and 7.2.

WEEK 5
Tues. March 3: Index construction

Assignment 3 due today.
Class presentation slides (pdf): index construction, midterm exam topics
Reading

Introduction to Information Retrieval, Chapter 4

Thurs. March 5: Remarks on index construction and query evaluation. Index compression. Exam review.

Class presentation: summary of board presentation on compression (pdf)
Reading

Introduction to Information Retrieval, Chapter 5: Sections 1 and 2.

Also of interest

Chris Anderson, The Long Tail , Wired, October 2004 (link is to updated version - Dec. 14, 2004).

WEEK 6

take-home EXAM 1: DISTRIBUTED end of class Tuesday, March 10. DUE beginning of class Thursday, March 12.

Tues. March 10: Index compression, continued.

Class presentation: summary posted 3/5/09.
Reading

Introduction to Information Retrieval, Chapter 5: Section 3

Also of interest

Description of bit-level variable-length encodings of positive integers (Elias gamma-code and delta-code and Golomb code covered in class) in *Modern Information Retrieval Section 7.4.5.
A. Moffat and J. Zobel, Self- indexing inverted files for fast text retrieval, ACM Transactions on Information Systems, Vol. 14, No. 4 (Oct. 1996), pgs 349-379.
The Anatomy of a Large-Scale Hypertextual Web Search Engine, (pdf from Stanford publications collection) Brin, Sergey and Page, Lawrence, Proceedings of the Seventh International WWW Conference (WWW 7), 1998.

Thurs. March 12: Final remarks on index compression. Distributed computation for index building and query execution.

Class presentation slides (pdf): distributed query execution and index building
Reading:

skip pointers covered in Section 2.3 assigned earlier
distributed indexing covered briefly in Section 4.4 assigned earlier.

Also of interest

Web Search for a Planet: The Google Cluster Architecture, Luiz Barroso, Jeffrey Dean, and Urs Hölzle, In IEEE Micro, Vol. 23, No. 2, pages 22-28, March, 2003.
MapReduce: simplified data processing on large clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, 51(1), Jan. 2008. (Special 50^thAnniversary issue: Breakthrough research: a preview of things to come.)

Spring break

WEEK 7
Tues. March 24: Search refinement; using users behavior

Class presentation slides (pdf): search refinement and recommendation - final.
Reading

Introduction to Information Retrieval, Chapter 9

Also of interest:

Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin; IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734-749, June 2005.
The following book contains several relevant chapters. The chapters are available as pdf files to members of the Princeton University community by accessing them from the princeton.edu domain:

The Adaptive Web; P. Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer Science book series Vol 4321, Springer, 2007.

The following chapter in The Adaptive Web is of particular interest:

Personalized Search on the World Wide Web; A. Micarelli, F.Gasparetti, F.Sciarrone and S. Gauch, Chapter 6.

Thurs. March 26: Clustering

Class presentation slides (pdf): clustering: intro and K-means algorithm
Reading

Introduction to Information Retrieval, Chapter 16, Introduction and Sections 16.1, 16.2, and 16.4.
Introduction to Information Retrieval, Chapter 17, Introduction and Sections 17.1, 17.2, and 17.6.

Assignment 4 is now available.

WEEK 8
Tues. March 31: Clustering continued

Class presentation slides (pdf): general clustering algorithms
Reading

Introduction to Information Retrieval, Chapter 17, Sections 17.3, 17.4, and 17.8.

Also of interest:

Introduction to Information Retrieval, Chapter 16, Section 6.3 is recommended if you are going to read research papers on clustering. We will touch on external evaluation criteria very briefly.
Introduction to Information Retrieval, Chapter 17, Sections 17.5 and 17.7 are recommended but not required.

Thurs. April 2: Semi-structured information and XML

Assignment 4 due today.
Class presentation slides (pdf): XML
The XML mark-up of Hamlet was done by Jon Bosak and can be found at http://www.cafeconleche.org/examples/shakespeare.
Reading

Introduction to Information Retrieval, Chapter 10

Also of interest

An XQuery Sandbox example tool can be found on the eXist Project Web site. The eXist Project is centered around eXist-db, which is (in their words) "an open source database management system entirely built on XML technology."

WEEK 9

Assignment 5 (pdf) is now available.

Tues. April 7: Detecting near-duplicate documents

Class presentation slides (pdf) detecting near-duplicate documents
Reading

Introduction to Information Retrieval, Chapter 19, Section 19.6

Thurs. April 9: Crawling the Web

Class presentation slides (pdf): crawling the Web
Reading

Introduction to Information Retrieval, Chapter 20, Sections 20.1-20.3.

WEEK 10

Tues. April 14: Privacy in Information Retrieval. Guest speaker Joe Calandrino

Assignment 5 (pdf) due today.

Thurs. April 16: Student presentations; Characteristics of the changing Web

Class presentation slides (pdf): Web characteristics
Also of interest - papers summarized in the presentation on characteristics of the Web:

What's New on the Web? The Evolution of the Web from a Search Engine Perspective, A. Ntoulas, J. Cho, and C. Olston, Intern. World Wide Web Conf.(WWW), ACM, 2004.
Estimating the Change of Web Pages, Sung Jin Kim and Sang Ho Lee, Intern. Conf. Computational Science (ICCS), Springer, 2007.
A large-scale study of the evolution of Web pages, D. Fetterly, M. Manasse, M. Najork and J. L. Wiener, Software: Practice and Experience, 34:213–237 (2004) Wiley.
Changes in Webpage Structure over Time (pdf via ftp), M. Dontcheva, S. M. Drucker, D.Salesin, M. F. Cohen, , U. Washington CSE Technical Report (TR2007-04-02), April 2007.
Recrawl Scheduling Based on Information Longevity, Christopher Olston and Sandeep Pandey, Intern. World Wide Web Conf.(WWW), 2008. (pdf here)

WEEK 11
Tues. April 21: Student presentations

Thurs. April 23: Student presentations;

WEEK 12
Take-home EXAM 2: DISTRIBUTED end of class Tuesday, April 28. DUE beginning of class Thursday, April 30.

Tues. April 28: Non-text retrieval: image retrieval

Also of interest - references for today:

Multimedia IR: Indexing and Searching, Christos Faloutos, Chapter 12 in *Modern Information Retrieval. Includes a discussion of characterizing images using color histograms.
Image Similarity Search with Compact Data Structures, Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on Information and Knowledge Management (CIKM), ACM, Nov. 2004. (Reports part of Princeton CASS project.)
PageRank for Product Image Search, Yushi Jing and Shumeet Baluja, Intern. World Wide Web Conf.(WWW), 2008. (pdf here)
Web site: Princeton CASS: Content-Aware Search Systems - Demo of image "search by example"
Web site: Retrievr search images by sketch (or example).
Web site: Google Labs demo site for Similar Images
Web site: Marsyas music genre meter demo by George Tzanetakis, 2001, on YouTube

Thurs. April 30: Searching the Deep Web; wrap-up

Class presentation slides (pdf): Overview of Deep Web search and wrap-up
Also of interest - references for today:

Harnessing the Deep Web: Present and Future (pdf), Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Halevy, 4th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2009.
Searching the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10 (Oct. 2008), pages 14-15.
Google's Deep-Web Crawl (pdf), Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y. Halevy, 34^thIntern. Conf. on Very Large Data Bases, VLDB Endowment, Aug. 2008.
Accessing the deep web, Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May 2007), pages 94-101.

* on reserve in the Engineering Library

Princeton University Computer Science Dept.

Computer Science 435 Information Retrieval, Discovery, and Delivery Andrea LaPaugh

Spring 2009

CHECK BACK FOR UPDATES AND ADDITIONAL READING

Princeton University
Computer Science Dept.

Computer Science 435
Information Retrieval, Discovery, and Delivery

Andrea LaPaugh