|
Computer Science 435
Information Retrieval, Discovery, and Delivery
Andrea
LaPaugh
|
Spring 2012
|
General Information | Schedule and
Assignments | Project Page | Announcements
EVOLVING: CHECK BACK FOR UPDATES
WEEK 1
Mon. Feb. 6: Overview of
course topics and organization. Models of information.
- Class presentation slides (pdf): Introduction
- Reading:
- Also of interest:
Wed. Feb 8: Foundations: classic information
retrieval of text. Extending the models.
WEEK 2
Mon. Feb. 13: Ranking, classical
- Class presentation slides: continuation of slides for Feb 8
- Optional
Reading:
Wed. Feb 15: Ranking, Web
- Problem Set 1 due
- Class presentation slides (pdf): link-based
ranking
- Reading:
- Also of interest, original papers:
- (HITS algorithm) Kleinberg, Jon, Authoritative
sources in a hyperlinked environment, Journal of the
ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632.
(Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998 and as IBM Research Report RJ 10076, May 1997.)
- (PageRank algorithm) Page, Larry and Sergey Brin, R.
Motwani, T. Winograd, The PageRank Citation Ranking:
Bringing Order to the Web, Stanford
Digital Library Technologies Project TR, Jan. 1998.
(Early version: L. Page. PageRank: Bringing order to the web.
Stanford Digital Libraries Working Paper 1997-0072, Stanford
University, 1997. )
WEEK 3
Sunday Feb. 19: Problem
Set
2 is now available
Mon. Feb. 20: Evaluation
of retrieval systems
- Class
presentation
slides
(pdf):
evaluation
- Reading:
- Also of interest:
Wed. Feb. 22: Index
structure and use.
WEEK 4
Mon. Feb. 27: Index construction
Wed. Feb. 29: Index
compression
- Also of interest
- Chris Anderson, The Long Tail , Wired, October 2004 (link
is to updated version - Dec. 14, 2004).
Thurs. March 1: Problem Set 3 (pdf) is now available.
WEEK 5
Mon. Mar. 5 : Index compression continued
- Project proposal due today.
- Notes on
compression lecture - Part 2 (pdf)
- Reading:
- Also of interest
- A. Moffat and J. Zobel, Self-
indexing inverted files for fast text retrieval, ACM
Transactions on Information Systems, Vol. 14, No. 4
(Oct. 1996), pgs 349-379.
- The Anatomy of
a Large-Scale Hypertextual Web Search Engine, (pdf from
Stanford
publications collection) Brin, Sergey and Page,
Lawrence, Proceedings of the Seventh International WWW
Conference (WWW 7), 1998.
Wed. Mar. 7: Distributed
computation
for index building and query execution.
- Problem Set
3 (pdf) due
today!
- brief
outline of topics for midterm exam (pdf)
- 2011 Exam 1
(pdf)
- 2011 Exam 1
solutions (pdf)
- Class presentation slides (pdf): distributed
computing for indexes
- Reading:
- Also of interest
- Web
Search
for
a
Planet:
The
Google
Cluster Architecture, Luiz Barroso, Jeffrey Dean, and
Urs Hölzle, In IEEE Micro, Vol. 23, No. 2,
pages 22-28, March, 2003.
- "Bigtable:
A
Distributed Storage System for Structured Data" ,
Fay Chang, et. al., In 7th USENIX
Symposium on Operating Systems Design and Implementation
(OSDI '06), 2006.
- MapReduce: simplified
data processing on large clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, 51(1), Jan. 2008. (Special
50th Anniversary issue: Breakthrough research: a preview of things to come.)
- MapReduce:
The programming model and practice (pdf of slides),
Jerry Zhao, Jelena Pjesivac-Grbovic, SIGMETRICS'09 Tutorial, 2009.
- The Apache Hadoop
project
Thurs. Mar. 8: A report on the Bing and Google
analysis for problem set 2 is now available.
Fri. Mar. 9: Solutions to problem set 3 (pdf)
are now
available.
WEEK
6
Mon. Mar. 12:
Crawling the Web
- Class presentation slides (pdf): crawling the web
- Reading:
- Also of interest:
- *Chakrabarti, Soumen,
Mining the Web: Discovering Knowledge from Hypertext Data, Chapter
2 and Chapter 8, section 8.3.1
- Intelligent
Crawling
On the World Wide Web with Arbitrary Predicates,"
Aggarwal, Al-Garawi, and Yu, Tenth International World
Wide Web Conference (WWW10), 2001.
- Evaluating
topic-driven
web crawlers, Filippo Menczer, Gautam
Pant, Padmini Srinivasan, Miguel E. Ruiz, Proc.
Intern.ACM SIGIR Conf. on Research and Development in
Information Retrieval (SIGIR Conf.),
ACM, 2001, pages:
241 - 249.
Wed. Mar. 14: Characteristics
of the changing Web
- Class presentation slides (pdf): Web characteristics
- Also of interest - papers
summarized in the presentation on characteristics of the Web:
- The Web
Changes Everything: Understanding the Dynamics of Web
Content, E. Adar, J. Teevan, S.T. Dumais and J. L.
Elsas, Intern. Conf.
on Web Search and Data Mining (WSDM), ACM,
2009, pgs 282-291.
- Recrawl
Scheduling Based on Information Longevity,
Christopher Olston and Sandeep Pandey, Intern. World Wide Web Conf.(WWW), 2008.
(pdf
here)
- Changes
in
Webpage
Structure
over
Time
(pdf via ftp),
M. Dontcheva, S. M. Drucker, D.Salesin, M. F. Cohen, , U.
Washington CSE Technical Report (TR2007-04-02), April 2007.
- What's
New on the Web? The Evolution of the Web from a Search
Engine Perspective, A.
Ntoulas, J. Cho,
and C. Olston, Intern.
World Wide Web Conf.(WWW), ACM, 2004.
- A
large-scale study of the evolution of Web pages, D.
Fetterly, M. Manasse, M. Najork and J. L. Wiener, Software: Practice and
Experience, 34:213–237 (2004) Wiley.
- Estimating
the Change of Web Pages, Sung Jin Kim and Sang Ho
Lee, Intern. Conf.
Computational Science (ICCS), Springer, 2007.
take-home EXAM 1: DISTRIBUTED
end of class Wednesday March 14. DUE 3:00 PM Friday, March
16.
Spring break
WEEK 7
Mon. March 26: Search
refinement; using users
behavior
- Class presentation slides (pdf): search refinement
and recommendation methods
- Reading:
- Also of interest:
- Personalizing
Web
Search
using
Long
Term
Browsing
History, Matthijs and Radlinski, International
Conf. on Web Search and Data Mining (WSDM), ACM, 2011.
- A
Large-scale Evaluation and Analysis of Personalized Search
Strategies (pdf), Song and Wen, Sixteenth Intern.World Wide Web
Conference, (WWW2007),
2007.
- Toward
the
Next
Generation
of
Recommender
Systems:
A
Survey
of
the
State-of-the-Art and Possible Extensions,
Gediminas Adomavicius, Alexander Tuzhilin; IEEE
Transactions on Knowledge and Data Engineering,
vol. 17, no. 6, pp. 734-749, June
2005.
- The Adaptive Web, P.
Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer
Science book series Vol 4321, Springer, 2007.
This book contains several relevant chapters. Chapter
6: Personalized Search on
the World Wide Web by A. Micarelli, F.Gasparetti,
F.Sciarrone and S. Gauch is of particular interest. The
chapters are available as pdf files to members of the
Princeton University community by accessing them from the
princeton.edu domain.
- Google:
Bing Is Cheating, Copying Our Search Results by Danny
Sullivan, Feb 1, 2011 at 8:45am ET and
Bing
Admits Using Customer Search Data, Says Google Pulled
‘Spy-Novelesque Stunt’ by Matt
McGee, Feb 1, 2011 at 1:56pm
ET, both on Search
Engine Land.
- Netflix
Awards $1 Million Prize and Starts a New Contest by
Steve Lohr, New York Times'
Bits Blog, Sept. 21 2009. (The new contest
was canceled due to privacy concerns.)
Wed. March 28 continuation of March 26: collaborative filtering
- Additional "also of interest reading":
Thurs. March 29:
Problem set 4 is now available
WEEK 8
Mon. April 2: Clustering
- Class
presentation
slides
(pdf):
clustering
- Reading:
Wed. April 4: Clustering continued
- Problem
Set 4 due today!
- Class presentation slides (pdf): see posting under April 2
- Reading:
- Also of interest:
Thurs. April 5:
Problem set 5 (pdf) is now available
WEEK 9
Project
progress reports this week - meet with Professor
LaPaugh.
Mon. April 9: Emergency preparedness experience
Wed. April 11: Detecting near-duplicate documents
WEEK 10
Mon. April 16 Latent Semantic Indexing
Wed. April 18 Semi-structured
information and XML
WEEK 11
Mon. April 23 Non-text retrieval: image retrieval
- Problem Set 6 solutions (pdf)
now available
- Class presentation slides (pdf): non-text retrieval
- Also of interest - papers
summarized in the presentation on image retrieval:
- Query
by
image
and
video
content: the QBIC system, Flickner, M., et.al., IEEE Computer,
IEEE
Computer
Society,
28(9) p23-32, Sept 1995.
- Image
Similarity Search with Compact Data Structures,
Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on
Information and Knowledge Management (CIKM), ACM, Nov.
2004. (Reports part of Princeton CASS project.)
- Integrating wavelets with clustering
and indexing for effective content-based image
retrieval, E
Yildizer, AM Balci, and TN Jarada, Knowledge-Based Systems, Vol. 13, July
2012 , Elsevier, pp 55-66.
- VisualRank:
Applying
PageRank
to
Large-Scale
Image Search, Yushi Jing and Shumeet Baluja, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 30(11), p 1877 -
1890, IEEE, 2008.
- Also of interest - sites
visited in demo
- Princeton CASS: Content-Aware SearchSystems Demos
click on VARY image
- Tiltomo
- Google images "similar"
option after doing term-based search
Wed. April 25 Issues in searching the modern
Web: deep Web
- Corrected
problem set 5 solutions (pdf)
are now available. The original version of Problem 1
missed an optimization in the expression of consumer similarity,
which affected the time to update consumer similarity.
- Class presentation slides (pdf): brief outline of topics for
second exam (pdf)
- 2011 Exam 2 (pdf)
- 2011
Exam 2 solutions (pdf)
- Class presentation slides (pdf): deep Web search
- Also of interest - references
for
today:
- Structured
Data on the Web, Michael J. Cafarella, Alon
Halevy, and Jayant Madhavan, Communications of the ACM (CACM),
Vol 54 (2) February 2011, pp 72-79.
- Harnessing the Deep
Web: Present and Future (pdf), Jayant Madhavan,
Loredana Afanasiev, Lyublena Antova, and Alon
Halevy, 4th Biennial
Conference on Innovative Data Systems Research (CIDR),
Jan. 2009.
- Web-scale
extraction
of structured data, Michael J. Cafarella, Jayant
Madhavan, and Alon Halevy, ACM SIGMOD Record, Vol. 37
(4) December 2008.
- Searching
the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10
(Oct. 2008), pages 14-15.
- Google's
Deep-Web Crawl (pdf), Jayant Madhavan, David Ko,
Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y.
Halevy, 34th Intern. Conf. on Very Large Data
Bases, VLDB Endowment, Aug. 2008.
- Accessing
the deep web, Bin He, Mitesh Patel, Zhen Zhang,
and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May
2007), pages 94-101.
- Searching
for Hidden-Web Databases, Luciano Barbosa and
Juliana Freire., Proceedings
of the 8th ACM SIGMOD International Workshop on Web and
Databases (WebDB), pp. 1-6, ACM 2005. (A more
recent, more complicated version of the crawler is described
at the 2007 WWW conf.)
Sat. April 28 Solutions to the first exam (pdf) are now available.
WEEK 12
Mon. April 30 Privacy Issues in Information Systems
- Class presentation slides (pdf): privacy
- Also of interest
- Survey Finds Facebook and Google Privacy
Policies Even More Confusing Than Credit Card Bills and Government
Notices, MarketWatch.com., April 24,
2012.
- Netflix
Spilled Your Brokeback Mountain Secret, Lawsuit Claims,
Ryan Singel, Wired,
De. 17, 2009.
- A
Face Is Exposed for AOL Searcher No. 4417749, Michael
Barbaro and Tom Zeller Jr., The
New York Times, August 9, 2006
- Engineering
Privacy, Sarah Spiekermann and Lorrie Vaigth Cranor, IEEE Transactions on Software
Engineering 35(1), IEEE, pp .67-82, Jan./Feb 2009.
- You Might Also Like: Privacy Risks of
Collaborative Filtering, Calandrino,
J.A, Kilzer, A., Narayanan, A., Felten, E.W., and
Shmatikov, V., IEEE
Sym. on Security and Privacy (SP), 2011, pp. 231 -
246.
- Personalization and privacy: a survey of privacy
risks and remedies in personalization-based systems, Eran Toch, Yang Wang and Lorrie
Faith Cranor, User Modeling
and User-Adapted Interaction, Vol.22 (1-2), Springer,
2012,
Wed. May 2 Wrap-up
- Class presentation slides (pdf): themes and future
- Also of interest
- The
Semantic Web, Tim Berners-Lee, James Hendler and Ora
Lassila, Scientific
American 284(5), May 2001, p. 34-43. (Scientific American
is available online through the Princeton University
Library.)
take-home EXAM 2: DISTRIBUTED
end of class Wednesday May 2. DUE 5:00 PM Friday,
May 4.
Project Report due
5:00 pm Dean's Date, Tuesday May 15, 2012
Project
Demonstration
between
May
16
and
May
21
* on reserve in the Engineering Library
last revised Mon
Jun 11 16:00:20 EDT 2012
Copyright
2010,
201, 2012 Andrea S. LaPaugh