|
Computer Science 435
Information Retrieval, Discovery, and Delivery
Andrea
LaPaugh
|
Spring 2017
|
General Information | Schedule and
Assignments | Project Page | Announcements
EVOLVING: CHECK BACK FOR UPDATES
The reading for a class should be completed before class.
WEEK 1
Mon. Feb. 6: Overview of
course topics and organization. Models of information.
Wed. Feb 8: Foundations: classic information
retrieval of text.
- Assignment
0: register on Piazza
if you aren't already and add yourself to cos 435.
- Assignment 1, Part 1
- due 2/15
- Class presentation slides (pdf): continuation of
slides posted for Feb 6
- Reading:
- Optional
Reading:
WEEK 2
Mon. Feb. 13: Extending
the models. Using links.
- Class presentation slides: link analysis for ranking
- Reading
- Also of interest, original papers:
- (HITS algorithm) Kleinberg, Jon, Authoritative
sources in a hyperlinked environment, Journal of the
ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632.
(Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998 and as IBM Research Report RJ 10076, May 1997.)
- (PageRank algorithm) Page, Larry and Sergey Brin, R.
Motwani, T. Winograd, The PageRank Citation Ranking:
Bringing Order to the Web, Stanford
Digital Library Technologies Project TR, Jan. 1998.
(Early version: L. Page. PageRank: Bringing order to the web.
Stanford Digital Libraries Working Paper 1997-0072, Stanford
University, 1997. )
Wed. Feb. 15: Continuation of link analysis.
- Assignment 1, Part 2 - due Wed. 2/22
- Class presentation slides (pdf): no new slides
- Reading: reading for evaluation originally here now under Feb
22.
WEEK 3
Mon. Feb 20 class canceled
Wed. Feb. 22: Evaluation
of retrieval systems
- Assignment 2 due Wed. 3/1
- Class presentation slides (pdf): evaluation
- Reading:
- Optional Reading:
- Also of interest:
WEEK 4
Mon. Feb. 27: Index structure and use.
Wed. Mar. 1: Index
construction
- Assignment 3 , due Wed. 3/8.
- Class presentation slides (pdf):
continuation of slides used Monday (2/27)
- Reading:
- Optional Reading:
- Introduction
to Information Retrieval: sections 2.1.1, 2.2.1,
2.2.3
-
details on B+ trees in Database
Management Systems by Raghu Ramakrishnan and Johannes
Gehrke (Third Edition, McGraw-Hill, 2003):
Chapter 10, Sections 3-6 (pp. 344-356). Book on reserve
in Engineering Library.
WEEK 5
Mon. Mar. 6: class canceled
Wed. Mar. 8: Index compression
WEEK 6
Mon. Mar. 13: Index
compression continued
- Short project proposal
due today, March 13 11:55 pm by DropBox
submission. See Project page for details.
- Take-home exam out
Wednesday March 17
- class notes: Compression,
Part 2 (pdf)
- Reading:
- Also of interest
- A. Moffat and J. Zobel, Self-
indexing inverted files for fast text retrieval, ACM
Transactions on Information Systems, Vol. 14, No. 4
(Oct. 1996), pgs 349-379.
- The Anatomy of
a Large-Scale Hypertextual Web Search Engine, (pdf from
Stanford
publications collection) Brin, Sergey and Page,
Lawrence, Proceedings of the Seventh International WWW
Conference (WWW 7), 1998.
Wed. Mar 15: Zipf's and Heap's Laws; Web crawling
overview
- class notes: Zipf's and Heap's Laws can be found in Part 2 of
the compression summary; Web crawling
- Reading:
- Also of interest:
-
Chakrabarti, Soumen, Mining
the Web: Discovering Knowledge from Hypertext Data,
Elsevier (Morgan_Kaufmann Division), 2003. Chapter 2 and
Chapter 8, section 8.3.1
- Take-home
midterm
exam March 15, 2017 to
Friday March 17, 2017
- The exam will look much like a
problem set. You can work on it at any time during
the two days. It is not meant to take two days.
Spring break
WEEK 7
Mon. Mar 27: Detecting
near-duplicate documents
- Class presentation slides (pdf): near-duplicate
documents
- Reading:
- Also of interest:
-
Henzinger, M., Finding
Near-Duplicate Web Pages: A Large-Scale Evaluation of
Algorithms, Conf. on Research and Development
in Information Retrieval (SIGIR), 2006.
-
Manku, G. S., Jain, A., Das Sarma, A., Detecting
Near-Duplicates for Web Crawling, Intern. World
Wide Web Conf. (WWW), 2007.
Wed. Mar 29: Latent Semantic Indexing; Clustering,
Part 1
WEEK 8
Mon. April 3: Clustering continued
Wed. April 5: Social Network analysis
- Recommended reading:
- Also of interest:
- An
Experimental Study of the Small World Problem, Jeffrey
Travers and Stanley Milgram, Sociometry, Vol. 32, No.
4, American Sociological Assoc. (Dec., 1969), pp. 425-443.
- Planetary-scale
views on a large instant-messaging network, Jure
Leskovec and Eric Horvitz, Proc. Intern. Conf. on World Wide
Web (WWW), ACM, 2008, pp, 915-924.
WEEK 9
Mon. April 10: Social Network
analyis cont.; Introduction to Search Refinement and
Personalization
- Class presentation slides: continuation of slides posted
for April 5.
- Reading
- Mining of Massive Data Sets.
(Rajaraman, Anand; Leskovec, Jure; Ullman,
Jeffrey D, Cambridge University Press.
2011), Chapter 10, Sections 2 and 4.
Wed. April 12: Search Refinement and Personalization cont.,
Recommender Systems
- Assignment 6 (last
assignment) - due Wed. 4/19
- Class presentation slides: search
refinement, personalization, content-based recommendations
- Reading
- Also of interest:
- Personalizing
Web Search using Long Term Browsing History, Matthijs
and Radlinski, International Conf. on Web Search and
Data Mining
(WSDM), ACM, 2011.
- Time
is of the Essence: Improving Recency Ranking Using Twitter
Data, Anlei Dong et. al., Proc. Intern. Conf. on
World Wide Web (WWW), ACM, 2010, pp.
331-340.
- The Adaptive Web, P.
Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer
Science book series Vol 4321, Springer, 2007.
This book contains several relevant chapters. Chapter 6: Personalized Search on the World
Wide Web by A. Micarelli, F.Gasparetti, F.Sciarrone
and S. Gauch is of particular interest. The chapters are
available as pdf files to members of the Princeton University
community by accessing them from the princeton.edu domain.
- Google:
Bing Is Cheating, Copying Our Search Results by Danny
Sullivan, Feb 1, 2011 at 8:45am ET and
Bing
Admits Using Customer Search Data, Says Google Pulled
‘Spy-Novelesque Stunt’ by Matt
McGee, Feb 1, 2011 at 1:56pm
ET, both on Search
Engine Land.
WEEK 10
Project
Progress report April
17, April 19, or April 20, 2017: See Project
page for details.
Mon.
April 17: Recommender Systems continued
- Also of interest:
- An Analytical
Comparison of Approaches to Personalizing PageRank, Sep
Kamvar, Taher Haveliwala and Glen Jeh, Stanford
University Technical Report. June, 2003.
- Matrix
Factorization Techniques for Recommender Systems, Koren,
Bell and Volinsky, IEEE Computer, 42(8), August 2009,
pp. 42-49.
- Modeling
Relationships at Multiple Scales to Improve Accuracy of
Large Recommender Systems, Bell, Koren and
Volinsky, International
Conf.
on Knowledge Discovery and Data Mining
(KDD),
ACM 2007.
- Scalable
Collaborative Filtering with Jointly Derived
Neighborhood Interpolation Weights, Bell and
Koren, IEEE International
Conference on Data Mining, 2007.
- Netflix
Awards $1 Million Prize and Starts a New Contest by
Steve Lohr, New York Times'
Bits Blog, Sept. 21 2009. (The new contest
was canceled due to privacy concerns.)
Wed. April 19
Non-text
retrieval: image retrieval
- Class presentation slides (pdf): non-text retrieval
- No required reading
- Also of interest - today's
material drawn from these references:
- Query
by
image
and
video
content:
the
QBIC system, Flickner, M., et.al., IEEE Computer, IEEE
Computer Society, 28(9) p23-32, Sept 1995.
- Image
Similarity Search with Compact Data Structures,
Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on
Information and Knowledge Management (CIKM), ACM, Nov.
2004. (Reports part of Princeton CASS project.)
- Integrating wavelets with clustering
and indexing for effective content-based image
retrieval, E
Yildizer, AM Balci, and TN Jarada, Knowledge-Based Systems, Vol. 13, July
2012 , Elsevier, pp 55-66.
- VisualRank:
Applying PageRank to Large-Scale Image Search, Yushi
Jing and Shumeet Baluja, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
30(11), p 1877 - 1890, IEEE, 2008.
- Also of interest - sites to
visit in demo
- Princeton CASS: Content-Aware SearchSystems Demos
click on VARY image
WEEK 11
Mon.
April 24: Deep Web Search
- Class presentation slides (pdf): deep web search
- No required reading
- Also of interest - today's
material drawn from these references:
- Structured
Data on the Web, Michael J. Cafarella, Alon
Halevy, and Jayant Madhavan, Communications of the ACM (CACM),
Vol 54 (2) February 2011, pp 72-79.
- Harnessing the Deep
Web: Present and Future (pdf), Jayant Madhavan,
Loredana Afanasiev, Lyublena Antova, and Alon
Halevy, 4th Biennial
Conference on Innovative Data Systems Research (CIDR),
Jan. 2009.
- Web-scale
extraction
of structured data, Michael J. Cafarella, Jayant
Madhavan, and Alon Halevy, ACM SIGMOD Record, Vol. 37
(4) December 2008.
- Searching
the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10
(Oct. 2008), pages 14-15.
- Google's
Deep-Web Crawl (pdf), Jayant Madhavan, David Ko,
Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y.
Halevy, 34th Intern. Conf. on Very Large Data
Bases, VLDB Endowment, Aug. 2008.
- Crawling deep web entity pages, Yeye He, Dong Xin, Venkatesh Ganti, Siriram
Rajaraman, and Nirav Shah, Proc. Intern. Conf. on Web Search and Data
Mining (WSDM), ACM, 2013, pp.
355-364.
- Accessing
the deep web, Bin He, Mitesh Patel, Zhen Zhang,
and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May
2007), pages 94-101.
- Searching
for Hidden-Web Databases (pdf), Luciano
Barbosa and Juliana Freire., Proceedings of the 8th ACM SIGMOD International
Workshop on Web and Databases (WebDB), pp. 1-6, ACM
2005. (A more recent, more complicated version of the
crawler is described at the 2007 WWW conf.)
- Towards web-scale structured web data extraction, Tomas Grigalis, Proc. Intern. Conf. on Web
Search and Data Mining (WSDM), ACM,
2013, pp. 753-757.
Wed.
April 26: Deep
Web Search continued; Semi-structured information
and XML
Second
take-home exam out Wednesday April 26, 2017 due Friday April
28, 2017
WEEK 12
Mon. May
1: Distributed
computation for index building and
query execution.
- Class presentation slides (pdf): distributed computing
- Reading:
- Optional reading:
- Also of interest
- Web
Search for a Planet: The Google Cluster Architecture,
Luiz Barroso, Jeffrey Dean, and Urs Hölzle, In IEEE
Micro, Vol. 23, No. 2, pages 22-28, March, 2003.
- "Bigtable:
A
Distributed Storage System for Structured Data" ,
Fay Chang, et. al., In 7th USENIX
Symposium on Operating Systems Design and Implementation
(OSDI '06), 2006.
- MapReduce: simplified
data processing on large clusters, Jeffrey Dean and Sanjay
Ghemawat, Communications
of the ACM, 51(1),
Jan. 2008. (Special 50th Anniversary
issue: Breakthrough
research: a preview of things to come.)
- The Apache Hadoop project
Wed. May 3: Continuation of
Distributed Computing; Closing
remarks
- Class presentation slides (pdf): see slides for May1 for
distributed computing; closing
remarks
- No required reading
- Reading of interest
-
Trains of
Thought: Generating Information Maps, Dafna Shaha,
Carlos Guestrin, and Eric Horvitz, Intern. World Wide Web
Conf. (WWW), ACM, 2012, pp. 899-908.
- We
Feel
Fine
and
Searching
the
Emotional
Web, Sepandar D. Kamvar and Jonathan Harris, Proc.
of the Intern. Conf. on Web Search and Data Mining
(WSDM), ACM, 2011, pp. 117-126.
- We Feel
Fine: An Almanac of Human Emotion by Sep Kamvar and
Jonathan Harris, video on YouTube. See also the book We Feel Fine: An Almanac of
Human Emotion by Sep Kamvar & Jonathan Harris,
Scribner, Dec. 2009.
-
Netflix
Spilled
Your
Brokeback
Mountain
Secret,
Lawsuit
Claims, Ryan Singel, Wired,
Dec. 17, 2009.
- A
Face Is Exposed for AOL Searcher No. 4417749, Michael
Barbaro and Tom Zeller Jr., The
New York Times, August 9, 2006
- '10
Concerts' Facbook Meme May Reveil More Than Musical Tastes,
Christopher Mele and Daniel Victor, The New York Times, April 28, 2017
- The
Internet of mess things, opinion by Steven J.
Vaughan-Nichols, Computerworld, May 3, 2017
- Google
says the internet of things' smarts will save energy,
Cade Metz, Wired, April 29, 2015.
- Internet
Usage Statistics, Internet World Stats
READING
PERIOD
Project
Report due
5:00 pm Dean's Date, Tuesday May 16, 2017
by CS Dropbox submission.
EXAM
PERIOD
Project
Demonstration between Wednesday May 17 and Friday May 19.
Individual team appointments. See Project
page for details.
last revised Mon
May 8 17:10:07 EDT 2017
Copyright
2010,
2011, 2012, 2013, 2014, 2015, 2016, 2017 Andrea S. LaPaugh