Please help keep this list
useful! Suggest free, stable, preferably open-source
software and free, open data sets. It is best if you can
give a recommendation from personal experience, but we will take
other suggestions subject to further exploration. Thanks!
Here are some data sets of possible value. Some have been used
in past COS435 projects.
IMPORTANT NOTE: Other
members of the faculty have data sets that they are willing to
share. If you need something - either a specific data set or a
specific kind of data - ask, and we'll see if it is available in the
Learning Archive - University of California at
Irving data sets, primarily for data mining tasks, but also
useful for other information analysis/search tasks. 235
Data Sets as of February 2013. Some example data sets:
NSF Research Awards Abstracts 1990-2003
OpinRank Review Data set including car reviews for model
years 2007-2009 and hotel reviews for 10 cities.
data set: From Microsoft. The site says "a package
of benchmark data sets for research on LEarning TO Rank. This
data set contains standard features, relevance judgments, data
partitioning, evaluation tools, and several baselines, for the
OHSUMED data collection and the '.gov' data collection."
"WordNet® is a large lexical database of English. Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive
synonyms (synsets), each expressing a distinct concept."
Web Services (AWS) Public Data Sets: "
a centralized repository of public data sets that can be
seamlessly integrated into AWS cloud-based applications".
: Examples: Sloan Digital Sky Survey, Google
Books Ngrams, Million Song Dataset
and many others. Of special interest:
The Common Crawl
corpus- a corpus of "over 5 billion web pages"
that is updated regularly. (Common Crawl is
self-described as "a non-profit foundation dedicated to
providing an open repository of web crawl data that can be
accessed and analyzed by everyone". )
data for "
Measuring User Influence in Twitter: The Million Follower
Fallacy by Cha, Haddadi, Benevenuto, and Gummadi, Inter.
AAAI Conf. on Weblogs and Social Media (ICWSM), 2010.
utility "converts HTML documents to simple text files, by
removing all HTML tags and formatting the text according to your
preferences." (copyright Nir Sofer).
Natural Language Toolkit (NLTK):
Python modules, linguistic data and documentation for research
and development in natural language processing, supporting
dozens of NLP tasks, with distributions for Windows, Mac OSX and
Linux." Includes access to WordNet.
"an easy to install Apache distribution containing MySQL, PHP
The Lemur Toolkit for
Modeling and Information Retrieval. Includes the Galago toolkit
for experimenting with text search - used in Search Engines: Information
Retrieval in Practice by Croft, Metzler and Strohman