Please help keep this list
useful! Suggest free, stable, preferably open-source
software and free, open data sets. It is best if you can
give a recommendation from personal experience, but we will take
other suggestions subject to further exploration. Thanks!
Data Sets
Here are some data sets of possible value. Some have been used
in past COS435 projects.
IMPORTANT NOTE: Other
members of the faculty have data sets that they are willing to
share. If you need something - either a specific data set or a
specific kind of data - ask, and we'll see if it is available in the
department.
UCI Machine
Learning Archive - University of California at
Irving data sets, primarily for data mining tasks, but also
useful for other information analysis/search tasks. 235
Data Sets as of February 2013. Some example data sets:
NSF Research Awards Abstracts 1990-2003
OpinRank Review Dataset including car reviews for model
years 2007-2009 and hotel reviews for 10 cities.
LETOR
data set: From Microsoft. The site says "a package
of benchmark data sets for research on LEarning TO Rank. This
dataset contains standard features, relevance judgments, data
partitioning, evaluation tools, and several baselines, for the
OHSUMED data collection and the '.gov' data collection."
Amazon
Web Services (AWS) Public Data Sets: "
a centralized repository of public data sets that can be
seamlessly integrated into AWS cloud-based applications"
Examples: Sloan Digital Sky Survey, a 5 billion Web page
crawl (60TB!). Many others.
4
universities data set: from CMU. CS
department Web pages from various universities, hand-classified
into 7 categories.
Facebook social graph data from the
Online
Social NetworksProject at UC Irvine.
(No experience with these datasets.)
Two sources of the social network for a sample of Twitter:
data for "What is Twitter, a Social Network or a News Media?"
by Kwak, Lee, Park, and Moon, Inter. World Wide Web (WWW)
Conf., 2010. See
http://an.kaist.ac.kr/traces/WWW2010.html
data for "
Measuring User Influence in Twitter: The Million Follower
Fallacy by Cha, Haddadi, Benevenuto, and Gummadi, Inter.
AAAI Conf. on Weblogs and Social Media (ICWSM), 2010.
See http://twitter.mpi-sws.org/
Dr. Kevin Chai,
research fellow for the Centre for
Health Informatics at the Australian Institute of Heath
Innovation, University of New South Wales, has compiled a
list of data sets and data set directories. The list is general
- not specific to health informatics. The list overlaps
with ours above, but includes many not listed above. We
have not checked every data set in this list.
Software
Here are some free (as far as I know) software tools of possible
value. Some have been used in past COS435 projects.
Natural Language Toolkit (NLTK):
From
the
site:
"Open
source
Python modules, linguistic data and documentation for research
and development in natural language processing, supporting
dozens of NLP tasks, with distributions for Windows, Mac OSX and
Linux." Includes access to WordNet.