====== Homework 2 ====== **Due** Tuesday, March 9th Tuesday, March 23rd ==== Questions ==== Download {{hw2.pdf|the questions here}}. ==== Data files ==== * The [[cos424>data/reuters21578.tar.gz|Reuters21578]] dataset. Also available from the [[http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection|UCI Machine Learning Repository]]. * The C source code of a [[cos424>data/porter_stemmer.c|Porter Stemmer]]. You can find implementations in other languages [[http://tartarus.org/~martin/PorterStemmer/|here]]. * A [[cos424>data/stopwords.txt|list of english stop words]]. ==== Software ==== * Question 3 uses [[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear]] which is a good package for classification with linear models. You'll find a lot of information on the LibLinear web site. However a local copy of the Liblinear source code is available [[cos424>data/liblinear-1.51.tar.gz|here]]. The archive contains an useful ''README'' file. * Important new information (3/17): The homework asks for ROC curves for models trained with options ''-s 0'' and ''-s 3''. However the LibLinear software only outputs probabilities for the log loss (''-s 0''). In order to obtain scores for the hinge loss (''-s 3''), you need to use this {{predict.c.txt|modified version of the file "predict.c"}}. Note that these scores are not probabilities. They are simply the output of the discriminant function, positive for one class, negative for the other class. ==== Solutions ==== Download {{hw2solutions.pdf|the solutions here}}.