% No 'submit' option for the problems by themselves.
\documentclass{cos302}
% Use the 'submit' option when you submit your solutions.
%\documentclass[submit]{cos302}
% Put in your full name and email address.
\name{Your Name}
\email{email@princeton.edu}
\discussants{Marie Curie \\ Leonhard Euler}
% You don't need to change these.
\course{COS302-S21}
\assignment{Assignment \#5}
\duedate{6:00pm March 19, 2021}
\dropboxurl{https://www.gradescope.com/courses/214271/assignments/1063589}
\usepackage{bm} % for bold math
\usepackage{mathtools} % colon-equals
\begin{document}
% IMPORTANT: Uncomment the \documentclass command above with the [submit] option and put your name where it says ``your name''.
\begin{center}
Assignments in COS 302 should be done individually. See the \href{https://www.cs.princeton.edu/courses/archive/spring21/cos302/files/syllabus.pdf}{course syllabus} for the collaboration policy.
\vspace{\baselineskip}
\textcolor{red}{%
Remember to append your Colab PDF as explained in the first homework, with all outputs visible.\\
When you print to PDF it may be helpful to scale at 95\% or so to get everything on the page.}
\end{center}
\begin{problem}[10pts]
We sometimes need to introduce norms for matrices.
This comes up when we want to talk about the magnitude of a matrix, or when we need a notion of distance between matrices.
One important matrix norm is the Frobenius norm, which can be written in several ways for a matrix~$\bm{A}\in\mathbb{R}^{m \times n}$:
\begin{align*}
||\bm{A}||_F &= \sqrt{\sum_{i=1}^m\sum_{j=1}^n A_{i,j}^2}
= \sqrt{\text{trace}(\bm{A}^{T}\bm{A})}
\end{align*}
\begin{enumerate}[(A)]
\item Show that the Frobenius norm is also the square root of the sum of the squared singular values.
\item Assume that~$\bm{A}$ is square and invertible, with a very small Frobenius norm.
What kind of value would you expect to get for the Frobenius norm of~$\bm{A}^{-1}$?
\end{enumerate}
\end{problem}
\newpage
\begin{problem}[28pts]
In this problem, you'll look at singular value decomposition as a way to find a low-rank approximation to a matrix.
\begin{enumerate}[(A)]
\item Upload any (tasteful) image you want to Colab.
Note that life is a little easier if you put it in your Google drive and access it from there so you don't have to constantly re-upload it every time you come back after a break.
Load the image as a NumPy array.
There are various ways to do this, but the easiest may be to use \href{https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imread.html}{\texttt{imread}} from matplotlib.
There is a useful tutorial \href{https://matplotlib.org/tutorials/introductory/images.html}{here} also.
Once you have a NumPy array, look at its shape: it is probably $width \times height \times 3$ or something to that effect, with the third dimension usually reflecting red, green, and blue channels.
Make the image grayscale by taking the mean across this third dimension.
This should make it into a $width \times height$ array.
Use \texttt{imshow} as in previous homeworks to render the grayscale image.
\item Import \texttt{numpy.linalg} and use the \texttt{svd} function to compute the singular value decomposition of the image matrix.
This will return three things: a~$\bm{U}$ matrix, a vector containing the singular values, and a~$\bm{V}^T$ matrix.
Print the shapes of these arrays, and then figure out how to ``reassemble'' these three arrays with multiplication to reconstruct the image.
Render the reconstruction and verify that it looks like the original image.
\item Plot the \href{https://docs.scipy.org/doc/numpy/reference/generated/numpy.cumsum.html}{cumulative sum} of the singular values.
Sum them from largest to smallest.
\item Create three different low-rank approximations to your image.
Create one of rank 5, one of rank 10, and one of rank 25.
(Hint: you can do this by reconstructing the image matrix as above, but with zeros for all the singular values with index larger than the rank of your approximation.
Render each of the three images.
\item Create three new matrices, each with the pixel-wise squared difference between the original image and the low-rank approximation.
Render these as images and qualitatively describe what kind of visual structure seems to be lost in the low-rank approximations.
\end{enumerate}
\end{problem}
% \newpage
\begin{problem}[30pts]
In this problem you will use SVD to model a text corpus, a small subset of New York Times articles.
You will use a ``bag of words'' representation in which documents are represented by the counts of words, usually excluding very common ``stop words'' like \emph{the} and \emph{and}.
Download \href{https://www.cs.princeton.edu/courses/archive/spring21/cos302/files/nyt.pkl.gz}{\texttt{nyt.pkl.gz}} and upload it to your Google drive so you don't have to upload it every time you open the Colab; it's a fairly big file.
Here's some code to get you going:
\begin{lstlisting}[style=python]
import pickle as pkl
import numpy as np
import gzip
filename = 'drive/My Drive/COS 302/nyt.pkl.gz'
with gzip.open(filename, 'rb') as fh:
nyt = pkl.load(fh)
documents = nyt['docs']
vocab = nyt['vocab']
# Create reverse lookup table.
vocab_indices = dict([(w, i) for (i, w) in enumerate(vocab)])
M = len(documents)
N = len(vocab)
print('%d documents, %d words' % (M,N))
count_mat = np.zeros((M,N))
for mm, doc in enumerate(documents):
for word, count in doc.items():
count_mat[mm, vocab_indices[word]] = count
\end{lstlisting}
\begin{enumerate}[(A)]
\item Typically, raw counts don't lead to discovery of interesting structure.
Instead, it is common to use something like \href{https://en.wikipedia.org/wiki/Tf-idf}{TF-IDF}, which stands for \emph{term frequency-inverse document frequency}.
Term frequency is the number of times word~$n$ appeared in document~$m$, divided by the total number of words in document~$m$.
\begin{align*}
{\rm tf}_{m,n} &= \frac{\text{\# times word $n$ appears in doc $m$}}{\text{total \# of words in doc $m$}}
\end{align*}
Transform \texttt{count\_mat} from the code above into a term frequency matrix.
\item Inverse document frequency is typically the natural log of the number of documents, divided by the number of documents in which word~$m$ appears.
(The plus-one ensures that you're not dividing by zero.)
\begin{align*}
{\rm idf}_{n} &= \log\frac{ \text{total \# of documents }}{ 1 + \text{\# of documents with word $n$}}
\end{align*}
Compute the idf vector from \texttt{count\_mat}.
\item Now compute the TF-IDF matrix by multiplying (broadcasting) ${\rm tf}_{m,n}$ and~${\rm idf}_n$.
Use \href{https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html}{\texttt{numpy.linalg.svd}} to take the SVD of this TF-IDF matrix; it may take a minute since the matrix is relatively large.
Plot the singular values in decreasing order.
\item The right singular vectors (columns of the~$\bm{V}$ matrix) will ideally represent interesting topical dimensions.
For each of the top 20 right singular vectors: identify the words in the vocabulary that have the largest entries in the vector and print them.
That will probably mean looping over the first 20 right singular vectors, doing an appropriate \texttt{argsort} and then finding that entry in the \texttt{vocab} variable.
Do you see any interesting qualitative structure in these groups of words?
\end{enumerate}
\end{problem}
\newpage
\begin{problem}[30pts]In various data analysis problems, it is often easier to reason about the pairwise distances between data, rather than features of the data directly.
This comes up particularly when dealing with discrete data where there aren't vectors, but perhaps there is a sensible concept of distance.
A prime example is strings, in which there are various sensible \href{https://en.wikipedia.org/wiki/Edit_distance}{edit distances}.
We previously examined principal component analysis (PCA) as a way to find low-dimensional representations of data, but if you only have distances, vanilla PCA doesn't apply.
Instead, the classic approach is to use principal \emph{coordinates} analysis, also called \href{https://en.wikipedia.org/wiki/Multidimensional_scaling}{multidimensional scaling} (MDS) to map the data to~$\mathbb{R}^d$ in such a way that the pairwise distances are approximately preserved.
That is, you are given a matrix of squared distances~$\bm{D}\in\mathbb{R}^{n\times n}$, and your goal is to discover reasonable locations~$\{\bm{x}_i\}^n_{i=1}$ for, say,~$\bm{x}\in\mathbb{R}^2$, so that~${D_{ij}\approx ||\bm{x}_i-\bm{x}_j||^2_2}$.
The details of MDS are beyond the scope of this course, but the steps are straightforward applications of tools that you've been learning to use in COS 302.
\begin{enumerate}[(1)]
\item Compute a squared distance matrix~$\bm{D}$.
\item Subtract off the column-wise means and then the row-wise means of~$\bm{D}$, i.e., perform ``double centering''. Be sure to compute the row-wise means \emph{after} you've subtracted the column-wise means.
\item Compute the largest eigenvalues~$\lambda_1, \lambda_2, \ldots$ and associated eigenvectors~$\bm{v}_1, \bm{v}_2,\ldots$ of~$-\frac{1}{2}\bm{D}$.
\item The~$j$th entry of the vector~$\sqrt{\lambda_i}\bm{v}_i$ can now be used as the~$i$th coordinate of~$\bm{x}_j$.
That is, for mapping data into~$\mathbb{R}^2$, you would form a~$n \times 2$ matrix
$\bm{X} = \begin{bmatrix}
\sqrt{\lambda_1}\bm{v}_1 & \sqrt{\lambda_2}\bm{v}_2
\end{bmatrix}$ whose rows are locations to embed each datum.
\end{enumerate}
In this problem, you're going to take a list of strings, find edit distances between them all, and then make a visualization of them in~$\mathbb{R}^2$.
By default, you can use \href{https://www.cs.princeton.edu/courses/archive/spring21/cos302/files/dog_names1000.txt}{\texttt{dog\_names1000.txt}}, a list of dog names taken from a random subset of \href{https://catalog.data.gov/dataset/dog-names}{registered dogs in Anchorage, Alaska}; you can use any list of strings you want, however, as long as it has at least 500 or so entries: lists of cities, street names, \href{https://github.com/OpenJarbas/metal_dataset}{metal bands}, etc.
You'll need to figure out how to load the string from file into a list, but then the function below will compute a squared distance matrix for you.
\begin{lstlisting}[style=python]
import numpy as np
import editdistance
def sq_distances(names):
'''Takes a list of strings. Returns squared edit distances.'''
N = len(names)
sq_dists = np.zeros((N,N))
for ii, name1 in enumerate(names):
for jj, name2 in enumerate(names[:ii]):
sq_dists[ii,jj] = (editdistance.eval(name1, name2) \
/ np.maximum(len(name1), len(name2)))**2
sq_dists[jj,ii] = sq_dists[ii,jj]
return sq_dists
\end{lstlisting}
Implement the MDS procedure above to put your strings into locations in~$\mathbb{R}^2$ and then plot them.
Assuming you have a~$n \times 2$ matrix \texttt{X}, the code below should get you started making the figure.
\begin{lstlisting}[style=python]
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plt.plot(X[:,0], X[:,1], '.')
for ii in range(n):
plt.text(X[ii,0], X[ii,1], names[ii], fontsize=10)
plt.show()
\end{lstlisting}
\end{problem}
\newpage
\begin{problem}[2pts]
Approximately how many hours did this assignment take you to complete?
\end{problem}
My notebook URL:
{\small \url{https://colab.research.google.com/XXXXXXXXXXXXXXXXXXXXXXX}}
\subsection*{Changelog}
\begin{itemize}
\item 1 Mar 2021 -- S21 initial version.
\end{itemize}
% \includepdf[pages=-]{mynotebook.pdf}
\end{document}