\documentclass[11pt]{article}
\usepackage{amssymb,amsmath,amsthm,url}
%uncomment to get hyperlinks
%\usepackage{hyperref}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Some macros (you can ignore everything until "end of macros")
\topmargin 0pt \advance \topmargin by -\headheight \advance
\topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt \evensidemargin \oddsidemargin \marginparwidth
0.5in
\textwidth 6.5in
%%%%%%
\newcommand{\getsr}{\gets_{\mbox{\tiny R}}}
\newcommand{\bits}{\{0,1\}}
\newcommand{\bit}{\{0,1\}}
\newcommand{\Ex}{\mathbb{E}}
\newcommand{\eqdef}{\stackrel{def}{=}}
\newcommand{\To}{\rightarrow}
\newcommand{\e}{\epsilon}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newtheorem{theorem}{Theorem}
\theoremstyle{definition}
\newtheorem{exercise}{Exercise}
% end of macros
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{Handout 1: Mathematical Background}
\author{Boaz Barak}
\begin{document}
\maketitle
This is a brief review of some mathematical tools, especially
probability theory that we will use. This material is mostly
from discrete math (COS 340/341) but is also taught in many
other courses.
Some good sources for this material are the lecture notes by
Papadimitriou and Vazirani (see home page of Umesh Vaziarani),
Lehman and Leighton (see home page of Eric Lehman, Chapters 18
to 24 are particularly relevant). The mathematical tool we use
most often is discrete probability. The ``Probabilistic Method''
book by Alon and Spencer is a great resource in this area. Also,
the books of Mitzenmacher and Upfal and Prabhakar and Raghavan
cover probability from a more algorithmic perspective.
Although knowledge of algorithms is not strictly necessary, it
would be quite useful. Good books are (1) Corment, Leiserson,
Rivest and Smith, (2) Dasgupte, Papadimitriou and Vaziarni, (3)
Kleinberg and Tardos. We do now require prior knowledge of
computability but some basic familiar could be useful: Sipser
(Into to theory of computation) is a great source. Victor
Shoup's book (Computational Introduction to Number Theory and
Algebra) is a great source for the number theory we'll need (and
much more!).
\section{Mathematical Proofs} \label{app:sec:proofs}
Perhaps \emph{the} mathematical prerequisite needed for this
course is a certain level of comfort with mathematical proofs.
While in everyday life we might use ``proof'' to describe a
fairly convincing argument, in mathematics a proof is an
argument that is convincing \emph{beyond any shadow of a
doubt}.\footnote{In a famous joke, as a mathematician and an
engineer drive in Scotland they see a white sheep on their left
side. The engineer says ``you see: all the sheep in Scotland are
white''. The mathematician replies ``All I see is that there
exists a sheep in Scotland whose right side is white''.}
For example, consider the following mathematical statement:
\begin{quote}
\emph{Every even number greater than $2$ is equal to the sum of
two primes.}
\end{quote}
This statement, known as ``Goldbach's Conjecture'', was
conjectured to be true by Christian Goldbach in 1742 ($4$ years
before Princeton was founded). In the more than $250$ years that
have passed since, no one has ever found a counterexample to
this statement. In fact, it has been verified to be true for all
even numbers from $4$ till $100,000,000,000,000,000$. Yet still
it is not considered proven, since we have not ruled out the
possibility that there is some (very large) even number that
cannot be expressed as the sum of two primes.
The fact that a mathematical proof has to be absolutely
convincing does not mean that it has to be overly formal and
tedious. It just has to be clearly written, and contain no
logical gaps. When you write proofs try to be clear and concise,
rather than using too much formal notation. When you read
proofs, try to ask yourself at every statement ``am I really
convinced that this statement is true?''.
Of course, to be absolutely convinced that some statement is
true, we need to be certain of what that statement means. This
why there is a special emphasis in mathematics on very precise
\emph{definitions}. Whenever you read a definition, try to make
sure you completely understand it, perhaps by working through
some simple examples. Oftentimes, understanding the meaning of a
mathematical statement is more than half the work to prove that
it is true.
\vspace{2ex}\noindent\textbf{Example:} Here is an example for a
classical mathematical proof, written by Euclid around 300 B.C.
Recall that a \emph{prime number} is an integer $p>1$ whose only
divisors are $p$ and $1$, and that every number $n$ is a product
of prime numbers. Euclid's Theorem is the following:
\begin{theorem} \label{app:thm:primes} There exist infinitely many primes.
\end{theorem}
Before proving it, let's see that we understand what this
statement means. It simply means that for every natural number
$k$, there are more than $k$ primes, and hence the number of
primes is not finite.
At first, one might think it's obvious that there are infinitely
many primes because there are infinitely many natural numbers,
and each natural number is a product of primes. However, this
is faulty reasoning: for example, the set of numbers of the form
$3^n$ is infinite, even though their only factor is the single
prime $3$.
To prove Theorem~\ref{app:thm:primes}, we use the technique of
\emph{proof by contradiction}. That is, we assume it is false
and try to derive a contradiction from that assumption. Indeed,
assume that all the primes can be enumerated as
$p_1,p_2,\ldots,p_k$ for some number $k$. Define the number $n =
p_1p_2\cdots p_k + 1$. Since we assume that the numbers
$p_1,\ldots,p_k$ are \emph{all} the primes, all of $n$'s prime
factors must come from this set, and in particular there is some
$i$ between $1$ and $k$ such that $p_i$ divides $n$. That is, $n
= p_i m$ for some number $m$. Thus,
\[
p_i m = p_1 p_2 \cdots p_k + 1
\]
or equivalently,
\[
p_i m - p_1 p_2 \cdots p_k = 1 \;\;\;.
\]
But dividing both sides of this equation by $p_i$, we will get a
whole number on the left hand side (as $p_i$ is a factor of
$p_1p_2\cdots p_k$) and the fraction $1/p_i$ on the right hand
side, deriving a contradiction. This allows us to rightfully
place the QED symbol {\ensuremath{\Box}} and consider
Theorem~\ref{app:thm:primes} as proven.
\section{Preliminaries}
I assume familiarity with basic notions of sets and operations on
sets such as union (denoted $\cup$), intersection (denoted $\cap$),
and set substraction (denoted $\setminus$). We denote by $|A|$ the
size of the set $A$. I also assume familiarity with functions, and
notions such one-to-one (injective) functions and onto (surjective)
functions. If $f$ is a function from a set $A$ to a set $B$, we
denote this by $f:A\To B$. If $f$ is one-to-one then this implies
that $|A| \leq |B|$. If $f$ is onto then $|A| \geq |B|$. If $f$ is
a permutation/bijection (e.g., one-to-one \emph{and} onto) then this
implies that $|A|=|B|$.
I also assume familiarity with \emph{big-Oh notation}: If $f,g$
are two functions from $\N$ to $\N$, then \textbf{(1)} $f =
O(g)$ if there exists a constant $c$ such that $f(n) \leq c\cdot
g(n)$ for every sufficiently large $n$, \textbf{(2)} $f =
\Omega(g)$ if $g=O(f)$, \textbf{(3)} $f = \Theta(g)$ is
$f=O(g)$ and $g=O(f)$, \textbf{(4)} $f = o(g)$ if for every
$\e>0$, $f(n) \leq \e \cdot g(n)$ for every sufficiently large
$n$, and \textbf{(5)} $f = \omega(g)$ if $g = o(f)$.
To emphasize the input parameter, I often write $f(n) = O(g(n))$
instead of $f = O(g)$, and use similar notation for
$o,\Omega,\omega,\Theta$.
\section{Sample Spaces}
For every probabilistic experiment (for example, tossing a coin or
throwing 3 dice) the set of all possible results of the experiment
is called a \emph{sample space}. For example, if the experiment is
to toss a coin and see if the result is ``heads'' or ``tails'' then
the sample space is the set $\{ H , T \}$, or equivalently (if we
denote heads by $1$ and tails by $0$) the set $\{ 0 , 1 \}$. With
every element $x$ of the sample space we associate the probability
$p_x$ that the result of the experiment will be $x$. The number
$p_x$ is between $0$ and $1$ and the sum of all the $p_x$'s is equal
to $1$. We sometimes denote the sample space by $\Omega$, but many
times it will be clear from the context. \textbf{Hint:} Whenever a
statement about probability is made, it is a good habit to ask
yourself what is the sample space that this statement refers.
As another example, consider the experiment of tossing three coins.
In this case there are $8$ possible results and hence the sample
space is $\{ 000 , 001, 010, 011 , 100 , 101 , 110 , 111 \}$. Each
element in the sample space gets chosen with probability
$\tfrac{1}{2} \cdot \tfrac{1}{2} \cdot \tfrac{1}{2} =
\tfrac{1}{2^3} = \tfrac{1}{8}$. An equivalent way to state the
experiment of tossing $n$ coins is to say that we choose a random
$n$-long binary string. We call this the \emph{uniform} distribution
over $\bits^{n}$ (because every string gets chosen with the same
probability). If we want to say that we let $x$ record the result of
this experiment then we use the notation $x \getsr \bits^n$.
\section{Events}
An \emph{event} is a subset of the sample space. The probability
that an event happens is the probability that the result of the
experiment will fall inside that subset. For example, if we consider
the sample space of tossing $101$ coins, then we can denote by $E$
the event that most of the coins came up tails --- at most $50$ of
the coins come up ``heads''. In other words, $E$ is the set of all
length-$101$ strings with at most $50$ ones. We denote the
probability that an event $E$ occurs by $\Pr[ E ]$. For example, in
this case we can write
\[
\Pr_{x\getsr \bits^{101}} [ \text{\# of $1$'s in $x$} \leq 50 ] =
\frac{1}{2}
\]
\noindent\textbf{Proof:} Let $S = \{ x : \text{\# of $1$'s in
$x$} \leq 50 \}$. Let $f$ be the function that flips all the
bits of $x$ from $1$ to $0$ and vice versa. Then $f$ is a
one-to-one and onto function from $S$ to
$\overline{S}=\bits^{101}\setminus S$, meaning that
$|S|=|\overline{S}| = \tfrac{2^{101}}{2}$. \qed
\section{Union Bound}
If $E$ and $E'$ are events over the same sample space then another
way to look at the probability that \emph{either} $E$ or $E'$ occurs
is to say that this is the probability that the event $E \cup E'$
(the union of $E$ and $E'$) occurs. A very simple but useful bound
is that this probability is \emph{at most} the sum of the
probability of $E$ and the probability of $E'$. This is called the
\emph{union bound}.
\begin{theorem}[Union bound] \label{thm:unionBound}
If $\Omega$ is a sample space and $E,E' \subseteq \Omega$ are
two events over $\Omega$. Then,
\[
\Pr_{\Omega}[ E \cup E' ] \leq \Pr_{\Omega}[ E ] + \Pr_{\Omega}[E']
\]
\end{theorem}
Note that that there are examples of $E$ and $E'$ such that $\Pr[ E
\cup E']$ is strictly less than $\Pr[ E ] + \Pr[ E' ]$. For example,
this can be the case if $E$ and $E'$ are the same set (and hence $E
\cup E' = E$). If $E$ and $E'$ are \emph{disjoint} (i.e., mutually
exclusive) then $\Pr[ E \cup E'] = \Pr[E] + \Pr[E']$.
\section{Random Variables}
A random variable is a function that maps elements of the sample
space to another set (often, but not always, to the set $\R$ of real
numbers). For example, in the case of the uniform distribution over
$\bits^{101}$, we can define the random variable $N$ to denote the
number of ones in the string chosen. That is, for every $x \in
\bits^{101}$, $N(x)$ is equal to the number of ones in $x$. Thus,
the event $E$ we considered before can be phrased as the event that
$N \leq 50$ and the formula above can be phrased as
\[
\Pr_{x \getsr \bits^{101}} [ N(x) \leq 50 ] = \frac{1}{2}
\]
For the remainder of this handout, we will only consider \emph{real}
random variables (that is random variables whose output is a
\emph{real number}).
\section{Expectation}
The \emph{expectation} of a random variable is its weighted average.
That is, it is the average value it takes, when the average is
weighted according to the probability measure on the sample space.
Formally, if $N$ is a random variable on a sample space $\Omega$
(where for every $x \in \Omega$, the probability that $x$ is
obtained is given by $p_x$) then the expectation of $N$, denoted by
$\Ex[N]$ is defined as follows:
\[
\Ex [ N ] \eqdef \sum_{x \in \Omega} N(x) \cdot p_x
\]
For example, if the experiment was to choose a random U.S. citizen
(and hence the sample space is the set of all U.S. citizens) and we
defined the random variable $H$ to be the height of the person
chosen, then the expectation of $H$ (denoted by $\Ex[H]$) is simply
the average height of a U.S. citizen.
There can be two different random variables with the same
expectation. For example, consider the sample space $\bits^{101}$
with the uniform distribution, and the following two random
variables:
\begin{itemize}
\item $N$ is the random variable defined above: $N(x)$ is the number of ones in $x$.
\item $M$ is defined as follows: if $x$ is the all ones string (that
is $x=1^{101}$) then $M(x) = 50.5 \cdot 2^{101}$. Otherwise (if
$x\neq 1^{101}$) then $M(x)=0$.
\end{itemize}
The expectation of $N$ equals $50.5$ (this follows from the
linearity of expectation, see below).
The expectation of $M$ is also $50.5$: with probability $2^{-101}$
it will be $2^{101}\cdot 50$ and with probability $1-2^{-101}$ it
will be $0$.
Note that even though the average of $M$ is $50.5$, the probability
that for a random $x$, $M(x)$ will be close to $50.5$ or even bigger
than zero is very very small. This is similar to the fact that if
Bill Gates is in a room with $99$ poor people (e.g. theoretical
computer scientists), then the average worth of a random person in
this room is more than \$100M even though with probability $0.99$ a
random person in the room will be worth much less than that amount.
Hence the name ``expectation'' is somewhat misleading.
In contrast, it will follow from Theorem~\ref{thm:chernoff}, that
for a random string $x$, even though it will never have $N(x)$ equal
to exactly $50.5$ (after all, $N(x)$ is always a whole number), with
high probability $N(x)$ will be close to $50.5$.
The fact that two different variables can have the same expectation
means that if we know the expectation it does not give us \emph{all}
the information about the random variable but only \emph{partial}
information.
\paragraph{Linearity of expectation.} The expectation has a very
useful property which is that it is a \emph{linear function}. That
is, if $N$ and $M$ are random variables over the same sample space
$\Omega$, then we can define the random variable $N+M$ in the
natural way: for every $x\in \Omega$, $(N+M)(x) = N(x)+M(x)$. It
turns out that $\Ex[N+M] = \Ex[N]+\Ex[M]$. For every fixed number
$c$ and random variable $N$ we define the random variable $cN$ in
the natural way: $(cN)(x) = c \cdot N(x)$ It turns out that $\Ex[cN]
= c\Ex[N]$.
For example, the random variable $N$ above is equal to
$X_1+\cdots+X_{101}$ with $X_i$ equalling the $i^{th}$ bit of
the chosen string. Since $\Ex[X_i] = (1/2)\cdot 0 + (1/2)\cdot 1
= 1/2$, $\Ex[N] = 101\cdot(1/2) = 50.5$.
\section{Markov Inequality}
As we saw above, sometimes we want to know not just the expectation
of a random variable but also the probability that the variable is
close to (or at least not too far from) its expectation. Bounds on
this probability are often called ``tail bounds''. The simplest one
of them is \emph{Markov} inequality, which is a one-sided
inequality. It says that with high probability a non-negative random
variable is never much larger than its expectation. (Note that the
random variable $M$ defined above was an example of a non-negative
random variable that with high probability is much \emph{smaller}
than its expectation.) That is, it is the following theorem:
\begin{theorem}[Markov Inequality] \label{thm:markov} Let $X$ be a random variable over
a sample space $\Omega$ such that for all $x\in\Omega$, $X(x)\geq
0$. Let $k\geq 1$. Then,
\[ \Pr [ X \geq k \Ex[ X ] ] \leq \frac{1}{k}
\]
\end{theorem}
\begin{proof} Denote $\mu = \Ex[ X ]$. Suppose for the sake of
contradiction that $\Pr[ X \geq k\mu ] > 1/k$. Let $S = \{ x \in
\Omega \;|\; X(x) \geq k \mu \}$ and $\overline{S} = \Omega
\setminus S$. By the definition of expectation
\[
\Ex[ X ] = \sum_{x\in\Omega} X(x)p_x = \sum_{x\in S} X(x)p_x +
\sum_{x\in \overline{S}} X(x)p_x
\]
However, we know that for each $x\in S$, $X(x) \geq k \mu$ and hence
\[
\sum_{x \in S} X(x)p_x \geq \sum_{x\in S}k\mu p_x = k\mu \sum_{x\in
S} p _x
\]
Yet $\sum_{x\in S} p_x = \Pr[ S ] > \tfrac{1}{k}$ under our
assumption and hence $\sum_{x \in S} X(x)p_x > k\mu \tfrac{1}{k} =
\mu$.
Since $X(x)\geq 0$ for all $x\in \Omega$ we get that $\sum_{x\in
\overline{S}} X(x)p_x \geq 0$ and hence $\Ex[x] > \mu$, yielding a
contradiction.
\end{proof}
\section{Variance and Chebychev inequality}
We already noted that the distance from the expectation is an
interesting parameter. Thus, for a random variable $X$ with
expectation $\mu$ we can define a new random variable $\Tilde{X}$
which to be the distance of $X$ from its expectation. That is, for
every $x\in \Omega$, we define $\Tilde{X}(x) = |X-\mu|$. (Recall
that $|\cdot|$ denotes the absolute value.) It turns out that it is
hard to work with $\Tilde{X}$ and so we look at the variable
$\Tilde{X}^2$, which is equal to $(X-\mu)^2$. We define the
\emph{variance} of a random variable $X$ to be equal to the
expectation of $\Tilde{X}^2$. That is, for $X$ with $\Ex[X] = \mu$,
\[
Var[X] \eqdef \Ex[ \Tilde{X}^2 ] = \Ex[ (X - \mu)^2 ]
\]
In other words $Var[X]$ is defined to be $\Ex[ (X - \Ex[X])^2 ]$.
We define the \emph{standard deviation} of $X$ to be the square root
of $Var[X]$.
If we have a bound on the variance then we can have a better tail
bound on the variables:
\begin{theorem}[Chebyshev's inequality] \label{thm:chebyshev}
Let $X$ be a random variable over $\Omega$ with expectation $\mu$
and standard deviation $\sigma$. Let $k\geq 1$. Then,
\[
\Pr[ | X - \mu | \geq k \sigma ] \leq 1/k^2
\]
\end{theorem}
\begin{proof} The variable $Y = (X-\mu)^2$ is non-negative and has
expectation $Var(X)=\sigma^2$. Therefore, by Markov inequality,
$\Pr[ Y \geq k^2 \sigma^2 ] \leq 1/k^2$.
However, whenever $|X-\mu| \geq k\sigma$ it holds that $|X-\mu|^2$
(which is equal to $Y$) is at least $k^2\sigma^2$. This means that
\[
\Pr[ |X-\mu| \geq k \sigma ] \leq \Pr[ Y^2 \leq k^2\sigma^2 ] \leq
1/k^2
\]
\end{proof}
\section{Conditional probabilities and independence}
Let $A$ be some event over a sample space $\Omega$ (with
$\Pr[A]>0$). By a probability \emph{conditioned on $A$} we mean the
probability of some event, assuming that we already know that $A$
happened. For example if $\Omega$ is our usual sample space of
uniform choices over $\bits^{101}$ and $A$ is the event that the
first coin turned out head, then the conditional space is the space
of all length-$101$ strings whose first bit is $1$.
Formally this is defined in the natural way: we consider $A$ as a
sample space by inheriting the probabilities from $\Omega$ (and
normalizing so the probabilities will sum up to one). That is, for
every $x \in A$ we define $p_{x|A}$ (the probability that $x$ is
chosen conditioned on $A$) to be $\tfrac{p_x}{\Pr[A]}$. For an event
$B$ we define $\Pr[ B| A]$ (the probability that $B$ happens
conditioned on $A$) to be $\sum_{x\in A \cap B} p_{x|A} =
\tfrac{\Pr[A \cap B]}{\Pr[A]}$.
\paragraph{Independent events.} We say that $B$ is independent from $A$ if $\Pr[B|A]=\Pr[B]$.
That is, knowing that $A$ happened does not give us any new
information on the probability that $B$ will happen. By plugging the
formula for $\Pr[B|A]$ we see that $B$ is independent from $A$ if
and only if
\[ \Pr[B \cap A] = \Pr[A]\Pr[B] \]
This means that $B$ is independent from $A$ iff $A$ is independent
from $B$ and hence we simply say that $A$ and $B$ are independent
events.
For example, if, as above, $A$ is the event that the first coin toss
is heads and $B$ is the event that the second coin toss is heads
then these are independent events. In contrast if $C$ is the event
that the number of heads is at most $50$ then $C$ and $A$ are
\emph{not} independent (since knowing that $A$ happened increases
somewhat the chances for $C$).
If we have more than two events then it's a bit more messy: we say
that the events $E_1,\ldots,E_n$ are mutually independent if not
only $\Pr[E_1 \cap E_2 \cap \cdots \cap E_n] = \Pr[E_1]\cdots
\Pr[E_n]$ but also this holds for every subset of $E_1,\ldots,E_n$.
That is, for every subset $I$ of the numbers $\{1,\ldots,n\}$,
\[
\Pr[ \cap_{i\in I} E_i ] = \prod_{i \in I} \Pr[ E_i ]
\]
\paragraph{Independent random variables.} We say that $U$ and $V$
are \emph{independent random variables} if for every possible values
$u$ and $v$, the events $U=u$ and $V=v$ are independent events or in
other words $\Pr[ U=u \text{ and } V=v]=\Pr[U=u]\Pr[V=v]$. We say
that $U_1,\ldots,U_n$ are a collection of independent random
variables if for all values $u_1,\ldots,u_n$, the events
$U_1=u_1,\ldots,U_n=u_n$ are mutually independent.
\section{The Chernoff Bound}
Suppose that 60\% of a country's citizens prefer the color blue over
red. A poll is the process of choosing a random citizen and finding
his or her favorite color. Suppose that we do this $n$ times and we
define the random variable $X_i$ to be $0$ if the color of the
$i^{th}$ person chosen is red and $1$ if it is blue. Then, for each
$i$ the expectation $\Ex[X_i]$ is $0.6$, and by linearity of
expectation $\Ex[ \sum_{i=1}^n X_i ] = 0.6n$. The estimate we get
out of this poll for the fraction of blue-preferrers is $\tfrac{\sum
X_i}{n}$ and we would like to know how close this is to the real
fraction of the population (i.e., $0.6$). In other words, for any
$\e>0$, we would like to know what is the probability that our
estimate will be $\e$ off from the real value, i.e., that
$|\tfrac{\sum X_i}{n} - 0.6| > \e$.
It turns out that in this case we have a very good bound on the
deviation of $\sum X_i$ from its expectation, and this is because
all of the $X_i$'s are independent random variables (since in each
experiment we draw a new random person independently of the results
of previous experiments). This is the Chernoff bound, which we state
here in a simplified form:
\begin{theorem}[Chernoff bound] \label{thm:chernoff} Let $X_1,\ldots,X_n$ be independent
random variables with $0 \leq X_i \leq 1$ and $\Ex[ X_i = \mu ]$.
Then,
\[
\Pr\left[ \left| \frac{\sum X_i}{n} - \mu \right| > \e \right] <
2^{-\e^2n/4}
\]
\end{theorem}
\section*{Exercises}
(The following exercises will be part of Homework 1 of the course. I prefer you submit the
exercises typed, ideally using \LaTeX , and \LaTeX\ sources for all homeworks are available on the
homepage.)
\begin{exercise}\label{ex:var} In the following exercise $X,Y$
denote finite random variables. That is, there are finite sets of real numbers
$\mathcal{X},\mathcal{Y}$ such that $\Pr[ X=x]=0$ and $\Pr[ Y=y]=0$ for every $x \not\in
\mathcal{X}$ and $y\not\in\mathcal{Y}$. We denote by $\Ex[X]$ the expectation of $X$ (i.e.,
$\sum_{x\in\mathcal{X}} x\Pr[X=x]$), and by $Var[X]$ the variance of $X$ (i.e., $\Ex[(X-\mu)^2]$
where $\mu=\Ex[X]$). The standard deviation of $X$ is defined to be $\sqrt{Var[X]}$.
\begin{enumerate}
\item Prove that $Var[X]$ is always non-negative.
\item Prove that $Var[X] = \Ex[X^2] - \Ex[X]^2$.
\item Prove that always $\Ex[X^2] \geq \Ex[X]^2$.
\item Give an example for a random variable $X$ such that $\Ex[X^2] \neq \Ex[X]^2$.
\item Give an example for a random variable $X$ such that its standard deviation is \emph{not
equal} to $\Ex[ | X - \Ex[X] | ]$.
\item Give an example for two random variables $X,Y$ such that $\Ex[XY] = \Ex[X]\Ex[Y]$.
\item Give an example for two random variables $X,Y$ such that $\Ex[XY] \neq \Ex[X]\Ex[Y]$.
\item Prove that if $X$ and $Y$ are independent random variables (i.e., for every $x \in
\mathcal{X},y\in\mathcal{Y}$, $\Pr[X=x \wedge Y=y]=\Pr[X=x]\Pr[Y=Y]$) then
$\Ex[XY]=\Ex[X]\Ex[Y]$ and $Var[X+Y]=Var[X]+Var[Y]$.
\end{enumerate}
\end{exercise}
\begin{exercise} Recall that two distributions $X$ and $Y$ that range over some set $S$ are
\emph{identical} if for every $s$ in $S$, $\Pr[ X = s] = \Pr[ Y = s ]$. Below $n$ is some integer
$n>1$. (You can get partial credit for solving the questions below for the special case that $n=3$
and $z$ (in Question~2) is the string $111$.)
\begin{enumerate}
\item Let $X_1,...,X_n$ be random variables where $X_i \in \bit$ chosen such that each $X_i$ is
chosen to equal $0$ with probability $1/2$ and equal $1$ with probability $1/2$, and all of
the $X_i$'s are independent. Let $Y_1,...,Y_n$ be random variables where $Y_i \in \bit$
chosen as follows: first an $n$ bit $0/1$ string $y$ is chosen uniformly at random from the
set $\bit^n$ of all possible $n$-bit $0/1$ strings, and then $Y_i$ is set to be the
$i^{th}$ coordinate of $y$. Prove that the distributions $(X_1,...,X_n)$ and
$(Y_1,...,Y_n)$ are identical.
\item Let $z$ be a fixed string in $\bit^n$, and let $Z_1,...,Z_n$ be random variables chosen
as follows: first a string $w \in \bits^n$ is chosen uniformly from $\bit^n$, and then
$Z_i$ is set to $z_i \oplus w_i$, where $\oplus$ is the XOR operation (i.e., $0\oplus 1 = 1
\oplus 0 = 1$ and $0 \oplus 0 = 1 \oplus 1 = 0$). Prove that the distribution
$(Z_1,...,Z_n)$ is identically distributed to $(X_1,...,X_n)$ and $(Y_1,...,Y_n)$ above.
\item Let $W_1,...,W_n$ be random variables where $W_i \in \bit$ chosen as follows: first a
string $w$ is chosen at uniformly at random from the set of all $n$-bit $0/1$ strings
satisfying $w_1 \oplus w_2 \oplus \cdots \oplus w_n = 0$, and then $W_i$ is set to be
$w_i$. \textbf{(a)} Prove that $W_1$ and $W_2$ are independent. \textbf{(b)} Prove or
disprove that the random variables $W_1,...,W_n$ mutually independent.
\end{enumerate}
\end{exercise}
\end{document}