Probability

Probability theory talks about the construction and implications of probability models. For example, given a probability distribution, what are the mean and variance? What is the distribution of a transformed random variable? In computer simulations, probability theory tells us what will happen to the generated realizations, in particular when the experiments can be repeated as many times as the researcher wishes. This is a real-world analogue of the frequentist’s interpretation of probability.

Probability Space

A sample space \(\Omega\) is a collection of all possible outcomes. It is a set of things.

An event \(A\) is a subset of \(\Omega\). It is something of interest on the sample space.

A \(\sigma\)-field is a complete set of events that include all countable unions, intersections, and differences. It is a well-organized structure built on the sample space.

A probability measure satisfies

(positiveness) \(P\left(A\right)\geq0\) for all events;
(countable additivity) If \(A_{i}\), \(i\in\mathbb{N}\), are mutually disjoint, then \(P\left(\bigcup_{i\in\mathbb{N}}A_{i}\right)=\sum_{i\in\mathbb{N}} P \left(A_{i}\right).\)
\(P(\Omega) = 1\).

The above construction gives a mathematically well-defined probability measure, but we have not yet answered “How to assign the probability?”

There are two major schools of thinking on probability assignment. One is the frequentist, who considers probability as the average chance of occurrence if a large number of experiments are carried out. The other is the Bayesian, who deems probability as a subjective brief. The principles of these two schools are largely incompatible, while each school has peculiar pros and cons under different real-world contexts.

Random Variable

A random variable maps an event to a real number. If the outcome is multivariate, we call it a random vector.

Distribution Function

We go back to some terminologies we learned in an undergraduate probability course. A (cumulative) distribution function \(F:\mathbb{R}\mapsto [0,1]\) is defined as

\[ F\left(x\right)=P\left(X\leq x\right). \]

It is often abbreviated as CDF, and it has the following properties.

\(\lim_{x\to-\infty}F\left(x\right)=0\),
\(\lim_{x\to\infty}F\left(x\right)=1\),
non-decreasing,
right-continuous \(\lim_{y\to x^{+}}F\left(y\right)=F\left(x\right).\)

The \(q\)-th quantile of a random variable is \(\min_{x\in \mathbb R} P(X \leq x) \geq q\).

For a continuous distribution, if its CDF is differentiable, then

\[ f(x) = d F\left(x\right) / d x \]

is called the probability density function of \(X\), often abbreviated as PDF. It is easy to show that \(f\left(x\right)\geq0\), and by the Leibniz integral rule \(\int_{a}^{b}f\left(x\right)dx=F\left(b\right)-F\left(a\right)\).

For a discrete random variable, its CDF is obviously non-differentiable at any jump points. In this case, we define the probability mass function \(f(x) = F(x) - \lim_{y \to x^{-}} F(y)\).

We have learned many parametric distributions. A parametric distribution can be completely characterized by a few parameters.

Examples:

Binomial distribution.

\[f(x = k ;p) = \binom{n}{k} p^k (1-p)^{n-k}\]

Poisson distribution.

\[f(x = k;\lambda) = \frac{\lambda^k \exp(-\lambda)}{k!}\]

Uniform distribution.

\[f(x; a, b) = \frac{1}{b-a} \cdot \mathbf{1}\{a\leq x \leq b\}\]

Normal distribution.

\[ f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left( - \frac{(x-\mu)^2}{2\sigma^2}\right) \]

Its mean is \(\mu\) and variance \(\sigma^2\).

Log-normal distribution.

\[ f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left( - \frac{(\log(x)-\mu)^2}{2\sigma^2}\right), \]

Its means is \(\exp(\mu + 0.5 \sigma^2)\) and variance \([\exp(\sigma^2 - 1)] \exp(2\mu+ \sigma^2)\).

\(\chi^{2}\), \(t\), and \(F\) distributions.

Example

R has a rich collection of distributions implemented in a unified pattern: d for density, p for probability, q for quantile, and r for random variable generator. For instance, dnorm, pnorm, qnorm, and rnorm are the corresponding functions of the normal distribution, and the parameters \(\mu\) and \(\sigma\) can be specified in the arguments of the functions.

Integration

In probability theory, an integral \(\int X\mathrm{d}P\) is called the expected value, or expectation, of \(X\). We often use the notation \(E\left[X\right]\), instead of \(\int X\mathrm{d}P\), for convenience.

The expectation is the average of a random variable, despite that we cannot foresee the realization of a random variable in a particular trial (otherwise there is no uncertainty). In the frequentist’s view, the expectation is the average outcome if we carry out a large number of independent trials.

If we know the probability mass function of a discrete random variable, its expectation is calculated as \(E\left[X\right]=\sum_{x}xP\left(X=x\right)\). If a continuous random variable has a PDF \(f(x)\), its expectation can be computed as \(E\left[X\right]=\int xf\left(x\right)\mathrm{d}x\).

Here are some properties of the expectation.

\(E\left[X^{r}\right]\) is call the \(r\)-moment of \(X\). The mean of a random variable is the first moment \(\mu=E\left[X\right]\), and the second centered moment is called the variance \(\mathrm{var}\left[X\right]=E [(X-\mu)^{2}]\).

The third centered moment \(E\left[\left(X-\mu\right)^{3}\right]\), called skewness, is a measurement of the symmetry of a random variable, and the fourth centered moment \(E\left[\left(X-\mu\right)^{4}\right]\), called kurtosis, is a measurement of the tail thickness.

We call \(E\left[\left(X-\mu\right)^{3}\right]/\sigma^{3}\) the skewness coefficient, and \(E\left[\left(X-\mu\right)^{4}\right]/\sigma^{4}-3\) degree of excess. A normal distribution’s skewness and degree of excess are both zero.
Moments do not always exist. For example, the mean of the Cauchy distribution does not exist, and the variance of the \(t(2)\) distribution does not exist.
\(E[\cdot]\) is a linear operation. \(E[a X_1 + b X_2] = a E[X_1] + b E[X_2].\)
Jensen’s inequality is an important fact. A function \(\varphi(\cdot)\) is convex if \(\varphi( a x_1 + (1-a) x_2 ) \leq a \varphi(x_1) + (1-a) \varphi(x_2)\) for all \(x_1,x_2\) in the domain and \(a\in[0,1]\). For instance, \(x^2\) is a convex function. Jensen’s inequality says that if \(\varphi\left(\cdot\right)\) is a convex function, then
\[ \varphi\left(E\left[X\right]\right)\leq E\left[\varphi\left(X\right)\right]. \]
Markov inequality is another simple but important fact. If \(E\left[\left|X\right|^{r}\right]\) exists, then
\[ P\left(\left|X\right|>\epsilon\right)\leq E\left[\left|X\right|^{r}\right]/\epsilon^{r} \]
for all \(r\geq1\). Chebyshev inequality \(P\left(\left|X\right|>\epsilon\right)\leq E\left[X^{2}\right]/\epsilon^{2}\) is a special case of the Markov inequality when \(r=2\).

Multivariate Random Variable#

A bivariate random variable is a vector of two scalar random variables. More generally, a multivariate random variable has the joint CDF as

\[ F\left(x_{1},\ldots,x_{n}\right)=P\left(X_{1}\leq x_{1},\ldots,X_{n}\leq x_{n}\right). \]

Joint PDF is defined similarly.

It is illustrative to introduce the joint distribution, conditional distribution and marginal distribution in the simple bivariate case, and these definitions can be extended to multivariate distributions. Suppose a bivariate random variable \((X,Y)\) has a joint density \(f(\cdot,\cdot)\). The marginal density \(f\left(y\right)=\int f\left(x,y\right)dx\) integrates out the coordinate that is not interested. The conditional density can be written as \(f\left(y|x\right)=f\left(x,y\right)/f\left(x\right)\) for \(f(x) \neq 0\).

Independence

For two events \(A_1\) and \(A_2\), the conditional probability is

\[ P\left(A_1|A_2\right) = \frac{P\left(A_1 A_2\right)}{ P\left(A_2\right) } \]

if \(P(A_2) \neq 0\). In this definition of conditional probability, \(A_2\) plays the role of the outcome space so that \(P(A_1 A_2)\) is standardized by the total mass \(P(A_2)\).

Since \(A_1\) and \(A_2\) are symmetric, we have \(P(A_1 A_2) = P(A_2|A_1)P(A_1)\). It implies

\[ P(A_1 | A_2)=\frac{P\left(A_2| A_1\right)P\left(A_1\right)}{P\left(A_2\right)} \]

This formula is the well-known Bayes’ Theorem, which is a cornerstone of decision theory.

Example

\(A_1\) is the event “a student can survive CUHK’s MSc program”, and \(A_2\) is his or her application profile.

We say two events \(A_1\) and \(A_2\) are independent if \(P(A_1A_2) = P(A_1)P(A_2)\). If \(P(A_2) \neq 0\), it is equivalent to \(P(A_1 | A_2 ) = P(A_1)\). In words, knowing \(A_2\) does not change the probability of \(A_1\).

If \(X\) and \(Y\) are independent, \(E[XY] = E[X]E[Y]\).

Law of Iterated Expectations#

In the bivariate case, if the conditional density exists, the conditional expectation can be computed as \(E\left[Y|X\right]=\int yf\left(y|X\right)dy\). The law of iterated expectation implies \(E\left[E\left[Y|X\right]\right]=E\left[Y\right]\).

Below are some properties of conditional expectations

\(E\left[E\left[Y|X_{1},X_{2}\right]|X_{1}\right]=E\left[Y|X_{1}\right];\)
\(E\left[E\left[Y|X_{1}\right]|X_{1},X_{2}\right]=E\left[Y|X_{1}\right];\)
\(E\left[h\left(X\right)Y|X\right]=h\left(X\right)E\left[Y|X\right].\)

Application

Regression is a technique that decomposes a random variable \(Y\) into two parts, a conditional mean and a residual. Write \(Y=E\left[Y|X\right]+\epsilon\), where \(\epsilon=Y-E\left[Y|X\right]\). Show that \(E[\epsilon] = 0\) and \(E[\epsilon E[Y|X] ] = 0\).

Untitled

Probability

Probability

Probability Space

Random Variable

Distribution Function

Integration

Multivariate Random Variable#

Independence

Law of Iterated Expectations#

Untitled