Describes random variables in terms of Hilbert spaces, defining inner products, norms, and minimum mean square error estimation.
Random variable spaces
Probability – notation primer
Definition 1 A
random variable x is defined by a distribution function
$$P\left(x\right)={F}_{X}\left(x\right)=\mathrm{Prob}(X\le x)$$
The
density function is given by
$$\frac{\partial P\left(x\right)}{\partial x}={f}_{X}\left(x\right)=\frac{\partial \mathrm{Prob}(X\le x)}{\partial x}$$
Definition 2 The
expectation of a function
$g\left(x\right)$ over the random variable
$x$ is
$${E}_{X}\left[g\left(x\right)\right]={\int}_{\infty}^{\infty}g\left(x\right){f}_{X}\left(x\right)\phantom{\rule{0.166667em}{0ex}}dx$$
Definition 3 Pairs of random variables
$X,Y$ are defined by the joint distribution function
$$P(x,y)={F}_{XY}(x,y)=\mathrm{Prob}(X\le x,Y\le y)$$
The
joint density function is given by
$$\frac{{\partial}^{2}P(x,y)}{\partial x\partial y}={f}_{XY}(x,y)=\frac{{\partial}^{2}\mathrm{Prob}(X\le x,Y\le y)}{\partial x\partial y}$$
The
expectation of a function
$g(x,y)$ is given by
$${E}_{X,Y}\left[g(x,y)\right]={\int}_{\infty}^{\infty}{\int}_{\infty}^{\infty}g(x,y){f}_{XY}(x,y)\phantom{\rule{0.166667em}{0ex}}dx\phantom{\rule{0.166667em}{0ex}}dy$$
A hilbert space of random variables
Definition 4 Let
$\{{Y}_{1},\cdots ,{Y}_{n}\}$ be a collection of zeromean (
$E\left[{Y}_{i}\right]=0$ ) random variables.
The space
$H$ of all random variables that are linear combinations of those
$n$ random variables
$\{{y}_{1},\cdots ,{y}_{n}\}$ is a Hilbert space with inner product
$$\u27e8X,Y\u27e9=E\left[x\overline{y}\right]={\int}_{\infty}^{\infty}{\int}_{\infty}^{\infty}x\overline{y}{f}_{XY}(x,y)\phantom{\rule{0.166667em}{0ex}}\text{d}x\text{d}y.$$
We can easily check that this is a valid inner product:

$\u27e8x,x\u27e9=E\left[x\overline{x}\right]={\int}_{\infty}^{\infty}{\leftx\right}^{2}{f}_{x}\left(x\right)\phantom{\rule{0.166667em}{0ex}}\text{d}x=E\left[{\leftx\right}^{2}\right]\ge 0$ ;

$\u27e8x,x\u27e9=0$ if and only if
${f}_{X}\left(x\right)=\delta \left(x\right)$ , i.e., if
$X$ is a random variable that is deterministically zero (and this random variable is the “zero” of this Hilbert space);

$\u27e8x,y\u27e9=\overline{\u27e8y,x\u27e9}$ ;

$\u27e8x+y,z\u27e9=E\left[(x+y)\overline{z}\right]=E\left[x\overline{z}\right]+E\left[y\overline{z}\right]=\u27e8x,z\u27e9+\u27e8y,z\u27e9;$
Note in particular that orthogonality, i.e.,
$\u27e8x,y\u27e9=0$ , implies
$E\left[x\overline{y}\right]=0$ , i.e.,
$x$ and
$y$ are independent random variables. Additionally, the induced norm
$\parallel X\parallel =\sqrt{\u27e8X,X\u27e9}=\sqrt{E\left[\rightx{}^{2}]}$ is the standard deviation of the zeromean random variable
$X$ .
A hilbert space of random vectors
One can define random vectors
$X$ ,
$Y$ whose entries are random variables:
$$X=\left[\begin{array}{c}{X}_{1}\\ \vdots \\ {X}_{N}\end{array}\right],Y=\left[\begin{array}{c}{Y}_{1}\\ \vdots \\ {Y}_{N}\end{array}\right].$$
For these, the following inner product is an extension of that given above:
$$\u27e8X,Y\u27e9=E\left[{y}^{H}x\right]=E\left[\sum _{i=1}^{n},\overline{{y}_{i}},{x}_{i}\right]=E\left[\mathrm{trace},\left[,x,{y}^{H},\right]\right].$$
The induced norm is
$$\parallel X\parallel =\sqrt{\u27e8X,X\u27e9}=E\left[\sqrt{{x}^{H}x}\right]=E\left[\sqrt{{\sum}_{i=1}^{N}{\left{x}_{i}\right}^{2}}\right],$$
the expected norm of the vector
$x$ .
Minimum mean square error estimation
In an MMSE estimation problem, we consider
$Y=AX+N$ , where
$X,Y$ are two random vectors and
$N$ is usually additive white Gaussian noise
(
$Y$ is
$m\times 1$ ,
$A$ is
$m\times n$ , X is
$n\times 1$ , and
$N\sim \mathcal{N}(0,{\sigma}^{2}I)$ is
$m\times 1$ ).
Due to this noise model, we want an estimate
$\widehat{X}$ of
$X$ that minimizes
$E\left[\parallel X,\widehat{X},{\parallel}^{2}\right]$ ; such an estimate has highest likelihood under an additive white Gaussian noise model. For computational simplicity, we often want to restrict the estimator to be linear, i.e.
$$\widehat{X}=KY=\left[\begin{array}{c}{K}_{1}^{H}\\ \vdots \\ {K}_{n}^{H}\end{array}\right]Y,$$
where
${K}_{i}^{H}$ denotes the
${i}^{th}$ row of the estimation matrix
$K$ and
${\widehat{X}}_{i}={K}_{i}^{H}Y$ . We use the definition of the
${\ell}_{2}$ norm to simplify the equation:
$$\underset{K}{min}E\left[\parallel X,\widehat{X},{\parallel}_{2}^{2}\right]=\underset{K}{min}E\left[{\parallel XKY\parallel}_{2}^{2}\right]=\underset{K}{min}E\left[\sum _{i=1}^{n},{({X}_{i}{K}_{i}^{H}Y)}^{2}\right]$$
Since the terms involved in the sum are independent from each other and nonnegative, this minimization can be posed in terms of
$n$ individual minimizations: for
$i=1,2,...,n$ , we solve
$$\underset{{K}_{i}}{min}E\left[{({X}_{i}{K}_{i}^{H}Y)}^{2}\right]=\underset{{K}_{i}}{min}E\left[{({X}_{i}\sum _{i=1}^{n}\overline{{K}_{ij}}{Y}_{j})}^{2}\right]=\underset{{K}_{i}}{min}\u2225{X}_{i},,\sum _{i=1}^{n},\overline{{K}_{ij}},{Y}_{j}\u2225,$$
where the norm is the induced norm for the Hilbert space of random variables. Note at this point that the set of random variables
${\sum}_{i=1}^{n}\overline{{K}_{ij}}{Y}_{j}$ over the choices of
${K}_{i}$ can be written as
$\mathrm{span}\left({\left\{{Y}_{j}\right\}}_{j=1}^{m}\right)$ . Thus, the optimal
${K}_{i}$ is given by the coefficients of the closest point in
$\mathrm{span}\left({\left\{{Y}_{j}\right\}}_{j=1}^{m}\right)$ to the random variable
${X}_{i}$ according to the induced norm for the Hilbert space of random variables. Therefore, we solve for
${K}_{i}$ using results from the projection theorem with the corresponding inner product. Recall that given a basis
${Y}_{i}$ for the subspace of interest, we obtain the equation
${\beta}_{i}=G{\left({K}_{i}^{H}\right)}^{T}=G\overline{{K}_{i}}$ , where
${\beta}_{i,j}=\u2329{X}_{i},,,{Y}_{j}\u232a$ and
$G$ is the Gramian matrix. More specifically, we have
$$\underset{{\beta}_{i}}{\underbrace{\left[\begin{array}{c}\u2329{X}_{i},,,{Y}_{1}\u232a\\ \u2329{X}_{i},,,{Y}_{2}\u232a\\ \vdots \\ \u2329{X}_{i},,,{Y}_{m}\u232a\end{array}\right]}}=\underset{G}{\underbrace{\left[\begin{array}{cccc}\u2329{Y}_{1},,,{Y}_{1}\u232a& \u2329{Y}_{2},,,{Y}_{1}\u232a& \cdots & \u2329{Y}_{m},,,{Y}_{1}\u232a\\ \u2329{Y}_{1},,,{Y}_{2}\u232a& \u2329{Y}_{2},,,{Y}_{2}\u232a& \cdots & \u2329{Y}_{m},,,{Y}_{2}\u232a\\ \vdots & \vdots & \ddots & \vdots \\ \u2329{Y}_{1},,,{Y}_{m}\u232a& \u2329{Y}_{2},,,{Y}_{m}\u232a& \cdots & \u2329{Y}_{m},,,{Y}_{m}\u232a\end{array}\right]}}\underset{{K}_{i}}{\underbrace{\left[\begin{array}{c}\overline{{K}_{i1}}\\ \overline{{K}_{i2}}\\ \vdots \\ \overline{{K}_{im}}\end{array}\right]}}.$$
Thus, one can solve for
$\overline{{K}_{i}}={G}^{1}{\beta}_{i}$ . In the Hilbert space of random variables, we have
$$\begin{array}{cc}\hfill G& =\left[\begin{array}{cccc}E\left[{Y}_{1}{Y}_{1}\right]& E\left[{Y}_{2}{Y}_{1}\right]& \cdots & E\left[{Y}_{m}{Y}_{1}\right]\\ E\left[{Y}_{1}{Y}_{2}\right]& E\left[{Y}_{2}{Y}_{2}\right]& \cdots & E\left[{Y}_{m}{Y}_{2}\right]\\ \vdots & \vdots & \ddots & \vdots \\ E\left[{Y}_{1}{Y}_{m}\right]& E\left[{Y}_{2}{Y}_{m}\right]& \cdots & E\left[{Y}_{m}{Y}_{m}\right]\end{array}\right]={R}_{Y},\hfill \\ \hfill \beta & =\left[\begin{array}{c}E\left[{X}_{i}{Y}_{1}\right]\\ E\left[{X}_{i}{Y}_{2}\right]\\ \vdots \\ E\left[{X}_{i}{Y}_{m}\right]\end{array}\right]={\rho}_{{X}_{i}Y}.\hfill \end{array}$$
Here
${R}_{Y}$ is the correlation matrix of the random vector
$Y$ and
${\rho}_{{X}_{i}Y}$ is the crosscorrelation vector of the random variable
${X}_{i}$ and vector
$Y$ . Thus, we have
$\overline{{K}_{i}}={G}^{1}{\beta}_{i}={R}_{Y}^{1}{\rho}_{{X}_{i}Y}$ , and so
${K}_{i}^{H}={\rho}_{{X}_{i}Y}^{T}{R}_{Y}^{1}$ . Concatenating all the rows of
$K$ together, we get
$K={R}_{X,Y}{R}_{Y}^{1}$ , where
${R}_{X,Y}$ is the crosscorrelation matrix for the random vectors
$X$ and
$Y$ . We therefore obtain the optimal linear estimator
$\widehat{X}=KY={R}_{X,Y}{R}_{Y}^{1}Y$ .
At first, there may be some confusion on the difference between least squares and minimum meansquare error. To summarize:

Least Squares are applied when the quantities observed are deterministic (i.e., a “single draw” of data or observations).

Minimum Mean Square Error Estimation are applied when random variables are observed under Gaussian noise; one must know a distribution over inputs, and the error must be measured in expectation.