Review: pac bounds
Consider a finite collection of models
$\mathcal{F}$ , and
recall the basic PAC bound: for any
$\delta >0$ , with probability at least
$1-\delta $
$$\begin{array}{c}\hfill R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{log\left|\mathcal{F}\right|+log(1/\delta )}{2n}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}},\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\forall f\in \mathcal{F}\end{array}$$
where
$$\begin{array}{ccc}\hfill {\widehat{R}}_{n}\left(f\right)& =& \frac{1}{n}\sum _{i=1}^{n}\ell (f\left({X}_{i}\right),{Y}_{i})\hfill \\ \hfill R\left(f\right)& =& E\left[\ell ,\left(,f,\right(,X,),,,Y,)\right]\hfill \end{array}$$
and the loss
$\ell $ is assumed to be bounded between 0 and 1.
Note that we can write the inequality above as:
$$\begin{array}{c}\hfill R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{log\left(\frac{\left|\mathcal{F}\right|}{\delta}\right)}{2n}}\end{array}$$
Letting
${\delta}_{f}=\frac{\delta}{\left|\mathcal{F}\right|}$ , we have:
$$\begin{array}{c}\hfill R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{log(1/{\delta}_{f})}{2n}}\end{array}$$
This is precisely the form of Hoeffding's inequality, with
${\delta}_{f}$ in place of the usual
$\delta $ . In effect, in order to
have Hoeffding's inequality hold with probability
$1-\delta $ for all
$f\in \mathcal{F}$ , we must distribute the “
$\delta $ -budget” or “confidence-budget”
over all
$f\in \mathcal{F}$ (in this case, evenly distributed):
$$\begin{array}{ccc}\hfill \sum _{f\in \mathcal{F}}{\delta}_{f}& =& \sum _{f\in \mathcal{F}}\frac{\delta}{\left|\mathcal{F}\right|}\hfill \\ & =& \delta \hfill \end{array}$$
However, to apply the union bound, we do not need to
distribute
$\delta $ evenly among the candidate models. We only
require:
$$\begin{array}{ccc}\hfill \sum _{f\in \mathcal{F}}{\delta}_{f}& =& \delta \hfill \end{array}$$
So, if
$p\left(f\right)$ are positive numbers satisfying
${\sum}_{f\in \mathcal{F}}p\left(f\right)=1$ , then we can take
${\delta}_{f}=p\left(f\right)\delta $ . This
provides two advantages:
- By choosing
$p\left(f\right)$ larger for certain
$f$ , we can
preferentially treat those candidates
- We do not need
$\mathcal{F}$ to be finite and we only require
${\sum}_{f\in \mathcal{F}}p\left(f\right)=1$
Prefix codes are one way to achieve this. If we assign a binary
prefix code of length
$c\left(f\right)$ to each
$f\in \mathcal{F}$ , then the values
$p\left(f\right)={2}^{-c\left(f\right)}$ satisfy
${\sum}_{f\in \mathcal{F}}p\left(f\right)\le 1$ according
to the Kraft inequality.
The main point of this lecture is to examine how PAC bounds of the
form w.p.
$\ge 1-\delta $
$$\begin{array}{c}\hfill R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{c\left(f\right)log2+log(1/\delta )}{2n}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}},\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\forall f\in \mathcal{F}\end{array}$$
can be used to select a model that comes close to achieving the
best possible performace
$$\begin{array}{c}\hfill \underset{f\in \mathcal{F}}{inf}R\left(f\right)\end{array}$$
Let
${\widehat{f}}_{n}$ be the model selected from
$\mathcal{F}$ using the training data
${\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ . We will specify
this model in a moment, but keep in mind that it is notnecessarily the model with minimum empirical risk as before. We
would like to have
$$\begin{array}{c}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{inf}R\left(f\right)\end{array}$$
as small as possible. First, for any
$\delta >0$ , define
$$\begin{array}{ccc}\hfill {\widehat{f}}_{n}^{\delta}& =& arg\underset{f\in \mathcal{F}}{min}\left\{{\widehat{R}}_{n},\left(f\right),+,C,(f,n,\delta )\right\}\hfill \end{array}$$
where
$$\begin{array}{c}\hfill C(f,n,\delta )\equiv \sqrt{\frac{c\left(f\right)log2+log(1/\delta )}{2n}}\end{array}$$
Then w.p.
$\ge 1-\delta $
$$\begin{array}{c}\hfill R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+C(f,n,\delta )\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}},\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\forall f\in \mathcal{F}\end{array}$$
and in particular,
$$\begin{array}{c}\hfill R\left({\widehat{f}}_{n}^{\delta}\right)\le {\widehat{R}}_{n}\left({\widehat{f}}_{n}^{\delta}\right)+C({\widehat{f}}_{n}^{\delta},n,\delta )\phantom{\rule{4pt}{0ex}},\end{array}$$
so, by the definition of
${\widehat{f}}_{n}^{\delta}$ ,
$\forall f\in \mathcal{F}$
$$\begin{array}{c}\hfill R\left({\widehat{f}}_{n}^{\delta}\right)\le {\widehat{R}}_{n}\left(f\right)+C(f,n,\delta )\phantom{\rule{4pt}{0ex}}.\end{array}$$
We will make use of the inequality above in a moment. First note
that
$\forall f\phantom{\rule{4pt}{0ex}}\in \mathcal{F}$
$$\begin{array}{ccc}\hfill E\left[R\left({\widehat{f}}_{n}^{\delta}\right)\right]-R\left(f\right)& =& E[R\left({\widehat{f}}_{n}^{\delta}\right)-{\widehat{R}}_{n}\left(f\right)]+E[{\widehat{R}}_{n}\left(f\right)-R\left(f\right)]\hfill \end{array}$$
The second term is exactly 0, since
$E\left[{\widehat{R}}_{n}\left(f\right)\right]=R\left(f\right)$ .
Now consider the first term
$E[R\left({\widehat{f}}_{n}^{\delta}\right)-{\widehat{R}}_{n}\left(f\right)]$ . Let
$\Omega $ be the set of events on which
$$\begin{array}{c}\hfill R\left({\widehat{f}}_{n}^{\delta}\right)\le {\widehat{R}}_{n}\left(f\right)-C(f,n,\delta ),\phantom{\rule{3.33333pt}{0ex}}\forall \phantom{\rule{3.33333pt}{0ex}}f\in \mathcal{F}\end{array}$$
From the bound above, we know that
$P\left(\Omega \right)\ge 1-\delta $ .
Thus,
$$\begin{array}{ccc}\hfill E[R\left({\widehat{f}}_{n}^{\delta}\right)-{\widehat{R}}_{n}\left(f\right)]& =& E[R\left({\widehat{f}}_{n}^{\delta}\right)-{\widehat{R}}_{n}\left(f\right)|\Omega ]P\left(\Omega \right)+E[R\left({\widehat{f}}_{n}^{\delta}\right)-{\widehat{R}}_{n}\left(f\right)|{\Omega}^{c}](1-P\left(\Omega \right))\hfill \\ & \le & C(f,n,\delta )+\delta \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}(\text{since}\phantom{\rule{3.33333pt}{0ex}}0\le R,\phantom{\rule{3.33333pt}{0ex}}\widehat{R}\le 1,\phantom{\rule{3.33333pt}{0ex}}P\left(\Omega \right)\le 1\phantom{\rule{3.33333pt}{0ex}}\text{and}\phantom{\rule{3.33333pt}{0ex}}1-P\left(\Omega \right)\le \delta )\hfill \\ & =& \sqrt{\frac{c\left(f\right)log2+log(1/\delta )}{2n}}+\delta \hfill \\ & =& \sqrt{\frac{c\left(f\right)log2+\frac{1}{2}logn}{2n}}+\frac{1}{\sqrt{n}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}(\text{by}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4.pt}{0ex}}\text{setting}\phantom{\rule{3.33333pt}{0ex}}\delta =\frac{1}{\sqrt{n}})\hfill \end{array}$$
We can summarize our analysis with the following theorem.
Theorem
Complexity regularized model selection
Let
$\mathcal{F}$ be a countable
collection of models, and assign a positive number
$c\left(f\right)$ to
each
$f\in \mathcal{F}$ such that
${\sum}_{f\in \mathcal{F}}{2}^{-c\left(f\right)}\le 1$ .
Define the minimum complexity regularized risk model
$$\begin{array}{ccc}\hfill {\widehat{f}}_{n}& =& arg\underset{f\in \mathcal{F}}{min}\left\{{\widehat{R}}_{n},\left(f\right),+,\sqrt{\frac{c\left(f\right)log2+\frac{1}{2}logn}{2n}}\right\}\hfill \end{array}$$
Then,
$$\begin{array}{c}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]\phantom{\rule{4pt}{0ex}}\le \phantom{\rule{4pt}{0ex}}\underset{f\in \mathcal{F}}{inf}\left\{R,\left(f\right),+,\sqrt{\frac{c\left(f\right)log2+\frac{1}{2}logn}{2n}},+,\frac{1}{\sqrt{n}}\right\}\end{array}$$
This shows that
$$\begin{array}{c}\hfill {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{c\left(f\right)log2+\frac{1}{2}logn}{2n}}\end{array}$$
is a reasonable surrogate for
$$\begin{array}{c}\hfill R\left(f\right)+\sqrt{\frac{c\left(f\right)log2+\frac{1}{2}logn}{2n}}\end{array}$$
Histogram classifiers
Let
$\mathcal{X}={[0,1]}^{d}$ be the input
space and
$\mathcal{Y}=\{0,1\}$ be the output space. Let
${\mathcal{F}}_{k}$ , k = 1, 2,
... denotes the collection of histogram classification rules withk equal volume bins. One choice of prefix code for this example is:
$k=1\Rightarrow \text{code}=0,k=3\Rightarrow \text{code}=10,k=3\Rightarrow \text{code}=110$ and so on .... Then, if
first code is corresponding to k
$\Rightarrow f\in {\mathcal{F}}_{k}$ , followed
by
$k={log}_{2}\left|{\mathcal{F}}_{k}\right|$ bits to indicate which of the
${2}^{k}$ histogram
rules in
${\mathcal{F}}_{k}$ is under consideration, we have
$$\begin{array}{c}\hfill f\in {\mathcal{F}}_{k}\Rightarrow c\left(f\right)=2k\phantom{\rule{3.33333pt}{0ex}}bits\end{array}$$
Let
${\widehat{f}}_{n}$ be the model that solves the minimization
i.e.,
$$\begin{array}{c}\hfill \underset{k\ge 1}{min}\left\{\underset{f\in {\mathcal{F}}_{k}}{min},{\widehat{R}}_{n},\left(f\right),+,\sqrt{\frac{2klog2+\frac{1}{2}logn}{2n}}\right\}\end{array}$$
That is, for each k, let
$$\begin{array}{ccc}\hfill {\widehat{f}}_{n}^{\left(k\right)}& =& arg\underset{f\in {\mathcal{F}}_{k}}{min}{\widehat{R}}_{n}\left(f\right)\hfill \end{array}$$
Then select the best
$k$ according to
$$\begin{array}{ccc}\hfill \widehat{k}& =& arg\underset{k\ge 1}{min}\left\{{\widehat{R}}_{n},\left({\widehat{f}}_{n}^{\left(k\right)}\right),+,\sqrt{\frac{2klog2+\frac{1}{2}logn}{2n}}\right\}\hfill \end{array}$$
and set
$$\begin{array}{ccc}\hfill {\widehat{f}}_{n}& =& {\widehat{f}}_{n}^{\left(\widehat{k}\right)}\hfill \end{array}$$
Then,
$$\begin{array}{c}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]\phantom{\rule{4pt}{0ex}}\le \phantom{\rule{4pt}{0ex}}\underset{k\ge 1}{inf}\left\{\underset{f\in {\mathcal{F}}_{k}}{min},R,\left(f\right),+,\sqrt{\frac{2klog2+\frac{1}{2}logn}{2n}},+,\frac{1}{\sqrt{n}}\right\}\end{array}$$
It is a simple exercise to show that if
$d=2$ and the
Bayes decision boundary is a 1-d curve, then by setting
$k=\sqrt{n}$ and selecting the best
$f$ from
${\mathcal{F}}_{\sqrt{n}}$ we have
$$\begin{array}{c}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]\phantom{\rule{4pt}{0ex}}=\phantom{\rule{4pt}{0ex}}O\left({n}^{-1/4}\right)\end{array}$$
The complexity regularized classifier
${\widehat{f}}_{n}$ adaptively
achieves this rate, without user intervention.