Recap: classifier design
Given a set of training data
${\{{X}_{i},{Y}_{i}\}}_{i=1}^{n}$ and a
finite collection of candidate functions
$\mathcal{F}$ , select
${\widehat{f}}_{n}\in \mathcal{F}$ that (hopefully) is a good predictor for future cases.
That is
$$\widehat{{f}_{n}}=arg\underset{f\in \mathcal{F}}{min}{\widehat{R}}_{n}\left(f\right)$$
where
${\widehat{R}}_{n}\left(f\right)$ is the empirical risk.
For any particular
$f\in \mathcal{F}$ , the corresponding empirical risk is
defined as
$${\widehat{R}}_{n}\left(f\right)=\frac{1}{n}\sum _{i=1}^{n}{\mathbf{1}}_{\{f\left({X}_{i}\right)\ne {Y}_{i}\}}.$$
Hoeffding's inequality
Hoeffding's inequality (Chernoff's bound in this case) allows
us to gauge how close
${\widehat{R}}_{n}\left(f\right)$ is to the true risk of
$f$ ,
$R\left(f\right)$ , in probability
$$P\left(\right|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5)\le 2{e}^{-2n{\u03f5}^{2}}.$$
Since our selection process involves deciding among all
$f\in \mathcal{F}$ ,
we would like to gauge how close the empirical risks are to theirexpected values. We can do this by studying the probability that one
or more of the empirical risks deviates significantly from itsexpected value. This is captured by the probability
$$P\left(\underset{f\in \mathcal{F}}{max},|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5\right).$$
Note that the event
$$\underset{f\in \mathcal{F}}{max}|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5$$
is
equivalent to union of the events
$$\bigcup _{f\in \mathcal{F}}\left\{|,{\widehat{R}}_{n},\left(f\right)-R\left(f\right)|\ge \u03f5\right\}.$$
Therefore, we can use Bonferonni's bound (aka the “union of events”
or “union” bound) to obtain
$$\begin{array}{ccc}\hfill P\left(\underset{f\in \mathcal{F}}{max},|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5\right)& =& P\left(\bigcup _{f\in \mathcal{F}},|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5\right)\hfill \\ & \le & \sum _{f\in \mathcal{F}}P\left(\right|{\widehat{R}}_{n}\left(f\right)-R\left(f\right)|\ge \u03f5)\hfill \\ & \le & \sum _{f\in \mathcal{F}}2{e}^{-2n{\u03f5}^{2}}\hfill \\ & =& 2\left|\mathcal{F}\right|{e}^{-2n{\u03f5}^{2}}\hfill \end{array}$$
where
$\left|\mathcal{F}\right|$ is the number of classifiers in
$\mathcal{F}$ . In the proof of
Hoeffding's inequality we also obtained a one-sided inequality thatimplied
$$P(R\left(f\right)-{\widehat{R}}_{n}\left(f\right)\ge \u03f5)\le {e}^{-2n{\u03f5}^{2}}$$
and hence
$$P\left(\underset{f\in \mathcal{F}}{max},\phantom{\rule{0.166667em}{0ex}},R\left(f\right)-{\widehat{R}}_{n}\left(f\right)\ge \u03f5\right)\le \left|\mathcal{F}\right|{e}^{-2n{\u03f5}^{2}}.$$
We can restate the inequality above as follows, For all
$f\in \mathcal{F}$ and for all
$\delta >0$ with probability at least
$1-\delta $
$$R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+\sqrt{\frac{log\left|\mathcal{F}\right|+log(1/\delta )}{2n}}.$$
This follows by setting
$\delta =\left|\mathcal{F}\right|{e}^{-2n{\u03f5}^{2}}$ and
solving for
$\u03f5$ . Thus with a high probability
$(1-\delta )$ ,
the true risk for all
$f\in \mathcal{F}$ is bounded by the empirical risk
of
$f$ plus a constant that depends on
$\delta >0$ , the number of
training samples n, and the size
$\mathcal{F}$ . Most importantly the bound
does not depend on the unknown distribution
${P}_{XY}$ . Therefore,
we can call this a
distribution-free bound.
Error bounds
We can use the
distribution-free bound above to obtain a
bound on the expected performance of the minimum empirical riskclassifier
$${\widehat{f}}_{n}=arg\underset{f\in \mathcal{F}}{min}{\widehat{R}}_{n}\left(f\right).$$
We are interested in bounding
$$E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{min}R\left(f\right)$$
the expected risk of
${\widehat{f}}_{n}$ minus the minimum risk for all
$f\in \mathcal{F}$ . Note that this difference is always non-negative since
${\widehat{f}}_{n}$ is at best as good as
$${f}^{*}=arg\underset{f\in \mathcal{F}}{min}R\left(f\right).$$
Recall that
$\forall f\in \mathcal{F}$ and
$\forall \delta >0$ , with
probability at least
$1-\delta $
$$R\left(f\right)\le {\widehat{R}}_{n}\left(f\right)+C(\mathcal{F},n,\delta )$$
where
$$C(\mathcal{F},n,\delta )=\sqrt{\frac{log\left|\mathcal{F}\right|+log(1/\delta )}{2n}}.$$
In particular, since this holds for all
$f\in \mathcal{F}$ including
${\widehat{f}}_{n}$ ,
$$R\left({\widehat{f}}_{n}\right)\le {\widehat{R}}_{n}\left({\widehat{f}}_{n}\right)+C(\mathcal{F},n,\delta )$$
and for any other
$f\in \mathcal{F}$
$$R\left({\widehat{f}}_{n}\right)\le {\widehat{R}}_{n}\left(f\right)+C(\mathcal{F},n,\delta )$$
since
${\widehat{R}}_{n}\left({\widehat{f}}_{n}\right)\le {\widehat{R}}_{n}\left(f\right)$
$\forall f\in \mathcal{F}$ .
In particular,
$$R\left({\widehat{f}}_{n}\right)\le {\widehat{R}}_{n}\left({f}^{*}\right)+C(\mathcal{F},n,\delta )$$
where
${f}^{*}=arg{min}_{f\in \mathcal{F}}R\left(f\right)$ .
Let
$\Omega $ denote the set of events on which the above
inequality holds. Then by definition
$$P\left(\Omega \right)\ge 1-\delta .$$
We can now bound
$E\left[R\left({\widehat{f}}_{n}\right)\right]-R\left({f}^{*}\right)$ as follows
$$\begin{array}{ccc}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]-R\left({f}^{*}\right)& =& E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)+{\widehat{R}}_{n}\left({f}^{*}\right)-R\left({f}^{*}\right)]\hfill \\ & =& E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)]\hfill \end{array}$$
since
$E\left[{\widehat{R}}_{n}\left({f}^{*}\right)\right]=R\left({f}^{*}\right)$ . The quantity above is bounded as follows.
$$\begin{array}{ccc}\hfill E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)]& =& E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)|\Omega ]\phantom{\rule{0.166667em}{0ex}}P\left(\Omega \right)+E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)|\Omega ]\phantom{\rule{0.166667em}{0ex}}P\left(\overline{\Omega}\right)\hfill \\ & \le & E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)|\Omega ]+\delta \hfill \end{array}$$
since
$P\left(\Omega \right)\le 1$ ,
$1-P\left(\Omega \right)\le \delta $ and
$R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)\le 1$
$$\begin{array}{ccc}\hfill E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)|\Omega ]& \le & E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({\widehat{f}}_{n}\right)|\Omega ]\hfill \\ & \le & C(\mathcal{F},n,\delta )\hfill \end{array}.$$
Thus
$$E[R\left({\widehat{f}}_{n}\right)-{\widehat{R}}_{n}\left({f}^{*}\right)]\le C(\mathcal{F},n,\delta )+\delta .$$
So we have
$$E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{min}R\left(f\right)\le \sqrt{\frac{log\left|\mathcal{F}\right|+log(1/\delta )}{2n}}+\delta ,\phantom{\rule{4.pt}{0ex}}\forall \delta >0.$$
In particular, for
$\delta =\sqrt{1/n}$ , we have
$$\begin{array}{ccc}\hfill E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{min}R\left(f\right)& \le & \sqrt{\frac{log\left|\mathcal{F}\right|+logn}{2n}}+\frac{1}{\sqrt{n}}\hfill \\ & \le & \sqrt{\frac{log\left|\mathcal{F}\right|+logn+2}{n}},\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4.pt}{0ex}}\text{since}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4.pt}{0ex}}\sqrt{x}+\sqrt{y}\le \sqrt{2}\sqrt{x+y},\phantom{\rule{4pt}{0ex}}\forall \phantom{\rule{4pt}{0ex}}x,y>0\hfill \end{array}.$$
Application: histogram classifier
Let
$\mathcal{F}$ be the collection of all classifiers with M equal volume
cells. Then
$\left|\mathcal{F}\right|={2}^{M}$ , and the histogram classification rule
$${\widehat{f}}_{n}=arg\underset{f\in \mathcal{F}}{min}\left(\frac{1}{n},\sum _{i=1}^{n},{1}_{\{f\left({X}_{i}\right)\ne {Y}_{i}\}}\right)$$
satisfies
$$E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{min}R\left(f\right)\le \sqrt{\frac{Mlog2+2+logn}{n}}$$
which suggests the choice
$M={log}_{2}n$ (balancing
$Mlog2$ with
$logn$ ), resulting in
$$E\left[R\left({\widehat{f}}_{n}\right)\right]-\underset{f\in \mathcal{F}}{min}R\left(f\right)=O\left(\sqrt{\frac{logn}{n}}\right).$$