0.5 Plug-in classifier and histogram classifier (Page 2/2)

Statistical learning theory Page 2 / 2

Theorem

Consistency of histogram classifiers

If $M \to \infty$ and $\frac{n}{M} \to \infty$ as $n \to \infty$ , then the histogram classifier risk converges to the Bayes risk for every distribution $P_{X Y}$ with marginal density $p_{X} (x) \geq c$ , for some constant $c > 0$ . Actually, the result holds for every distribution $P_{X Y}$ . For the more general theorem, refer to Theorem $6.1$ in A probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi and Gábor Lugosi. .

What the theorem tells us is that we need the number of partition cells to tend to infinity (to insure that the bias tends to zero), but they can'tgrow faster than the number of samples ( i.e., we want the number of samples per box tending to infinity to drive the variance to zero).

Let $P_{j} \equiv \frac{\int_{Q_{j}} η (x) p_{X} (x) d x}{\int_{Q_{j}} p_{X} (x) d x}$ (the theoretical analog of ${\hat{P}}_{j}$ ) and define

\bar{η} (x) = \sum_{j = 1}^{M} P_{j} 1_{{x \in Q_{j}}}

The function $\bar{η}$ is the theoretical analog of $\hat{η}$ (i.e., the function obtained by averaging $η$ over the partition cells). By the triangle inequality,

E [|, {\hat{η}}_{n}, (X) - η (X) |] \leq \underset{E s t i m a t i o n E r r o r}{\underset{︸}{E [| {\hat{η}}_{n} (X) - \bar{η} (X) |]}} + \underset{A p p r o x i m a t i o n E r r o r}{\underset{︸}{E [| {\bar{η}}_{n} (X) - η (X) |]}}

Let's first bound the estimation error. For any $x \in {[0, 1]}^{d}$ , let $Q (x)$ denote the histogram bin in which $x$ falls in. Define the random variable

N (x) = \sum_{i = 1}^{n} 1_{{X_{i} \in Q (x)}}

If $Q (x) = Q_{j}$ , then this random variable is simply $n {\hat{P}}_{j}$ . Note that

{\hat{η}}_{n} (x) = \frac{1}{N (x)} B (x)

where $B (x) = = \sum_{i = 1}^{n} 1_{{X_{i} \in Q (x), Y_{i} = 1}} = \sum_{i : X_{i} \in Q (x)} Y_{i}$ . $B (x)$ is simply the number of samples in cell $Q (x)$ labelled 1. Now ${\hat{η}}_{n} (x)$ is a fairly complicated random variable, but the conditional distribution of $B (x)$ given $N (x)$ is relatively simple. Note that

B (x) | N (x) = k \sim Binomial (k, \bar{η} (x))

since $\bar{η} (x)$ is the probability of a sample in $Q (x)$ having the label 1 and we are conditioning on the event of observing $k$ samples in $Q (x)$ .

Now consider the conditional expectation

E [|{\hat{η}}_{n} (x) - \bar{η} (x)| ∣ N (x) = k] \leq \{\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k], & k > 0 \\ 1, & k = 0 (since 0 \leq \bar{η} (x) \leq 1) \end{matrix})

Next note that

\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k] & = & E [|\frac{B (x)}{k} - \bar{η} (x)| ∣ N (x) = k] \\ = & E [\frac{1}{k} | B (x) - \underset{E [B (x)]}{\underset{︸}{k \bar{η} (x)}} | ∣ N (x) = k] \\ \leq & \frac{1}{k} {(\underset{conditional variance of B (x)}{\underset{︸}{E [| B (x) - k \bar{η} {(x) |}^{2} ∣ N (x) = k]}})}^{\frac{1}{2}} \end{matrix}

by the Jensen's inequality, $E [| Z |] \leq {(E [| Z |}^{2} {])}^{\frac{1}{2}}$ .

Therefore,

\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k] & \leq & \frac{1}{k} {(k, \bar{η}, (x), (1 - \bar{η} (x)))}^{\frac{1}{2}} \\ = & \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x))}{k}} \end{matrix}

and

E [| {\hat{η}}_{n} (x) - \bar{η} (x) | ∣ N (x) = k] \leq \{\begin{matrix} \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{k}}, & k > 0 \\ 1, & k = 0 \end{matrix})

or in other words,

E [| {\hat{η}}_{n} (x) - \bar{η} (x) | ∣ N (x) = k] \leq \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{N (x)}} 1_{{N (x) > 0}} + 1_{{N (x) = 0}}

Now taking expectation with respect to $N (x)$

\begin{matrix} E_{N} [E [| {\hat{η}}_{n} (x) - \bar{η} (x) | N (x) = k]] & \leq & E_{N} [\sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{N (x)}}, 1_{{N (x) > 0}}] + P (N (x) = 0) \\ \leq & E [\frac{1}{2 \sqrt{N (x)}}, 1_{{N (x) > 0}}] + P (N (x) = 0) \\ \leq & \frac{1}{2} P (N (x) \leq k) + \frac{1}{2 \sqrt{k}} \underset{\leq 1}{\underset{︸}{P (N (x) > k)}} + P (N (x) = 0) \end{matrix}

Now a key fact is that for any $k > 0$ , $P (N \leq k) \to 0 as n \to \infty$ . This follows from the assumption that the marginal density $p_{X} (x) \geq c$ , for some constant $c > 0$ , and $\frac{n}{M} \to \infty$ as $n \to \infty$ . This result is easily verified by contradiction. If $P (N \leq k) \to q > 0 as n \to \infty$ , then $P_{X} (x) > 0$ is contradicted. Thus, for any $ϵ > 0$ there exists a $k > 0$ such that $\frac{1}{2 \sqrt{k}} < ϵ$ and $P (N \leq k) < ϵ$ for $n$ sufficiently large. Therefore, for $n$ sufficiently large and every $x \in {[0, 1]}^{d}$ ,

E [| {\hat{η}}_{n} (x) - \bar{η} (x) |] < 3 ϵ

where the expectation is with respect to the distribution of the sample ${X_{i}, Y_{i}}_{i = 1}^{n}$ . Thus,

E [| {\hat{η}}_{n} (X) - \bar{η} (X) |] < 3 ϵ

where the expectation is now with respect to the distribution of the sampleand the marginal distribution of $X$ .

Next consider the approximation error $E [|, {\bar{η}}_{n}, (X) - η (X) |]$ , where the expectation is over $X$ alone. The function $η$ may not itself be continuous, but there is another function $η_{ϵ}$ that is uniformly continuous and such that $E [|, η_{ϵ}, (X) - η (X) |] < ϵ$ . Recall that uniformly continuous functions can be well approximated by piecewiseconstant functions.

By the triangle inequality,

E [|, \bar{η}, - η |] \leq \underset{\leq ϵ}{\underset{︸}{E [| \bar{η} - {\bar{η}}_{ϵ} |]}} + E [| {\bar{η}}_{ϵ} - η_{ϵ} |] + \underset{\leq ϵ by design}{\underset{︸}{E [|, η_{ϵ}, - η |]}}

where ${\bar{η}}_{ϵ} (x) = \sum_{j = 1}^{m} [\int_{Q_{j}}, η_{ϵ}, (x^{'}), p_{X}, (x^{'}), d, x^{'}] 1_{{x \in Q_{j}}}$ .

\begin{matrix} E [| \bar{η} (X) - {\bar{η}}_{ϵ} (X) |] & = & \sum_{j = 1}^{m} [\int_{Q_{j}}, | η (x) - η_{ϵ} (x) |, p_{X}, (x), d, x] 1_{{x \in Q_{j}}} \\ \leq & ϵ \end{matrix}

and since $η_{ϵ}$ is uniformly continuous,

\begin{matrix} E [| {\bar{η}}_{ϵ} (X) - η_{ϵ} (X) |] & = & \sum_{j = 1}^{M} \int_{Q_{j}} | {\bar{η}}_{ϵ} (x) - η_{ϵ} (x) | 1_{{x \in Q_{j}}} p_{X} (x) d x \\ \leq & \sum_{j = 1}^{M} δ P (x \in Q_{j}), where δ depends on M \\ = & δ, since \sum_{j = 1}^{M} P (X \in Q_{j}) = 1 \end{matrix}

By taking $M$ sufficiently large, $δ$ can be made arbitrarily small. So for large $M$ , $δ \leq ϵ$ .

Thus, we have shown

E [|, \bar{η}, (X) - η (X) |] < 3 ϵ

for sufficiently large $M$ . Since $ϵ > 0$ was arbitrary, we have shown that taking

{\hat{f}}_{n} (x) = \{\begin{matrix} 1, & {\hat{η}}_{n} (x) \geq 1 / 2 \\ 0, & o t h e r w i s e \end{matrix})

satisfies

P ({\hat{f}}_{n} (X) \neq Y) - P (f^{*} (X) \neq Y) \leq 2 E [|, {\hat{η}}_{n}, (X) - η (X) |] \to 0

\begin{matrix} M & \to & \infty \\ \frac{n}{M} & \to & \infty as n \to \infty \end{matrix}

Note

P ({\hat{f}}_{n} (X) \neq Y) = E [1_{{\hat{f} (X) \neq Y}}]

is the expected risk of

\hat{f}

, with expectation over the distributions of

(X, Y)

and

{X_{i}, Y_{i}}_{i = 1}^{n}

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	2011 Dynamics CRM By Danielrosenberger Start Quiz
	Cardiac Electrophysiology Basic By Mistry Bhavesh Start Quiz
	2 Understanding Societies 2 By Jessica Collett Start Exam
	13 Lec:13 Hypothesis Testing P-values By Janet Forrester Start Quiz
	9 AP 09 Joints Essay Quiz By OpenStax Start Flashcards
	38 Biology 38 The Musculoskeletal System MCQ By OpenStax Start Quiz
	Advertising Promotion BUS210 By Melinda Salzer Start Quiz
	13 AP Key Terms 13 Anatomy of the Nervous System By OpenStax Start Key Terms
	5 Microbiology Final Practice By Madison Christian Start Assignment
©flickr: Tim	Hazard Quiz By Royalle Moore Start Quiz