0.11 Decision trees

Statistical learning theory Page 1 / 5

Minimum complexity penalized function

Recall the basic results of the last lectures: let $X$ and $Y$ denote the input and output spaces respectively. Let $X \in X$ and $Y \in X$ be random variables with unknown joint probability distribution $P_{X Y}$ . We would like to use $X$ to “predict” $Y$ . Consider a loss function $0 \leq ℓ (y_{1}, y_{2}) \leq 1, \forall y_{1}, y_{2} \in Y$ . This function is used to measure the accuracy of our prediction. Let $F$ be a collection of candidate functions (models), $f : X \to Y$ . The expected risk we incur is given by $R (f) \equiv E_{X Y} [ℓ (f (X), Y)]$ . We have access only to a number of i.i.d. samples, ${X_{i}, Y_{i}}_{i = 1}^{n}$ . These allow us to compute the empirical risk ${\hat{R}}_{n} (f) \equiv \frac{1}{n} \sum_{i = 1}^{n} ℓ (f (X_{i}), Y_{i})$ .

Assume in the following that $F$ is countable. Assign a positive number $c (f)$ to each $f \in F$ such that $\sum_{f \in F} 2^{- c (f)} \leq 1$ . If we use a prefix code to describe each element of $F$ and define $c (f)$ to be the codeword length (in bits) for each $f \in F$ , the last inequality is automatically satisfied.

We define the minimum complexity penalized estimator as

{\hat{f}}_{n} \equiv arg min_{f \in F} \{{\hat{R}}_{n} (f) + \sqrt{\frac{c (f) log 2 + \frac{1}{2} log n}{2 n}}\} .

As we showed previously we have the bound

E [R ({\hat{f}}_{n})] \leq min_{f \in F} \{R (f) + \sqrt{\frac{c (f) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}}\} .

The performance (risk) of ${\hat{f}}_{n}$ is on average better than

R (f_{n}^{*}) + \sqrt{\frac{c (f_{n}^{*}) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}},

where

f_{n}^{*} = arg min_{f \in F} \{R (f) + \sqrt{\frac{c (f) log 2 + \frac{1}{2} log n}{2 n}}\} .

If it happens that the optimal function, that is

f^{*} = arg min_{f measurable} R (f),

is close to an $f \in F$ with a small $c (f)$ , then ${\hat{f}}_{n}$ will perform almost as well as the optimal function.

Suppose $f^{*} \in F$ , then

E [R ({\hat{f}}_{n})] \leq R (f^{*}) + \sqrt{\frac{c (f^{*}) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}} .

Furthermore if $c (f^{*}) = O (log n)$ then

E [R ({\hat{f}}_{n})] \leq R (f^{*}) + O (\sqrt{\frac{log n}{n}}),

that is, only within a small $O (\sqrt{\frac{log n}{n}})$ offset of the optimal risk.

In general, we can also bound the excess risk $E [R ({\hat{f}}_{n})] - R^{*}$ , where $R^{*}$ is the Bayes risk,

R^{*} = inf_{f measurable} R (f) .

By subtracting $R^{*}$ (a constant) from both sides of the inequality

E [R ({\hat{f}}_{n})] \leq min_{f \in F} \{R (f) + \sqrt{\frac{c (f) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}}\}

we obtain

E [R ({\hat{f}}_{n})] - R^{*} \leq min_{f \in F} \{R (f) - R^{*} + \sqrt{\frac{c (f) log 2 + \frac{1}{2} log n}{2 n}} + \frac{1}{\sqrt{n}}\} .

Note that two terms in this upper bound: $R (f) - R^{*}$ is a bound on the approximation error of a model $f$ , and remainder is a bound on the estimation error associated with $f$ . Thus, we see that complexity regularization automatically optimizes a balance between approximation and estimationerrors. In other words, complexity regularization is adaptive to the unknown tradeoff between approximation and estimation.

Classification

Consider the particularization of the above to a classification scenario. Let $X = {[0, 1]}^{d}$ , $Y = {0, 1}$ and $ℓ (\hat{y}, y) \equiv 1_{{\hat{y} \neq y}}$ . Then $R (f) = E_{X Y} [1_{{f (X) \neq Y}}] = P (f (X) \neq Y)$ . The Bayes risk is given by

R^{*} = inf_{f measurable} R (f) .

As it was observed before, the Bayes classifier ( i.e., a classifier that achieves the Bayes risk) is given by

f^{*} (x) = \{\begin{matrix} 1, & P (Y = 1 | X = x) \geq \frac{1}{2} \\ 0, & P (Y = 1 | X = x) < \frac{1}{2} \end{matrix}) .

This classifier can be expressed in a different way. Consider the set $G^{*} = {x : P (Y = 1 | X = x) \geq 1 / 2}$ . The Bayes classifier can written as $f^{*} (x) = 1_{{x \in G^{*}}}$ . Therefore the classifier is characterized entirely by the set $G^{*}$ , if $X \in G^{*}$ then the “best” guess is that $Y$ is one, and vice-versa. The boundary of this set corresponds to the points where the decision is harder.The boundary of $G^{*}$ is called the Bayes Decision Boundary . In [link] (a) this concept is illustrated. If $η (x) = P (Y = 1 | X = x)$ is a continuous function then the Bayes decision boundary is simply given by ${x : P (Y = 1 | X = x) = 1 / 2}$ . Clearly the structure of the decision boundary provides importantinformation on the difficulty of the problem.

Questions & Answers

what is biology

Hajah Reply

the study of living organisms and their interactions with one another and their environments

AI-Robot

what is biology

Victoria Reply

HOW CAN MAN ORGAN FUNCTION

Alfred Reply

the diagram of the digestive system

Assiatu Reply

allimentary cannel

Ogenrwot

How does twins formed

William Reply

They formed in two ways first when one sperm and one egg are splited by mitosis or two sperm and two eggs join together

Oluwatobi

what is genetics

Josephine Reply

Genetics is the study of heredity

Misack

how does twins formed?

Misack

What is manual

Hassan Reply

discuss biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles

Joseph Reply

what is biology

Yousuf Reply

the study of living organisms and their interactions with one another and their environment.

Wine

discuss the biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles in an essay form

Joseph Reply

what is the blood cells

Shaker Reply

list any five characteristics of the blood cells

Shaker

lack electricity and its more savely than electronic microscope because its naturally by using of light

Abdullahi Reply

advantage of electronic microscope is easily and clearly while disadvantage is dangerous because its electronic. advantage of light microscope is savely and naturally by sun while disadvantage is not easily,means its not sharp and not clear

Abdullahi

cell theory state that every organisms composed of one or more cell,cell is the basic unit of life

Abdullahi

is like gone fail us

DENG

cells is the basic structure and functions of all living things

Ramadan

What is classification

ISCONT Reply

is organisms that are similar into groups called tara

Yamosa

in what situation (s) would be the use of a scanning electron microscope be ideal and why?

Kenna Reply

A scanning electron microscope (SEM) is ideal for situations requiring high-resolution imaging of surfaces. It is commonly used in materials science, biology, and geology to examine the topography and composition of samples at a nanoscale level. SEM is particularly useful for studying fine details,

Hilary

cell is the building block of life.

Condoleezza Reply

Got questions? Join the online conversation and get instant answers!

Jobilize.com Reply

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	3 Arts Society: Theater 3 By Jonathan Long Start Quiz
	11 AP 11 Muscular System MCQ By OpenStax Start Quiz
	2 Endocrinology Essay By Rohini Ajay Start Test
	1 Physiotherapy Flashcards Set 1 By Rhodes Start Flashcards
©flickr:	Dairy Cattle Evaluation Exam By Katy Keilers Start Exam
	15 AP 15 Autonomic Nervous System Essay By OpenStax Start Flashcards
	NCE Ch 11 Counseling Families, Diagnosis... By Anh Dao Start Quiz
	10 Physiotherapy Modalities-Thermo By Rhodes Start Quiz
	9 BOD- Liver Quiz .... By Brooke Delaney Start Quiz
	5 Sociology 05 Socialization MCQ By OpenStax Start Quiz