<< Chapter < Page Chapter >> Page >

Statistical learning methods are based on developing decision rules or estimators based only on a collection of training examples, ratherthan predetermined probability models. Statistical learning methods are often said to be distribution-free , since they do not assume particular probability models. The canonical set-up for statisticallearning is as follows. We begin with a collection of training examples, { ( X i , Y i ) } i = 1 n , which are assumed to be independently and identically distributed according to an unknown probability distribution P X , Y ( x , y ) . If we knew P X , Y ( x , y ) , then we could compute a desired risk function and design an optimal decision rule using the methods described above. Inessence, the training examples give us a glimpse at the underlying distribution, but our knowledge of it is far from complete. We cannotexactly compute a risk function, and therefore we cannot derive a corresponding optimal decision rule.

There are at least two ways to proceed at this point. One possibility is to use the training examples to estimate the joint probabilitydistribution, and then use this estimate to derive an decision rule. Unfortunately, the (general-purpose) problem of estimating adistribution is often more difficult from a limited pool of data than is the problem of designing a specific-purpose decision rule. Forthis reason, a second possibility is more commonly favored in practice. Rather than estimating the complete distribution, one canuse the training examples to directly design a decision rule. More precisely, perhaps the most common approach is to use the trainingexamples to compute an estimate of the desired risk function.

Suppose that we are interested in minimizing a particular risk function. Recall that the risk is the expected value of a chosen lossfunction. Let ( Y ^ , Y ) denote the loss, and let f ( X ) denote a candidate decision function, mapping observations to predictions about Y (i.e., Y ^ = f ( X ) ). The empirical risk function is constructed from the training examples as follows:

R ^ ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) .

This is simply the average loss of the decision rule f over the set of training examples. Note that since the training examples areindependent and identically distributed, the expected value of the empirical risk is equal to the true risk R ( f ) = E [ ( f ( X ) , Y ) ] . Moreover, we known (according to the law of large numbers) that theempirical risk tends to the true risk as the size of the training sample increases. These facts lend support to the idea of choosing adecision rule to minimize the empirical risk.

Empirical risk minimization (ERM) is just this process. Given a collection of possible decision rules, say F , ERM selects a decision rule according to

f ^ n = arg min f F R ^ ( f ) .

The selected rule, f ^ n , obviously depends on the given set of training examples, and therefore it is itself a random quantity. Thetheoretically optimal counterpart to f ^ n is the decision rule that minimizes the true risk

f * = arg min f F R ( f ) .

The central problem in statistical learning is to quantify how close f ^ n performs relative to f * . Note that R ( f * ) R ( f ^ n ) , since f * minimizes the true risk. Thus, one way to gauge the performance of f ^ n relative to f * is to show that there exists small positive values ϵ and δ such that with probability at least 1 - δ we have

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask