This paper reviews and contrasts the basic elements of statistical decision theory and statistical learning theory. It is not intended to be a comprehensive treatment of either subject, but rather just enough to draw comparisons between the two.

Throughout this module, let X denote the input to a decision-making process and Y denote the correct response or output (e.g., the value of a parameter, the label of a class, the signal of interest). We assume that X and Y are random variables or random vectors with joint distribution P X , Y ( x , y ) , where x and y denote specific values that may be taken by the random variables X and Y , respectively. The observation X is used to make decisions pertaining to the quantity of interest. For thepurposes of illustration, we will focus on the task of determining the value of the quantity of interest. A decision rule for this task is a function f that takes the observation X as input and outputs a prediction of the quantity Y . We denote a decision rule by Y ^ or f ( X ) , when we wish to indicate explicitly the dependence of the decision rule on the observation. Wewill examine techniques for designing decision rules and for analyzing their performance.

Measuring decision accuracy: loss and risk functions

The accuracy of a decision is measured with a loss function. For example, if our goal is to determine the value of Y , then a loss function takes as inputs the true value Y and the predicted value (the decision) Y ^ = f ( X ) and outputs a non-negative real number (the “loss”) reflective of theaccuracy of the decision. Two of the most commonly encountered loss functions include:

  1. 0/1 loss: 0 / 1 ( Y ^ , Y ) = I Y ^ Y , which is the indicator function taking the value of 1 when Y ^ Y and taking the value 0 when Y ^ ( X ) = Y .
  2. squared error loss: 2 ( Y ^ , Y ) = Y ^ - Y 2 2 , which is simply the sum of squared differences between the elements of Y ^ and Y .

The 0/1 loss is commonly used in detection and classification problems, and the squared error loss is more appropriate for problemsinvolving the estimation of a continuous parameter. Note that since the inputs to the loss function may be random variables, so is the loss.

A risk R ( f ) is a function of the decision rule f , and is defined to be the expectation of a loss with respect to the jointdistribution P X , Y ( x , y ) . For example, the expected 0/1 loss produces the probability of error risk function; i.e., a simply calculation shows that R 0 / 1 ( f ) = E [ ( I f ( X ) Y ] = Pr ( f ( X ) Y ) . The expected squared error loss produces the mean squared error MSE risk function, R 2 ( f ) = E [ f ( X ) - Y 2 2 ] .

Optimal decisions are obtained by choosing a decision rule f that minimizes the desired risk function. Given complete knowledge of theprobability distributions involved (e.g., P X , Y ( x , y ) ) one can explicitly or numerically design an optimal decision rule, denoted f * , that minimizes the risk function.

The maximum likelihood principle

The conditional distribution of the observation X given the quantity of interest Y is denoted by P X | Y ( x | y ) . The conditional distribution P X | Y ( x | y ) can be viewed as a generative model, probabilistically describing the observations resulting from a givenvalue, y , of the quantity of interest. For example, if y is the value of a parameter, the P X | Y ( x | y ) is the probability distribution of the observation X when the parameter value is set to y . If X is a continuous random variable with conditional density p X | Y ( x | y ) or a discrete random variable with conditional probability mass function (pmf) p X | Y ( x | y ) , then given a value y we can assess the probability of a particular measurment value y by the magnitude of either the conditional density or pmf.

