# 12.6 Outliers  (Page 2/11)

 Page 2 / 11

## Try it

Identify the potential outlier in the scatter plot. The standard deviation of the residuals or errors is approximately 8.6.

The outlier appears to be at (6, 58). The expected y value on the line for the point (6, 58) is approximately 82. Fifty-eight is 24 units from 82. Twenty-four is more than two standard deviations (2 s = (2)(8.6) = 17.2 ). So 82 is more than two standard deviations from 58, which makes (6, 58) a potential outlier.

## Numerical identification of outliers

In [link] , the first two columns are the third-exam and final-exam data. The third column shows the predicted ŷ values calculated from the line of best fit: ŷ = –173.5 + 4.83 x . The residuals, or errors, have been calculated in the fourth column of the table: observed y value−predicted y value = y ŷ .

s is the standard deviation of all the y ŷ = ε values where n = the total number of data points. If each residual is calculated and squared, and the results are added, we get the SSE. The standard deviation of the residuals is calculated from the SSE as:

$s=\sqrt{\frac{SSE}{n-2}}$

## Note

We divide by ( n – 2) because the regression model involves two estimates.

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For this example, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals

• 35
• –17
• 16
• –6
• –19
• 9
• 3
• –1
• –10
• –9
• –1
.

x y ŷ y ŷ
65 175 140 175 – 140 = 35
67 133 150 133 – 150= –17
71 185 169 185 – 169 = 16
71 163 169 163 – 169 = –6
66 126 145 126 – 145 = –19
75 198 189 198 – 189 = 9
67 153 150 153 – 150 = 3
70 163 164 163 – 164 = –1
71 159 169 159 – 169 = –10
69 151 160 151 – 160 = –9
69 159 160 159 – 160 = –1

We are looking for all data points for which the residual is greater than 2 s = 2(16.4) = 32.8 or less than –32.8. Compare these values to the residuals in column four of the table. The only such data point is the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for this student is 35.

## How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65, 175) as an outlier. We should re-examine the data for this point to see if there are any problems with the data. If there is an error, we should fix the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the results, as a learning experience.

## Compute a new best-fit line and correlation coefficient using the ten remaining points:

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new line of best fit and the correlation coefficient are:

ŷ = –355.19 + 7.39 x and r = 0.9121

The new line with r = 0.9121 is a stronger correlation than the original ( r = 0.6631) because r = 0.9121 is closer to one. This means that the new line is a better fit to the ten remaining data values. The line can better predict the final exam score given the third exam score.

mean is number that occurs frequently in a giving data
That places the mode and the mean as the same thing. I'd define the mean as the ratio of the total sum of variables to the variable count, and it assigns the variables a similar value across the board.
Samsicker
what is mean
what is normal distribution
What is the uses of sample in real life
pain scales in hospital
Lisa
change of origin and scale
3. If the grades of 40000 students in a course at the Hashemite University are distributed according to N(60,400) Then the number of students with grades less than 75 =*
If a constant value is added to every observation of data, then arithmetic mean is obtained by
sum of AM+Constnt
Fazal
data can be defined as numbers in context. suppose you are given the following set of numbers 18,22,22,20,19,21
what are data
what is mode?
what is statistics
Natasha
statistics is a combination of collect data summraize data analyiz data and interprete data
Ali
what is mode
Natasha
what is statistics
It is the science of analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.
Bernice
history of statistics
statistics was first used by?
Terseer
if a population has a prevalence of Hypertension 5%, what is the probability of 4 people having hypertension from 8 randomly selected individuals?
Carpet land sales persons average 8000 per weekend sales Steve qantas the firm's vice president proposes a compensation plan with new selling incentives Steve hopes that the results of a trial selling period will enable him to conclude that the compensation plan increases the average sales per sales
Supposed we have Standard deviation 1.56, mean 6.36, sample size 25 and Z-score 1.96 at 95% confidence level, what is the confidence interval?