# 12.6 Outliers  (Page 2/11)

 Page 2 / 11

## Try it

Identify the potential outlier in the scatter plot. The standard deviation of the residuals or errors is approximately 8.6.

The outlier appears to be at (6, 58). The expected y value on the line for the point (6, 58) is approximately 82. Fifty-eight is 24 units from 82. Twenty-four is more than two standard deviations (2 s = (2)(8.6) = 17.2 ). So 82 is more than two standard deviations from 58, which makes (6, 58) a potential outlier.

## Numerical identification of outliers

In [link] , the first two columns are the third-exam and final-exam data. The third column shows the predicted ŷ values calculated from the line of best fit: ŷ = –173.5 + 4.83 x . The residuals, or errors, have been calculated in the fourth column of the table: observed y value−predicted y value = y ŷ .

s is the standard deviation of all the y ŷ = ε values where n = the total number of data points. If each residual is calculated and squared, and the results are added, we get the SSE. The standard deviation of the residuals is calculated from the SSE as:

$s=\sqrt{\frac{SSE}{n-2}}$

## Note

We divide by ( n – 2) because the regression model involves two estimates.

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For this example, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals

• 35
• –17
• 16
• –6
• –19
• 9
• 3
• –1
• –10
• –9
• –1
.

x y ŷ y ŷ
65 175 140 175 – 140 = 35
67 133 150 133 – 150= –17
71 185 169 185 – 169 = 16
71 163 169 163 – 169 = –6
66 126 145 126 – 145 = –19
75 198 189 198 – 189 = 9
67 153 150 153 – 150 = 3
70 163 164 163 – 164 = –1
71 159 169 159 – 169 = –10
69 151 160 151 – 160 = –9
69 159 160 159 – 160 = –1

We are looking for all data points for which the residual is greater than 2 s = 2(16.4) = 32.8 or less than –32.8. Compare these values to the residuals in column four of the table. The only such data point is the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for this student is 35.

## How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65, 175) as an outlier. We should re-examine the data for this point to see if there are any problems with the data. If there is an error, we should fix the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the results, as a learning experience.

## Compute a new best-fit line and correlation coefficient using the ten remaining points:

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new line of best fit and the correlation coefficient are:

ŷ = –355.19 + 7.39 x and r = 0.9121

The new line with r = 0.9121 is a stronger correlation than the original ( r = 0.6631) because r = 0.9121 is closer to one. This means that the new line is a better fit to the ten remaining data values. The line can better predict the final exam score given the third exam score.

#### Questions & Answers

Frequency find questions
Rimsha Reply
What is nominal variable
olusola Reply
Write short notes on, nominal variable, ordinal variable, internal variable, ratio variable.
olusola
P( /x-50/ less than or equal to 5 ) where mean =52 and Variance =25
Jay Reply
how I get the mcq
Mukesh Reply
please what is data mining
Josephine Reply
the exploration and analysis of large data to discover meaningful patterns and rules
Hussein
how do we calculate the median
All Reply
f(x)=cx(1-x)^4 as x range 4rm 0<=x<=1. Can someone pls help me find d constant C. By integration only..
Akeem Reply
uses of statistics in Local Government
Saleema Reply
Hi
Tamuno
hello
Saleema
state road transport corporation
Atul
District statistical officer
Atul
statistical services
Atul
Please is this part of the IMT program
Tamuno
testing of drugs
Shambhavi
hii 2
Qamar-ul-
How about population census
Tamuno
Hello every one
Okoi
sample survey is done by local government in each and every field.
syeda
statistics is used in almost every government organisations such as health department, economic department, census, weather forecasting fields
raghavendra
that's true
syeda
statistics is one of the tool that represents the falling and rising of any cases in one sheet either that is in population census whether forecast as well as economic growth
Aadil
statistic is a technique, and statistics is a subject
syeda
what is business statistics
PM Reply
Probability tells you the likelihood of an event happening. ... The higher the probability, the more likely it is to happen. Probability is a number or fraction between 0 and 1. A probability of 1 means something will always happen, and a probability of 0 means something will never happen...
La Reply
Saying it's a number between zero and one means it is a fraction so you could remove "or fraction" from you definition.
Carlos
wouldn't be correct to remove fractions, saying a number is justified as probabilities can also be decimals between 0 and 1.
Denzel
Saying "a number" will include it being a decimal which are themselves fractions in another form.
Carlos
I will simply say a probability is a number in the range zero to one, inclusive.
Carlos
f#\$
Carlos
How to delete an entry? This last one was a pocket print.
Carlos
what is probability
sky-D Reply
chance of occurrence
Sikander
what is data
Muhd Reply
raw facts and figures
Sikander
information of any kind
Tahir
What is Statistic
ibrahim Reply
what statistical analysis can i run on growth and yield of spinach.
guillio
format of the frequency distribution table
henry
what is pearson correlation coefficient indicates?
Eticha
Statistic is the mean of the sample.
Raman
can anyone determine the value of c and the covariance and correlation for the joint probability density function Fxy(x,y)=c over the range 0<x<5,0<y,and x-1<y<x-1.
Nuhu
what actually is the definition of range
Chinedu Reply
I need social statistics materials
Chinedu
the range of a set of data is the difference between the largest and smallest values
La
I need more explanation about cluster sampling
Hafsat
write the set of old number that are greater than or equal to minutes 7 butl less than 5 in both of the set notation
Jerry Reply

### Read also:

#### Get the best Introductory statistics course in your pocket!

Source:  OpenStax, Introductory statistics. OpenStax CNX. May 06, 2016 Download for free at http://legacy.cnx.org/content/col11562/1.18
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Introductory statistics' conversation and receive update notifications?

 By By By By By