# 12.6 Outliers  (Page 2/11)

 Page 2 / 11

## Try it

Identify the potential outlier in the scatter plot. The standard deviation of the residuals or errors is approximately 8.6.

The outlier appears to be at (6, 58). The expected y value on the line for the point (6, 58) is approximately 82. Fifty-eight is 24 units from 82. Twenty-four is more than two standard deviations (2 s = (2)(8.6) = 17.2 ). So 82 is more than two standard deviations from 58, which makes (6, 58) a potential outlier.

## Numerical identification of outliers

In [link] , the first two columns are the third-exam and final-exam data. The third column shows the predicted ŷ values calculated from the line of best fit: ŷ = –173.5 + 4.83 x . The residuals, or errors, have been calculated in the fourth column of the table: observed y value−predicted y value = y ŷ .

s is the standard deviation of all the y ŷ = ε values where n = the total number of data points. If each residual is calculated and squared, and the results are added, we get the SSE. The standard deviation of the residuals is calculated from the SSE as:

$s=\sqrt{\frac{SSE}{n-2}}$

## Note

We divide by ( n – 2) because the regression model involves two estimates.

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For this example, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals

• 35
• –17
• 16
• –6
• –19
• 9
• 3
• –1
• –10
• –9
• –1
.

x y ŷ y ŷ
65 175 140 175 – 140 = 35
67 133 150 133 – 150= –17
71 185 169 185 – 169 = 16
71 163 169 163 – 169 = –6
66 126 145 126 – 145 = –19
75 198 189 198 – 189 = 9
67 153 150 153 – 150 = 3
70 163 164 163 – 164 = –1
71 159 169 159 – 169 = –10
69 151 160 151 – 160 = –9
69 159 160 159 – 160 = –1

We are looking for all data points for which the residual is greater than 2 s = 2(16.4) = 32.8 or less than –32.8. Compare these values to the residuals in column four of the table. The only such data point is the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for this student is 35.

## How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65, 175) as an outlier. We should re-examine the data for this point to see if there are any problems with the data. If there is an error, we should fix the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the results, as a learning experience.

## Compute a new best-fit line and correlation coefficient using the ten remaining points:

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new line of best fit and the correlation coefficient are:

ŷ = –355.19 + 7.39 x and r = 0.9121

The new line with r = 0.9121 is a stronger correlation than the original ( r = 0.6631) because r = 0.9121 is closer to one. This means that the new line is a better fit to the ten remaining data values. The line can better predict the final exam score given the third exam score.

what is the sample size if the degree of freedom is 25?
26..
friend
25
Tariku
27
Tariku
degrees of freedom may differ with respect to distribution...so tell which distribution you have selected...?
friend
my distribution is 27
Tariku
how to understand statistics
you are working for a bank.The bank manager wants to know the mean waiting time for all customers who visit this bank. she has asked you to estimate this mean by taking a sample . Briefly explain how you will conduct this study. assume the data set on waiting times for 10 customers who visit a bank. Then estimate the population mean. choose your own confidence level.
what marriage for 10 years
fit a least square model of y on x ? what is the regression coefficient ? x : 2 3 6 8 9 10 y : 5 6 7 10 8 11
how can we find the expectation of any function of X?
Jennifer
I've been using this app for some time now. I'm taking a stats class in college in spring and I still have no idea what's going on. I'm also 55 yrs old. Is there another app for people like me?
Tamala
Serious
Hamza
yes I am. it's been decades since I've been in school.
Tamala
who are u
zaheer
is there a private chat we can do
Tamala
hello how can I get PDF of solutions introduction mathematical statistics ( fourth education) who can help me
ahssal
can anyone help me
Halim
what is probability
simply probability means possibility.. definition:Probability is a measure of the likelihood of an event to occur.
laraib
fit a least square model of y on x ? what is the regression coefficient ? x : 2 3 6 8 9 10 y : 5 6 7 10 8 11
Nayab
classification of data by attributes is called
qualitative classification
talal
tell me details about measure of Dispersion
Halim
Following data provided Class Frequency less than 10 10-20 5 15 10-30 25 12 40 and above Which measure of central tendency would you compute and why?
a box contains a few red and a few blue balls.one ball is drawn randomly find the probability of getting a red ball if we know that there are 30 red and 40 blue balls in the box
3/7
RICH
Total=30+40=70 P(red balls) =30/70 Therefore the answer is 3/7
Anuforo
define transport statistical unit
describe each transport statistical unit
Dennis
explain uses of each transport statistical unit
Dennis
identify various transport statistical units with their example
Dennis
I didn't understand about Chi- square.
explain the concept of data analysis and data processing
mean=43+37+35+30+41+23+33+31+16/10 =310/10 =31
Anuforo
43+37+35+30+41+23+33+31+16 divided by 10 =310/10 =31
Anuforo
=310/10 =31
Anuforo
In a recent survey of nurses in Region II, it was found out that the average monthly net income of nurses is Php 8,048.25. Suppose a researcher wants to test this figure by taking a random sample of 158 nurses in Region II to determine whether the monthly income has changed. Supposed further that th