12.6 Outliers

 Page 1 / 11

In some data sets, there are values (observed data points) called outliers . Outliers are observed data points that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the analysis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is to examine carefully what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points . Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.

Computers and many calculators can be used to identify outliers from the data. Computer output for regression analysis will often identify both outliers and influential points so that you can examine them.

Identifying outliers

We could guess at outliers by looking at a graph of the scatterplot and best fit-line. However, we would like some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule of thumb, we can flag any point that is located further than two standard deviations above or below the best-fit line as an outlier . The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points that are outside this extra pair of lines are flagged as potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the standard deviation. On the TI-83, 83+, or 84+, the graphical approach is easier. The graphical procedure is shown first, followed by the numerical calculations. You would generally need to use only one of these methods.

In the third exam/final exam example , you can determine if there is an outlier or not. If there is an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This means the SSE should be smaller and the correlation coefficient ought to be closer to 1 or –1.

Graphical identification of outliers

With the TI-83, 83+, 84+ graphing calculators, it is easy to identify the outliers graphically and visually. If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance were equal to 2 s or more, then we would consider the data point to be "too far" from the line of best fit. We need to find and graph the lines that are two standard deviations below and above the regression line. Any points that are outside these two lines are outliers. We will call these lines Y2 and Y3:

As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us. Using the LinRegTTest with this data, scroll down through the output screens to find s = 16.412 .

Line Y2 = –173.5 + 4.83 x –2(16.4) and line Y3 = –173.5 + 4.83 x + 2(16.4)

where ŷ = –173.5 + 4.83 x is the line of best fit. Y2 and Y3 have the same slope as the line of best fit.

Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and Y3 in the "Y="equation editor and press ZOOM 9. You will find that the only data point that is not between lines Y2 and Y3 is the point x = 65, y = 175. On the calculator screen it is just barely outside these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final exam; this point is further than two standard deviations away from the best-fit line.

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to tell if the point is between or outside the lines. On a computer, enlarging the graph may help; on a small calculator screen, zooming in may make the graph clearer. Note that when the graph does not give a clear enough picture, you can use the numerical comparisons to identify outliers.

If X is a Uniform random variable in [ -2, 2 ], find the pdf of Y X  and E Y[ ].
I want to know statistics
why is data so important in statistics
want summary statistic on gender, age group, weight, and weight loss
Trixie
are you asking question or looking for Solution
Arun
solution pls
Trixie
1st convert gender and group to factor than use summary function It will give mean median and mode with other details
Arun
its a bit complicated could u bring it to my level of under standing
Trixie
u know the question was put in a tabular form where we were to find the variable type, summary statistics and graph type of the given variables that's the gender,age group, weight and weight loss
Trixie
if you see, gender and group are not numerical due to which they will not give you correct statistics
Arun
how you denote gender m or f
Arun
or t
Arun
ok tnk u
Trixie
these are not numerical so you have to convert they as f=1; m=2; t=3 same thing you have to do with group or any variable which is character else you should drop them
Arun
Arun
oh OK tnk u
Trixie
so pls why is data important in statistics
Trixie
not only data correct data is imp
Arun
statistics works on data only
Arun
without data you can not summarize, can not predict future, can not establish relationship between two and more variables, can not prepare reports and make decisions on it
Arun
so I'll give example. suppose you want to open a restaurant and you have to choose one best location out of 5. then how you will decide which location is best for you
Arun
awww thank you pls
Trixie
pls I want a brief note on observation, survey and experimentation way of obtaining data
Trixie
hi
Abdiwahab
please Tell me difference parameters and non parameter
Abdiwahab
can you tell me about the scopes of statistics?
Minhal
Minhal
Methods of Collecting Data Observation Observational studies allow researchers to document behavior in a natural setting and witness events that could not be produced in a lab.
Arun
Key Points Observation differs from most other forms of data collection in that the researcher does not manipulate variables or directly question participants. The advantages of observation include observing natural behavior, refining hypotheses, and allowing for observation of behavior that canno
Arun
be produced in an artificial environment for ethical or practical reasons. The disadvantages of observation are that these studies do not produce quantitative data, do not allow for cause and effect statements, may be very time consuming, and can be prone to researcher bias.
Arun
Key Terms observational research: Research focusing on the observation of behavior outside of a laboratory setting. external validity: In research, whether or not study findings can be generalized to real world scenarios.
Arun
Surveys and Interviews Surveys are a low-cost option for gathering a large amount of data, but they are also susceptible to reporting bias.
Arun
Key Points The survey method of data collection is likely the most common of the four major research methods. The benefits of this method include low cost, large sample size, and efficiency.
Arun
The major problem with this method is accuracy: since surveys depend on subjects’ motivation, honesty, memory, and ability to respond, they are very susceptible to bias. A researcher must have a strong understanding of how to properly frame survey questions in order to gather reliable and relevant
Arun
Key Terms reliability: The degree to which a measure is likely to yield consistent results each time it is used. validity: The degree to which a measure is actually assessing the concept it was designed to measure. survey: A method for collecting qualitative and quantitative information about ind
Arun
individuals in a population.
Arun
Interviews Interviews are a type of qualitative data in which the researcher asks questions to elicit facts or statements from the interviewee. Interviews used for research can take several forms:
Arun
Informal Interview: A more conversational type of interview, no questions are asked and the interviewee is allowed to talk freely. General interview guide approach: Ensures that the same general areas of information are collected from each interviewee. Provides more focus than the conversational ap
Arun
approach, but still allows a degree of freedom and adaptability in getting the information from the interviewee. Standardized, open-ended interview: The same open-ended questions are asked to all interviewees. This approach facilitates faster interviews that can be more easily analyzed and compared
Arun
Closed, fixed-response interview (Structured): All interviewees are asked the same questions and asked to choose answers from among the same set of alternatives.
Arun
experiments An experiment involves the creation of a contrived situation in order that the researcher can manipulate one or more variables whilst controlling all of the others and measuring the resultant effects.
Arun
Boyd and Westfall1 have defined experimentation as: "...that research process in which one or more variables are manipulated under conditions which permit the collection of data which show the effects, if any, in unconfused fashion."
Arun
Experiments can be conducted either in the field or in a laboratory setting. When operating within a laboratory environment, the researcher has direct control over most, if not all, of the variables that could impact upon the outcome of the experiment
Arun
When experiments are conducted within a natural setting then they are termed field experiments. The variety test carried out by United Fruits on their Gros Michel and Valery bananas is an example of a field experiment.
Arun
parameter Parameters are factors or limits which affect the way that something can be done or made
Arun
Arun
pls can u use mean n mode at the statistical summary pls
Trixie
yes, statistical summary itself gives all value
Arun
but u didn't tell me the advantage and disadvantage of the experimental method
Trixie
but you didn't tell me the advantage and disadvantage of experimental method
Trixie
by using a sampling distribution? how to estimate the population mean using a ramdom variable n?
The “average increase” for all NASDAQ stocks is the:
any video any proof...what is point of estimation in statistics
Define the meaning of statistics
sampli
Hidayat
roductory Statistics is intended for the one-semester introduction to statistics course for students who are not mathematics or engineering majors. It focuses on the interpretation of statistical results, especially in real world settings, and assumes that students have an understanding of intermedi
Hidayat
statistics is science collection of method planning experiment then organizing summarizing presenting analyzing and drawing conclusion.
MOVIES
uses and miss uses of statistics
Pure
Identify the population, sample, parameter, statistic, variable, and data for this example. population sample parameter statistic variable data
Woyo
kinds of probability samples and there advantage
are you going to explain it.
Anil
🙂
Meera
hy
what....?
Sampling takes on two forms in statistics: probability sampling and non-probability sampling: Probability sampling uses random sampling techniques to create a sample. Non-probability samplingtechniques use non-random processes like researcher judgment or convenience sampling.
Advantages Cluster sampling: convenience and ease of use. Simple random sampling: creates samples that are highly representative of the population. Stratified random sampling: creates strata or layers that are highly representative of strata or layers in the population. Systematic sampling: creates
any example plz?
Anil
Plz write the uses and miss uses of statistical theory
Pure
what is the difference between weighted simple price index (WSPI ) & Laspeyre's Price Index ( LPI )
What are the 5 steps of hypothesis testing?
5 steps of hypothesis testing
Sixolisiwe
Make guesses (e.g., customers will leave if we raise our rates) State the null H0 and alternative H1 hypotheses (e.g., H0: there is no correlation) and alpha Select the sampling distribution and specify the test statistic Compute the test statistic Make a decision and interpret the results
Ara
.Five Steps in Hypothesis Testing: 1_Specify the Null Hypothesis. 2_Specify the Alternative Hypothesis. 3_Set the Significance Level (a) 4_Calculate the Test Statistic and Corresponding P-Value. 5_Drawing a Conclusion.
Rachel
.Econometric Results uses Multiple Regression for the basis of looking at number of casual factors (independent χ Variables) such as Employment, being Female etc., to test for any relationship with the dependent γ Variable Wages, in order to find any evidence to support the Alternative Hypothesis(Ha
Rachel
.Alternative Hypothesis (H1 or Ha) of Wage Differentials or in the extreme case, if the strength of relationship is strong enough between the dependent γ Variable, and multiple χ Variables, suggesting evidence for the Null Hypothesis ( Ho) that Wage Discrimination may exist.
Rachel
.The Significance Level which is also the Critical Value gives the maximum allowable probability of making a Type I error – the Significance Level value of which is decided upon before the data sample is collected and analysed, as a guide to avoid or control making a Type I error.
Rachel
Type I Error occurs when the Null Hypothesis (Ho) is not accepted when in reality the Null Hypothesis is true. A Type II Error however, occurs when one fails to reject the Null Hypothesis when in reality, the Null Hypothesis (Ho) is not true.
Rachel
.The #P-Value measures the likelihood of getting the sample results if the Null Hypothesis were true, and could be defined as the smallest level of significance (observed level of significance) at which the Null Hypothesis will be rejected, assuming the Null Hypothesis (Ho) is true.
Rachel
.In most cases, the research attempt is to find support for the Alternative Hypothesis (Ha or H1). Thus, the smaller the P-Value, the more the (the father out the #Test-Statistics is on the Standard Normal Distribution Diagram, and the more confident the researcher can be about rejecting the Null H
Rachel
.#Test-Statistics is on the Standard Normal Distribution Diagram, and the more confident the researcher can be about rejecting the Null Hypothesis (Ho) in support for the Alternative Hypothesis (H1 / Ha).
Rachel
.The #P-Value is less than the Critical Values (Significance Level) of 1% (0.01), 5% (0.05), and 10% (0.10) given in Table (1) in the Appendix, means the Null Hypothesis (Ho) that there is Wage Discrimination is not reflective of the population or equal to the Mean of the Population
Rachel
.Mean of the Population(data sample of Sample Mean distribution of the Population ) which confirms that the Researcher Rejects the Null Hypothesis (Ho) and Accepts the (Alternative Hypothesis).
Rachel
Rachel
see publication ' Winston and Chellie by Rachel Adeniji '
Rachel
correction, dependent x variables such as Employment, being Female; dependent y variable Wages
Rachel
correction, Wage Differentials such as Employment, Region affecting Wages; Wage Discrimination such as being Female or Ethnicity affecting Wages
Rachel
correction_, linear regression/equation is computed as y=mx + c or y=m • x1+x2+x3+c where independent x variables eg Employment x1, Female x2 , Ethnicity x3, and dependent y variable Wages
Rachel
how do you draw a line of best fit?
***youtu.be/l2BOZDosuIk
William
informal explanation:lets suppose you have 10 points and you want a line to best fit on all of them. all you need to keep in mind that the distance and error should be minimum and you will get the best fit line.
umair
how was the data collected to draw the graph
Nji
draw a straight line through the points on the graph that are most clustered with other data / points
Rachel
suppose that 30% of the employees in a large factory of smokers what is the probability that there will be exactly two smokers in a randomly-chosen five-person workgroup
binomialPdf(5, .3, 2) .3087
Ara
are the fraction integers
The ratio of male to female nurses is 2:3 or 2/3. There are 40 nurses in the ward. For every 5 nurses, how many male and female nurses are there? How many groups can be divided into shifts. Pls show the solution and explain.
in a group of 5, the probability tbat exactly 3 of the nurses are male is .6630 or 66% calculation P(X=0)+(...)+P(X=3)=.6630
Ara
i dont think u got a correct answer. you are computing for the probability not the ratio and proportion
(2+5)/40*2 = male , (2+5)/40*3 = female
Rachel
40/(2+3)*2 = male , 40/(2+3)*3=female, ....sorry correction
Rachel
thank you so much for the help
x
Rachel
Iyhoo
Sixolisiwe
if x is a continuous random variable and` c` is a constant then p(x=c)
the length of human pregnancies from conception to birth approximates a normal distribution with a mean of 266days and a standard deviation of 16days.(i) what length of time marks the shortest 10%of all pregnancies ?
Neha
27.6390625 days
festus
steps?
Neha
how can I solve a Hypothetic problem that provide sample data such as 45,3_,45,28,17 ect...what is the first step
Leticia
find the mean and standard deviation first
how can I get line of best fit?
Josh