Monday, May 27, 2019

Simple Linear Regression

Simple huntar regression is the statistic method used to make summary of and provide the association between uncertains that are continues and quantitative ,basically it deals with deuce measures that describes how strong the unidimensional affinity we washbasin compute in selective information .Simple linear regression consist of one variable know as the forecaster variable and the other variable denote y known as response variable .It is expected that when we talk of primary linear regression to butt on on deterministic relationship and statistical relationship, the concept of least retrieve square .the rendering of the b0 and b1 that they are used to interpret the estimate regression . t present is also what is known as the race regression line and the estimate regression line .This linearity is measured using the correlation coefficient (r), that can be -1,0,1.The strength of the association is determined from the honour of r .( https//onlinecourses.science.psu.edu/s tat501/node/250). History of simple-minded linear regression Karl Pearson established a demanding treatment of Applied statistical measure known as Pearson harvest Moment Correlation .This come from the thought of Sir Francis Galton ,who had the idea of the modern notions of correlation and regression ,Sir Galton contributed in science of biological science ,psychology and Applied statistics . It was seen that Sir Galton is fascinated with genetics and heredity provided the initial inspiration that led to regression and Pearson Product Moment Correlation .The thought that encouraged the advance of the Pearson Product Moment Correlation began with vexing problem of heredity to understand how closely features of generation of living things exhibited in the next generation. Sir Galton took the approach of using the sweet pea plant to check the characteristic similarities. ( Bravais, A. (1846).The use of sweet pea was motivated by the fact that it is self- fertilize ,daughter plants shows differences in genetics from mother with-out the use of the second nourish that ordain lead to statistical problem of assessing the genetic combination for both parents .The first insight came almost regression came from deuce dimensional diagram diagramting the sizing independent being the mother peas and the dependent being the daughter peas.He used this representation of entropy to show what statisticians call it regression today ,from his plot he realised that the median weight of daughter seeds from a particular size of mother seed approximately described a straight line with positive slope less than 1. Thus he naturally reached a straight regression line ,and the never-ending variability for all arrays of character for a prone character of second .It was ,perhaps best for the progress of the correlational calculus that this simple special case should promulgated first .It so simply grabbed by the beginner (Pearson 1930,p.5).Then it was later generalised to more co mplex way that is called the multiple regression. Galton, F. (1894),Importance of linear regressionStatistics usually uses the term linear regression in interpretation of data association of a particular survey, research and experiment .The linear relationship is used in forgeling .The assumeling of one explanatory variable x and response variable y will require the use of simple linear regression approach .The simple linear regression is verbalise to be broadly useful in methodology and the practical application. This method on simple linear regression model is not used in statistics totally but it is applied in many biological, social science and environmental research. The simple linear regression is worth importance because it gives indication of what is to be expected, mostly in monitoring and amendable purposes involved on some disciplines(April 20, 2011 , plaza ,).Description of linear regression The simple linear regression model is described by Y=(?0 + ?1 +E), this is t he mathematical way of showing the simple linear regression with labelled x and y .This comparability gives us a clear idea on how x is associated to y, there is also an actus reus term shown by E. The term E is used to justification for inconsistency in y, that we can be able to detect it by the use of linear regression to give us the amount of association of the ii variables x and y .Then we have the parameters that are use to represent the population (?0 + ?1x) .We then have the model condition(p) by E(y)= (?0 + ?1x), the ?0 being the intercept and ?1 being the slope of y ,the basal of y at the x values is E(y) . The hypothesis is impinge ond is we assume that there is a linear association between the two variables ,that being our H0 and H1 we assume that there is no linear relationship between H0 and H1. Background of simple linear regression Galton used descriptive statistics in parliamentary law for him to be able to generalise his work of different heredity problems .The needed opportunity to break up the process of analysing these data, he realised that if the degree of association between variables was held constant,then the slope of the regression line could be described if variability of the two measure were known . Galton assumed he estimated a single(a) heredity constant that was generalised to multiple inherited characteristics .He was wondering why, if such a constant existed ,the sight slopes in the plot of parent child varied too more than over these characteristics .He realise variation in variability amongst the generations, he attained at the idea that the variation in regression slope he harbored were solely due to variation in variability between the various set of measurements .In resent terms ,the principal this principal can be illustrated by assuming a constant correlation coefficient but varying the archetype deviations of the two variables involved . On his plot he found out that the correlation in each data set. He then observe three data sets ,on data set one he realised that the standard deviation of Y is the same as that of X , on data set two standard deviation of Y is less than that of X ,third data set standard deviation of Y is great than that of X .The correlation remain constant for three sets of data even though the slope of the line changes as an outcome of the differences in variability between the two variables.The rudimentary regression equivalence y=r(Sy / Sx)x to describe the relationship between his paired variables .He the used an estimated value of r , because he had no knowledge of calculating it The (Sy /Sx) expression was a subject field factor that helped to adjust the slope according to the variability of measures .He also realised that the ratio of variability of the two measures was the key factor in find out the slope of the regression line .The uses of simple linear regression Simple linear regression is a typical Statistical selective information Analysis strate gy. It is utilised to decide the degree to which there is a direct connection between a needy variable and at least one free factors. (e.g. 0-100 quiz score) and the free variable(s) can be estimated on either an all out (e.g. male versus female) or consistent estimation scale.There are a hardly a(prenominal) different suppositions that the information must full fill keeping in mind the end goal to meet all requirements for simple linear regression. elemental linear regression is like connection in that the reason for existing is to scale to what degree there is a direct connection between two factors.The real contrast between the two is that relationship sees no difference amongst the two variables . Specifically, the reason for simple linear regression anticipate the estimation of the reliant variable in light of the estimations of at least one free factors. https//www.statisticallysignificantconsulting.com/RegressionAnalysis.htmReferenceBravais, A. (1846), Analyse Mathematique sur les Probabilites des Erreurs de Situation dun Point, Memoires par divers Savans, 9, 255-332.Duke, J. D. (1978),Tables to Help Students perceptiveness Size Differences in Simple Correlations, Teaching of Psychology, 5, 219-221. hold backzPatrick, P. J. (1960),Leading British Statisticians of the Nineteenth Century, Journal of the American Statistical Association, 55, 38-70.Galton, F. (1894),Natural Inheritance (5th ed.), New York Macmillan and Company.https//onlinecourses.science.psu.edu/stat501/node/250.https//www.statisticallysignificantconsulting.com/RegressionAnalysis.htmGhiselli, E. E. (1981),Measurement Theory for the behavioral Sciences, San Francisco W. H. Freeman.Goldstein, M. D., and Strube, M. J. (1995), Understanding Correlations Two Computer Exercises, Teaching of Psychology, 22, 205-206.Karylowski, J. (1985),Regression Toward the Mean Effect No Statistical Background Required, Teaching of Psychology, 12, 229-230.Paul, D. B. (1995), gyptrolling Human Heredity, 186 5 to the Present, Atlantic Highlands, N.J. Humanities Press.Pearson, E. S. (1938),Mathematical Statistics and Data Analysis (2nd ed.), Belmont, CA Duxbury.Pearson, K. (1896),Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and Panmixia, Philosophical Transactions of the Royal Society of London, 187, 253-318.Pearson, K. (1922),Francis Galton A Centenary Appreciation, Cambridge University Press.Pearson, K. (1930),The Life, Letters and Labors of Francis Galton, Cambridge University Press.Williams, R. H. (1975), A New manner for Teaching Multiple Regression to Behavioral Science Students, Teaching of Psychology, 2, 76-78.Simple Linear RegressionStat 326 gate to Business Statistics II redirect examination Stat 226 Spring 2013 Stat 326 (Spring 2013) mental home to Business Statistics II 1 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 2 / 47 Review evidence for Regression Example Real Estate, Tampa Palms, Florida remnant Predict s ale equipment casualty of residential property based on the appraised value of the property Data sale price and total appraised value of 92 residential properties in Tampa Palms, Florida megabyte 900 Sale Price (in Thousands of Dollars) 800 700 600 500 cd three hundred cc 100 0 0 100 cc 300 400 500 600 700 800 900 1000 Appraised Value (in Thousands of Dollars)Review evidence for Regression We can describe the relationship between x and y using a simple linear regression model of the form y = ? 0 + ? 1 x 1000 900 Sale Price (in Thousands of Dollars) 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Appraised Value (in Thousands of Dollars) response variable y sale price explanatory variable x appraised value relationship between x and y linear strong positive We can estimate the simple linear regression model using Least Squares (LS) yielding the following LS regression line y = 20. 94 + 1. 069x Stat 326 (Spring 2013) Introduction to Business Statist ics II / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 4 / 47 Review demonstration for Regression Interpretation of estimated intercept b0 corresponds to the predicted value of y , i. e. y , when x = 0 Review Inference for Regression Interpretation of estimated slope b1 corresponds to the change in y for a unit increase in x when x increases by 1 unit y will increase by the value of b1 interpretation of b0 is not forever and a day meaningful (when x cannot take values close to or equal to zero) here b0 = 20. 94 when a property is appraised at zero value the predicted sales price is $20,940 meaningful?Stat 326 (Spring 2013) Introduction to Business Statistics II 5 / 47 b1 0 y decreases as x increases (negative association) b1 0 y increases as x increases (positive association) here b1 = 1. 069 when the appraised value of a property increases by 1 unit, i. e. by $1,000, the predicted sale price will increase by $1,069. Stat 326 (Spring 2013) Introduction to B usiness Statistics II 6 / 47 Review Inference for Regression mensuration strength and adequacy of a linear relationship correlation coe? cient r measure of strength of linear relationship ? 1 ? r ? 1 here r = 0. 9723 Review Inference for RegressionPopulation regression line Recall from Stat 226 Population regression line The regression model that we assume to hold dependable for the complete population is the so-called population regression line where y = ? 0 + ? 1 x, coe? cient of determination r 2 amount of variation in y explained by the ? tted linear model 0 ? r2 ? 1 here r 2 = (0. 9723)2 = 0. 9453 ? 94. 53% of the variation in the sale price can be explained through the linear relationship between the appraised value (x) and the sale price (y ) Stat 326 (Spring 2013) Introduction to Business Statistics II 7 / 47 y just (mean) value of y in population for ? xed value of x ? population intercept ? 1 population slope The population regression line could only be obtained i f we had information on all individuals in the population. Stat 326 (Spring 2013) Introduction to Business Statistics II 8 / 47 Review Inference for Regression Based on the population regression line we can fully describe relationship between x and y up to a random error term ? y = ? 0 + ? 1 x + ? , where ? ? N (0, ? ) Review Inference for Regression In summary, these are important notations used for SLR Description x y Parameters ? 0 ? 1 y ? Stat 326 (Spring 2013) Introduction to Business Statistics II 9 / 47 Stat 326 (Spring 2013)Description Estimates b0 b1 y e Description Introduction to Business Statistics II 10 / 47 Review Inference for Regression Review Inference for Regression Validity of prognostics Assuming we have a good model, predictions are only valid deep down the range of x-values used to ? t the LS regression model Predicting outside the range of x is called extrapolation and should be avoided at all costs as predictions can bring about unreliable. Why ? t a LS re gression model? A good model allows us to make predictions about the behavior of the response variable y for di? rent values of x estimate average sale price (y ) for a property appraised at $223,000 x = 223 y = 20. 94 + 1. 069 ? 223 = 259. 327 ? the average sale price for a property appraised at $223,000 is estimated to be about $259,327 What is a good model? answer to this question is not straight forward. We can visually check the validity of the ? tted linear model (through residuary plots) as well as make use of numerical values such as r 2 . more on assessing the validity of regression model will follow. 11 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 12 / 47 Stat 326 (Spring 2013)Introduction to Business Statistics II Review Inference for Regression What to look for Review Inference for Regression Regression Assumptions residual plot Assumptions SRS (independence of y -values) linear relationship between x and y for each value of x, population of y -v alues is normally distributed (? ? ? N) r2 for each value of x, standard deviation of y -values (and of ? ) is ? In order to do inference (con? dence intervals and hypotheses tests), we need the following 4 assumptions to hold Stat 326 (Spring 2013) Introduction to Business Statistics II 13 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 14 / 47Review Inference for Regression SRS Assumption is hardest to check The Linearity Assumption and Constant SD Assumption are typically checked visually through a residual plot. Recall residual = y ? y = y ? (b0 + b1 x) The Normality Assumption is checked by assessing whether residuals are approximately normally distributed (use normal quantile plot) plot x versus residuals any pattern indicates violation Review Inference for Regression Stat 326 (Spring 2013) Introduction to Business Statistics II 15 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 16 / 47 Review Inference for RegressionReturning to the Tamp a Palms, Florida example 100 50 Residual 0 -50 -100 -150 0 100 200 300 400 500 600 700 800 900 1000 Review Inference for Regression Going one step further, excluding the outlier yields 0. 2 0. 1 0. 0 -0. 1 -0. 2 -0. 3 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 Residual Appraised Value (in Thousands of Dollars) Note non-constant variance can often be stabilized by transforming x, or 0. 5 y , or both Residual 0. 0 -0. 5 -1. 0 -1. 5 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 outliers/in? uential points in general should only be excluded from an analysis if they can be explained and their exclusion can be justi? ed, e. g. ypo or hamper measurements, etc. excluding outliers always means a loss of information handle outliers with caution may want to compare analyses with and without outliers Stat 326 (Spring 2013) Introduction to Business Statistics II 17 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 18 / 47 Review Inference for Regression normal quantile plots Tampa Palms examp le Residuals Sale Price (in Thousands of Dollars) 100 .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Review Inference for Regression Residuals log Sale 50 Regression Inference Con? dence intervals and hypotheses tests -3 -2 -1 0 1 2 3 Normal Quantile Plot -50 -100 Need to assess whether linear relationship between x and y holds true for ideal population. .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Residuals log Sale without outlier 0. 2 0. 1 0 -0. 1 -0. 2 -0. 3 -3 -2 -1 0 1 2 3 This can be accomplished through testing H0 ? 1 = 0 vs. H0 ? 1 = 0 based on the estimates slope b1 . For simplicity we will work with the untransformed Tampa Palms data. Normal Quantile Plot Stat 326 (Spring 2013) Introduction to Business Statistics II 19 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 20 / 47 Review Inference for RegressionReview Inference for Regression Example Find 95% CI for ? 1 for the Tampa Palms data set Con? dence intervals We can manufacture con? dence intervals ( CIs) for ? 1 and ? 0 . General form of a con? dence interval estimate t ? SEestimate , where t ? is the critical value corresponding to the elect level of con? dence C t ? is based on the t-distribution with n ? 2 degrees of freedom (df) Interpretation Stat 326 (Spring 2013) Introduction to Business Statistics II 21 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 22 / 47 Review Inference for Regression Review Inference for RegressionTesting for a linear relationship between x and y If we wish to test whether there exists a signi? cant linear relationship between x and y , we need to test H0 ? 1 = 0 Why? If we fail to reject the null hypothesis (i. e. stick with H0 = ? 1 = 0), the LS regression model reduces to y = ? 1 =0 versus Ha ? 1 = 0 ?0 + ? 1 x ? 0 + 0 x ? 0 (constant) Introduction to Business Statistics II 24 / 47 = = implying that y (and hence y ) is not linearly dependent on x. Stat 326 (Spring 2013) Introduction to Business Statistics II 23 / 47 Stat 326 (Spring 2013) Review Inference for Regression Review Inference for RegressionExample (Tampa Palms data set) Test at the ? = 0. 05 level of signi? cance for a linear relationship between the appraised value of a property and the sale price Stat 326 (Spring 2013) Introduction to Business Statistics II 25 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 26 / 47 Inference about expectation Why ? t a LS regression model? The purpose of a LS regression model is to 1 Inference about prodigy 2 estimate y average/mean value of y for a abandoned value of x, say x ? e. g. estimate average sale price y for all residential property in Tampa Palms appraised at x ? $223,000 predict y an individual/single approaching value of the response variable y for a assumption value of x, say x ? e. g. predict a future sale price of an individual residential property appraised at x ? =$223,000 Keep in mind that we consider predictions for only one value of x at a time. Note, these two tasks are VERY di? erent. Carefully think about the di? erence Stat 326 (Spring 2013) Introduction to Business Statistics II 27 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 28 / 47 Inference about prospicience To estimate y and to predict a single future y value for a given level of x = x ? we can use the LS regression line y = b0 + b1 x Simply modify the desired value of x, say x ? , for x y = b0 + b1 x ? Inference about Prediction In addition we need to know how much variability is associated with the point estimator. Taking the variability into account provides information about how good and reliable the point estimator really is. That is, which range potentially captures the true (but unknown) parameter value? Recall from 226 ? cook upion of con? dence intervals Stat 326 (Spring 2013) Introduction to Business Statistics II 29 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 0 / 47 Inference about Prediction Much more variability is associated with estimating a single observation than estimating an average individual observations always vary more than averages Inference about Prediction Therefore we distinguish a con? dence interval for the average/mean response y and a prediction interval for a single future observation y Both intervals use a t ? critical value from a t-distribution with df = n ? 2. the standard error will be di? erent for each interval While the point estimator for the average y and the future individual value y are the same (namely y = b0 + b1 x ? , the of the two con? dence intervals Stat 326 (Spring 2013) Introduction to Business Statistics II 31 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 32 / 47 Inference about Prediction Con? dence interval for the average/mean response y Width of the con? dence interval is determined using the standard error SE (from estimating the mean response) SE can be obtained in JMP Keep in mind that every con? dence interval is always c onstructed for one speci? c given value x ? A level C con? dence interval for the average/mean response y , when x takes the value x? is given by y t ?SE , where SE is the standard error for estimating a mean response. Stat 326 (Spring 2013) Introduction to Business Statistics II 33 / 47 Inference about Prediction Prediction interval for a single (future) value y Again, Width of the con? dence interval is determined using the standard error SE (from estimating the mean response) SEy can be obtained in JMP Keep in mind that every prediction interval is always constructed for one speci? c given value x ? A level C prediction interval for a single observation y , when x takes the value x ? is given by y t ? SEy , where SEy is the standard error for estimating a single response.Stat 326 (Spring 2013) Introduction to Business Statistics II 34 / 47 Inference about Prediction The larger picture Inference about Prediction The larger picture contd. Stat 326 (Spring 2013) Introduction to Bu siness Statistics II 35 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 36 / 47 Inference about Prediction Example An appliance store runs a 5-calendar month experiment to determine the e? ect of advertising on sales revenue. There are only 5 observations. The scatterplot of the advertising expenditures versus the sales revenues is shown below Bivariate tog of Sales Revenues (in Dollars) By Advertising expenditureInference about Prediction Example contd JMP can draw the con? dence intervals for the mean responses as well as for the predicted values for future observations (prediction intervals). These are called con? dence bands Bivariate Fit of Sales Revenues (in Dollars) By Advertising expenditure 5000 5000 Sales Revenues (in Dollars) 4000 3000 2000 1000 Sales Revenues (in Dollars) 4000 3000 2000 1000 0 0 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) Linear Fit Linear Fit Sales Rev enues (in Dollars) = -100 + 7 Advertising expenditure (in Dollars)Stat 326 (Spring 2013) Introduction to Business Statistics II 37 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 38 / 47 Inference about Prediction Inference about Prediction Estimation and prediction (for the appliance store data) Estimation and prediction Using JMP For each observation in a data set we can get from JMP y , SEy , and also SE . In JMP do 1 2 We wish to estimate the mean/average revenue of the subpopulation of stores that spent x ? = 200 on advertising. Suppose that we also wish to predict the revenue in a future month when our store spends x ? = 200 on advertising.The point estimate in both situations is the same y = ? 100 + 7 ? 200 ? 1300 the corresponding standard errors of the mean and of the prediction however are di? erent SE ? 331. 663 SEy ? 690. 411 40 / 47 Choose Fit Model From response icon, take in Save Columns and then choose Predicted Values, Std Error of Predicted, an d Std Error of Individual. Stat 326 (Spring 2013) Introduction to Business Statistics II 39 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II Inference about Prediction Estimation and prediction (contd) Note that in the appliance store example, SEy SE (690. 411 versus 331. 63). This is true always we can estimate a mean value for y for a given x ? much more precisely than we can predict the value of a single y for x = x ?. In estimating a mean y for x = x ? , the only uncertainty arises because we do not know the true regression line. In predicting a single y for x = x ? , we have two uncertainties the true regression line plus the expected variability of y -values around the true line. Inference about Prediction Estimation and prediction (contd) It always holds that SE SEy Therefore a prediction interval for a single future observation y will always be wider than a con? ence interval for the mean response y as there is simply more uncertainty in predicting a sing le value. Stat 326 (Spring 2013) Introduction to Business Statistics II 41 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 42 / 47 Inference about Prediction Example contd JMP also calculates con? dence intervals for the mean response y as well as prediction intervals for single future observations y. (For instructions follow the handout on JMP commands related to regression CIs and PIs. ) Inference about Prediction Example contd To construct both a con? ence and/or prediction interval, we need to obtain SE and SEy in JMP for the value x ? that we are interested in Month Ad. Expend. Sales Rev. Pred. Sales Rev. StdErr Pred Sales Revenues StdErr Indiv Sales Revenues Lets construct one 95% CI and PI by hand and see if we can come up with the same results as JMP In the second month the appliance store spent x = $200 on advertising and observed $1000 in sales revenue, so x = 200 and y = 1000 Using the estimated LS regression line, we predict y = ? 100 + 7 ? 200 = 1300 Stat 326 (Spring 2013) Introduction to Business Statistics II 43 / 47 Need to ? nd t ? ?rstStat 326 (Spring 2013) Introduction to Business Statistics II 44 / 47 Inference about Prediction A 95% CI for the mean response y , when x ? = 200 Inference about Prediction A 95% PI for a single future observation of y , when x ? = 200 Stat 326 (Spring 2013) Introduction to Business Statistics II 45 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 46 / 47 Inference about Prediction Example contd Advertising exp. Sales Rev. Lower 95% Mean Upper 95% Mean Sales Rev. Sales Rev. Lower 95% Indiv Sales Rev. Upper 95% Indiv Sales Rev. Month Stat 326 (Spring 2013) Introduction to Business Statistics II 47 / 47

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.