# A series of basic statistics by Tom Lang

## 6. Correlation and Linear Regression Analysis

### Introduction

### Correlation Analysis

a) positively correlated, B) negative correlated, and C) not correlated.

The fact that correlation is not established as present or absent based on the P value also means that the result has to be interpreted. Describing the correlation as weak, moderate, or strong depends on the medicine involved, not on the size of the correlation coefficient itself, despite what some authors have proposed (Figure 2). For example, the concentration of a substance in an IV infusion should be highly correlated with its concentration in the blood. In such a case, the serum a correlation coefficient of 0.85—often described as a high correlation in may instances— may be unacceptably low.

### Simple Linear Regression Analysis

Linear regression analysis extends correlation analysis by "fitting" a "least-squares regression line" to the scatter plot. Basically, this line is the line that passes as close to all the data points. (Actually, it is the line with the smallest "sums of squares," in which the distance from each data point to the line is squared and the line with the smallest sum of squares—the "least squares line"—is the line that best summarizes the data. Figure 3)

The differences are small and close to the line of

zero differences throughout the range of values on the X-axis.

B）A graph showing that the underlying relationship is linear,

z but there is much more variability in the data, meaning the model will probably

znot predict as well as the model shown in A.

^{2}, should be reported. The importance of r

^{2}is that it indicates how good the model is: how much variability in y is explained by knowing x. Values closer to zero means that the model does not predict well, and values closer to 1 indicate that it predicts better.

Finally, the model needs to be "validated," or tested to determine whether it is really modeling the data well. One form of validation is to develop the model on, say, 80% of the data and then to see how well it predicts in the remaining 20%. If the values of r

^{2}are similar, the model is considered to be validated. Another way to validate the model is to compare it with an existing model developed by someone else on a similar set of data. Again, if the values of r

^{2}are similar, the model is considered to be validated.

The simple linear regression model we developed for predicting serum drug concentrations from weight was: Y = 12.6 + 0.25X. The slope of the regression line was significantly greater than zero, indicating that serum levels tend to increase as weight increases (slope = 0.25; 95% CI = 0.19 to 0.31; t_{451} = 8.3; P < 0.001; r^{2} = 0.67)

Where:

・Y is the drug concentration in mg/dL

・12.6 is the Y-axis intercept point

・X is weight in kg

・0.25 is the slope of the regression line or the regression coefficient or the beta weight;

for each additional kilogram of weight, drug concentration increased by 0.25 mg/dL

・0.19 to 0.31 is the 95% confidence interval for the slope of the line:

if the study was repeated 100 times with data from the same population,

we would expect the slope of the regression line to fall between 0.19 to 0.31 in 95 of

the studies

・t_{451} = 8.3 is the value of the t statistic with "451 degrees of freedom,"

numbers that are an intermediate step to determining the P value

・P < 0.001 is the probability that the slope of the line would differ from a slope of zero

(a flat, horizontal line) if there were no relationship between x and y

・r^{2} is the coefficient of determination; a patient's weight explains 67%

of the variation in drug concentrations

### MULTIPLE LINEAR REGRESSION ANALYSIS

Below is an example of a multiple linear regression model with four variables, X

_{1}through X

_{4}. The number before each variable is the regression coefficient or beta weight that indicates how much the value of Y will chance for each unit change in X.

Y = 12.6 + 0.25X_{1} + 13X_{2} - 2X_{3} + 0.9X_{4} |

Individual predictor variables that are significantly related to the outcome variable are called "candidate variables" because they will be considered for inclusion in the final model. Often, the threshold of statistical significance will be higher than the typical 0.05, such as 0.2, to make sure that all predictors the might even be remotely related to the outcome will be considered.

Once the candidate variables have been identified, they should be evaluated for "colinearity" and "interaction." Co-linear variables are highly correlated, such that they add the same information to the model. Height and stride length (the distance a person moves with each step) are highly correlated, for example, so only one needs to be included in a model.

Two variables are said to interact if the combination produces results that are greater than the results of each variable individually. For example, taking barbiturates and drinking alcohol at the same time can be fatal, even if the dose of each drug by itself is not. In this case, and "interaction term" that models the interaction between the two variables is created and added to the model.

When all the candidate variables are identified, and co-linear variables have been eliminated and interaction variables have been added, the variables are put into a "variable selection process," in which they are combined in various ways to create several regression models. Each model has a "coefficient of multiple determination," R

^{2}, which is the same as the coefficient of determination, r

^{2}, in simple linear regression, except that it applied to multiple regression. The model with the largest R

^{2}will predict the outcome best and is chosen as the final model.

An example of a correctly reported multiple linear regression analysis is shown in the Table.