A series of basic statistics by Tom Lang
2. Estimates and Confidence Intervals
Introduction
In the first article, we talked about "descriptive statistics," an area of statistics concerned with describing and summarized data with percentages, means, interquartile ranges, and so on. When we talk about estimates and confidence intervals, we have moved to an area called "inferential statistics," in which we measure the characteristics of a sample and guess or "infer" the value of these characteristics for the population from which the sample was taken.
Most medical research is done on samples, but the findings are applied to "populations." For example, we can't study all patients with, say, epilepsy, we can only study a sample of such patients. When we're done, we hope that what we have learned from the sample will also be true for all patients who have epilepsy.
In other words, we have to estimate the population values from the measured values of the sample. Then, because our sample is seldom more than only a tiny fraction of the population, we need to know how accurate our estimate is. In medicine, this measure of accuracy is most often expressed as a confidence interval (CI), usually a 95% CI, although the "confidence coefficient" (the 95%) may be 90% for smaller samples and theoretically can be any number. Thus, understanding estimates and confidence intervals is important to understanding the medical literature.
Most medical research is done on samples, but the findings are applied to "populations." For example, we can't study all patients with, say, epilepsy, we can only study a sample of such patients. When we're done, we hope that what we have learned from the sample will also be true for all patients who have epilepsy.
In other words, we have to estimate the population values from the measured values of the sample. Then, because our sample is seldom more than only a tiny fraction of the population, we need to know how accurate our estimate is. In medicine, this measure of accuracy is most often expressed as a confidence interval (CI), usually a 95% CI, although the "confidence coefficient" (the 95%) may be 90% for smaller samples and theoretically can be any number. Thus, understanding estimates and confidence intervals is important to understanding the medical literature.
Estimates
An estimate is a probable value for a population that is inferred from a measured value of a sample. In medicine, we sometimes want to estimate the value of a physical trait in a population, such as the average height of 7yearolds. We might also want to estimate the response to an intervention, such as differences between groups ("betweengroup comparisons") or in the same group before and after treatment ("within group comparisons").
I'll use the same example to talk about estimates and confidence intervals. In Western folklore, there is a mythical being called a "gnome" (pronounced "nohm"; Figure 1). Gnomes have only been seen in small groups, however, so no one knows how tall the average gnome is. Thus, our research question is "How do we estimate the average height of all gnomes if we can only measure a few of them?"
Suppose that a gnome magically appears on your desk. You measure it and find that it is exactly 10 cm tall. What is our best guess about the average height of all gnomes? The answer is: 10 cm, because it's all the information we have in a sample size of 1.
Now suppose a second gnome appears beside the first one. This gnome could be 10 cm, but probably it will be a little bigger or a little smaller. Supposed it measures 11 cm. Now, what is our best guess about the average height of all gnomes? The answer is: 10.5 cm, because it's all the information we have. That is, we average the heights of our sample of 2, which is 10.5 cm.
We could repeat this process if, say, 20 gnomes were to appear: measure the height of each gnome and then calculate their mean height. This sample mean will, again, be the best estimate of the mean height of the gnome population. In other words, the sample mean is the best estimate of the population mean. The same is true for other characteristics of the sample as well: for medians, ranges, and standard deviations, for example.
Notice that our sample was small: 20 gnomes out of a population of several thousand gnomes (or so I've been told). With so many gnomes, how likely is it that our estimate is accurate? If we happened to get a sample containing the smallest gnomes, we would underestimate the average height in the population, and if we happened to get a sample containing the largest gnomes, we would overestimate it. What we need is a way to determine how accurate our estimate might be. This measure is the confidence interval.
I'll use the same example to talk about estimates and confidence intervals. In Western folklore, there is a mythical being called a "gnome" (pronounced "nohm"; Figure 1). Gnomes have only been seen in small groups, however, so no one knows how tall the average gnome is. Thus, our research question is "How do we estimate the average height of all gnomes if we can only measure a few of them?"
Suppose that a gnome magically appears on your desk. You measure it and find that it is exactly 10 cm tall. What is our best guess about the average height of all gnomes? The answer is: 10 cm, because it's all the information we have in a sample size of 1.
Figure 1
The gnome is a mythological character in Western culture.
This one is 10 cm tall. If this gnome is the only one we have measured, our best estimate of the average height of all gnomes is 10 cm because that is all the information we have.
We could repeat this process if, say, 20 gnomes were to appear: measure the height of each gnome and then calculate their mean height. This sample mean will, again, be the best estimate of the mean height of the gnome population. In other words, the sample mean is the best estimate of the population mean. The same is true for other characteristics of the sample as well: for medians, ranges, and standard deviations, for example.
Notice that our sample was small: 20 gnomes out of a population of several thousand gnomes (or so I've been told). With so many gnomes, how likely is it that our estimate is accurate? If we happened to get a sample containing the smallest gnomes, we would underestimate the average height in the population, and if we happened to get a sample containing the largest gnomes, we would overestimate it. What we need is a way to determine how accurate our estimate might be. This measure is the confidence interval.
Confidence intervals
To better describe the concept of a confidence interval, I will first use a pretend but easily understood example. Then, once you know the concepts, I will show how confidence intervals are actually determined. I'll use the gnomes again to remind you that the steps described below are not what actually happens in determining the confidence interval, they just explain the concepts.
Suppose we have unlimited resources and unlimited cooperation of all the gnomes, such that we can take all possible random samples of, say, 10 gnomes. In other words, we draw a sample of 10 gnomes, measure the height of each, calculate the mean, graph the mean, and then return the gnomes to the population. We then draw another sample of 10 gnomes and repeat the process: measure each one, calculate the mean of the sample, graph the mean, and return the gnomes to the population. We repeat this process until we have taken samples of every possible combination of 10 gnomes (Table).
When we graph the means of all our samples (Figure 1), we find that they are normally distributed. (This result is explained by what is called "the central limit theorem," which I won't address here.) From the first article in this series, we know that the "area under the curve" can be expressed in units of standard deviation. More importantly, the mean of this graph of sample means is, again, our best estimate of the population mean. Now, however, instead of a single sample mean, we have a distribution of sample means. When we had a sample a Data, we called this unit the standard Deviation (SD). Now we have a distribution of sample Means, so we are going to call the standard deviation the standard error of the Mean (SEM).
The SD and SEM represent the same concept and have the same mathematical properties: both can be used to indicate the area under a normal curve. The only difference is that the standard Deviation is a descriptive statistic that indicates the variability of a distribution of Data, whereas the standard error of the Mean is an inferential statistic that indicates the variability of an estimate; that is, the variability of the distribution of the Means of all possible samples of the same size.
Remember that about 68% of the data will be included in the range defined by 1 SD below the mean to +1 SD above the mean and that about 95% will be included between 2 SD and +2 SD. These relationships are the same for the SEM: about 68% of the sample means will be included in the range defined by 1 SEM below the mean of the sample means to +1 SEM above the mean, and about 95% will be included between 2 SEM and +2 SEM (Figure 2).
The mean of this distribution of sample means is the best estimate of the population mean, and the range given by plus or minus 2 SEMs, is a 95% CI. In other words, we measured only samples of gnomes, and the estimate varied from sampletosample. However, we are fairly sure that the mean in 95 of 100 samples of the same size will probably fall within the range defined by 2 SEMs above and below the mean of the sample means.
The heights of 100 gnomes as collected in 10 samples.
The overall mean of the 10 sample means is 9.3 cm, which is the best estimate of the average height of the gnome population. The SEM is 1.9. Because the mean 2 SEM = 5.5 and the mean +2 SEM = 13.1 cm, the 95% CI is 5.5 to 13.1 mm Hg. (See text for details)
Table
The SD and SEM represent the same concept and have the same mathematical properties: both can be used to indicate the area under a normal curve. The only difference is that the standard Deviation is a descriptive statistic that indicates the variability of a distribution of Data, whereas the standard error of the Mean is an inferential statistic that indicates the variability of an estimate; that is, the variability of the distribution of the Means of all possible samples of the same size.
Remember that about 68% of the data will be included in the range defined by 1 SD below the mean to +1 SD above the mean and that about 95% will be included between 2 SD and +2 SD. These relationships are the same for the SEM: about 68% of the sample means will be included in the range defined by 1 SEM below the mean of the sample means to +1 SEM above the mean, and about 95% will be included between 2 SEM and +2 SEM (Figure 2).
Figure 2
The pretend process of creating a 95% confidence interval around the estimated mean height.
Upper panel: we take all possible samples of the same size from the population of interest, compute the mean height of each sample, and graph the means. Lower panel: the new distribution of means will be normally distributed, so 95% of the samples we drew had means that ranged between two SEMs above and below the overall mean of the new distribution. The overall mean is the estimated height, and the range between the mean plus and minus 2 SEMs is the 95% interval for the estimate.
In reality, we generally measure only a single sample. The (measured) sample mean is the best estimate of the population mean, and the 95% CI is calculated from the SEM with the simple formula:
One SEM on either side of our mean of sample means is about a 68% CI. To get the 95% CI, we essentially double it, which gives the range of values in which we expect the mean height to fall in 95 of 100 similar samples.
Using the data from the example in the figures, the mean of the distribution of all possible samples of the same size (although only 10 are shown here) is 9.3 cm. The SEM is 1.9, and 2 SEMs equal 3.8. Adding and subtracting the 3.8 to the mean of 9.3 gives us an estimated height of 9.3 cm with a 95% CI of 5.5 to 13.1 cm.
SEM= 

Using the data from the example in the figures, the mean of the distribution of all possible samples of the same size (although only 10 are shown here) is 9.3 cm. The SEM is 1.9, and 2 SEMs equal 3.8. Adding and subtracting the 3.8 to the mean of 9.3 gives us an estimated height of 9.3 cm with a 95% CI of 5.5 to 13.1 cm.
Figure 3
The distribution of sample means in the example of estimating the average height of gnomes.
The estimate is 9.3 cm, 1 SEM = 1.9, and 2 SEMs = 3.8, so the 95% CI is 5.5 to 13.1.
The Misuse of the Standard Error of the Mean
The SEM is often used incorrectly as a descriptive statistic. Especially in the basic life sciences, measurements are routinely reported as means and SEMs. This practice is established and poses no problem to those who are used to seeing measurements presented this way. However, because the SEM is always smaller than the SD, it makes measurements look more precise than they would look if they were reported with SDs, so this distortion needs to be kept in mind when interpreting the SEM. My research indicates that the SEM is appropriately reported in only a few circumstances, such as in tables reporting regression analysis. The SD is preferred to describe a distribution of data, and the 95% CI is the preferred measure of precision for an estimate.
The Value of Confidence Intervals
Confidence intervals have enormous value in reporting the results of medical research. The results of most biomedical studies (called the "effect size") are actually estimates and so should be accompanied by CIs. In addition, CIs are increasingly preferred to P values when reporting results. The P value is a measure of chance as an explanation and has no biological interpretation. (More about P values in the next article.) On the other hand, the CIs keep the interpretation focused on the biological meaning of the effect size.
Here's an example of the value of confident intervals. Suppose a study finds that a drug reduces diastolic blood pressure (DBP):
Here, the effect size was a drop of 15 mm Hg, and the drop in this one study was statistically significant. That is, if the drug did nothing, we would get a drop in DP of 15 mm Hg or higher by chance in only 1 of 100 similar studies. Given that low probability, we decide that the drug was probably responsible for the drop.
Let's assume that the 15 mm Hg reduction in DBP is clinically important. Even though this result is statistically significant here, the 95% CI tells us that the drop in DBP would probably range from 3.5 to 26.5 mm Hg in 95 of 100 similar studies. A drop of 26.5 mm Hg is clinically important, but a drop of only 3.5 mm Hg is not. So, we can't really say for sure that the drug is effective; our 15mm Hg estimate is not precise enough. We need to do the study again, probably with a larger sample, to improve the precision of the estimate. When all the values in the CI are clinically important (or when all are not clinically important), we have a more definitive conclusion about the efficacy of the drug.
Here's an example of the value of confident intervals. Suppose a study finds that a drug reduces diastolic blood pressure (DBP):
"The drug reduced diastolic blood pressure by a mean of 15 mmHg
(95% CI = 3.5 to 26.5 mmHg; P = 0.01)."
(95% CI = 3.5 to 26.5 mmHg; P = 0.01)."
Let's assume that the 15 mm Hg reduction in DBP is clinically important. Even though this result is statistically significant here, the 95% CI tells us that the drop in DBP would probably range from 3.5 to 26.5 mm Hg in 95 of 100 similar studies. A drop of 26.5 mm Hg is clinically important, but a drop of only 3.5 mm Hg is not. So, we can't really say for sure that the drug is effective; our 15mm Hg estimate is not precise enough. We need to do the study again, probably with a larger sample, to improve the precision of the estimate. When all the values in the CI are clinically important (or when all are not clinically important), we have a more definitive conclusion about the efficacy of the drug.
Bibliography
Rowntree D. Statistics Without Tears: An Introduction for NonMathematicians. London: Penguin Books, 2000
Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.
Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. Philadelphia: American College of Physicians, 1997. Reprinted in English for distribution within China, 1998. Chinese translation, 2001. Second edition, 2006. Japanese translation, 2011; Russian translation, 2013.