Statistics

An ongoing discussion in my Facebook group about the reliability of specific scientific papers purporting to show the benefits of natural childbirth reminds me yet again that most people are deeply confused about statistics.

For laypeople, part of the problem is the fact that the word “statistics” has more than one meaning. Used colloquially, statistics can refer to data. The technical meaning of the term is different. According to the American Statistical Association:

Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty …

Statistics can be divided into descriptive statistics and inferential statistics. Descriptive statistics are, as the name implies, descriptive. A classic example of descriptive statistics is an average. If someone drives 10 miles an hour for 1 hour, 20 miles an hour for 1 hour and 30 miles an hour for one hour, the average speed is 20 miles an hour. Descriptive statistics are extremely useful, yielding everything from RBIs in baseball, to class medians in college courses, to standard deviations that allow us to determine how much a specific result differs from the majority of results.

When we talk about scientific papers, however, we are referring to inferential statistics. Once again the name is apt. Inferential statistics allows us to make predictions about large groups or populations by looking at a small subset. For example, we might look at the long term outcomes of smoking among 10,000 people in order to make predictions about the impact of smoking on the tens of millions of people who smoke. The key value of inferential statistics is that it tells us which observations can be extrapolated to large populations and which cannot.

Most individual observations cannot be extrapolated to large populations because they represent a random result. For example, you might meet a woman named Esmerelda who has five daughters. You cannot infer from that that all women named Esmerelda will have five daughters, will have only daughters or will have children at all. And that brings us to the first and most basic element of inferential statistics. You must have collected a large group of individual observations before you can extrapolate to the general population. Why? Because inferential statistics tell us that unless we prove otherwise, any set of observations is likely to reflect random variation, and will not be reproduced in another set of similar observations.

How many individual observations do you need before you can draw an inference from a data set? The answer can be determined by statistical power, which is short hand way of identifying the power of a set of observations to yield an accurate prediction about any population. Consider coin tossing. Imagine that you toss a coin 6 times and it lands heads 2 times and tails 4 times. Can you infer that tossing a coin results in heads 1/3 of the time? No you cannot, because the result you obtained was purely random. You need many more observations for the study to have enough power to yield an accurate inference. If you tossed a coin 2 million times and got heads 999,999 times and tails 1,000,001 times, you would be entitled to infer that coin tosses yield heads half the time and tails the other half.

Determining whether a study includes enough observations to draw inferences is beyond the scope of this post, but, as a general matter, the less likely you are to see a specific result, the more observations you need to reach valid conclusions. If a study looks at neonatal deaths at homebirth, and neonatal deaths are typically measure per 1000, you are going to need several thousand observations or more to draw valid conclusions. There are exceptions to this rule, since it isn’t the number of observations alone that determines whether a study is adequately powered. However, as a general matter, if a study contains only a few dozen observations, it is underpowered and you cannot make ANY inferences regarding the results.

Most midwifery studies are grossly underpowered. If you look at the C-section rates among 20 women who employed doulas during labor and compared that to the C-section rate of 20 women who did not employ doulas, you can generate descriptive statistics such as the average C-section rate in each group. However, you cannot draw ANY conclusions about the impact of doulas on C-section rates, because the few observations you have generated are not necessarily representative of a real difference in C-section rates caused by employing a doula. When a study is underpowered, the results are simply random and tell us nothing.

A second basic element of inferential statistics is statistical significance. A study may be adequately powered, yet the results may still be insufficient for us to draw conclusions. Consider a study that compares the neonatal death rate of 10,000 women who had homebirths with 10,000 women who had hospital births. Suppose that 11 babies in the homebirth group died and 10 babies in the hospital group died. Does that mean that homebirth is 10% more dangerous than hospital birth. No, it doesn’t not, because although the results of the two groups are different, performing the appropriate statistical test will show that the difference is not statistically significant. Determining which is the correct test of statistical significance and performing it can be complicated, but the underlying concept is simple. When a result is not statistically significant, it means that it is likely that it happened by random and therefore does not reflect a true difference. If a result is not statistically significant, the result has no predictive value. In other words, it is meaningless.

A third basic concept of inferential statistics is that you must be sure that you are comparing like with like. Suppose you are comparing breastfeeding rates at a hospital with the “baby friendly” designation to one that lacks the designation. In order to draw a valid conclusion, you must be sure that the women who give birth at the baby friendly hospital are similar to the women who give birth in the other hospital. If it turns out that the baby friendly hospital is located in a wealthy suburb, where almost all the women are married, well educated and relatively well off and the hospital that is not baby friendly is in the inner city, serves a population of teen mothers who are uneducated and impoverished, you are not going to be able to draw any conclusions about whether the baby friendly designation improves breastfeeding rates even if the baby friendly hospital has a statistically significant increase in rates and even if the study is adequately powered. That’s because there are other factors, known as confounding factors, that may be responsible for the observed difference and the baby friendly designation may have nothing to do with it at all.

Obviously this is a grossly oversimplified view of inferential statistics, but it does suggest several things that lay people can look for when trying to determine if the conclusions of a study are valid.

Does the study involve lots of people in each group? If not, the study is underpowered and the results are meaningless.

Are the differences between the two groups statistically significant? If not, the results are meaningless.

Did the authors compared groups that are similar except for the one variable under investigation? If the two group differ appreciably, the results are meaningless.

Just by looking at these three factors, lay people can easily discard much of the homebirth and natural childbirth literature as invalid.

The Skeptical OB

Statistics

Dr. Amy