Female doctors provide better care than male doctors? New paper grossly exaggerates findings.

Confident female doctor in front of team

A new study claiming to show that female doctors provide better care than male doctors is a paradigmatic example of the way that scientific research has been perverted by contemporary journalism. In an effort to provide click-bait headlines, scientists are grossly exaggerating the clinical significance of their findings and thereby misleading the public.

Is a difference in mortality rate of 0.7% clinically meaningful?

The paper is Comparison of Hospital Mortality and Readmission Rates for Medicare Patients Treated by Male vs Female Physicians published in this week’s edition of the Journal of the American Medical Association (JAMA).

NPR is typical of news outlets hyping the findings:

In a study that is sure to rile male doctors, Harvard researchers have found that female doctors who care for elderly hospitalized patients get better results. Patients cared for by women were less likely to die or return to the hospital after discharge.

Previous research has shown that female doctors are more likely to follow recommendations about prevention counseling and to order preventive tests like Pap smears and mammograms.

But the latest work, published Monday in JAMA Internal Medicine, is the first to show a big difference in the result that matters most to patients: life or death.

According to the paper:

Elderly hospitalized patients treated by female internists have lower mortality and readmissions compared with those cared for by male internists. These findings suggest that the differences in practice patterns between male and female physicians, as suggested in previous studies, may have important clinical implications for patient outcomes.

What did the researchers do?

We analyzed a 20% random sample of Medicare fee-for-service beneficiaries 65 years or older hospitalized with a medical condition and treated by general internists from January 1, 2011, to December 31, 2014. We examined the association between physician sex and 30-day mortality and readmission rates, adjusted for patient and physician characteristics and hospital fixed effects …

So far, so good.

What did they find?

… Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233) and lower 30-day readmissions (adjusted readmissions, 15.02% vs 15.57%; adjusted risk difference, –0.55%; 95% CI, –0.71% to –0.39%; P < .001; number needed to treat to prevent 1 readmission, 182) than patients cared for by male physicians, after accounting for potential confounders…

The study is methodologically sound. The authors carefully chose the subjects and carefully corrected for a large number of confounding variables. The differences are statistically significant but the authors fail to pay careful attention to the most important question: Are the results clinically significant?

Put another way, is a difference in adjusted mortality rates of 0.47% clinically meaningful?

I doubt it.

What’s the difference between statistical significance and clinical significance?

A biostatistician explains:

In clinical research is not only important to assess the significance of the differences between the evaluated groups but also it is recommended, if possible, to measure how meaningful the outcome is (for instance, to evaluate the effectiveness and efficacy of an intervention). Statistical significance does not provide information about the effect size or the clinical relevance. Because of that, researchers often misinterpret statistically significance as clinical one…

That is what has happened here. The authors have claimed that there is a statistically significant difference in death rates and readmission rates between female and male physicians; that’s true. But they’ve gone on to imply that this finding is the same as a clinically significant difference and that’s false.

What is the clinical significance of these findings?

The authors haven’t done all the appropriate calculations to determine clinical significance and have not provided the information we need to calculate them.

They do provide the number needed to treat (i.e. 233 to prevent one death), but they don’t provide the Cohen’s effect size, nor the standard deviations that we could use to calculate the Cohen’s effect size for ourselves.

There’s another aspect of clinical significance that hasn’t been considered: are 30 day mortality and readmission rates a good measure of effect? Most people do not choose their doctor by calculating the odds of surviving 30 days after admission to the hospital, nor should they.

The authors hype their findings by extrapolating them. According to NPR:

The study’s authors estimate “that approximately 32,000 fewer patients would die if male physicians could achieve the same outcomes as female physicians every year.”

But the authors provide no evidence that their findings in Medicare patients admitted to the hospital could be extrapolated to other elderly, hospitalized patients, let alone younger patients, outpatients or doctors from specialities other than internal medicine.

The claim is an excellent example of the way that scientific research is perverted by the need to publish and to publicize.

The research paper yielded an interesting finding whose clinical significance is uncertain, has yet to be replicated, and should not be extrapolated. Then the authors proceeded to hype their findings in order to get attention for them. In the process, they have made claims that don’t bear scrutiny and essentially mislead the American public.

That behavior may be good for getting tenure, but it’s not good for patients.

  • omdg

    A NNT of 223 is clinically significant if you compare it what we use to make medical decisions for other therapies. For instance, for common treatments such as statin or aspirin use, commonly cited NNTs are 2000 or 10000 for aspirin to prevent non-fatal MI or stroke over a year, or 268 for a statin to prevent stroke over a year. This result is a NNT of 223 for DEATH within 30 days. Now, one could argue that 30 day mortality is a suboptimal outcome measure in this population of elderly adults, or that we really shouldn’t be pushing aspirin or statin use either…. but I think many people would argue that 223 patients is actually quite clinically significant.

  • Mom of 3

    Off topic but I was really disappointed to see Dr Amy post a racist meme on the SOB facebook page comparing obnoxious lactivism with white supremacists using the confederate flag. I only saw it when someone posted a SS on another fb page I’m on so I don’t know if Dr Amy attempted to defend it before taking it down. The apology was disappointing, as it was of the “I’m sorry people were upset” variety. I hope to see more understanding from you, Dr Amy, about why that meme was so offensive. I hate obnoxious lactivists as much as anyone here, but it is simply wrong and shows a lack of understanding of the plague of systemic racism in this country to compare it with the white supremacy associated with the confederate flag.

    I’ve been a fan of this blog for years and have really enjoyed your recent posts about politics but this was completely unacceptable.

    • Mom of 3

      Thank you for the updated apology on your facebook page.

  • Sue

    All good points.

    The headlines could be re-written “Male and female internists achieve almost precisely the same 30-day mortality”.

    But 30-day mortality is just one point in time. What is the 5-day, 20-day or 60-day mortality? The 5 yr mortality?

    Not to mention that “32,000 fewer patients would die” is just a nonsense – as we all will die.

  • CSN0116

    And where have we seen decades worth of perverted clinical vs statistical significance over and over and over?!

    I swear, between this and Bartick’s latest “simulation” findings, I’m about to throw my CV at Harvard and see how the cards fall. Shit, I might just land a job.

  • Concerned Anon

    Alma Midwifery Birth Center in Portland, Oregon has a GoFund Me – https://www.gofundme.com/alma-midwifery-what-love-can-do

  • MaineJen

    I thought this sounded suspect when I heard it this morning. I also wondered: could this difference be due to the tendency of male and female physicians to gravitate toward different specialties? Or did they control for that?

    • fiftyfifty1

      They only looked at one specialty: general internal medicine.

      • Prudent planner

        Yes.
        Perhaps internal medicine has more of the male ‘B team’; if the male ‘A team” tends toward specialties like Brain Surgery? …… But ‘A team’ female doctors may remain in internal medicine as a quality of life trade-off.

        • fiftyfifty1

          Yes, it’s possible that general internal medicine men are more from the “B team” while general internal medicine women are more from the “A team”. But of course that doesn’t change things from the patient’s point of view. When you are in the hospital with a sickness you are going to be cared for by a internal medicine doc, not a brain surgeon. The women docs may be a tiny bit better because they are “A team” or maybe because they are “so nurturing” or “so conscientious” or “such good communicators”. Who knows.

          • Azuran

            Or also because patients and even coworker are more likely to doubt or question a female than a male doctor. At least openly and to their faces.
            I know that regardless of our skills, experience or aptitude, the male vets I work with are much less likely (as in never) to get complaints against them. While people openly bitch the female vets.
            Receiving critics or being openly doubted more often probably helps in making sure you are always at the top of your game and as up to date as you can me.

          • StephanieJR

            People really complain about the gender of their vet? WTF? How does that matter? All I want is someone that knows exotics so they can treat my rabbit effectively, and who can give me clear instructions on what to do.

          • Azuran

            It’s not really complaining about the gender. It’s more like, people having inherently more respect for a male doctor than a woman one. It’s more a bias that people have. If something goes wrong (whether it’s a real complication, or an imaginary one) the owners will be more reluctant to complain against the male vets.

          • shay simmons

            I don’t know if it was a business decision or personal preference on their parts — the husband/wife vet clinic we use is in a country/agricultural town with a fairly significant % of Apostolic Christians (like the Amish, but with electricity). She does the small animal work, he treats livestock.

          • Azuran

            Probably a personal preference.
            Men are indeed more likely to end up working with livestock than women. Generally it’s mainly a personal preference. But there are indeed a few biases and stereotype that might influence your decision. Working with livestock is seen as more manly: it’s dirty, more physically demanding etc. And although it’s not the majority, the sexist farmer stereotype didn’t come out of nowhere. Some farmers will be vocal about their preference for male vets and their distrust of women. And usually, male vets will be more likely to get hired than women in vet clinics treating large animals.
            Those are things any female vet knows and has to be ready to face if she want’s to work with large animals, so it probably influence the decisions of a few of them.

          • shay simmons

            I know that regardless of our skills, experience or aptitude,

            An experience, alas, that every professional woman faces at some time in her career.

          • Sue

            Or they may be almost exactly the same, as in “11.07% vs 11.49%” 30-day mortality.

        • Sue

          Did they adjust/stratify for any characteristic other than physician gender?

          How about age, experience, hours of work, duration of training – there could be a multitude of factors that are different between men and women in internal medicine.

  • Sheven

    I don’t understand the difference between statistical and clinical significance. Could you explain a little more about it? Maybe an example of a circumstance in which something might be statistically significant but make no difference when it comes to clinical significance?

    • Azuran

      Clinical significance is more of a measure of the visible clinical effect, to see if it’s something that actually matters in real life.

      There was a study a few months ago about how c-section raised the risk of obesity by 15%. That was statistically significant.
      However, in real like, that basically meant a difference of 1 pound. Gaining or losing 1 pound makes no difference whatsoever, your weight varies by more then that every day. So it’s wasn’t really clinically significant.

      Or another example would be people losing their mind over a study showing that breastfed babies have a statistically significant increase of 2 IQ points when compared to formula fed babies. But forget that clinically, 2IQ points is nothing.

      • Dr Kitty

        Saying “extended breastfeeding has as much effect on IQ as drinking a double espresso as opposed to a green tea before the IQ test” is true, but probably not something that will win you friends in lactavist circles.

    • Ceridwen

      Even in academic circles we tend to talk about “real world significance vs. statistical significance”. In particular, while large data sets are awesome, they have a tendency to produce results that are clearly statistically significant but of a scale that makes them highly unlikely to make a difference in the real world, where noise from other areas will swamp that difference.

      As an example I recently published a paper where distance between sites over water was highly significantly associated with genetic distance between organisms at those sites. But the size of the effect was tiny, with only ~1% of the genetic variation explained by distance over water. In contrast, a composite variable that included other measures of habitat suitability besides water explained >30% of the variation in genetic distance and was similarly statistically significant. In the real world the distance over water provides effectively no information about how related two populations would be, while the composite habitat suitability variable provides quite a bit of information even though both are equally statistically significant.

      Statistical significance is useful to know about but effect sizes matter too. Especially if you are trying to use the results to make changes in practice.

      • Sheven

        So, if I’m getting this right, statistical significance means that it probably exists if everything else is kept equal. Clinical significance means that, in the real world where everything else is rarely kept equal, it has a noticeable effect. Yes/no?

        • Amy Tuteur, MD

          No, it’s that the difference is real but it doesn’t matter in a real world setting.

        • Sue

          Statistical significance means that the two numerical values are unlikely to be different just by chance. This is a numerical calculation.

          Clinical significance means that it makes a difference to the health outcomes between the two groups.

          So, the numerical difference is real, but it doesn’t make any difference to the health of the people it is measured on.

          This is sometimes known as the difference between test-based outcomes vs patient-based outcomes.

          Here’s a relevant article:
          https://freshbiostats.wordpress.com/2014/03/28/statistical-or-clinical-significance-that-is-the-question/

        • Roadstergal

          The tricky part of all of this is, of course, p-hacking and other conscious and unconscious tricks.
          If you retrospectively analyze a bunch of things after getting a dataset, some of them will come up positive purely by chance. The most common pitfall is looking at too many things retrospectively in order to make sure something falls out by chance, but there are other statistical pitfalls, as well.

          Ideally, you take a dataset and split it in half. Half is designated as a ‘training set,” where you look at the things you’re hypothesizing and see which ones fall out as significant. Half is your ‘test dataset,’ where you take your short list of things to look at from the training set, and see if that significance holds. I generally work with a stats person to give me an alpha before the test set, so I have some context for the p-value. And you have to run the right test – some statistical tests are commonly misused (like a paired t-test for multiple comparisons, or using a test that works best on a normal dataset on a non-normal dataset).

          This is a long way of saying – just because it came out as ‘significant’ doesn’t even necessarily mean it’s numerically significant. Someone just ran a program that might or might not have been the right one, and got a number, and decided that number met some arbitrary threshold.

    • Kristi Berry Pedler

      Pretend I did a paper showing drug Loozalot is statistically significant in causing weight loss over intervention X. Digging into the paper, Loozalot takers lost 0.6 lbs more over a 16 week study time.

      Can YOU tell the difference clinically? If so, the result may be numerically significant but not clinically.

    • Sue

      Hi, Sheven

      The easiest way to understand the difference is for research where the measured results are numerical values – like the value of some blood test, or blood pressure.

      Let’s say that treatment A results in a blood pressure of 3 points lower than treatment B. The study might be analysed in a way that shows that the difference between A and B is statistically significant (ie unlikely to have occurred by chance) but that difference in blood pressure makes no difference to the patients’ outcomes.

    • AnnaPDE

      The trick is that statisticians and normal people use the word “significance” for two different things.
      Normal person: Significance = Importance. “This finding is significant” means that there’s a noticeable, important effect. In other words, the effect size is big enough to make the thing matter in the real world.
      Statistician: Significance = Statistical reliability. “This finding is significant” means that the data allows to draw conclusions that are probably due to there being an actual effect, not just random variation that happened to look like there’s something going on. It might be a tiny effect that really doesn’t matter for practical purposes, but the likelihood of the finding being just a fluke in the data is low. (How low? That’s what the p-value measures.)

  • Roadstergal

    I saw this being shared on FB, and I thought, this has got to be yet another batch of clickbait bullshit. Thanks for putting this up!