Book Review

  • John Maindonald
  • Australian & New Zealand Journal of Statistics, December 2019, Wiley
  • DOI: 10.1111/anzs.12281

Statistical Inference as Severe Testing, by Deborah Mayo. Statistics wars, P-values, and more.

What is it about?

Over-riding themes that this reviewer identifies are: 1) Warnings against practices that lead to BENT science (Bad Evidence, No Test). These include cherry picking, multiple testing, artificiality of experiments, publication bias, and so forth. 2) An elaboration of Karl Popper's ideas of severe testing, as they relate to scientific hypotheses, here within a null hypothesis significance testing (NHST) context. 3) A distinction between the philosophies 'Probabilism' (primarily Bayesian); `Performance' (where the sampling distribution has a central role); and her own philosophy that she terms `Probativeness' (identified as performance within a severe testing context). 4) A defense of the role of P-values that leaves little or no room for investigation of the implications the choice of prior. In one class of examples that are discussed, the author is unwilling to allow a data-based choice. 5) Extensive discussion of the history, extending through to recently published work, from the author's own null hypothesis significance testing perspective.

Why is it important?

How, based on available evidence, are defensible scientific judgments properly made? There can be no doubting the importance to scientific work of the issues addressed in this book. This review addresses a number of points that have not attracted the attention that they deserve in other reviews that I have seen. I argue that replication studies, to date mainly those mounted by the University of Virginia Center for Open Science (COS), have a much greater importance than Mayo allows them. They offer insights that are otherwise unavailable into the way that research and publication processes function. Those insights are of central importance in any discussion of scientific processes.


John Maindonald

Mayo's emphasis on `severe testing' is very welcome. It, or some equivalent, has to be a part of any defensible philosophy of statistics. By contrast, the limits that she places on the data and models that may be used to inform statistical analysis are retrogressive. One effect is to limit the scope of severe testing. Considerations that are not relevant to assessing the quality and relevance of the evidence are presented against allowing results a large scale replication studies to influence judgments about the credibility to be placed on individual papers that are broadly comparable to those that have fallen within the scope of the study. Where 40% of relevant results could be reproduced, rather than perhaps 80%, readers have a much stronger incentive to look very critically at the claims made and at supporting documentation. Severe testing certainly requires the calling out of what Mayo identifies as BENT science. More than that, it requires putting the statistical tools that are used through the ringer. For Mayo, P-values have a central role. Here, severe critique is notably absent. Mayo resists attempts to express the false positive risk as a function of the P-value or alpha-value, and the prior probability that the NULL is false. Her argument appears to be that the models used are too simplistic to provide useful insight. In this, they reflect the simplistic dichotomies that NHST imposes on the assessment of scientific results.

Read Publication

The following have contributed to this page: John Maindonald

In partnership with: