In the context of problems with replicability in psychology and other empirical fields, statistical significance testing and p-values have received a lot of criticism. And without question: much of the criticism has its merits. There certainly are problems with how significance tests are used and p-values are interpreted.1
However, when we are talking about “p-hacking”, I feel that the blame is unfairly on p-values and significance testing alone without acknowledging the general consequences of such behaviour in the analysis.2 In short: selective reporting of measures and cases3 invalidates any statistical method for inference. When I only selectively report variables and studies, it doesn’t matter whether I use p-values or Bayes factors — both results will be useless in practice.
Questionable research practices change the sample and it’s representation of an underlying population. Considering classical significance tests: The common model assumptions, such as random sampling, are then gravely violated. It changes the effect size estimates and the test-statistics in a systematical but unknown way. Hence, any inference (both statistical and theoretical) based on this sample will also be biased in an undetermined way. And this translates to Bayes factors (and statistical modelling in general) as well: The models do not represent the generative process anymore. While this is true in general (“all models are false, some are useful”) under questionable research practices (i.e. selective reporting of measures, covariates or cases) the inferences about the real world cannot be valid.4
A common argument in favour of Bayes factors is, that the value of the Bayes factor can still be interpreted in a valid way even under QRPs: the relative evidence in favour of one model when compared to another model. I cannot refute this – but it is also true for p-values (and for interval and point estimations). The p-value still measures the inconsistency of the data under the null hypothesis of random sampling from a central distribution (in the usual case).
The “if not taken into account” in the footnote above alludes to the fact, that some of what is considered a questionable research practice can be taken into account when setting up a statistical procedure:
- Data peeking and sequential tests can be considered in both significance tests (Lakens, 2014) and Bayes factors (Schönbrodt et al., 2017; but also see Sanborn & Hills, 2014).
- Removing outliers or unexpected results can be incorporated in a statistical model in order to minimise their effect on inferences e.g. by imputing only single values instead of removing whole cases. (It would actually be better not to remove extreme responses or outliers but to use regularisation, partial pooling or cross-validation in order to make inferences more robust.)
- If the data was non-randomly sampled, a statistical model might be able to account for that (e.g. through post-stratification or weighting procedures). However, the theoretical conclusions cannot be as strong as for random samples. Further, one need large enough samples over all sub-groups to properly use post-stratification or weighting: Convenience samples of some undergraduate students cannot be “corrected” to become representative samples of the general population.
That is to say, what is considered a “questionable research practice” is not per se questionable. It is questionable when it is done without transparently reporting it (as is the case when talking about QRPs) and engaging in them in order to achieve statistical significance or a Bayes factor larger than some arbitrary boundary or an effect of interesting size. And then the statistical procedure merely matters as any inference will be off.
The relevant and interesting question we might want to ask: What statistical tools do we have to use in order to guard us against the effects of questionable research practices? Or put another way: Can we find statistical tools that are more robust against such practices?
And one can indeed make a case for Bayes factors here. As well as for Bayesian multilevel modelling in order to regularise estimates and make use of partial pooling. Or focusing on predictive accuracy by making use of cross-validation. But this does not change the fact, that the statistical outcome is biased due to any questionable research practice used.
We need to prevent questionable research practices and find a way to make the “garden of forking paths” (Gelman & Loken, 2013) transparent. Pre-registration and registered reports are, in my view, the two most effective policies to tackle the problems arising from questionable research practices and HARKing in particular. It still allows researchers to propose strange theories and inadequate research designs, but it makes the common post-hoc practices more visible. And they can also still report explorative results and propose explanations for an observed pattern – every reader will simply be made aware what is confirmatory and what is exploratory stuff.
Finally, I want to stress again that the answer to statistical questions do not relate bijectively to theoretical questions.5 That means, we cannot use the outcome of a statistical procedure directly to evaluate a theory or answer a real-world question. Bayes factors and p-values ask two very different statistical questions and they can provide an answer. This answer, however, has to be translated in an informed judgement about the question we actually ask (not in the realm of statistics, but in the realm of theory or practical implications). In order to do this, we as researchers or users of statistics need to consider implicit and explicit auxiliary hypotheses, involved assumptions and a critical evaluation of the method (e.g. experimental design, statistical model used, …).
Statistics is not a magical tool.
- Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis. Retrieved from http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
- Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023
- Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. http://doi.org/10.1037/1082-989X.5.2.241
- Sanborn, A. N., & Hills, T. T. (2014). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic Bulletin & Review, 21(2), 283–300. http://doi.org/10.3758/s13423-013-0518-9
- Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods, 22(2), 322–339. http://doi.org/10.1037/met0000061
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. http://doi.org/10.1177/0956797611417632
- There are several articles debating the pro’s and con’s. For a very balanced and nuanced overview see Nickerson (2000). ↩
- The workshop “Reflections on Replication” at Utrecht University last week motivated me to write this post and I made a comment along its lines after Christopher Green’s introductory keynote. The issue came up on Twitter recently in a debate between Joachim Vandekerckhove and Tal Yarkoni and Valentin Amrhein wrote a blog post on a similar note. ↩
- If not taken into account when setting up a statistical model or test, that is. ↩
- The paper “false-positive psychology” is famous on its own (Simmons et al., 2011). And Uri Simonsohn showed the effect on Bayes factors in a blog post at Data Colada and a related manuscript through a series of simulations. ↩
- At least as we not reduce it to “theories are statistical models and statistical models are theories”. ↩