The perverse incentives that stand as a roadblock to scientific reform

The efforts in psychology to improve the believability of our science can be boiled down to simple and easy changes to our standard research practices.  As a field we should:

    1. Provide more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices
    2. Design our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes)
    3. Provide more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
    4. Reward the production and publication of direct replication.

 

When confronted with these recommendations it seems many researchers balk.  This is surprising to me because many of these recommendations seem quite mundane and easy to implement. Why would researchers choose not to embrace these recommendations as a means to improve the quality of their work?

I believe the reason for the passion behind the protests is that the proposed changes undermine the incentive structure on which the field is built. What is that incentive structure? 

In my opinion our incentive system rewards four qualities: 1) finding p values less than .05, 2) small N, conceptual replications, 3) discovering counter-intuitive findings, and 4) producing a clean narrative.

P < .05.

The first, seemingly most valued component of psychological science is that your findings must be “statistically significant” which is indicated concretely by achieving results where the probability of your data given the null hypothesis is less than 5%.  Researchers must attain a p-value less than .05 to be a success in psychological science. If your p-value is greater than .05, you have no finding and nothing to say because your work will not be published, discussed in the press, or net you a TED talk.

Because the p-value is the primary key to the domain of scientific success we do almost anything we can to find the desired small p-value.  We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical modeling.  It is clear from reviews of psychological science that we not only value p-values less than .05, but also have been remarkably successful in limiting the publication of alternative p-values.  In our published literature psychology confirms 95% of its hypotheses (Fanelli, 2012).

Even worse, we punish those who try to publish null effects by considering them “second stringers” or incompetent—especially if they fail to replicate already published, and by default, statistically significant effects. Of course, if you successfully emerge from your graduate training still possessing the view that the ideal of psychological science is the pursuit of truth, maybe you deserve to be punished.  The successful, eminent scientists in our field know better.  They know that “the game” is not to pursue truth, but to produce something with p <  .05. If you don’t figure that out early, you are destined to be unsuccessful because the people in control of resources are the ones who’ve succeeded at the game as it is played now.

Small N, Conceptual Replications

Under the right circumstances, conceptual replications are an excellent device in the researcher’s tool kit.  The problem, of course, is that the “right circumstances” are those in which an effect is reproducible—as in directly replicable.  In the absence of evidence that an effect can be directly replicated, conceptual replications might as well be a billboard screaming that the effect cannot be directly reproduced and the author was left sifting through either multiple studies or multiple outcomes across studies to find a statistically significant effect.

And for seemingly many good reasons the ideal conceptual replication seems to be a small N replication.  Despite decades of smart methodologists pointing out that our research is poorly designed to detect such amazingly subtle things as between subjects, 2x2 interaction effects, researchers continue to plug away at sample sizes well south of 100 where they should be using samples in excess of 400 (Cohen, 1990; Simonsohn, 2014).

article author(s)

facebook