The year the journals changed

Where do all the studies come from? Behind every headline trumpeting a new finding in psychology, you can usually find an article in a peer-reviewed psychology journal. But how reliable are these findings? This is what many scientists have recently started to wonder. Because of this, journals in psychology are starting to insist on better reporting of research studies. In this first post of a two-part series, I will explain some of the standards that have typically been used to judge whether a study deserves publication or not.

When success depends on meeting rigid standards, the danger is that people will focus on meeting the standards at the expense of wider goals of quality. This is like schools “teaching to the test” at the expense of giving their students a more rounded education. In psychology research it is the significance test that has come under fire for this reason.

A matter of significance

First, a little background. Most research in psychology takes a sample of people and, from their answers or behavior, tries to figure out what is likely to be going on in the larger population. But there’s always the chance that you’ve chosen a sample of people whose answers don’t reflect how everyone else would answer.

So imagine you ask 100 people whether they prefer goldfish or hamsters as pets and you get 75% preferring hamsters.  Now imagine a weird population where the answers are actually the opposite of what your sample tells you: in fact, the majority of people actually prefer goldfish.  It’s possible that you just happened to sample a whole bunch of hamster-lovers by chance, and your sample doesn’t represent the goldfish-loving majority. Statistics gives us a way to estimate how likely it is that you’d get this deceptive result in a random sample. The technique of pinning a percentage number on that risk is known as null-hypothesis significance testing.

Now, let’s say that a significance test gives you a value of p = .23 (23%). This means that a weird population with results opposite to your sample would give you a result that looks like your sample 23% of the time. So this makes it hard to conclude that your sample tells you anything definite about the population.  If your p value equals .001 (1/10th of a percent), that gives you more confidence. Only once out of a thousand times would you get the results you got from an opposite-looking population.

In psychology, the statistical standards typically used to accept a finding as true require a significance level of p < .05. This means that if, in reality, the researchers’ idea was wrong, there should be a less than 5% chance of getting the finding that was actually found. But that chance is never zero, so this still means that a small number of findings accepted as true won’t reflect reality. Statisticians call this kind of finding a false positive. Even with the best research practices, up to 5% of “true” findings from individual studies might have false positives behind them.

This risk goes down, a lot, if researchers run two or three or four studies and find the same kind of significant result in each of them.  But even after running multiple studies, there’s  still the possibility to get things wrong.

Beauty vs. truth

First off, there might be some flaw in the methodology that makes the researchers’ conclusions invalid. For example, you might want to show the benefits of pet ownership in an experiment where you put some people in a room with a warm puppy for five minutes, and others in a room alone. If the puppy people score better on an intelligence test than the empty-roomers, though, it doesn’t mean there is anything special about the puppies. Instead, it could just be that the boring experience of an empty room made the other group do worse than they would have. Journal editors and reviewers are usually quick to point out these kinds of flaws.

But another, less visible threat to accuracy comes from the fact that there are many different ways to look at a study’s data. Until recently, some researchers found it acceptable to pick and choose. If they had three outcomes that measured prejudice, for example, and one’s intervention only reduced prejudice in two of them, some found it to be OK to report just those two. What’s more, reviewers and editors would sometimes lend a hand, telling authors to “streamline” their results by leaving out measures and experimental conditions that didn’t work, or to get their results to an acceptable level of significance by running a few more participants.

The problem? This leads to only the best-looking results being presented, overstating the case in the research. Think about what would happen if people entered an archery contest but were allowed to send in an edited video of their best shots ever, rather than an honest video of a single session of archery. That contest wouldn’t really be able to distinguish between people who were really good at archery, and people who just tried and tried until they had enough impressive-looking shots.

In fact, one critique estimated that if an author is allowed to do a selective analysis using all the commonly accepted techniques, the data can be made to squeak out “p < .05!” up to 60% of the time even if there is no plausible effect out there (Simmons, Nelson & Simonsohn, 2011). More generally, being able to present analyses without telling the reader what you did to get there looks bad by common-sense standards of honesty, and is bad for the accuracy of science.

In some ways, the false-positive problem I have presented here is worse than the problem of outright scientific fraud. Dishonest researchers who have just made up data knew what they were doing. Most scientists take pains to avoid that kind of fraud. But what happens when an entire field – including the gatekeepers at the journals – convinces itself that science is well served by a selective, pretty story instead of an exhaustive, complete one (Giner-Sorolla, 2012)? Then it’s possible for even well-meaning people to do things that increase the chance of false positives, in order to get published.

But there is hope. In the next post I’ll review some of the changes in journals’ standards, starting this year, to help make sure that research reports are complete and honest.

References:

Giner-Sorolla, R. (2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspectives on Psychological Science7, 562–571.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False- positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significantPsychological Science22, 1359–1366.