Reviewing Pham’s “feeling the future” paper to illustrate how easy it is to unintentionally p-hack

Context: Michel Pham has responded to Krefeld-Schwalb and Scheibehenne (2022) “Tighter Nets for Smaller Fishes” paper in a way that I believe fails to acknowledge how widespread p-hacking is and how easy it is to fall into that trap. Thus in response I’ve attempted to show how he may have p-hacked his own 2012 paper that falsely claims that students can be made to predict stock market performance and the weather through use of a silly prime. My hope is that this will help Pham and others see that the problem is widespread and easy to do, and that addressing this problem should be priority #1, 2 and 3. We shouldn’t care about relevance, impact, generalizability or anything else if the finding is false due to p-hacking.

Pham’s paper

Many people don’t know that the Journal of Consumer Research published its own “Feeling the Future” paper that actually cited the infamous Bem (2011) pre-cognition paper, also titled “Feeling the Future,” numerous times uncritically and unironically. The paper has an entire section dedicated to Bem (2011) in the conceptual development. In “Feeling the Future: The Emotional Oracle Effect,” Pham, Lee and Stephen (2012) claimed, and presented evidence that a simple prime could make people much better at predicting future stock market performance and the weather, among other things.

What was this prime?

They just asked people to describe “situations in which you trusted your feelings to make a judgment or a decision and it was the right thing to do.” They primed higher “trust-in-feelings” by asking them to think of two times, and lower by asking them to think of 10 times. The idea is that people in the 10x condition wouldn’t be able to think of 10 times they trusted in their feelings so they would lose confidence. It’s unclear to me how you get from this confidence prime to actual precognition but I’m sure it’s in that intro somewhere for those willing to dig.

First of all, let’s reason our way through this with a series of questions:

  1. Do marketing professors have the ability to manipulate people into more accurately predicting stock market performance? No. If they did, they’d surely be rich.
  2. Has any priming effect ever been replicated by a neutral 3rd party? No. Never. I discussed a lot more details about the problems with priming in the 2022 Marketing Metascience Year in Review. See that for more information on why priming studies can’t be trusted.
  3. Have other studies linked confidence with predicting the future? No, I couldn’t find any other examples.

Given the negative answers to the questions above, we can assume that the findings are probably false and work backwards to figure out what went wrong. So how did the researchers do it? There are four main ways to obtain false results: fraud, errors, confounds or p-hacking. Errors and confounds do happen but they are unlikely to produce a whole 8-study paper with these kinds of results. It makes more sense to look specifically at p-hacking and fraud. The first step is to rule out p-hacking. I would only look for fraud if I couldn’t find evidence of p-hacking.

Luckily p-hacking can be easily detected using readily available forensic techniques such as p-curve. First of all, let’s look at the paper itself to see if there’s a lot of room baked in for extra “researcher degrees of freedom.” In 2012 and before, preregistration wasn’t really a thing so it’s not surprising that it’s not preregistered but the fact that it’s not leaves open A LOT of room for p-hacking.

Potential researcher degrees of freedom

Although p-hacking activities are typically not disclosed, I will outline some potential researcher degrees of freedom I’ve identified in the paper. These are not meant as accusations but just attempts to forensically determine how the researchers obtained this (likely) false result using the limited reporting that was and is customary in consumer research. The list that follows identifies some potential p-hacking strategies based on the described studies. These techniques if implemented can massively inflate type I error rates.

  1. Possible iterative data collection. The sample sizes are not round numbers, vary greatly from study to study and don’t seem to follow any logic. This indicates possible iterative data collection (collect, analyze, collect, analyze, etc.). In one case, the authors indicate that the data for a study was collected in two different periods (iterative).
  2. Swapping DVs. In some cases the DV is manipulated through a prime. In other cases it is measured through something that looks a heck of a lot like a manipulation check repurposed as a DV.
  3. Covariates and filters. In some studies people are filtered out because of certain criteria or a potential filtering criterion is used as a covariate instead. Collecting a few such measures gives extensive researcher degrees of freedom. These are used inconsistently from study to study, not used in all studies and never used in the same way twice.
  4. Competing hypotheses. Again this allows for inflated type I error rates when either way is a success.
  5. Inconsistent conditions. Sometimes the Google conditions are used, sometimes not. They are also pooled together when they failed to yield an interesting effect. I’d not be surprised if some conditions were dropped just because of the inconsistency that I’m seeing.

Again, this evaluation of potential researcher degrees of freedom is not meant to be accusatory. I’m just trying to figure out what might have gone wrong. Keep in mind that inferential statistics are intended to help one test an idea–not find an idea. The more studies and tests you run, the more potential for type I error.

Looking at the numbers for evidence of p-hacking

P-hacking is usually detected forensically by a look at the p-values of core hypothesis tests. A set of true effects will mostly be p < .001. P-hacked effects like to hang out between .01 and .05. Here I’ve collected all of the test statistics used for the core hypothesis tests in each study. Notice that 7 out of 8 significant and marginally significant p-values are higher than .01. Just from a quick eyeball of this table it would seem that Study 8 is fairly unlikely to be p-hacked but 1 through 7 all map to p-hacking very well. That doesn’t mean that Study 8 is a true effect though. One low p-value from a set of so many (and in this case an unbelievable effect) doesn’t indicate that the effect is real. Here are the possible explanations: confounds, errors, fraud, and there actually is a very small chance that it’s a type I error but it’s pretty unlikely because p-values are uniformly distributed between 0 and 1 for null effects. Unlikely a type I error would be that close to p=0 by chance. I won’t be able to determine what happened in Study 8 but generally this set of studies maps very well to p-hacking. Generally when you have just one effect that is not p-hacked, a confound in how the experiment is administered is the most likely explanation.

StudyPrediction contextSample sizeEffect size (Cohen’s d)High TIF meanLow TIF meanTest statisticp-value
1Democratic nomination (2008)2290.2471.90%63.90%Z = 1.840.06577
2Movie box office1710.5747.50%24.40%chi2(1) = 4.140.04188
3American Idol (2009)1040.3chi2(1) = 3.960.04659
4Dow Jones Index1350.36F(1,132) = 4.900.02857
5NCAA football champs3060.3chi2(1) = 3.710.05409
6Weather520.647.10%27.80%Z = 2.510.01207
7Weather1160.2835.50%17.10%chi2(1) = 3.220.07274
8Weather1750.4853.90%21.40%chi2(1) = 9.660.00188
Pham, Lee and Stephen 2012

Rather than just eyeballing it, we can formalize it by running it through p-curve. P-curve is a tool that compares the distribution of reported p-values to the pattern to be expected if there is no effect or if there is a true effect with 33% power.

In the case of this data, three marginal p-values are dropped from the p-curve. This actually understates the problem of p-hacking substantially. But even without the marginal ones, the p-values that are left don’t look good at all. I wouldn’t say this is a clear case of p-hacking only because p-hacking is forensically indiscernable from related issues of low power, publication bias and HARKing (Hypothesizing After Results are Known which is basically p-hacking). We don’t know for sure if it’s p-hacking but we know it lacks evidential value because it is flat or uniform (that’s bad). It should be right-skewed for true effects.

The takeaway

What do we learn from this? We see that just as in Bem (2011), too many researcher degrees of freedom can lead to false results even in outrageous studies like this one. Just because it’s outrageous doesn’t mean fraud. There is no effect that can’t be obtained through p-hacking. Even precognition can be falsely proven when scientists misuse inferential statistics. It’s also a cautionary tale for early career researchers because it shows how easy it is to fool yourself. I assume the researchers had good intentions and were trying to do good research. They probably had no intention to fudge the numbers. But if you make analysis decisions based on the data and run studies and analyses until p < .05 you will just be capturing noise. A p-value is a very limited tool. It can only be used to test your idea. It can’t be used to find an idea. You must very carefully preplan all studies and tests and stick to your plan. Any deviation must be declared so that readers can judge for themselves how tight the hypothesis test actually is. The more tests you’re running, the more you’re fooling yourself.

Related Posts






Leave a Reply

Your email address will not be published. Required fields are marked *