This was cross-posted on Dan Kahan’s blog.
Ok. At this point, I think most people know that replications are important and necessary for science to proceed. This is what tells us if a finding is robust to different samples, different lab groups, and minor differences in procedure. If a finding is found, but never replicated is it really a finding? Most working scientists would say no (I hope).
But not all replications are created equal. What makes a convincing replication? A few years ago with a lot of help from collaborators we sat down to figure it out (at least for now; see the open access paper). A convincing replication is rigorously conducted by independent researchers, but there are also another 5 ingredients.
- Carefully defining the effects and methods that the researcher intends to replicate: If you don’t know what effect you are exactly trying to replicate, it is difficult to carefully plan the study and evaluate the replication attempt. This ingredient determines nearly all that follow.
- Following as exactly as possible the methods of the original study (including participant recruitment, instructions, stimuli, measures, procedures, and analyses): The closer the replication is to the original attempt, the easier it is to infer if the original finding is confirmed (or not). Although replications that are less close or even just conceptually similar help establish the generalizability of an effect (see this nice paper), the differences make it impossible to tell if differences in results are due to the instability of the underlying effect or to differences in the design.
- Having high statistical power: Statistical power is basically an indicator of whether your study has a chance of detecting the effect you plan to study. Statisticians will give you more precise definitions and some branches of statistics (e.g., Bayesian) don’t really have the concept. Putting these things aside, the general idea is that you should be able to collect enough data to have precise enough estimates to make strong conclusions about the effect you’re interested in. In most of the domains I work in, power is most easily increased by including more people in the sample; however, it’s also possible to increase power by increasing the number of observations in other ways (e.g., using a within-subjects design with multiple observations per person). The best way to ensure high statistical power in a replication will depend on the precise design of the original study.
- Making complete details about the replication available, so that interested experts can fully evaluate the replication attempt (or attempt another replication themselves): To best evaluate whether a replication is a close replication attempt, it is useful to make all of the details available for external evaluation. This transparency can illuminate potential problems with either the replication attempt or the original study (or both). It is also beneficial to pre-register the replication study, including the criteria that will be used to evaluate the replication attempt.
- Evaluating replication results, and comparing them critically to the results of the original study: Don’t just put the results out there. Interpret them too! How are the results similar to the original study and how are they different? Are they statistically similar or different? And what could possibly explain the differences? How to evaluate replication results has become its own industry, with a lot of food for thought (see this paper).
This is all fine, you might say. But how does this work in practice? Well, for one thing we’ve developed a form to help people plan and pre-register replication results. It’s available in our paper, its available here (and in French!), and its built into the Open Science Framework. It’s also useful to examine how it doesn’t work in practice.
Here we turn to a paper that Ballarini and Sloman (B&S) presented at the meeting of the Cognitive Science Society (paper is here). B&S were testing out a debiasing strategy and in that context state that they “failed to replicate Kahan et al.’s ‘motivated numeracy effect’.” To evaluate this claim we need to know what the motivated numeracy effect is and if the B&S study is a convincing replication of it.
A quick summary of the original Kahan et al paper (paper is here): a large, representative sample of Americans evaluated a math problem incorrectly when it conflicted with their prior beliefs and this was the case primarily for people high in numeracy (the people who are good at math). The design is entirely between subjects, with participants completing a scale of political beliefs, a numeracy scale, and a word problem that did or did not conflict with their beliefs. There is more to the paper; go read it.
B&S wanted to see how they could debias people within the context of the Kahan paradigm by presenting people with competing interpretations of the data in the math problem. They found that highly numerate people were more likely to adjust their interpretation based on this competing information. This is interesting. They also did not find any evidence that highly numerate people are more likely to misinterpret a belief contradicting math problem.
It is important to state that this study was conducted by independent scholars and appears to be conducted rigorously. This is a step in the right direction as it provides evidence relevant to the motivated numeracy effect that is independent of the Kahan et al group. But did they fail to replicate?
It is actually hard to say. The first problem is that B&S used a within-subjects paradigm where participants repeatedly received math problems of the sorts used by Kahan (and a few other types). This is different than the between-subjects design of the original study and so a problem with Ingredient #2. Although within- and between-subject designs can tap into similar processes, it is up to these replication authors to show that this procedural change does not affect the psychological processes at work.
But I do not think this is the biggest problem; if it’s powerful then the motivated numeracy effect should be able to overcome some of these design changes.
The second and more consequential problem is that whereas the original study used a very large sample (N = 1111) representative of Americas, B&S use a small sample (N = 66) of students (that is further reduced for procedural reasons). This smaller sample of students makes it less likely that they will have participants with diverse political views (1% were conservative) and a range of numeracy scores. In designs with measured predictors it is necessary to have adequate range or else there won’t be enough people who are truly low numerate or conservative to test hypotheses about these subpopulations.
The small sample size also it makes it impossible to confidently estimate the size and the direction of these effects (a problem with Ingredient #3). B&S point to the within-subjects part of their design as evidence of its statistical power, but that part of the design does not address the low power for the between-subjects part of the design. That is, although they might have the necessary power to detect differences between the math problems (the within part of the design), they do not have enough people to make strong inferences about the between part of the design (numeracy and politics).
So, at the end of this, what does the B&S study tell us about the motivated numeracy effect? Not much. The sample isn’t big enough or diverse enough for these research questions (and the difference in design is an additional complication). If B&S are just interested in the debiasing aspect, then I think that these data are useful, but they should not be framed as a replication of Kahan et al; the study is not set up to convincingly replicate the motivated numeracy effect. To their credit, B&S are more circumspect in interpreting the replication aspect of their study in the discussion (in contrast to their summary in the abstract). Hopefully most readers will go beyond the abstract…
Why do I care and why should you? Replications are important, but poor replications, just like poor original studies, pollute the literature. I don’t want to discourage people from replicating Kahan et al’s work, but when it is replicated it is important for researchers to carefully recreate the conditions of the study so that we can be confident in the evidence obtained in the study. A representative sample of America is expensive, but there are other ways of recruiting participants with diverse political backgrounds (e.g., collect data from other university campuses). We need a literature of high quality studies so that we can make informed theoretical and practical decisions. Without this it will be difficult to know where to begin.