To the child, Nature gives various means of rectifying any mistakes he may commit... In the study and practice of the sciences it is quite different; the false judgments we form neither affect our existence nor our welfare; and we are not forced by any physical necessity to correct them. Imagination, on the contrary, which is ever wandering beyond the bounds of truth, joined to self-love and that self-confidence we are so apt to indulge, prompt us to draw conclusions that are not immediately derived from facts, so that we become in some measure interested in deceiving ourselves.

 -Lavoisier, Preface to Elements of Chemistry (1790)

Figure 5 from    Danchin et al. Science 2018   purports to show that female flies can remember the ratio of previous green to pink males that they’ve seen mate, and that they use this information to select their own mate. This would be evidence of cultural transmission in  Drosophila .

Figure 5 from Danchin et al. Science 2018 purports to show that female flies can remember the ratio of previous green to pink males that they’ve seen mate, and that they use this information to select their own mate. This would be evidence of cultural transmission in Drosophila.

What do you do if your PCR doesn’t work? First: run it again. Second: change the temperatures and run it again. Third: swap out some reagent and run it again. It finally worked! Why didn’t it work on the first four attempts? Who cares? We can sequence the band and make sure it’s correct.

Now what if your behavioral experiment doesn’t work? Run it again? Didn’t work again. Run it yourself (instead of having an undergrad do it). Still didn’t work. Make some small parameter change. It worked! Now…what do you do with the first three runs? Discard the data? You’re not going to include them and make your result look weaker or non-existent, are you? But are you sure that it was the tweak that gave you the result you wanted? There’s no confirmatory sequencing here. Like the evolutionary processes that have generated so many amazing things, your result may be due to chance and selection—unlike evolution by natural selection your finding may be false.

Take a look at Figure 5 in a Research Article published in Science. The female fly has watched five or six other females mate with males that were either covered in pink or green dust. Now it’s her turn to decide: who does she want to mate with, a pink male or a green one? It seems that she is heavily biased by the previous selections made by other females. Incredibly, if she watches 3 females mate with green males and 2 with pink males, her own bias towards green males is as severe as if she saw only 6 matings with green males. The inverse is true for pink males. Can the female fly really remember all of the matings she has seen and use this information to select a mate?

This struck us as unbelievable. Look at this video to see what happens when a female is placed with two interested males. Think she’d be able to tell which is green and which is pink? Can she even see them? How much control does she even have over which male mates with her?

So we got the pink and green dust and tried to reproduce the results. We quickly realized that the extent to which the males are dusted strongly influences their willingness to court (they hate being dusted) and we suspected this effect would be stronger than any female choice. So we contacted the authors. They agreed that dust level was an issue and told us how they dust the males, but they also told us this: “it is clear that one has to play with all the different phases of the experiments to master all the steps. No long talk can replace this. Different students show different skills.” Sounds like the PCR thought experiment from above. So we looked at their results more closely.

Figure 1 from    Thornquist and Crickmore Science 2019  . The plot on the left shows the expected distributions of p-values for hypothesis-confirming experiments in white, compared to the actual p-values of Danchin et al. in pink. The pink values are much closer to 0.05 (blue dashed line) than you would expect, which could indicate that just-good-enough trials were selected for. The panel on the right shows the expected distribution of p-values when comparing effects across experiments (e.g. the green bars in their Figure 5). Here you would expect a uniform distribution of p-values, since you are effectively drawing from the same population. Instead, the actual values in pink are highly skewed toward 1, indicative of extreme consistency across experiments and pointing to some factor other than the female simply choosing pink vs. green males.

Figure 1 from Thornquist and Crickmore Science 2019. The plot on the left shows the expected distributions of p-values for hypothesis-confirming experiments in white, compared to the actual p-values of Danchin et al. in pink. The pink values are much closer to 0.05 (blue dashed line) than you would expect, which could indicate that just-good-enough trials were selected for. The panel on the right shows the expected distribution of p-values when comparing effects across experiments (e.g. the green bars in their Figure 5). Here you would expect a uniform distribution of p-values, since you are effectively drawing from the same population. Instead, the actual values in pink are highly skewed toward 1, indicative of extreme consistency across experiments and pointing to some factor other than the female simply choosing pink vs. green males.

We found that their p-values for experiments that “worked” (the female chose based on color history) were too often just on the “working” side of 0.05. We also found that the extent to which the experiment “worked” was far more consistent than you would expect by chance. You can see this because their p-values comparing experiments in which the female chose based on color history are strongly skewed toward 1 instead of being evenly distributed, as you would expect from independent trials with the same underlying mean and variance. How could you get these kinds of data? One way is to find reasons to exclude data that doesn’t fit the hypothesis. The authors have responded, criticizing some of our numbers, but agreeing that the effect is much more consistent than you would expect. How often would you expect to get nearly exactly 30 “heads” from 60 coin flips? What does it mean if that keeps happening (look at the green bars in Figure 5)? It means that something else is going on. We believe it is experimental selection: discarding experiments because the student didn’t have the right skills or all of the phases of the experiments weren’t exactly right. The authors counter that it can be explained by some unknown but extremely reproducible variable, for example the female is very consistently stressed on a very reproducible fraction of the trials.

How important is this? Think about the claims: flies pay attention to, remember, count, and reason mathematically about the types of matings they’ve seen; flies have transmissible culture. Now think about the experiments and data that support these ideas.

How common a problem is this in behavioral neuroscience? It happens in our lab all of the time. We suspect that an undergrad’s weird results are because they weren’t careful and used the wrong genotype or protocol. What should we do? Use that data and contaminate our trustworthy data? Throw away those data? Throw away all of that person’s data? Would we exclude the same data if it supported our hypothesis? It’s scary to think about, but it’s scarier to not think about.