Serendip is an independent site partnering with faculty at multiple colleges and universities around the world. Happy exploring!

Sapienza et al (2009): Critiques and Sociocultural Implications

AmyMay's picture

I had a lot of issues with the study we read on gender differences in testosterone and financial risk aversion.  One of the bigger issues I saw was the sample population they used.  Though they defend their choice of participants ideal for this study, since they were already familiar with financial risk, were fairly demographically homogenous, and provide some measure of risk among professional financial decision makers.  However, this group may also be overly homogenous in testosterone levels, offering only a sliver of possible data.  They report that other studies have found correlations between testosterone and career choice, and concede that greater testosterone among the subjects may reflect the greater risk-taking in that industry.  The selectivity of this sample is incredibly problematic to the generalization of their data.  Such selective sampling cannot generalize to the general population.  This is also true of the negative correlations they found.  Though there may be a negative correlation in the part of the population distribution represented by the sample, the relationship between variables may not be the same elsewhere in the distribution. For example, for the figure below, if you sample between a population with X values between 70 and 80, the X-Y relationship will appear to have a negative, linear correlation.  However, if the whole population distribution of X is represented in the sample, it becomes clear that this is not the case.

Scatterplot of non-linear data

            I also have an issue with their implication of biological destiny.  Yes, as they say in their rebuttal, they use plenty of statistician lingo to dance around the subject (i.e. using suggest instead of demonstrate, as they mention.)  However, a responsible scientist (in my opinion) should be aware of the sociopolitical context of their research (shout out to Barad here.)  Scientists and statisticians with the know-how to parse out the distinction between “demonstrate” and “suggest” are not the only people who will pick up this article.  The popular press often reports scientific findings, with varying levels of success and accuracy. Sapienza et al fails to fully address the problems inherent in a correlational design (inconclusive direction of causality, the possibility of a third variable causing variation in both X and Y, etc.)  Even the title of the article—“Gender differences in financial risk aversion are affected by testosterone”—implies testosterone is causing these differences, when there is absolutely no evidence in their study to support this.  (No experimental manipulation, no causation.  Period.)  Though it seems intuitive for hormonal/biological/brain factors to influence behavior, there is a growing body of research showing that not only does the brain affect behavior, but also behavior affects the brain (and the endocrine system, which is controlled by the brain.)

            My last big issue with this paper is fairly common in research on sex-differences (based on my own readings of psych research on gender dimorphisms): they do not include a measure of effect size.  Effect size is different than statistical significance.  While statistical significance measure the probability that the results between conditions occurred by chance, effect size addresses whether or not this significant difference is substantial enough to be of concern.  This may seem redundant, but in the study of gender dimorphisms, it is essential.  The size of a sample is included in the calculation of statistical significance—the bigger the sample, the less likely it occurred by chance since it is presumably more representative of the whole population, the easier it is to achieve statistical significance for the effect. For example, men and women may differ in mathematical ability a very, very small amount (maybe the female population average is 70 while the male population average is 71 out of 100.)  It will take a HUGE sample of participants to detect this effect, but if you sample enough people, you will find significant effects.  Effect size puts this into perspective, basically asking “Is this effect big enough for us to care, or did the researchers have to sample everyone and their mother to reach significance?”  In the field of psychology, it’s becoming more common (especially with research on gender dimorphisms) to report effect size alongside significance level (p level).  Given that the study at hand had over 500 participants (that is A LOT—for my thesis I’m shooting for about 40 participants), I am left questioning whether the effect sizes of some of the relationships they describe were significant or not.  Though their effects are significant, are they actually meaningful?

Given what we have been talking about recently with respect to the God-effect and the hegemony of scientific fact, it seems irrisponsible for researchers to leave such points unaddressed (or insufficiently addressed) in their research.  Like an economist whose report on car sales impacts future car sales, psychologists, endocrinologists, biologists, and behavioral neuroscientists have an impact on that which they study (again, shout out to Barad here.)  I think there is a necessary place in research ethics for such considerations.