IQ gender differences research has long been one of the most debated topics in psychology — and a landmark meta-analysis published in the journal Educational Psychology Review may finally offer some clarity. Analyzing data from 79 studies and more than 46,000 participants, researchers found that the overall difference in full scale IQ scores between males and females is remarkably small — so small, in fact, that it carries virtually no practical meaning in everyday life.
Still, the story is more nuanced than a simple “no difference” headline. When you break intelligence down into specific cognitive domains, look at how test design has evolved over decades, and factor in the concept of score variability rather than just averages, a far richer picture emerges. This article walks through the key findings, what they mean, and why the way we ask the question matters just as much as the answer.
Once again, personality researcher and author of Villain Encyclopedia, Tokiwa (@etokiwa999), will provide the explanation.
※We have developed the HEXACO-JP Personality Assessment! It has more scientific basis than MBTI. Tap below for details.

目次
- 1 What the IQ Gender Differences Research Actually Studied
- 2 The Full Scale IQ Gender Gap: What the Numbers Actually Show
- 3 Beyond Averages: Why IQ Gender Differences Research Must Include Variability
- 4 How Test Design Shapes the Results: The Role of IQ Test Bias
- 5 What These Findings Mean in Practice: Moving Beyond Gender Stereotypes in Intelligence
- 6 Frequently Asked Questions
- 6.1 Is the gender difference in IQ caused by genetics or environment?
- 6.2 How small is an effect size of d = 0.09 in real-world terms?
- 6.3 Why do newer versions of the WISC show smaller gender differences?
- 6.4 What is the variability hypothesis and why does it matter?
- 6.5 Is a sample of 46,000 participants large enough to trust these results?
- 6.6 Do men or women score higher on specific cognitive tasks?
- 6.7 Does this research mean gender plays no role in intelligence at all?
- 7 Summary: What IQ Gender Differences Research Really Tells Us
What the IQ Gender Differences Research Actually Studied
A Meta-Analysis of 79 Studies and Over 46,000 Participants
The study in question is a large-scale meta-analysis — one of the most reliable research designs available in the social sciences. A meta-analysis works by pooling data from many independent studies to produce a more statistically robust conclusion than any single study could offer on its own. This particular analysis drew on 79 papers published between 1961 and 2019, with a combined sample of 46,605 participants — 23,404 males and 23,201 females — and extracted 640 separate effect size estimates across 134 independent samples.
The key figures at a glance:
- Number of studies included: 79 (published 1961–2019)
- Total participants: 46,605
- Independent samples: 134
- Effect size estimates extracted: 640
Because individual studies can be swayed by small sample sizes, population-specific factors, or slight differences in testing conditions, combining them in this way dramatically increases confidence in the results. The sheer scale of this analysis places it among the largest ever conducted on intelligence and sex differences. The more data points you have, the harder it becomes for random noise to distort the conclusions.
Why Researchers Focused Exclusively on the WISC
One of the most methodologically sound decisions in this research was limiting the analysis to a single test battery: the Wechsler Intelligence Scale for Children (WISC). Mixing results from different intelligence tests is a major source of inconsistency in this field, because each test measures slightly different things and scores are not always directly comparable. By standardizing on the WISC — the most widely used child intelligence test in the world — the researchers were able to make clean, apples-to-apples comparisons across decades and countries.
The WISC has been revised several times since its creation:
- Original WISC: 1949
- WISC-R: 1974
- WISC-III and WISC-IV: 1991 onward
- WISC-V: 2014
Each revision has refined how cognitive ability is measured. Importantly, the researchers did not just lump all versions together — they also analyzed older and newer editions separately to see whether test design itself influenced the apparent size of gender differences. This decision turned out to be highly revealing, as you will see in later sections. Focusing on one well-validated instrument is a key strength that distinguishes this study from earlier, more eclectic reviews.
Six Cognitive Domains Examined Separately
Rather than treating intelligence as a single, monolithic number, the researchers broke it down into 6 distinct cognitive domains using the CHC (Cattell-Horn-Carroll) theoretical framework. CHC theory is one of the most widely accepted models of human intelligence. It organizes cognitive abilities into a hierarchy, allowing researchers to identify where, specifically, any gender differences might appear — rather than averaging everything into one score that could mask important distinctions.
The 6 domains analyzed were:
- Fluid Intelligence (Gf): The ability to reason through novel problems using logic — without relying on prior knowledge.
- Visual Processing (Gv): The ability to perceive, analyze, and mentally manipulate visual and spatial information.
- Crystallized Intelligence (Gc): Knowledge and vocabulary accumulated through education and experience.
- Short-Term and Working Memory (Gsm): The capacity to hold information in mind briefly and manipulate it.
- Processing Speed (Gs): The speed and accuracy with which simple cognitive tasks are performed.
- Full Scale IQ (FSIQ): An overall composite score representing general cognitive ability.
Separating these domains was crucial because the overall picture can look very different from what is happening in any one area. For example, it is entirely possible for two groups to show the same full scale IQ while differing meaningfully on spatial reasoning or processing speed. This level of granularity is what makes this meta-analysis particularly informative compared to studies that only report a single composite score.
Older vs. Newer Test Versions: A Critical Comparison
The researchers made a particularly important analytical choice by separating results from older WISC versions and newer ones. The older group included the original WISC and WISC-R (up to 1974), while the newer group covered WISC-III, WISC-IV, and WISC-V (from 1991 onward). The rationale was straightforward: if the apparent gender gap in cognitive ability testing changes depending on which version of the test is used, that tells us something important about the test itself — not just about the people taking it.
- Older versions (WISC, WISC-R): Tended to show larger gender differences
- Newer versions (WISC-III, IV, V): Showed smaller or statistically non-significant differences
This pattern suggests that what earlier research interpreted as a “real” gender gap in intelligence may have partly been an artifact of how those tests were designed. As test construction became more rigorous — with greater attention to eliminating culturally biased items and ensuring that tasks measure pure cognitive ability — the apparent differences narrowed considerably. This is a powerful reminder that results in cognitive ability research are never entirely separable from the measurement tools used to produce them.
The Full Scale IQ Gender Gap: What the Numbers Actually Show
Overall Difference: About 1.4 IQ Points in Favor of Males
Across all 79 studies combined, males scored slightly higher than females on full scale IQ, but the difference was extremely small — an effect size of d = 0.09. In practical terms, this translates to approximately 1.4 IQ points on a standard scale where the average is 100 and one standard deviation equals 15 points. Cohen’s d is the standard way researchers express the size of a difference between two groups: values below 0.20 are generally considered “negligible” or trivial, values around 0.50 are “moderate,” and values above 0.80 are “large.”
- Overall effect size (d): 0.09 (males slightly higher)
- IQ points difference: approximately 1.4 points
- Statistical significance: Yes (p < 0.001) — but statistical significance does not equal practical importance
- Practical meaning: Negligible
It is important to understand why something can be statistically significant yet practically meaningless. With a sample of over 46,000 people, even tiny differences will appear “significant” by conventional statistical thresholds. What matters more is the effect size — and 0.09 is about as close to zero as a non-zero number can get. The variation between any two individual people chosen at random will vastly dwarf this group-level difference. Framing a 1.4-point gap as evidence of a meaningful gender advantage in general intelligence research simply does not hold up to scrutiny.
Newer Test Versions Show No Statistically Significant Gender Gap at All
When the analysis was restricted to newer WISC editions (WISC-III, IV, and V), the already-small gap shrank further — and disappeared as a statistically meaningful finding. The effect size dropped to d = 0.054, equivalent to roughly 0.81 IQ points. Crucially, this result was not statistically significant (p = 0.13), meaning the observed difference could plausibly be explained by random chance alone.
- Effect size for newer versions only: d = 0.054
- IQ points difference: approximately 0.81 points
- Statistical significance: None (p = 0.13)
Research suggests the likely explanation is that newer versions of the WISC were designed with greater care to eliminate items that inadvertently favored one gender over another. Earlier tests may have included content that reflected cultural expectations or knowledge domains more familiar to boys than girls (or vice versa). As test construction methodology improved, those incidental biases were reduced — and so was the apparent IQ gap. This finding is a strong argument that at least some of the historically reported gender gap in cognitive testing was a product of IQ test bias rather than a reflection of true differences in underlying ability.
Domain-by-Domain Results: Where Small Differences Do Appear
While the full scale IQ difference is negligible, breaking results down by cognitive domain reveals a more textured picture — with small advantages appearing in different directions for different abilities. None of the domain-level differences were large by any scientific standard, but the pattern is worth understanding because it challenges the oversimplified narrative that men and women are cognitively identical in every respect, while also refuting the equally oversimplified claim that men are broadly smarter.
- Fluid Intelligence (Gf): d = 0.09 overall (males slightly higher); drops to d = 0.05 and becomes non-significant in newer test versions. Matrix reasoning subtasks actually show d = −0.04, slightly favoring females.
- Visual Processing / Spatial Reasoning (Gv): This domain shows the most consistent male advantage across versions, though still in the small range.
- Crystallized Intelligence (Gc): Near zero difference; neither sex shows a consistent advantage in accumulated knowledge and vocabulary as measured by the WISC.
- Working Memory (Gsm): Minimal difference overall.
- Processing Speed (Gs): Females tend to score slightly higher, consistent with findings from other research areas suggesting women process certain routine tasks more quickly.
- Full Scale IQ (FSIQ): d = 0.09 overall; d = 0.054 for newer versions (non-significant).
The takeaway is not “males are better at spatial tasks and females are faster processors” as a sweeping generalization — the differences are too small for that framing to be useful. Rather, the domain-level data shows that intelligence is not a single thing, and that gender-related patterns, to the extent they exist at all, are highly specific, context-dependent, and dwarfed by the enormous overlap between the two groups.
Beyond Averages: Why IQ Gender Differences Research Must Include Variability
The Same Average Can Hide Very Different Distributions
One of the most important — and most commonly overlooked — aspects of comparing two groups is that identical averages can coexist with very different distributions of scores. Imagine two classes where both have an average exam score of 75 out of 100. In Class A, scores range from 65 to 85 — most students clustered tightly around the average. In Class B, scores range from 40 to 100 — far more students at the extremes. The averages look the same, but the classes are not the same at all. The same logic applies directly to IQ score comparisons between males and females.
- High variability group: More individuals at both the top and bottom of the distribution
- Low variability group: More individuals clustered near the average, fewer at the extremes
This means that even if males and females have virtually identical average IQ scores, differences in how much scores spread out could produce real differences in the proportion of each group at very high or very low IQ levels. Concluding “there is no gender difference in intelligence” based solely on average scores ignores this dimension entirely — and doing so can lead to misleading conclusions about representation at the extremes of intellectual ability.
The Variability Hypothesis: Males Show Greater Spread at Both Extremes
The variability hypothesis is the scientific idea that males tend to show greater variance in cognitive test scores than females — meaning more males score at both the very high end and the very low end of the IQ distribution. This hypothesis, which has been discussed in the scientific literature for well over a century, does not predict that males will have higher average scores. Instead, it predicts that the male score distribution will be wider — more “spread out” — than the female distribution, even if the centers of the two distributions are nearly identical.

Key predictions of the variability hypothesis include:
- More males than females at very high IQ levels (e.g., above 130)
- More males than females at very low IQ levels (e.g., below 70)
- Females more concentrated around the average IQ range
- Group-level average scores may be nearly identical despite these distributional differences
If this hypothesis is correct — and research suggests it has at least partial support — it has real-world implications. For example, it could partly explain why males are overrepresented both in programs for gifted students and in diagnoses of intellectual disabilities. Importantly, this does not mean males are “smarter” — it means the distribution of scores is shaped differently. The meta-analysis highlighted the variability question as an area requiring more dedicated future research, since most existing studies, including this one, primarily analyzed mean differences rather than variance differences.
Why Most Studies Have Missed This — And What It Means for the Field
The majority of research on intelligence and sex differences has focused almost exclusively on comparing average scores, while largely ignoring variance — a methodological blind spot with serious consequences for how findings get interpreted. The authors of this meta-analysis explicitly noted this gap, pointing out that focusing only on means provides an incomplete and potentially misleading picture of how cognitive ability is distributed across genders.
- What most studies have done: Compared average IQ scores between males and females
- What most studies have NOT done: Examined whether the spread (variance) of scores differs between groups
- The consequence: Conclusions like “no gender difference exists” are only accurate for averages — they may not apply to the shape of the full distribution
Beyond the variance issue, many earlier studies also suffered from other limitations: small sample sizes, samples drawn from single countries or non-representative populations, and the use of multiple different intelligence tests whose results were pooled without proper controls. These factors likely contributed to the inconsistent findings that have characterized this research field for decades. The meta-analysis under discussion represents a significant step forward precisely because it addressed several of these limitations simultaneously — but it also openly acknowledged that the variability question remains an important open problem for future investigation.
How Test Design Shapes the Results: The Role of IQ Test Bias
Older Tests Showed Larger Gaps — And Here Is Why That Matters
One of the most striking findings in this meta-analysis is the consistent pattern that older versions of the WISC produced larger apparent gender gaps than newer versions. This is not a minor statistical quirk — it is a finding with significant implications for how we interpret decades of earlier research on cognitive ability in men versus women. If the size of the measured gap changes substantially depending on which version of the test is used, then at least part of what previous researchers reported as a “gender difference in intelligence” may actually be a “gender difference in how well a particular test was designed.”
Why might older tests have been more biased?
- Older tests were developed in cultural contexts where gender roles were more rigidly defined, and test content may have unintentionally reflected those norms
- Certain knowledge-based questions (crystallized intelligence items) may have drawn on domains more familiar to boys in mid-20th-century Western societies
- Differential Item Functioning (DIF) analysis — a statistical technique for identifying items that perform differently for different demographic groups — was not widely applied to early test versions
- Newer tests have undergone more rigorous psychometric review to ensure items measure the targeted ability rather than gender-related background knowledge
The implication is not that all previously reported gender differences were fabricated. Rather, research suggests that the measurement instrument itself was a variable — and that as instruments improved, apparent differences shrank. This is actually a success story for psychometrics: the field has become better at measuring what it claims to measure. But it also means that older studies overstated the gender gap in cognitive testing, and those overstated findings have had an outsized influence on popular beliefs about intelligence differences between men and women.
What “Statistically Significant But Practically Meaningless” Really Means
A common source of confusion in intelligence and sex differences research — and in science reporting more broadly — is the difference between statistical significance and practical significance. Statistical significance tells you whether an observed difference is likely to be real rather than due to random chance. Practical significance (measured by effect size) tells you whether that difference is large enough to actually matter in real life. These two things are not the same, and conflating them leads to a great deal of unnecessary controversy.
Here is a concrete way to think about effect size d = 0.09:
- In IQ points: approximately 1.4 points on a 100-point average scale
- In height terms (as an analogy): roughly equivalent to a height difference of about 1.4 cm between two groups whose average heights are around 170 cm — detectable statistically with a large enough sample, but imperceptible in daily life
- In overlap terms: the distributions of male and female IQ scores overlap by well over 95%, meaning the “average male” and “average female” are cognitively nearly indistinguishable
With 46,605 participants, even a difference of d = 0.09 crosses the threshold for statistical significance — but this says nothing about whether the difference is meaningful. The far larger source of variance in any individual’s IQ is not their gender; it is the unique combination of genetic endowment, educational opportunity, socioeconomic background, health, and the many other factors that shape cognitive development across a lifetime. Gender, by comparison, explains a vanishingly small fraction of the total variation in intelligence scores.
What These Findings Mean in Practice: Moving Beyond Gender Stereotypes in Intelligence
For Educators: Stop Sorting Students by Gender-Based Cognitive Assumptions
Perhaps the most direct practical implication of this research is that educational systems, teachers, and parents should not use gender as a meaningful predictor of a child’s intellectual strengths or weaknesses. The stereotype that “boys are better at math and spatial reasoning while girls are better at language” is not well supported by full scale IQ data, and even domain-specific differences — where they exist — are too small to justify sorting individual children by gender-based expectations.
- Why it matters: Stereotype threat research shows that when students believe a stereotype applies to them (e.g., “girls aren’t as good at math”), it can actually impair their performance — creating a self-fulfilling prophecy that has nothing to do with underlying ability
- What to do instead: Assess individual children’s strengths and challenges on their own terms, using objective data, rather than filtering expectations through gender lenses
- How to apply it: Encourage all students to engage with all subjects equally in early education, before cultural messages about “gendered” subjects take hold
The individual variation within each gender group is enormously larger than the average difference between groups. Any given girl is far more likely to outperform any given boy on spatial reasoning tasks than the group-level statistics might imply — because the groups overlap so extensively. Treating children as individuals rather than as representatives of their gender is not just philosophically sound; it is what the science actually supports.
For Researchers: Average Scores Are Only Part of the Story
This meta-analysis serves as a methodological call to action for future researchers in the field of general intelligence research: variance matters as much as means, and failing to analyze both gives an incomplete picture. Future studies on cognitive ability differences should routinely report not only average score differences but also variance ratios — the ratio of male-to-female score variance — to enable a full assessment of how the distributions compare.
- Report variance, not just means: Variance ratio analysis reveals whether one group has a wider spread of scores, which affects the proportion of each group at the extremes
- Use updated, bias-reviewed instruments: Results from older, less-refined tests should be interpreted cautiously and not generalized to current populations
- Control for socioeconomic and cultural variables: Many apparent cognitive gender differences may be confounded by educational access, cultural expectations, and socioeconomic status
- Conduct multi-national analyses: Cross-cultural replication is essential to determining whether findings reflect biological tendencies or culturally specific patterns
The variability hypothesis in particular deserves dedicated meta-analytic attention of the same scale and rigor as this study applied to mean differences. Until that work is done, the question of whether males truly show greater variance in cognitive ability — and what that would imply — remains genuinely open. Good science demands that we follow the evidence wherever it leads, without being driven by either the desire to find differences or the desire to deny them.
For the General Public: What “A 1.4-Point Difference” Should and Shouldn’t Change
For anyone who has ever wondered whether men or women are “smarter,” the honest and evidence-based answer is: they are, on average, essentially the same — and that average difference of about 1.4 IQ points in full scale IQ scores is far too small to justify any conclusions about individuals. Knowing someone’s gender tells you almost nothing about their intelligence. Knowing their educational history, the quality of their early environment, their specific interests and practice history, and many other factors tells you far more.
- What the data does NOT support: Broad claims that one gender is more intelligent, better at logical reasoning, or naturally suited to cognitively demanding fields
- What the data DOES suggest: Very small average differences in specific domains (slightly higher spatial scores for males; slightly faster processing speed for females), dwarfed by individual variation
- The bottom line: Gender explains only a tiny fraction of variance in IQ — other factors matter far more in determining any individual’s cognitive profile
Cultural narratives about gendered intelligence — the “boys are naturally better at science” or “girls are naturally more verbal” stories — tend to take small statistical signals and amplify them into sweeping generalizations that then shape educational and career trajectories in self-reinforcing ways. The research suggests it is time to retire those narratives in favor of a more accurate, individual-centered view of cognitive ability.
Frequently Asked Questions
Is the gender difference in IQ caused by genetics or environment?
Current research cannot definitively attribute the tiny observed gap in full scale IQ scores to either genetics or environment. The fact that the gap shrinks significantly — and becomes statistically non-significant — when newer, more carefully designed tests are used strongly suggests that environmental factors and test design play a major role. Cultural expectations, educational access, and socioeconomic conditions all influence cognitive test performance, making it very difficult to isolate any purely biological contribution to IQ gender differences research findings.
How small is an effect size of d = 0.09 in real-world terms?
An effect size of d = 0.09 is classified as negligible by standard scientific benchmarks. In IQ terms, it corresponds to approximately 1.4 points on a scale where the population average is 100. As a rough analogy, it is similar to a height difference of about 1.4 cm between two groups of people — statistically detectable with a very large sample, but completely imperceptible in everyday life. The individual variation within each gender group is vastly larger than this group-level difference.
Why do newer versions of the WISC show smaller gender differences?
Newer WISC editions have been developed with more rigorous psychometric methods, including Differential Item Functioning (DIF) analysis, which identifies and removes test items that perform differently for different demographic groups. Older versions were created in cultural contexts where gender norms were more rigid, and some items may have inadvertently favored male or female test-takers. As test construction improved, these incidental biases were reduced — and so was the apparent gender gap in cognitive ability testing.
What is the variability hypothesis and why does it matter?
The variability hypothesis is the scientific proposition that males tend to show greater variance (spread) in cognitive test scores than females. If true, this means that even when average scores are nearly identical, more males than females would appear at both the very high and very low ends of the IQ distribution. This could partly explain why males are overrepresented in both gifted programs and intellectual disability diagnoses. It is an important complement to average-score comparisons, and researchers have called for more dedicated study of this aspect of IQ gender differences research.
Is a sample of 46,000 participants large enough to trust these results?
Yes — a combined sample of 46,605 participants drawn from 79 independent studies is exceptionally large for this research area. Larger samples reduce the risk that results are distorted by random variation or sample-specific characteristics. The meta-analytic design also means the findings represent diverse populations across multiple countries and decades. Together, these features make the conclusions more reliable than any single study could produce, though the authors appropriately acknowledged areas — particularly variance analysis — where further research is still needed.
Do men or women score higher on specific cognitive tasks?
Research suggests small differences in specific domains. Males tend to score slightly higher on visual processing and spatial reasoning tasks, while females tend to score slightly higher on processing speed measures. However, all of these domain-level differences are small in magnitude — generally in the negligible range by scientific standards — and the overlap between male and female score distributions is extremely large. Individual variation within each gender is far greater than average differences between genders, making gender a poor predictor of any individual’s cognitive strengths.
Does this research mean gender plays no role in intelligence at all?
Not quite. The research indicates that gender plays a very small role in determining average full scale IQ scores — so small that it has virtually no practical significance. However, the question of whether males show greater variance in score distribution (the variability hypothesis) remains an open research question. Small domain-specific tendencies also appear to exist. A more accurate statement is: gender is a very weak predictor of individual cognitive ability, and other factors — education, environment, socioeconomic status — matter far more in shaping any person’s intelligence profile.
Summary: What IQ Gender Differences Research Really Tells Us
After analyzing data from 79 studies and more than 46,000 children and adolescents, the conclusion from this landmark meta-analysis is clear: the overall difference in full scale IQ scores between males and females is negligible — approximately 1.4 IQ points, with an effect size of d = 0.09. When analysis is restricted to newer, more carefully designed versions of the WISC, even this tiny difference becomes statistically non-significant. Small domain-specific patterns do emerge — slightly higher spatial processing in males, slightly faster processing speed in females — but none of these are large enough to support sweeping generalizations about gendered intelligence.
What the data does highlight is the importance of how questions are asked. The version of the test used, the domains measured, the populations sampled, and whether variance as well as means are analyzed all shape what researchers find. IQ gender differences research has historically been muddied by methodological limitations, including reliance on older tests that may have contained inadvertent biases and a near-universal focus on average scores while ignoring distributional variance. Future research that addresses both of these gaps will give us a much richer — and much more accurate — understanding of how cognitive ability is distributed across human populations.
The most important takeaway for anyone reading this — student, parent, educator, or researcher — is that gender tells you very little about the intellectual potential of the person in front of you. Individual variation is the dominant story in cognitive ability, not group averages. If this topic has made you curious about your own cognitive profile and how it compares across different domains, exploring a well-validated intelligence assessment can be a fascinating way to see where your individual strengths genuinely lie — independent of any gender-based expectation.
