Article — Allele Frequency Calculator (Hardy-Weinberg)
Allele frequency calculator — Hardy-Weinberg population genetics
Allele frequency is the proportion of a specific allele in the gene pool. For a two-allele locus, p is the frequency of the dominant allele and q is the frequency of the recessive allele, with p + q = 1. From genotype counts, p = (2 × n_AA + n_Aa) / 2N, where N is the total individual count. Under Hardy-Weinberg equilibrium, genotype frequencies follow p² + 2pq + q² = 1.
The Hardy-Weinberg principle is the null model of population genetics. It predicts genotype frequencies in a population where mutation, selection, migration, drift, and non-random mating are all absent. Any real population that deviates from Hardy-Weinberg expectations is doing so because of one of those five evolutionary forces — which makes Hardy-Weinberg the most useful baseline in the field.
What is allele frequency?
An allele is one of two or more alternative forms of a gene at a single locus. Allele frequency is the fraction of all copies of that gene in a population that are a particular allele. If 60 percent of all gene copies at a locus are allele A and 40 percent are allele a, then p = 0.60 and q = 0.40.
Counting allele frequency requires counting alleles, not individuals. A diploid organism has two copies at each locus, so a population of 100 individuals has 200 alleles at any given locus. A homozygous AA individual contributes two A alleles; a heterozygous Aa contributes one of each; a homozygous aa contributes two a alleles.
The Hardy-Weinberg principle was published in 1908 by two scientists working independently: G.H. Hardy (a Cambridge mathematician who thought the result was "trivial") and Wilhelm Weinberg (a German physician who saw it in clinical contexts). Their separate papers became the founding equation of population genetics.
The allele frequency formula
From observed genotype counts, the formulas for p and q are short. Each homozygous AA individual contributes 2 A alleles; each heterozygote contributes 1 A and 1 a; each homozygous aa contributes 2 a alleles. Sum up and divide by the total allele count (2N).
p = (2·n_AA + n_Aa) / 2Nq = (2·n_aa + n_Aa) / 2Np + q = 1Expected AA = p² × NExpected Aa = 2pq × NExpected aa = q² × NFor a sample of 200 individuals with 80 AA, 96 Aa, 24 aa: p = (160 + 96) / 400 = 0.64. q = 1 − p = 0.36. Cross-check by the recessive formula: q = (48 + 96) / 400 = 0.36.
Hardy-Weinberg expected allele frequencies
Once you know p and q, the Hardy-Weinberg expectation gives the genotype frequencies that the population should show under five conditions: no mutation, random mating, no selection at the locus, no migration, and infinite (or very large) population size. Expected frequencies are p² for AA, 2pq for Aa, and q² for aa.
The 2 in the heterozygote term comes from two pathways. An Aa individual can be made by an A egg meeting an a sperm, or an a egg meeting an A sperm. Both are equally likely under random mating, so heterozygote frequency is 2 × p × q rather than p × q.
Chi-square test for allele frequency equilibrium
Real populations rarely match Hardy-Weinberg expectations exactly. The chi-square test asks whether the deviation between observed and expected counts is large enough to reject the null. The formula sums (observed − expected)² ÷ expected across all three genotypes. The result is compared against the critical value for one degree of freedom at the chosen significance level (3.841 at α = 0.05).
Allele frequencies are expressed as decimal fractions between 0 and 1, not percentages. A frequency of 0.60 means 60 percent of alleles are the dominant form, but writing it as "60 %" in the formula breaks the math: p × q = 60 × 40 = 2,400 instead of 0.24.
From allele frequency to carrier rate
For a rare recessive disease, the affected (homozygous recessive) frequency equals q². The carrier (heterozygous) frequency is 2pq. When q is small, p is close to 1, so the carrier frequency simplifies to roughly 2q. This is the standard back-calculation in genetic counseling.
- Cystic fibrosis = affected rate 1 in 2,500 (Europeans). q = √(1/2500) = 0.02. Carrier rate ≈ 2q = 0.04 (1 in 25).
- Sickle cell disease = affected rate 1 in 400 (US African-Americans). q = 0.05. Carrier rate ≈ 0.10 (1 in 10).
- Tay-Sachs = affected rate 1 in 3,500 (Ashkenazi Jewish). q = 0.017. Carrier rate ≈ 0.034 (1 in 30).
- Phenylketonuria (PKU) = affected rate 1 in 10,000. q = 0.01. Carrier rate ≈ 0.02 (1 in 50).
- Albinism = affected rate 1 in 17,000 worldwide. q = 0.0077. Carrier rate ≈ 1 in 65.
- Galactosemia = affected rate 1 in 60,000. q = 0.0041. Carrier rate ≈ 1 in 122.
Allele frequencies in human populations
Some alleles vary dramatically by population. The lactase persistence allele LCT-13910T sits at q = 0.7–0.9 in Northern Europeans, q = 0.2–0.4 in Middle Eastern populations, and q < 0.05 in East Asians. This is one of the strongest signatures of recent natural selection in the human genome — the result of 7,000 years of dairy farming.
When comparing allele frequencies between populations, plot them on a 2D PCA of all observed variants. Populations cluster by geographic ancestry, with frequency differences that map closely to migration history. The 1000 Genomes Project provides allele frequency data for 2,500+ individuals from 26 populations — the gold standard for reference.
What changes allele frequency over time
Five evolutionary forces shift allele frequencies. Mutation introduces new alleles at low rates (~10⁻⁸ per base per generation in humans). Natural selection raises beneficial alleles and lowers harmful ones, sometimes by 10–50 percent per generation under strong pressure. Migration averages frequencies between populations. Genetic drift causes random fluctuations, with stronger effects in small populations. Non-random mating (assortative or inbreeding) shifts genotype frequencies without changing allele frequencies.
Common allele frequency mistakes
Three errors repeat in introductory genetics. First, confusing allele frequency with genotype frequency — p is per allele, not per individual. Second, assuming the dominant allele must be the more common one; dominance is about phenotype, not frequency. Many recessive alleles (like lactase persistence in adults) are more common than the dominant form. Third, applying Hardy-Weinberg to admixed populations: if your sample mixes two ancestry groups with different allele frequencies, the combined sample shows excess homozygotes (the Wahlund effect) even when each subgroup is in HWE separately. A practical fix is to stratify by self-reported ancestry before running the test, then combine the per-stratum results using a Mantel-Haenszel approach. Small sample sizes also bite: with N below about 50, the chi-square approximation becomes unreliable, and an exact test (Guo and Thompson 1992) replaces it as the standard.