Article — T-Test Calculator (Two-Sample, Welch and Pooled)
T-test calculator — two-sample comparison
A two-sample t-test compares the means of two groups and returns a t-statistic plus a p-value. If p is below your significance threshold (commonly 0.05), the difference between means is unlikely to be chance — you reject the null hypothesis. This calculator runs both Welch's t-test and the classical pooled version from summary statistics.
The t-test was published by William Sealy Gosset in 1908 under the pseudonym Student. He invented it while working at Guinness brewery in Dublin, where he needed to compare small samples of barley and yeast. Guinness considered statistics a trade secret, so Gosset published under a pseudonym to bypass the company's policy. The Student's t-test got its name from that workaround.
What is a t-test?
A t-test answers one question: given the means and variability of two samples, how likely is it that we would see this difference (or larger) if the underlying populations were actually the same? The output is the t-statistic — how many standard errors separate the two sample means — and the p-value that converts that into a probability.
Three flavors exist: one-sample (compare to a fixed number), two-sample independent (compare two unrelated groups), and paired (compare matched pairs, like before-and-after measurements). This calculator handles the two-sample independent case from summary statistics.
William Sealy Gosset invented the t-distribution in 1908 because brewers needed to test small samples of grain. The normal distribution required large samples, and Gosset only had a few dozen plants to work with. His Student's t-distribution gave a way to test means rigorously with sample sizes as small as 5.
The two-sample t-test formula
The core formula is t = (mean difference) / (standard error of the difference). The standard error depends on whether you assume equal or unequal variances.
t (Welch) (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)t (pooled) (x̄₁ − x̄₂) / [sₚ · √(1/n₁ + 1/n₂)]df (pooled) n₁ + n₂ − 2df (Welch) Welch-Satterthwaite formulaCohen's d (x̄₁ − x̄₂) / sₚThe Welch-Satterthwaite degrees of freedom formula is uglier but more accurate when variances differ: df = (s₁²/n₁ + s₂²/n₂)² ÷ [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]. The result is usually fractional and falls between min(n₁, n₂) − 1 and n₁ + n₂ − 2.
Welch's t-test vs pooled t-test
The pooled version, sometimes called Student's t-test, assumes both populations have the same variance. When that assumption holds, it has slightly more statistical power. The Welch t-test makes no equal-variance assumption and remains valid even when sample sizes and variances differ.
Modern statistical practice favors Welch by default. R uses it as the default for t.test(). Several published recommendations (Ruxton 2006, Delacre, Lakens & Leys 2017) argue Welch should be the standard choice unless equal variance is verified or theoretically guaranteed.
T-test worked example
Suppose group A has mean 50, standard deviation 10, n = 30. Group B has mean 55, standard deviation 12, n = 30. Mean difference is −5. Standard error (Welch) = √(100/30 + 144/30) = √8.133 = 2.852. t = −5 / 2.852 = −1.753. Welch-Satterthwaite df works out to about 56.4. Two-tailed p ≈ 0.085.
At α = 0.05 you fail to reject the null. The samples are consistent with the population means being equal. Cohen's d = −5 / √((100+144)/2) = −0.452 — small to medium effect. With larger samples, this effect would likely become statistically significant.
P-value interpretation
The p-value is the probability of observing a result at least as extreme as the data, assuming the null hypothesis is true. It is NOT the probability that the null hypothesis is true — that is one of the most common misinterpretations in science.
A tiny effect with a huge sample can produce a p-value of 0.001 while being practically meaningless. A massive effect with a small sample can produce p = 0.10 and look "non-significant". Always report effect size alongside p-values.
Cohen's d and effect size
Cohen's d standardizes the mean difference by the pooled standard deviation. It is independent of sample size and gives the practical magnitude of the effect. Jacob Cohen's 1988 conventional benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large.
The American Psychological Association and most peer-reviewed journals now require effect sizes alongside p-values. A statistically significant result with d = 0.05 may not be worth pursuing. A non-significant result with d = 0.7 in a small sample is worth replicating with more participants.
T-test assumptions and limits
The two-sample t-test rests on three assumptions: each sample is drawn from a normally distributed population (or n is large enough for the Central Limit Theorem to apply), observations are independent, and variances are equal (for the pooled version only).
- Normality matters less as n grows. With n ≥ 30 per group, modest skew is tolerable.
- Independence is critical. Repeated measurements need paired tests, not independent.
- Outliers distort the mean and inflate the standard deviation. Inspect data first.
- Variance equality is only a pooled-test concern; Welch handles unequal variance.
- Sample size too small (n < 5) makes the test unreliable even with perfect normality.
If normality is doubtful and samples are small, use a Mann-Whitney U test (also called Wilcoxon rank-sum) instead of a t-test. It does not assume normality and compares the rank distributions rather than the means.
Common t-test mistakes
Four traps catch students and researchers. First, using a one-tailed test after seeing the data — that inflates Type I error. Second, ignoring multiple comparisons; running 20 t-tests at α = 0.05 expects about one false positive by chance. Third, conflating "fail to reject" with "evidence of equality". Fourth, reporting only the p-value without confidence intervals or effect sizes.
A fifth, increasingly recognized error is p-hacking — running tests with different subgroups, transformations, or exclusion rules until a significant p emerges. The 2016 ASA statement on p-values explicitly called out this practice. Pre-register your analysis plan when stakes are high, or use multiple-comparison corrections like Bonferroni or false discovery rate control.
Finally, reverse-engineering a t-test from a published article when only some statistics are reported can be tricky. If you have the t-statistic and total sample size, you can derive a usable estimate of effect size. If only the p-value is given, the test type and degrees of freedom are needed to back-calculate the t-statistic. Always demand sufficient detail in research reports to allow such verification.