T-Test Calculator (Two-Sample, Welch and Pooled)

Two-sample t-test calculator.

Science Welch / Pooled Cohen's d effect Two-tailed p-value

Rate this calculator · 5.0 (2)

Two-sample t-test

Welch or pooled · two-tailed p

Instructions — T-Test Calculator (Two-Sample, Welch and Pooled)

Pick the variance assumption: Welch (recommended, no equal-variance assumption) or Pooled (classical, assumes equal variances).
Enter mean, standard deviation, and sample size for each of the two samples.
Set the significance level α (0.05 is the default for most fields).

Results: t-statistic, degrees of freedom, two-tailed p-value, Cohen's d, and whether to reject the null hypothesis H₀: μ₁ = μ₂.

Formulas

Welch's t-test (unequal variances)

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Welch-Satterthwaite degrees of freedom

df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]

Pooled t-test (equal variances)

t = (x̄₁ − x̄₂) / [sₚ × √(1/n₁ + 1/n₂)]
sₚ² = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁ + n₂ − 2)

df = n₁ + n₂ − 2.

Cohen's d (effect size)

d = (x̄₁ − x̄₂) / sₚ

Conventional thresholds: |d| < 0.2 = negligible, 0.2 = small, 0.5 = medium, 0.8 = large.

Reference

P-value interpretation

p	Decision (α = 0.05)	Strength
p < 0.001	Reject H₀	Very strong evidence
0.001 ≤ p < 0.01	Reject H₀	Strong evidence
0.01 ≤ p < 0.05	Reject H₀	Moderate evidence
0.05 ≤ p < 0.10	Fail to reject	Suggestive (marginal)
p ≥ 0.10	Fail to reject	No evidence

When to choose Welch vs Pooled

Welch's t-test — default in R, modern statistics texts. Robust when variances differ. No assumption to check.
Pooled t-test — historical default. Slightly more powerful when variances are truly equal. Requires Levene's test to verify the assumption.

Article — T-Test Calculator (Two-Sample, Welch and Pooled)

T-test calculator — two-sample comparison

In this article

What is a t-test?
The two-sample t-test formula
Welch's t-test vs pooled t-test
T-test worked example
P-value interpretation
Cohen's d and effect size
T-test assumptions and limits
Common t-test mistakes

A two-sample t-test compares the means of two groups and returns a t-statistic plus a p-value. If p is below your significance threshold (commonly 0.05), the difference between means is unlikely to be chance — you reject the null hypothesis. This calculator runs both Welch's t-test and the classical pooled version from summary statistics.

The t-test was published by William Sealy Gosset in 1908 under the pseudonym Student. He invented it while working at Guinness brewery in Dublin, where he needed to compare small samples of barley and yeast. Guinness considered statistics a trade secret, so Gosset published under a pseudonym to bypass the company's policy. The Student's t-test got its name from that workaround.

What is a t-test?

A t-test answers one question: given the means and variability of two samples, how likely is it that we would see this difference (or larger) if the underlying populations were actually the same? The output is the t-statistic — how many standard errors separate the two sample means — and the p-value that converts that into a probability.

Three flavors exist: one-sample (compare to a fixed number), two-sample independent (compare two unrelated groups), and paired (compare matched pairs, like before-and-after measurements). This calculator handles the two-sample independent case from summary statistics.

Did you know

William Sealy Gosset invented the t-distribution in 1908 because brewers needed to test small samples of grain. The normal distribution required large samples, and Gosset only had a few dozen plants to work with. His Student's t-distribution gave a way to test means rigorously with sample sizes as small as 5.

The two-sample t-test formula

The core formula is t = (mean difference) / (standard error of the difference). The standard error depends on whether you assume equal or unequal variances.

T-test essentials

t (Welch) (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

t (pooled) (x̄₁ − x̄₂) / [sₚ · √(1/n₁ + 1/n₂)]

df (pooled) n₁ + n₂ − 2

df (Welch) Welch-Satterthwaite formula

Cohen's d (x̄₁ − x̄₂) / sₚ

The Welch-Satterthwaite degrees of freedom formula is uglier but more accurate when variances differ: df = (s₁²/n₁ + s₂²/n₂)² ÷ [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]. The result is usually fractional and falls between min(n₁, n₂) − 1 and n₁ + n₂ − 2.

Welch's t-test vs pooled t-test

The pooled version, sometimes called Student's t-test, assumes both populations have the same variance. When that assumption holds, it has slightly more statistical power. The Welch t-test makes no equal-variance assumption and remains valid even when sample sizes and variances differ.

Welch's t-test

Default in R

Unequal variance OK

Pooled t-test

Classical Student's

Needs equal variance

Modern statistical practice favors Welch by default. R uses it as the default for t.test(). Several published recommendations (Ruxton 2006, Delacre, Lakens & Leys 2017) argue Welch should be the standard choice unless equal variance is verified or theoretically guaranteed.

T-test worked example

Suppose group A has mean 50, standard deviation 10, n = 30. Group B has mean 55, standard deviation 12, n = 30. Mean difference is −5. Standard error (Welch) = √(100/30 + 144/30) = √8.133 = 2.852. t = −5 / 2.852 = −1.753. Welch-Satterthwaite df works out to about 56.4. Two-tailed p ≈ 0.085.

At α = 0.05 you fail to reject the null. The samples are consistent with the population means being equal. Cohen's d = −5 / √((100+144)/2) = −0.452 — small to medium effect. With larger samples, this effect would likely become statistically significant.

P-value interpretation

The p-value is the probability of observing a result at least as extreme as the data, assuming the null hypothesis is true. It is NOT the probability that the null hypothesis is true — that is one of the most common misinterpretations in science.

! P-value is not effect size

A tiny effect with a huge sample can produce a p-value of 0.001 while being practically meaningless. A massive effect with a small sample can produce p = 0.10 and look "non-significant". Always report effect size alongside p-values.

Cohen's d and effect size

Cohen's d standardizes the mean difference by the pooled standard deviation. It is independent of sample size and gives the practical magnitude of the effect. Jacob Cohen's 1988 conventional benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large.

The American Psychological Association and most peer-reviewed journals now require effect sizes alongside p-values. A statistically significant result with d = 0.05 may not be worth pursuing. A non-significant result with d = 0.7 in a small sample is worth replicating with more participants.

T-test assumptions and limits

The two-sample t-test rests on three assumptions: each sample is drawn from a normally distributed population (or n is large enough for the Central Limit Theorem to apply), observations are independent, and variances are equal (for the pooled version only).

Normality matters less as n grows. With n ≥ 30 per group, modest skew is tolerable.
Independence is critical. Repeated measurements need paired tests, not independent.
Outliers distort the mean and inflate the standard deviation. Inspect data first.
Variance equality is only a pooled-test concern; Welch handles unequal variance.
Sample size too small (n < 5) makes the test unreliable even with perfect normality.

Tip

If normality is doubtful and samples are small, use a Mann-Whitney U test (also called Wilcoxon rank-sum) instead of a t-test. It does not assume normality and compares the rank distributions rather than the means.

Common t-test mistakes

Four traps catch students and researchers. First, using a one-tailed test after seeing the data — that inflates Type I error. Second, ignoring multiple comparisons; running 20 t-tests at α = 0.05 expects about one false positive by chance. Third, conflating "fail to reject" with "evidence of equality". Fourth, reporting only the p-value without confidence intervals or effect sizes.

A fifth, increasingly recognized error is p-hacking — running tests with different subgroups, transformations, or exclusion rules until a significant p emerges. The 2016 ASA statement on p-values explicitly called out this practice. Pre-register your analysis plan when stakes are high, or use multiple-comparison corrections like Bonferroni or false discovery rate control.

Finally, reverse-engineering a t-test from a published article when only some statistics are reported can be tricky. If you have the t-statistic and total sample size, you can derive a usable estimate of effect size. If only the p-value is given, the test type and degrees of freedom are needed to back-calculate the t-statistic. Always demand sufficient detail in research reports to allow such verification.

Sources

FAQ

Q1 What does a t-test tell you?

A t-test compares two means and reports the probability of seeing a difference at least as large as the one observed, assuming the populations are actually identical. If the probability (p-value) is below your significance threshold α (commonly 0.05), the difference is unlikely to be chance — you reject the null hypothesis.

Q2 When should I use Welch's t-test vs pooled?

Use Welch's t-test by default. It does not assume equal variances and handles unequal sample sizes well. R uses Welch by default for that reason. Choose the pooled (Student's) t-test only when you have prior evidence that variance is equal.

Q3 What is a good sample size for a t-test?

For detecting a medium effect (Cohen's d = 0.5) at α = 0.05 with 80 percent power, each group needs about 64 observations. For a large effect (d = 0.8), about 26 per group. For tiny effects, hundreds or thousands per group.

Q4 What is Cohen's d?

Cohen's d is the standardized mean difference — how many pooled standard deviations separate the group means. A d of 0.5 means the means differ by half a standard deviation, considered a medium effect. Effect size is independent of sample size and complements the p-value.

Q5 What is the difference between one-tailed and two-tailed p-values?

A two-tailed test asks whether the means differ in either direction. A one-tailed test only counts deviations in the predicted direction, doubling the available significance for the same data. Use one-tailed only when the direction of the effect is theoretically constrained beforehand.

Q6 What does failing to reject the null hypothesis mean?

It means the data do not provide enough evidence to conclude the means differ. It does NOT mean the means are equal. Absence of evidence is not evidence of absence. To support a no-difference conclusion you need equivalence testing, not a non-significant t-test.

T-Test Calculator (Two-Sample, Welch and Pooled)

Two-sample t-test

Instructions — T-Test Calculator (Two-Sample, Welch and Pooled)

Formulas

Welch's t-test (unequal variances)

Welch-Satterthwaite degrees of freedom

Pooled t-test (equal variances)

Cohen's d (effect size)

Reference

P-value interpretation

When to choose Welch vs Pooled

Article — T-Test Calculator (Two-Sample, Welch and Pooled)

T-test calculator — two-sample comparison

What is a t-test?

The two-sample t-test formula

Welch's t-test vs pooled t-test

T-test worked example

P-value interpretation

Cohen's d and effect size

T-test assumptions and limits

Common t-test mistakes

Sources

FAQ

Related Science

Dog Pregnancy Calculator

Potential Energy Calculator

Wave Speed Calculator

Air Density Calculator

Tension Calculator

Density Mass Volume Calculator

Dice Probability Calculator

Chemical Equation Balancer