Article — Z-Test Calculator
Z-test calculator
A z-test compares a sample mean to a known population mean using the standard normal distribution. The test statistic is z = (x̄ − μ₀) / (σ/√n). With a sample mean of 102, population mean 100, σ = 15, and n = 36, the z-statistic is 0.80, giving a two-tailed p-value of 0.4237 — not significant at any conventional α level.
The z-test belongs to the family of parametric hypothesis tests. It assumes you know the population standard deviation in advance, which is rare in practice but common in textbook problems, quality control with established process variation, and large-sample survey work where σ is well-estimated. When σ is unknown, the t-test takes over — and for n ≥ 30 the two tests give nearly identical results.
What is a z-test?
A one-sample z-test is a statistical procedure for deciding whether a sample mean is consistent with a hypothesised population mean. The null hypothesis H₀ states that the true population mean equals μ₀. The alternative H₁ states that it differs — either in some direction (one-tailed) or in either direction (two-tailed). The z-statistic measures how many standard errors the sample mean lies from μ₀.
The standard normal distribution provides the reference. A z of ±1.96 corresponds to the two-tailed 5% significance level; ±2.576 corresponds to 1%. Any value of |z| larger than the critical value at your chosen α leads to rejecting H₀. The equivalent p-value approach computes the probability of seeing |z| or larger under H₀ and rejects when p < α.
Z-test formula and procedure
The full procedure is five steps. Step three is the formula; steps one and two are setup; steps four and five are interpretation.
1. State H₀ and H₁ H₀: μ = μ₀; H₁: μ ≠ μ₀ (or one-tailed)2. Choose α typically 0.05, sometimes 0.01 or 0.103. Compute z z = (x̄ − μ₀) / (σ/√n)4. Find p-value two-tailed: p = 2 × [1 − Φ(|z|)]5. Decide reject H₀ if p < αCritical z values for two-tailed tests are 1.6449 (α=0.10), 1.9600 (α=0.05), and 2.5758 (α=0.01). For one-tailed tests, divide α by one: 1.2816 (α=0.10), 1.6449 (α=0.05), 2.3263 (α=0.01). These are the standard normal quantiles you compare |z| against.
Z-test example, step by step
A factory produces metal rods with target length 100 cm and known historical standard deviation 15 cm. A quality control sample of 36 rods averages 102 cm. Is the production line drifting?
- H₀: μ = 100 cm (process on target).
- H₁: μ ≠ 100 cm (process drifted, two-tailed).
- α: 0.05.
- SE: 15 / √36 = 2.50 cm.
- z: (102 − 100) / 2.50 = 0.80.
- p (two-tailed): 2 × (1 − Φ(0.80)) = 2 × 0.2119 = 0.4237.
- Decision: 0.4237 > 0.05, so fail to reject H₀.
- 95% CI: 102 ± 1.96 × 2.50 = [97.1, 106.9] cm. Contains 100, consistent with H₀.
The 2 cm observed deviation is plausible random variation given the process variability and sample size. To detect a real 2 cm drift at α = 0.05 with 80% power, you would need n ≈ 441 rods — much larger than the n = 36 sample.
Z-test vs t-test
The choice between z-test and t-test hinges on whether σ is known. The z-test uses the known population σ in the denominator. The t-test substitutes the sample standard deviation s and uses the Student t-distribution, which has heavier tails to compensate for the extra uncertainty.
For n ≥ 30, t and z give nearly identical p-values because the t-distribution converges to the normal. For n < 30 with unknown σ, the t-test is strictly preferable — using the z-test in that regime underestimates p-values and inflates the false-positive rate. In practice, modern statistical software defaults to t-test for one-sample mean comparisons.
Z-test p-value and significance
The p-value is the probability of observing a z-statistic at least as extreme as the calculated one, assuming H₀ is true. Small p values (p < α) lead to rejecting H₀. The conventional thresholds are α = 0.05 in most fields, α = 0.01 in stricter contexts (medical trials, particle physics), and α = 0.10 in exploratory or social-science work.
The 5% significance threshold has no theoretical basis. Ronald Fisher chose it as a convenient round number in 1925's Statistical Methods for Research Workers, writing that "the value for which P = 0.05... is convenient to take this point as a limit in judging whether a deviation is to be considered significant." Almost a century later, replication crises in psychology and biomedicine have prompted calls to lower the default to 0.005 or to abandon fixed thresholds entirely.
Z-test effect size and power
Statistical significance and practical significance are not the same thing. With a sufficiently large sample, even tiny differences become statistically significant — but the effect may be too small to care about. Always report effect size alongside the p-value.
Cohen's d for a one-sample test is (x̄ − μ₀)/σ. Conventional benchmarks are d = 0.2 (small), 0.5 (medium), 0.8 (large). Sample size for 80% power at α = 0.05 two-tailed is approximately n = (1.96 + 0.84)² / d² ≈ 7.85 / d². So d = 0.5 needs n ≈ 31; d = 0.2 needs n ≈ 196; d = 0.8 needs n ≈ 12.
A clinical trial with n = 50,000 may flag a 0.1 mm drug-induced height change as "highly significant" (p < 0.001) even though no one cares about a 0.1 mm effect. Conversely, a small pilot study (n = 10) may miss a 30% effect (p = 0.20) and report no significant difference even though the effect is real and important. Always interpret p-values in light of effect size and sample size.
Common z-test mistakes
Decide one-tailed vs two-tailed before looking at the data. Switching to one-tailed after seeing the direction of your sample is a form of p-hacking that doubles your false-positive rate. The choice should follow from the hypothesis, not the result.
The most common mistake is treating an unknown sample SD as if it were the known population σ. Plugging s in place of σ and running a z-test gives slightly anti-conservative p-values, especially with small samples. The correct move is the t-test. Most textbook problems pretend σ is known to keep things simple, but real data rarely cooperate.
The second common mistake is multiple testing without correction. Running 20 z-tests on the same data at α = 0.05 gives an expected one false positive even if every H₀ is true. The Bonferroni correction (divide α by the number of tests) is conservative but easy; the Benjamini–Hochberg false-discovery-rate procedure is more powerful for large numbers of tests.
A subtler trap is interpreting a non-significant p as "no effect." Failing to reject H₀ is not evidence for H₀ — it is consistent with no effect, a small effect, or insufficient power to detect a real effect. To support H₀ you need a different framework, such as a confidence interval that excludes meaningful effect sizes or an equivalence test.