Article — P-Value Calculator
The p-value calculator, with the math that runs underneath
A p-value is the probability of observing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true. For a two-tailed z-test, p = 2 × (1 − Φ(|z|)). For a two-tailed t-test, p = 2 × (1 − F_t(|t|, df)). The smaller the value, the stronger the evidence against the null hypothesis — but the p-value alone is never the whole story.
The calculator above evaluates both formulas using the same algorithms that scipy.stats and R rely on: the Abramowitz & Stegun Hastings approximation for the normal CDF and the regularized incomplete beta function for the Student t CDF. Results match R's pnorm() and pt() to at least four decimal places across the working range.
What a p-value actually is
Imagine the null hypothesis is true. Under that assumption, your test statistic follows a known distribution — the standard normal for a z-test, the Student t for a t-test. The p-value is the probability that this random variable lands at least as far out as your observed statistic, in the direction your alternative hypothesis specifies.
Ronald Fisher introduced the concept in 1925 in Statistical Methods for Research Workers. His definition is the one used today: "if P is small, we have either an exceptionally rare event or the null hypothesis is false." The smaller the p-value, the more uncomfortable it becomes to attribute the result to chance.
Fisher chose the 0.05 threshold because it corresponded to "about two standard deviations" on the normal distribution. He never claimed it was a fundamental cutoff — only that "a value of P = 0.05 will be regarded as a convenient point." Decades of journal practice turned the convenience into a rule.
How to calculate a p-value
Every p-value calculation has three ingredients: a test statistic, a reference distribution, and a tail specification.
- Test statistic: the number that summarizes your data under H₀. Common forms: z, t, χ², F.
- Reference distribution: what the statistic looks like if H₀ is true. Standard normal for z, Student t for t-tests, chi-square for goodness-of-fit, F for variance ratios.
- Tail: one-sided (H₁ predicts a direction) or two-sided (H₁ says "different in either direction").
- The p-value: the integral of the reference distribution past the observed statistic, in the chosen tail(s).
For a two-tailed test the p-value is twice the upper-tail probability of the absolute statistic. For a one-tailed test it is just the relevant tail. The calculator handles all three options.
P-value from a z-test
The z-test uses the standard normal distribution. The two-tailed p-value is p = 2 × (1 − Φ(|z|)), where Φ is the cumulative distribution function. For the canonical critical value z = 1.96, this gives p = 0.05, and that is why 1.96 appears in every statistics textbook.
z = 1.645 ⇒ one-tailed p = 0.05 z = 1.96 ⇒ two-tailed p = 0.05z = 2.576 ⇒ two-tailed p = 0.01 z = 3.291 ⇒ two-tailed p = 0.001z = 5.000 ⇒ two-tailed p ≈ 5.7 × 10⁻⁷ (5σ, particle physics)The numerical method for Φ in this calculator is the Hastings approximation, equation 26.2.17 in Abramowitz & Stegun. The maximum error is 7.5 × 10⁻⁸, well below the precision you would ever report in a manuscript.
P-value from a t-test
The t-test uses Student's t-distribution, which has heavier tails than the normal. The shape depends on the degrees of freedom (df). For a one-sample t-test, df = n - 1. For a Welch two-sample t-test on samples of size n₁ and n₂, df is computed by the Welch-Satterthwaite formula. For a paired t-test, df = n - 1 where n is the number of pairs.
The two-tailed p-value is p = 2 × (1 − F_t(|t|, df)). The calculator computes F_t through the regularized incomplete beta function, the method used in R, Python's scipy.stats, and Press et al.'s Numerical Recipes chapter 6.4. For df = 30, the critical t-value at α = 0.05 (two-tailed) is t = 2.042 — already close to z = 1.96, and the gap shrinks further as df grows.
One-tailed versus two-tailed p-values
Two-tailed is the safe default. The alternative hypothesis is "different from H₀," with no direction specified. A two-tailed test treats positive and negative deviations equally.
One-tailed tests have more statistical power but only when the predicted direction is correct. You must specify the direction before collecting data. If the result lies in the predicted direction, the one-tailed p-value is half the two-tailed value. If it lies in the wrong direction, the one-tailed p-value is greater than 0.5 — the test cannot find significance no matter how extreme the wrong-direction effect is.
Choosing a one-tailed test after looking at the data is the classic p-hacking move: it cuts the reported p-value in half without any new information. Pre-register the test direction (in OSF, AsPredicted, or a journal protocol) and stick with it. The 2016 ASA statement on p-values explicitly warns against this practice.
P-value thresholds and what they mean
Different fields use different significance thresholds because the cost of a false positive differs. Social science accepts α = 0.05 in part because subsequent replications usually weed out false discoveries. Medical research often runs at α = 0.01 because a false-positive treatment claim can be life-threatening. Particle physics demands p < 3 × 10⁻⁷ (the 5σ standard) because it tests millions of channels, so even a strict-looking threshold becomes loose once the multiple-testing burden is included.
Common p-value misinterpretations
The 2016 American Statistical Association statement lists six common errors. Three matter for almost every reader:
- P is not the probability that H₀ is true. P is computed under H₀. To get P(H₀ | data), you need Bayes' theorem and a prior — which p-values do not provide.
- P does not measure effect size. A tiny effect in a huge sample gives a small p-value. Always report the effect size (Cohen's d, η², odds ratio) alongside.
- P > 0.05 is not evidence for H₀. It is absence of evidence against H₀, which is not the same thing. Low-power studies miss real effects often.
If you find yourself reporting a p-value, also report the confidence interval and the effect size. The CI tells readers the plausible range of the true effect; the effect size tells them whether it matters in practice. A 95% CI that excludes 0 conveys the same significance verdict as p < 0.05, but with much more information.
A short history of the p-value
Karl Pearson invented the chi-square p-value in 1900 for testing goodness of fit. Fisher generalized the concept in 1925 with z and t. Jerzy Neyman and Egon Pearson reformulated hypothesis testing in 1933 as a decision-theoretic framework with fixed α and β — the version most statistics textbooks teach. The two camps disagreed about almost everything: Fisher considered p-values as continuous evidence; Neyman and Pearson treated them as bright-line decisions. The hybrid "p < 0.05 cutoff" that dominates published research today is not what either camp actually proposed.
The American Statistical Association published its formal statement on p-values in 2016, followed by a special supplement of The American Statistician in 2019 titled "Moving to a World Beyond p < 0.05." The discussion is ongoing. The p-value is not going away, but the binary significant/non-significant verdict is increasingly seen as a relic of a less sophisticated era of statistical practice.
The 2012 discovery of the Higgs boson at CERN was reported at 5σ significance — a one-in-3.5 million chance of seeing the bump at the right mass by accident. Both the ATLAS and CMS experiments hit the threshold independently. The conservative 5σ standard exists in particle physics specifically because the number of hypotheses tested in a typical analysis is enormous, and weaker cutoffs would generate constant false discoveries.
Sources
- Fisher: Statistical Methods for Research Workers (1925)
- American Statistical Association: Statement on p-Values (2016)
- R Core Team: pt() and pnorm() documentation
- SciPy: scipy.stats.t and scipy.stats.norm
- Press et al.: Numerical Recipes 3rd ed. (Cambridge University Press)
- NIST/SEMATECH e-Handbook of Statistical Methods