A
Sensible Formulation of the Significance Test
Lyle
V. Jones
The
University of North Carolina at Chapel Hill
and
John
W. Tukey
Princeton
University
A
Sensible Formulation of the Significance Test
Abstract
The conventional procedure for null hypothesis significance testing long
has been the target of appropriate criticism. Here, a more reasonable alternative is
proposed, one that not only avoids the unrealistic postulation of a null
hypothesis, but also, for a given parametric difference and a given error
probability, is more likely to report the detection of that
difference.
A
Sensible Formulation of the Significance Test
Procedures
for testing the significance of the difference between two sample means have
been widely utilized, and also widely criticized, throughout most of the
20th century. To better
codify the procedures, the Neyman-Pearson approach to choosing between two point
hypotheses emerged. While
attractive mathematically, that approach is troublesome in practice, because
investigators are often unwilling to specify a value for the alternative to the
null hypothesis and, of equal importance, both hypotheses, like all point
hypotheses about parameters in real-world populations that are subjected to
different treatments, are always untrue when calculations are carried to enough
decimal places.
The
introduction of hypothesis testing did lead to an attempt to reformulate the
significance test as a test of a point null hypothesis against all possible
alternatives. However, that
reformulation failed to silence the critics, e.g., Bakan (1966), Bechhofer
(1954), Berkson, (1938, 1942), Cohen (1994), Lykken (1968), Meehl (1967),
Rozeboom (1960), Schmidt (1996), Schmidt & Hunter (1997), among many
others. Jones (1955, p. 407)
suggested that “an investigator would be misled less frequently and would be
more likely to obtain the information he seeks were he to formulate his
experimental problems in terms of the estimation of population parameters, with
the establishment of confidence intervals about the estimated values, rather
than in terms of a null hypothesis against all possible alternatives.” Many other critics have echoed that
advice, to which we also subscribe, especially when the outcome measure is well
defined.
However,
when faced with several or many mean differences, i.e., with multiple
comparisons, straightforward procedures for establishing confidence intervals
may not be available. Also, in some
situations, the scale of the outcome measure may be so untrustworthy that the
primary interest then may appropriately reside in just the direction of a
treatment effect rather than in its size.
Whenever for these or for other reasons a test of significance is to be
performed, we propose an alternative to the conventional formulation. (Additional arguments in support of
significance testing, are set forth by Abelson, 1997, Hagen, 1997, Mulaik, Raju,
& Harshman, 1997, and Wainer, 1999, among others.)
A
common formulation for the conventional test-of-hypothesis version of the test
of significance is the traditional test for the equality of two population
means, using Student’s t distribution.
A null hypothesis is set forth,
HO:
µA - µB = 0
vs.
an omnibus alternative,
H1: µA -
µB ą
0.
From
samples of sizes nA and
nB, an estimate of
µA - µB,
yA – yB,
is obtained. Usually,
yA and yB are sample means, but they might be sample
medians, sample midmeans, or other estimates of µA and
µB. An estimated
standard error of
yA – yB , sd, is calculated and a
statistic, (yA -
yB)/sd, is formed.
A
two-tailed rejection region of the sampling distribution of this statistic,
typically Student's t with df = nA + nB – 2, is
established by setting a value for a,
the probability of rejecting HO when it is true, often set as
.05. When the t statistic
falls in either tail of the rejection region, each with area a/2,
HO is rejected in favor of H1. The only allowable conclusions are (1)
reject HO, and (2) fail to reject that null hypothesis, thereby
withholding judgment.
Cohen
(1994, p.1001) advised, "don't look for a magic alternative to" this formulation
of the null hypothesis significance test.
"It doesn't exist." In the
same paper, Cohen cited Tukey (1991), but seems to have overlooked the
alternative suggested there (see also Tukey, 1993). As applied to multiple comparisons, that
alternative has been more explicitly developed and illustrated in Williams,
Jones, & Tukey (1999). As
shown below, the formulation is entirely suitable for use with a single mean
difference as well as with multiple comparisons.
When
A and B are different treatments,
µA and µB are certain to differ in some
decimal place so that µA
- µB = 0 is known in advance to be false and µA - µB
ą
0 is known to be true (Cohen, 1990; Tukey, 1991). An extensive rebuttal to this
claim is provided by Hagen (1997), who states that “I agree that A and B will
always produce differential effects on some variable or variables that
theoretically could be measured.
But I do not agree that A and B will always produce an effect on the
dependent variable … ” (p. 20).
We simply do not accept that view.
For large finite treatment populations, a total census is at least
conceivable, and we cannot imagine an outcome for which µA -
µB = 0 when the dependent variable (or any other variable) is
measured to an indefinitely large number of decimal places. (We come to a similar conclusion with
respect to the mean of a single population when HO: µ = k, whether k
= 0 or any other definite value.)
For hypothetical treatment populations, µA - µB may
approach zero as a limit, but as for the approach of population sizes to
infinity, the limit never is reached.
The population mean difference may be trivially small, but will always be
positive or negative. As a
consequence, we should not set forth a null hypothesis because to do so is
unrealistic and misleading.
Instead, we should assess the sample data and entertain one of three
conclusions:
(1)
act
as if µA - µB > 0;
(2)
act
as if µA -
µB < 0;
or
(3)
act
as if the sign of µA - µB, is indefinite, i.e.,
is not (yet)
determined.
This
specification is similar to “the
three alternative decisions” proposed by Tukey (1960, p. 425).
With
this formulation, a conclusion is in error only when it is “a reversal,” when it
asserts one direction while the (unknown) truth is the other direction. Asserting that the direction is not yet
established may constitute a wasted opportunity, but is not an error. We want to control the rate of error,
the reversal rate, while minimizing wasted opportunity, i.e., while minimizing
indefinite results.
Consider
the probability of erroneously concluding that the population mean difference is
in one direction when in truth it is in the other. If µA - µB
is positive, an error will occur only when the value of t falls in the
lower tail of the distribution, with area a/2,
which also is the limiting probability of error for that case. Likewise, if µA -
µB is negative, an error occurs only when the t statistic falls
in the upper tail of the distribution, also of area a/2. The reversal rate, also the overall
probability of error, then is
P(error)
Ł
(a/2) P[(µA - µB ) > 0] + (a/2)P[(µA - µB ) < 0]
Ł
(a/2) P[(µA - µB )
ą
0] Ł
a/2.
This
P(error) is just one-half the probability of Type I
error for the traditional null hypothesis test and is the same as that for a
one-tailed test (see Jones, 1952).
Yet, the procedure is symmetric, equally sensitive to a mean difference
in either direction.
A
task force on statistical inference is currently preparing guidelines for
standards to be adopted in psychology journals. In an earlier article, Wilkinson, &
the Task Force on Statistical Inference (1999) recommended that, when hypothesis
tests have been performed, effect sizes and p values always should be
presented. The adoption of this
proposed three-alternative conclusion procedure lends itself to reporting the
effect size as the estimated value of
µA - µB
(either standardized or not).
Regardless of the size of that estimate, and regardless of whether or not
the calculated value of t falls in the rejection region, it seems appropriate to
report the p value as the area of the t distribution more positive or more
negative (but not both) than the value of t obtained from (yA - yB)/sd). (The limiting values of p then are 0, as
the absolute value of t becomes indefinitely large, and 1/2, as the value of t
approaches zero.)
For
any specified positive or negative population mean difference, there may be
found in the usual way the probability of a Type II error, of withholding
judgment when the parametric difference is as specified. For each specified difference, the
probability of a Type II error is smaller than that for the conventional
two-tailed test of significance.
Thus, the proposed procedure is uniformly more powerful than the
conventional procedure.
Hodges
& Lehmann (1954) proposed a modification of the traditional Student test,
converting it from a two-sided test of the null hypothesis to two one-sided
tests. Kaiser (1960) proposed
combining the two one-sided tests into a single test, but one with two
directional alternative hypotheses, µA < µB and µA
> µB. (For further discussion, see Bohrer,
1979, Bulmer, 1957, and Harris, 1997.)
However, the unrealistic null hypothesis of zero mean difference in the
population is included in these proposals, in contrast to the formulation
above. By acknowledging the
fiction of the null hypothesis, and following the implications from “every null
hypothesis is false,” our formulation yields, for any sample size and any value
of a,
a test with greater reported sensitivity to detect the direction of a difference
between two population means.
Note
that those accustomed to make "tests of hypotheses" at .05 would, using the
procedures set forth here, do the same arithmetic but would describe their
results as a "test of significance" at one-half of .05, i.e., at .025. Alternatively, to maintain at .05 the
probability of acting as if the parametric difference is in one direction when,
in fact, it is in the other, the investigator would employ the .10 tabled value
of a.
In
summary, then:
·
Prefer
confidence intervals when they are available.
·
Recognize
that point hypotheses, while mathematically convenient, are never fulfilled in
practice.
·
When
performing a simple test of significance, seek for one of three outcomes, as
described above.
References
Abelson,
R. P. (1997). A retrospective on
the significance test ban of 1999
(If there were no significance tests, they would be invented). In L. A. Harlow, S. A. Mulaik, and J. H.
Steiger (Eds.), What if there were no significance
tests? (pp. 117-144). Mahwah, NJ: Erlbaum.
Bakan,
D. (1966). The test of significance
in psychological research. Psychological Bulletin, 66, 1-29.
Bechhofer,
R. E. (1954). A single-sample
multiple decision procedure for ranking means of normal populations with known
variances. Annals of
Mathematical Statistics, 25, 16-39.
Berkson,
J. (1938). Some difficulties of
interpretation encountered in the application of the chi-square test. Journal of the American Statistical
Association, 33,
526-536.
Berkson,
J. (1942). Tests of significance
considered as evidence. Journal
of the American Statistical Association, 37, 325-335.
Bohrer,
R. (1979). Multiple three-decision
rules for parametric signs. . Journal of the American Statistical
Association, 74,
432-437.
Bulmer,
M. G. (1957). Confirming
statistical hypotheses. Journal
of the Royal Statistical Society, Series B, 19, 125-132.
Cohen,
J. (1990). Things I have learned
(so far). American
Psychologist, 45,
1304-1312.
Cohen,
J. (1994). The earth is round (p
< .05). American
Psychologist, 49,
997-1003.
Hagen,
R. L. (1997). In praise of the null
hypothesis statistical test. American Psychologist, 52, 15-24.
Harris,
R. J., (1997). Reforming
significance testing via three-valued logic. In L. A. Harlow, S. A. Mulaik, and J. H.
Steiger (Eds.), What if there
were no significance tests?
(pp. 145-174). Mahwah, NJ:
Erlbaum.
Hodges,
J. L. & Lehmann, E. L. (1954).
Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical
Society, Series B. 16,
261-268.
Jones,
L. V. (1952). Tests of hypotheses:
One-sided vs. two-sided alternatives.
Psychological Bulletin, 49, 43-46.
Jones,
L. V. (1955). Statistics and
research design. Annual Review
of Psychology, 6,
405-430.
Kaiser,
H. F. (1960). Directional
statistical decisions.
Psychological Review, 67,
160-167.
Lykken,
D. (1968). Statistical significance
in psychological research.
Psychological Bulletin, 70,
151-159.
Meehl,
P. E. (1967). Theory testing in psychology and physics: A methodological
paradox. Philosophy of
Science, 34,
103-115.
Mulaik,
S.A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for
significance testing. In L. A.
Harlow, S. A. Mulaik, and J. H. Steiger
(Eds.), What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Erlbaum.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis
significance test. Psychological
Bulletin, 57,
416-428.
Schmidt, F. L. (1996). Statistical significance testing and
cumulative knowledge in psychology: Implications for training of
researchers. Psychological
Methods, 1,
115-129.
Schmidt,
F. L. & Hunter, J. E. (1997).
Eight common but false objections to the discontinuation of significance
testing in the analysis of research data .
In L. A. Harlow, S. A. Mulaik, and J. H. Steiger (Eds.), What if there were no
significance tests? (pp.
37-64). Mahwah, NJ: Erlbaum.
Tukey,
J. W. (1960). Conclusions vs. decisions. Technometrics, 2,
423-433. Also in L. V. Jones (Ed.),
(1986). The collected works of John W. Tukey, Volume III, Philosophy and
principles of data analysis:
1949-1964 (pp. 127-142).
Monterey, CA: Wadsworth & Brooks/Cole.
Tukey, J. W. (1991). The philosophy of multiple
comparisons. Statistical
Science, 6,
100-116.
Tukey,
J. W. (1993). Where should multiple
comparisons go next? In F. M. Hoppe
(Ed.), Multiple comparisons, selection, and applications in biometry (pp.187-208). New York: Dekker.
Wainer, H. (1999). One cheer for null hypothesis
significance testing. Psychological Methods, 6,
212-213.
Wilkinson, L., & the Task Force on
Statistical Inference. (1999).
Statistical methods in psychology journals: Guidelines and
explanations. American
Psychologist, 54,
594-604.
Williams,
V. S. L., Jones, L. V., &
Tukey, J. W. (1999). Controlling
error in multiple comparisons, with examples from state-to-state differences in
educational achievement. Journal
of Educational and Behavioral Statistics, 24, 42-69.
Author Note
Lyle
V. Jones, L. L. Thurstone Psychometric Laboratory, Department of Psychology, The
University of North Carolina at Chapel Hill; John W. Tukey, Princeton
University.
This
project was supported in part by the National Institute of Statistical Sciences
with a grant from the National Science Foundation (No. RED-9350005). The authors thank Thomas S. Wallsten,
and Forrest W. Young for constructive criticisms of an earlier
draft.
Correspondence
concerning this article should be addressed to Lyle V. Jones, CB 3270, The
University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-3270. Electronic mail may be sent to
lvjones@email.unc.edu.