A
Sensible Formulation of the Significance Test

Lyle
V. Jones

The
University of North Carolina at Chapel Hill

and

John
W. Tukey

Princeton
University

A
Sensible Formulation of the Significance Test

Abstract

The conventional procedure for null hypothesis significance testing long
has been the target of appropriate criticism. Here, a more reasonable alternative is
proposed, one that not only avoids the unrealistic postulation of a null
hypothesis, but also, for a given parametric difference and a given error
probability, is more likely to report the detection of that
difference.

A
Sensible Formulation of the Significance Test

Procedures
for testing the significance of the difference between two sample means have
been widely utilized, and also widely criticized, throughout most of the
20^{th} century. To better
codify the procedures, the Neyman-Pearson approach to choosing between two point
hypotheses emerged. While
attractive mathematically, that approach is troublesome in practice, because
investigators are often unwilling to specify a value for the alternative to the
null hypothesis and, of equal importance, both hypotheses, like all point
hypotheses about parameters in real-world populations that are subjected to
different treatments, are always untrue when calculations are carried to enough
decimal places.

The
introduction of hypothesis testing did lead to an attempt to reformulate the
significance test as a test of a point null hypothesis against all possible
alternatives. However, that
reformulation failed to silence the critics, e.g., Bakan (1966), Bechhofer
(1954), Berkson, (1938, 1942), Cohen (1994), Lykken (1968), Meehl (1967),
Rozeboom (1960), Schmidt (1996), Schmidt & Hunter (1997), among many
others. Jones (1955, p. 407)
suggested that “an investigator would be misled less frequently and would be
more likely to obtain the information he seeks were he to formulate his
experimental problems in terms of the estimation of population parameters, with
the establishment of confidence intervals about the estimated values, rather
than in terms of a null hypothesis against all possible alternatives.” Many other critics have echoed that
advice, to which we also subscribe, especially when the outcome measure is well
defined.

However,
when faced with several or many mean differences, i.e., with multiple
comparisons, straightforward procedures for establishing confidence intervals
may not be available. Also, in some
situations, the scale of the outcome measure may be so untrustworthy that the
primary interest then may appropriately reside in just the direction of a
treatment effect rather than in its size.
Whenever for these or for other reasons a test of significance is to be
performed, we propose an alternative to the conventional formulation. (Additional arguments in support of
significance testing, are set forth by Abelson, 1997, Hagen, 1997, Mulaik, Raju,
& Harshman, 1997, and Wainer, 1999, among others.)

A
common formulation for the conventional test-of-hypothesis version of the test
of significance is the traditional test for the equality of two population
means, using Student’s t distribution.
A null hypothesis is set forth,

H_{O}:
µ_{A} - µ_{B} = 0^{}

vs.
an omnibus alternative,

H_{1: }µ_{A} -
µ_{B} ą
0.

From
samples of sizes n_{A } and
n_{B}, an estimate_{ }of
µ_{A} - µ_{B},
y_{A} – y_{B},
is obtained. Usually,
y_{A} and y_{B} are sample means, but they might be sample
medians, sample midmeans, or other estimates of µ_{A} and
µ_{B}. An estimated
standard error of
_{}

y_{A} – y_{B} , s_{d}, is calculated and a
statistic, (y_{A} -
y_{B})/s_{d}, is formed.

A
two-tailed rejection region of the sampling distribution of this statistic,
typically Student's t with df = n_{A }+ n_{B} – 2, is
established by setting a value for a,
the probability of rejecting H_{O} when it is true, often set as
.05. When the t statistic
falls in either tail of the rejection region, each with area a/2,
H_{O} is rejected in favor of H_{1}. The only allowable conclusions are (1)
reject H_{O,} and (2) fail to reject that null hypothesis, thereby
withholding judgment.

Cohen
(1994, p.1001) advised, "don't look for a magic alternative to" this formulation
of the null hypothesis significance test.
"It doesn't exist." In the
same paper, Cohen cited Tukey (1991), but seems to have overlooked the
alternative suggested there (see also Tukey, 1993). As applied to multiple comparisons, that
alternative has been more explicitly developed and illustrated in Williams,
Jones, & Tukey (1999). As
shown below, the formulation is entirely suitable for use with a single mean
difference as well as with multiple comparisons.

When
A and B are different treatments,
µ_{A }and_{ }µ_{B} are certain to differ in some
decimal place so that µ_{A}
- µ_{B} = 0 is known in advance to be false and _{ }µ_{A} - µ_{B}
ą
0 is known to be true (Cohen, 1990; Tukey, 1991). An extensive rebuttal to this
claim is provided by Hagen (1997), who states that “I agree that A and B will
always produce differential effects on some variable or variables that
theoretically could be measured.
But I do not agree that A and B will always produce an effect on the
dependent variable … ” (p. 20).
We simply do not accept that view.
For large finite treatment populations, a total census is at least
conceivable, and we cannot imagine an outcome for which µ_{A} -
µ_{B} = 0 when the dependent variable (or any other variable) is
measured to an indefinitely large number of decimal places. (We come to a similar conclusion with
respect to the mean of a single population when H_{O}: µ = k, whether k
= 0 or any other definite value.)
For hypothetical treatment populations, µ_{A} - µ_{B }may
approach zero as a limit, but as for the approach of population sizes to
infinity, the limit never is reached.
The population mean difference may be trivially small, but will always be
positive or negative. As a
consequence, we should not set forth a null hypothesis because to do so is
unrealistic and misleading.
Instead, we should assess the sample data and entertain one of three
conclusions:

(1)
act
as if µ_{A }-_{ }µ_{B } > 0;

(2)
act
as if µ_{A } -_{
}µ_{B }** < **0;

or

(3)
act
as if the sign of µ_{A }-_{ }µ_{B}, is indefinite, i.e.,
is not (yet)
determined.

This
specification is similar to “the
three alternative decisions” proposed by Tukey (1960, p. 425).

With
this formulation, a conclusion is in error only when it is “a reversal,” when it
asserts one direction while the (unknown) truth is the other direction. Asserting that the direction is not yet
established may constitute a wasted opportunity, but is not an error. We want to control the rate of error,
the reversal rate, while minimizing wasted opportunity, i.e., while minimizing
indefinite results.

Consider
the probability of erroneously concluding that the population mean difference is
in one direction when in truth it is in the other. If µ_{A }-_{ }µ_{B
}is positive, an error will occur only when the value of t falls in the
lower tail of the distribution, with area a/2,
which also is the limiting probability of error for that case. Likewise, if µ_{A }-_{
}µ_{B} is negative, an error occurs only when the t statistic falls
in the upper tail of the distribution, also of area a/2. The reversal rate, also the overall
probability of error, then is

**P**(error)
Ł
(a/2)** P**[(µ_{A } -_{ }µ_{B })_{ }> 0] + (a/2)**P**[(µ_{A } -_{ }µ_{B })** **< 0]

Ł
(a/2)** P**[(µ_{A} - µ_{B} )
ą
0] **Ł****
** a/2.

This
**P**(error) is just** **one-half the probability of Type I
error for the traditional null hypothesis test and is the same as that for a
one-tailed test (see Jones, 1952).
Yet, the procedure is symmetric, equally sensitive to a mean difference
in either direction.^{ }

A
task force on statistical inference is currently preparing guidelines for
standards to be adopted in psychology journals. In an earlier article, Wilkinson, &
the Task Force on Statistical Inference (1999) recommended that, when hypothesis
tests have been performed, effect sizes and p values always should be
presented. The adoption of this
proposed three-alternative conclusion procedure lends itself to reporting the
effect size as the estimated value of
µ_{A }-_{ }µ_{B
}(either standardized or not).
Regardless of the size of that estimate, and regardless of whether or not
the calculated value of t falls in the rejection region, it seems appropriate to
report the p value as the area of the t distribution more positive or more
negative (but not both) than the value of t obtained from (y_{A} - y_{B})/s_{d}). (The limiting values of p then are 0, as
the absolute value of t becomes indefinitely large, and 1/2, as the value of t
approaches zero.)

For
any specified positive or negative population mean difference, there may be
found in the usual way the probability of a Type II error, of withholding
judgment when the parametric difference is as specified. For each specified difference, the
probability of a Type II error is smaller than that for the conventional
two-tailed test of significance.
Thus, the proposed procedure is uniformly more powerful than the
conventional procedure.

Hodges
& Lehmann (1954) proposed a modification of the traditional Student test,
converting it from a two-sided test of the null hypothesis to two one-sided
tests. Kaiser (1960) proposed
combining the two one-sided tests into a single test, but one with two
directional alternative hypotheses, µ_{A }< µ_{B} and µ_{A
} > µ_{B}. (For further discussion, see Bohrer,
1979, Bulmer, 1957, and Harris, 1997.)
However, the unrealistic null hypothesis of zero mean difference in the
population is included in these proposals, in contrast to the formulation
above. By acknowledging the
fiction of the null hypothesis, and following the implications from “every null
hypothesis is false,” our formulation yields, for any sample size and any value
of a,
a test with greater reported sensitivity to detect the direction of a difference
between two population means.

Note
that those accustomed to make "tests of hypotheses" at .05 would, using the
procedures set forth here, do the same arithmetic but would describe their
results as a "test of significance" at one-half of .05, i.e., at .025. Alternatively, to maintain at .05 the
probability of acting as if the parametric difference is in one direction when,
in fact, it is in the other, the investigator would employ the .10 tabled value
of a.

In
summary, then:

·
Prefer
confidence intervals when they are available.

·
Recognize
that point hypotheses, while mathematically convenient, are never fulfilled in
practice.

·
When
performing a simple test of significance, seek for one of three outcomes, as
described above.

References

Abelson,
R. P. (1997). A retrospective on
the significance test ban of 1999
(If there were no significance tests, they would be invented). In L. A. Harlow, S. A. Mulaik, and J. H.
Steiger (Eds.), __What if there were no significance
tests?__ (pp. 117-144). Mahwah, NJ: Erlbaum.

Bakan,
D. (1966). The test of significance
in psychological research.__ Psychological Bulletin__, __66__, 1-29.

Bechhofer,
R. E. (1954). A single-sample
multiple decision procedure for ranking means of normal populations with known
variances. __Annals of
Mathematical Statistics__, __25__, 16-39.

Berkson,
J. (1938). Some difficulties of
interpretation encountered in the application of the chi-square test. __Journal of the American Statistical
Association__, __33__,
526-536.

Berkson,
J. (1942). Tests of significance
considered as evidence. __Journal
of the American Statistical Association__, __37__, 325-335.

Bohrer,
R. (1979). Multiple three-decision
rules for parametric signs. . __Journal of the American Statistical
Association__, __74__,
432-437.

Bulmer,
M. G. (1957). Confirming
statistical hypotheses. __Journal
of the Royal Statistical__ __Society, Series B__, __19__, 125-132.

Cohen,
J. (1990). Things I have learned
(so far). __American
Psychologist__, __45__,
1304-1312.

Cohen,
J. (1994). The earth is round (p
< .05). __American
Psychologist__, __49__,
997-1003.

Hagen,
R. L. (1997). In praise of the null
hypothesis statistical test. __American Psychologist__, __52__, 15-24.

Harris,
R. J., (1997). Reforming
significance testing via three-valued logic. In L. A. Harlow, S. A. Mulaik, and J. H.
Steiger (Eds.), __What if there
were no significance tests?__
(pp. 145-174). Mahwah, NJ:
Erlbaum.

Hodges,
J. L. & Lehmann, E. L. (1954).
Testing the approximate validity of statistical hypotheses. __Journal of the Royal Statistical
Society, Series B__. __16__,
261-268.

Jones,
L. V. (1952). Tests of hypotheses:
One-sided vs. two-sided alternatives.
__Psychological__ __Bulletin__, __49__, 43-46.

Jones,
L. V. (1955). Statistics and
research design. __Annual Review
of Psychology__, __6__,
405-430.

Kaiser,
H. F. (1960). Directional
statistical decisions.
__Psychological Review__, __67__,
160-167.

Lykken,
D. (1968). Statistical significance
in psychological research.
__Psychological Bulletin__, __70__,
151-159.

Meehl,
P. E. (1967). Theory testing in psychology and physics: A methodological
paradox. __Philosophy of
Science__, __34__,
103-115.

Mulaik,
S.A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for
significance testing. In L. A.
Harlow, S. A. Mulaik, and J. H. Steiger
(Eds.), __What if there were no significance tests?__ (pp. 65-116). Mahwah, NJ: Erlbaum.

Rozeboom, W. W. (1960). The fallacy of the null hypothesis
significance test. __Psychological
Bulletin__, __57__,
416-428.

Schmidt, F. L. (1996). Statistical significance testing and
cumulative knowledge in psychology: Implications for training of
researchers. __Psychological
Methods__, __1__,
115-129.

Schmidt,
F. L. & Hunter, J. E. (1997).
Eight common but false objections to the discontinuation of significance
testing in the analysis of research data .
In L. A. Harlow, S. A. Mulaik, and J. H. Steiger (Eds.), __What if there were no
significance tests?__ (pp.
37-64). Mahwah, NJ: Erlbaum.

Tukey,
J. W. (1960). Conclusions vs. decisions. __Technometrics__, __2__,
423-433. Also in L. V. Jones (Ed.),
(1986). __The collected works of John W. Tukey, Volume III, Philosophy and
principles of data analysis:
1949-1964__ (pp. 127-142).
Monterey, CA: Wadsworth & Brooks/Cole.

Tukey, J. W. (1991). The philosophy of multiple
comparisons. __Statistical
Science__, __6__,
100-116.

Tukey,
J. W. (1993). Where should multiple
comparisons go next? In F. M. Hoppe
(Ed.), __Multiple comparisons, selection, and applications in biometry__ (pp.187-208). New York: Dekker.

Wainer, H. (1999). One cheer for null hypothesis
significance testing. __ Psychological Methods__, __6__,
212-213.

Wilkinson, L., & the Task Force on
Statistical Inference. (1999).
Statistical methods in psychology journals: Guidelines and
explanations. __American
Psychologist__, __54__,
594-604.

Williams,
V. S. L., Jones, L. V., &
Tukey, J. W. (1999). Controlling
error in multiple comparisons, with examples from state-to-state differences in
educational achievement. __Journal
of Educational and Behavioral Statistics__, __24__, 42-69.

Author Note

Lyle
V. Jones, L. L. Thurstone Psychometric Laboratory, Department of Psychology, The
University of North Carolina at Chapel Hill; John W. Tukey, Princeton
University.

This
project was supported in part by the National Institute of Statistical Sciences
with a grant from the National Science Foundation (No. RED-9350005). The authors thank Thomas S. Wallsten,
and Forrest W. Young for constructive criticisms of an earlier
draft.

Correspondence
concerning this article should be addressed to Lyle V. Jones, CB 3270, The
University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-3270. Electronic mail may be sent to
lvjones@email.unc.edu.