Ubuntu Manpage: CombinePvals - combining probabilities from independent tests of significance into a

Provided by: libgenome-model-tools-music-perl_0.04-3_all

NAME

       CombinePvals - combining probabilities from independent tests of significance into a
       single aggregate figure

SYNOPSIS

               use CombinePvals;

               my $obj = CombinePvals->new ($reference_to_list_of_pvals);

               my $pval = $obj->method_name;

               my $pval = $obj->method_name (@arguments);

DESCRIPTION

There are a variety of circumstances under which one might have a number of different
kinds of tests and/or separate instances of the same kind of test for one particular null
hypothesis, where each of these tests returns a p-value. The problem is how to properly
condense this list of probabilities into a single value so as to be able to make a
statistical inference, e.g. whether to reject the null hypothesis. This problem was
examined heavily starting about the 1930s, during which time numerous mathematical
contintencies were treated, e.g. dependence vs. independence of tests, optimality, inter-
test weighting, computational efficiency, continuous vs. discrete tests and combinations
thereof, etc. There is quite a large mathematical literature on this topic (see
"REFERENCES" below) and any one particular situation might incur some of the above
subtleties. This package concentrates on some of the more straightforward scenarios,
furnishing various methods for combining p-vals. The main consideration will usually be
the trade-off between the exactness of the p-value (according to strict frequentist
modeling) and the computational efficiency, or even its actual feasibility. Tests should
be chosen with this factor in mind.

Note also that this scenario of combining p-values (many tests of a single hypothesis) is
fundamentally different from that where a given hypothesis is tested multiple times. The
latter instance usually calls for some method of multiple testing correction.

REFERENCES

       Here is an abbreviated list of the substantive works on the topic of combining
       probabilities.

       •   Birnbaum, A. (1954) Combining Independent Tests of Significance, Journal of the
           American Statistical Association 49(267), 559-574.

       •   David, F. N. and Johnson, N. L. (1950) The Probability Integral Transformation When
           the Variable is Discontinuous, Biometrika 37(1/2), 42-49.

       •   Fisher, R. A. (1958) Statistical Methods for Research Workers, 13-th Ed. Revised,
           Hafner Publishing Co., New York.

       •   Lancaster, H. O. (1949) The Combination of Probabilities Arising from Data in Discrete
           Distributions, Biometrika 36(3/4), 370-382.

       •   Littell, R. C. and Folks, J. L. (1971) Asymptotic Optimality of Fisher's Method of
           Combining Independent Tests, Journal of the American Statistical Association 66(336),
           802-806.

       •   Pearson, E. S. (1938) The Probability Integral Transformation for Testing Goodness of
           Fit and Combining Independent Tests of Significance, Biometrika 30(12), 134-148.

       •   Pearson, E. S. (1950) On Questions Raised by the Combination of Tests Based on
           Discontonuous Distributions, Biometrika 37(3/4), 383-398.

       •   Pearson, K. (1933) On a Method of Determining Whether a Sample Of Size N Supposed to
           Have Been Drawn From a Parent Population Having a Known Probability Integral Has
           Probably Been Drawn at Random Biometrika 25(3/4), 379-410.

       •   Van Valen, L. (1964) Combining the Probabilities from Significance Tests, Nature
           201(4919), 642.

       •   Wallis, W. A. (1942) Compounding Probabilities from Independent Significance Tests,
           Econometrica 10(3/4), 229-248.

       •   Zelen, M. and Joel, L. S. (1959) The Weighted Compounding of Two Independent
           Significance Tests, Annals of Mathematical Statistics 30(4), 885-895.

AUTHOR

       Michael C. Wendl

       mwendl@wustl.edu

       Copyright (C) 2009 Washington University

       This program is free software; you can redistribute it and/or modify it under the terms of
       the GNU General Public License as published by the Free Software Foundation; either
       version 2 of the License, or (at your option) any later version.

       This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
       without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
       See the GNU General Public License for more details.

       You should have received a copy of the GNU General Public License along with this program;
       if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
       MA 02111-1307, USA.

GENERAL REMARKS ON METHODS

       The available methods are listed below.  Each of computational techniques assumes that
       tests, as well as their associated p-values, are independent of one another and none
       considers any form of differential weighting.

CONSTRUCTOR METHODS

       These methods return an object in the CombinePvals class.

   new
       This is the usual object constructor, which takes a mandatory, but otherwise un-ordered
       (reference to a) list of the p-values obtained by a set of independent tests.

               my $obj = CombinePvals->new ([0.103, 0.078, 0.03, 0.2,...]);

       The method checks to make sure that all elements are actual p-values, i.e. they are real
       numbers and they have values bounded by 0 and 1.

EXACT ENUMERATIVE PROCEDURES FOR STRICTLY DISCRETE DISTRIBUTIONS

When all the individual p-vals are derived from tests based on discrete distributions, the
"standard" continuum methods cannot be used in the strictest sense. Both Wallis (1942)
and Lancaster (1949) discuss the option of full enumeration, which will only be feasible
when there are a limited number of p-values and their range is not too large. Feasibility
experiments are suggested, depending upon the type of hardware and size of calculation.

exact_enum_arbitrary
This routine is designed for combining p-values from completely arbitrary discrete
probability distributions. It takes a list-of-lists data structure, each list being the
probability tails ordered from most extreme to least extreme (i.e. as a probability
cummulative density function) associated with each individual test. However, the ordering
of the lists themselves is not important. For instance, Wallis (1942) gives the example
of two binomials, a one-tailed test having tail values of 0.0625, 0.3125, 0.6875, 0.9375,
and 1, and a two-tailed test having tail values 0.125, 0.625, and 1. We would then call
this method using

my $pval = $obj->exact_enum_arbitrary (
[0.0625, 0.3125, 0.6875, 0.9375, 1],
[0.125, 0.625, 1]
);

The internal computational method is relatively straightforard and described in detail by
Wallis (1942). Note that this method does "all-by-all" multiplication, so it is the least
efficient, although entirely exact.

exact_enum_identical
This routine is designed for combining a set of p-values that all come from a single
probability distribution.

NOT IMPLEMENTED YET

TRANSFORMS FOR CONTINUOUS DISTRIBUTIONS

       The mathematical literature furnishes several straightforward options for combining p-vals
       if all of the distributions underlying all of the individual tests are continuous.

   fisher_chisq_transform
       This routine implements R.A. Fisher's (1958, originally 1932) chi-square transform method
       for combining p-vals from continuous distributions, which is essentially a CPU-efficient
       approximation of K. Pearson's log-based result (see e.g. Wallis (1942) pp 232).  Note that
       the underlying distributions are not actually relevant, so no arguments are passed.

               my $pval = $obj->fisher_chisq_transform;

       This is certainly the fastest and easiest method for combining p-vals, but its accuracy
       for discrete distributions will not usually be very good.  For such cases, an exact or a
       corrected method are better choices.

CORRECTION PROCEDURES FOR DISCRETE DISTRIBUTIONS: LANCASTER'S MODELS

Enumerative procedures quickly become infeasible if the number of tests and/or the support
of each test grow large. A number of procedures have been described for correcting the
methodologies designed for continuum testing, mostly in the context of applying so-called
continuity corrections. Essentially, these seek to "spread" dicrete data out into a
pseudo-continuous configuration as appropriate as possible, and then apply standard
transforms. Accuracy varies and should be suitably established in each case.

The methods in this section are due to H.O. Lancaster (1949), who discussed two
corrections based upon the idea of describing how a chi-square transformed statistic
varies between the points of a discrete distribution. Unfortunately, these methods
require one to pass some extra information to the routines, i.e. not only the CDF (the
p-val of each test), but the CDF value associated with the next-most-extreme statistic.
These two pieces of information are the basis of interpolating. For example, if an
underlying distribution has the possible tail values of 0.0625, 0.3125, 0.6875, 0.9375, 1
and the test itself has a value of 0.6875, then you would pass both 0.3125 and 0.6875 to
the routine. In all cases, the lower value, i.e. the more extreme one, precedes higher
value in the argument list. While there generally will be some extra inconvenience in
obtaining this information, the accuracy is much improved over Fisher's method.

lancaster_mean_corrected_transform
This method is based on the mean value of the chi-squared transformed statistic.

my $pval = $obj->lancaster_mean_corrected_transform (@cdf_pairs);

Its accuracy is good, but the method is not strictly defined if one of the tests has
either the most extreme or second-to-most-extreme statistic.

lancaster_median_corrected_transform
This method is based on the median value of the chi-squared transformed statistic.

my $pval = $obj->lancaster_median_corrected_transform (@cdf_pairs);

Its accuracy may sometimes be not quite as good as when using the average, but the method
is strictly defined for all values of the statistic.

lancaster_mixed_corrected_transform
This method is a mixture of both the mean and median methods. Specifically, mean
correction is used wherever it is well-defined, otherwise median correction is used.

my $pval = $obj->lancaster_mixed_corrected_transform (@cdf_pairs);

This will be a good way to handle certain cases.

additional methods
The basic functionality of this package is encompassed in the methods described above.
However, some lower-level functions can also sometimes be useful.

exact_enum_arbitrary_2

Hard-wired precursor of exact_enum_arbitrary for 2 distributions. Does no pre-checking,
but may be useful for comparing to the output of the general program.

exact_enum_arbitrary_3

Hard-wired precursor of exact_enum_arbitrary for 3 distributions. Does no pre-checking,
but may be useful for comparing to the output of the general program.

binom_coeffs

Calculates the binomial coefficients needed in the binomial (convolution) approximate
solution.

$pmobj->binom_coeffs;

The internal data structure is essentially the symmetric half of the appropriately-sized
Pascal triangle. Considerable memory is saved by not storing the full triangle.

perl v5.22.1 Genome::Model::Tools::Music::PathScan::CombinePvals(3pm)