Provided by: libgenome-model-tools-music-perl_0.04-4_all bug


       PopulationPathScan - apply PathScan test to populations rather than just single


               use PopulationPathScan;

               my $obj = PopulationPathScan->new ($ref_to_list_of_gene_lengths);

               $obj->assign ($number_of_compartments);

               $obj->preprocess ($background_mutation_rate);

               $pval = $obj->population_pval_approx ($ref_to_list_of_hits_per_sample);
               $pval = $obj->population_pval_exact ($ref_to_list_of_hits_per_sample);


       The "PathScan" package is implemented strictly as a test of a set of genes, e.g. a
       pathway, for a single individual.  Specifically, knowing the gene lengths in the pathway,
       the number of genes that have at least one mutation, and the estimated background mutation
       rate, one can test the null hypothesis that these observed mutations are well-explained
       simply by the mechanism of random background mutation.  However, it will often be the case
       that data for a pathway will be available for many individuals, meaning that we now have
       many tests of the given (single) hypothesis.  (This should not be confused with the
       scenario of multiple hypothesis testing.)  The set of values contains much more
       information than a single value, suggesting that significance must be judged on the basis
       of the collective result.  For example, while no single p-value by itself may exceed the
       chosen statistical threshold, the overall set of probabilities may still give the
       impression of significance.  Properly combining such numbers is a necessary, but not
       entirely trivial task.  This package basically serves as a high-level interface to first
       perform individual tests using the methods of "PathScan", and then to properly combine the
       resulting p-values using the methods of "CombinePvals".


       Michael C. Wendl

       Copyright (C) 2009 Washington University

       This program is free software; you can redistribute it and/or modify it under the terms of
       the GNU General Public License as published by the Free Software Foundation; either
       version 2 of the License, or (at your option) any later version.

       This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
       without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
       See the GNU General Public License for more details.

       You should have received a copy of the GNU General Public License along with this program;
       if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
       MA 02111-1307, USA.


       The available methods are listed below.

       The object constructor takes a mandatory, but otherwise un-ordered reference to a list of
       gene lengths comprising the biological group (e.g. a pathway) whose mutation significance
       is to be analyzed using the PathScan paradigm.

               my $obj = PopulationPathScan->new ([474, 1038, 285, ...]);

       The method checks to make sure that all elements are legitimate lengths, i.e. integers
       exceeding 3.

       This method assigns the manner in which genes will be internally organized for passing to
       the PathScan calculation component.  The main consideration here is how the list may be
       compartmentalized for greater computational efficiency, though at some loss of accuracy,
       for the PathScan calculation.  If the gene list is long, exact calculation is generally
       infeasible.  The method takes a single argument representing the number of compartments
       (or sub-lists) the lengths will be divided into, e.g. 1 represents a single list, i.e.
       exact computation, 2 indicates two lists, 3 three lists, etc.

               $obj->assign (3);

       The values are then organized internally such that the smallest genes are grouped
       together, then the slightly larger ones, and so forth.  Generally, 3 or 4 lists give
       reasonable balance between accuracy and computation (Wendl et al., in progress).

       This method pre-processes the population-level calculation, specifically, it sets up and
       executes the PathScan module to obtain the CDF associated with the given gene set and
       background mutation rate.  It takes the latter as an argument.

               $obj->preprocess (0.0000027);

       Executing this method will take various amounts of CPU time, depending upon the level of
       accuracy and the number of genes in the calculation.

       The method optionally takes the list of the number of mutated genes in the group for each
       sample as a second argument, if this information is known at this point

               $obj->preprocess (0.0000027, [4, 5, 7, 3, 0, ...]);

       and it is usually better to use this form because the internals will compute only a
       truncated CDF that is just sufficient to process this list, rather than computing the full
       CDF.  Not only is speed improved, but this helps avoid overflow errors for large pathways.

       This method performs the population-level calculation using exact enumeration.  It takes
       the list of the number of mutated genes in the group for each sample, e.g. each patient's
       whole genome sequence, for example

               patient 1: 4 genes in the pathway are mutated
               patient 2: 5 genes in the pathway are mutated
               patient 3: 7 genes in the pathway are mutated
               patient 4: 3 genes in the pathway are mutated
               patient 5: 0 genes in the pathway are mutated
                 :     :  :   :    :  :     :     :     :

       which is invoked as

               $pval = $obj->population_pval_exact ([4, 5, 7, 3, 0, ...]);

       Most scenarios will not actually be able to make use of this method because enumeration of
       all possible cases is rarely computationally feasible.  This method will mostly be useful
       for examining small test cases.

       This method performs the population-level calculation using Lancaster's approximate
       transform correction.  It takes, as a mandatory argument, the list of the number of
       mutated genes in the group for each sample, e.g. each patient's whole genome sequence.

               $pval = $obj->population_pval_approx ([4, 5, 7, 3, 0, ...]);

       You must pass the list of hits, even if you already passed this list earlier to the pre-
       processing method.  Most cases will use this method because exact combination of
       individual probability values is rarely computationally feasible.  Note that Lancaster's
       method typically gives much better (more accurate) results than Fisher's "standard" chi-
       square transform.

       ·   Fisher, R. A. (1958) Statistical Methods for Research Workers, 13-th Ed. Revised,
           Hafner Publishing Co., New York.

       ·   Lancaster, H. O. (1949) The Combination of Probabilities Arising from Data in Discrete
           Distributions, Biometrika 36(3/4), 370-382.

perl v5.26.2                       Genome::Model::Tools::Music::PathScan::PopulationPathScan(3pm)