Provided by: libpdl-stats-perl_0.82-1_amd64 bug

NAME

       PDL::Stats::Basic -- basic statistics and related utilities such as standard deviation,
       Pearson correlation, and t-tests.

DESCRIPTION

       The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are
       threadable and methods that are NOT threadable, respectively.

       Does not have mean or median function here. see SEE ALSO.

SYNOPSIS

           use PDL::LiteF;
           use PDL::NiceSlice;
           use PDL::Stats::Basic;

           my $stdv = $data->stdv;

       or

           my $stdv = stdv( $data );

FUNCTIONS

   stdv
         Signature: (a(n); float+ [o]b())

       Sample standard deviation.

       stdv processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   stdv_unbiased
         Signature: (a(n); float+ [o]b())

       Unbiased estimate of population standard deviation.

       stdv_unbiased processes bad values.  It will set the bad-value flag of all output ndarrays
       if the flag is set for any of the input ndarrays.

   var
         Signature: (a(n); float+ [o]b())

       Sample variance.

       var processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   var_unbiased
         Signature: (a(n); float+ [o]b())

       Unbiased estimate of population variance.

       var_unbiased processes bad values.  It will set the bad-value flag of all output ndarrays
       if the flag is set for any of the input ndarrays.

   se
         Signature: (a(n); float+ [o]b())

       Standard error of the mean. Useful for calculating confidence intervals.

           # 95% confidence interval for samples with large N

           $ci_95_upper = $data->average + 1.96 * $data->se;
           $ci_95_lower = $data->average - 1.96 * $data->se;

       se processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   ss
         Signature: (a(n); float+ [o]b())

       Sum of squared deviations from the mean.

       ss processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   skew
         Signature: (a(n); float+ [o]b())

       Sample skewness, measure of asymmetry in data. skewness == 0 for normal distribution.

       skew processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   skew_unbiased
         Signature: (a(n); float+ [o]b())

       Unbiased estimate of population skewness. This is the number in GNumeric Descriptive
       Statistics.

       skew_unbiased processes bad values.  It will set the bad-value flag of all output ndarrays
       if the flag is set for any of the input ndarrays.

   kurt
         Signature: (a(n); float+ [o]b())

       Sample kurtosis, measure of "peakedness" of data. kurtosis == 0 for normal distribution.

       kurt processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   kurt_unbiased
         Signature: (a(n); float+ [o]b())

       Unbiased estimate of population kurtosis. This is the number in GNumeric Descriptive
       Statistics.

       kurt_unbiased processes bad values.  It will set the bad-value flag of all output ndarrays
       if the flag is set for any of the input ndarrays.

   cov
         Signature: (a(n); b(n); float+ [o]c())

       Sample covariance. see corr for ways to call

       cov processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   cov_table
         Signature: (a(n,m); float+ [o]c(m,m))

       Square covariance table. Gives the same result as threading using cov but it calculates
       only half the square, hence much faster. And it is easier to use with higher dimension
       pdls.

       Usage:

           # 5 obs x 3 var, 2 such data tables

           perldl> $a = random 5, 3, 2

           perldl> p $cov = $a->cov_table
           [
            [
             [ 8.9636438 -1.8624472 -1.2416588]
             [-1.8624472  14.341514 -1.4245366]
             [-1.2416588 -1.4245366  9.8690655]
            ]
            [
             [   10.32644 -0.31311789 -0.95643674]
             [-0.31311789   15.051779  -7.2759577]
             [-0.95643674  -7.2759577   5.4465141]
            ]
           ]
           # diagonal elements of the cov table are the variances
           perldl> p $a->var
           [
            [ 8.9636438  14.341514  9.8690655]
            [  10.32644  15.051779  5.4465141]
           ]

       for the same cov matrix table using cov,

           perldl> p $a->dummy(2)->cov($a->dummy(1))

       cov_table processes bad values.  It will set the bad-value flag of all output ndarrays if
       the flag is set for any of the input ndarrays.

   corr
         Signature: (a(n); b(n); float+ [o]c())

       Pearson correlation coefficient. r = cov(X,Y) / (stdv(X) * stdv(Y)).

       Usage:

           perldl> $a = random 5, 3
           perldl> $b = sequence 5,3
           perldl> p $a->corr($b)

           [0.20934208 0.30949881 0.26713007]

       for square corr table

           perldl> p $a->corr($a->dummy(1))

           [
            [           1  -0.41995259 -0.029301192]
            [ -0.41995259            1  -0.61927619]
            [-0.029301192  -0.61927619            1]
           ]

       but it is easier and faster to use corr_table.

       corr processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   corr_table
         Signature: (a(n,m); float+ [o]c(m,m))

       Square Pearson correlation table. Gives the same result as threading using corr but it
       calculates only half the square, hence much faster. And it is easier to use with higher
       dimension pdls.

       Usage:

           # 5 obs x 3 var, 2 such data tables

           perldl> $a = random 5, 3, 2

           perldl> p $a->corr_table
           [
            [
            [          1 -0.69835951 -0.18549048]
            [-0.69835951           1  0.72481605]
            [-0.18549048  0.72481605           1]
           ]
           [
            [          1  0.82722569 -0.71779883]
            [ 0.82722569           1 -0.63938828]
            [-0.71779883 -0.63938828           1]
            ]
           ]

       for the same result using corr,

           perldl> p $a->dummy(2)->corr($a->dummy(1))

       This is also how to use t_corr and n_pair with such a table.

       corr_table processes bad values.  It will set the bad-value flag of all output ndarrays if
       the flag is set for any of the input ndarrays.

   t_corr
         Signature: (r(); n(); [o]t())

           $corr   = $data->corr( $data->dummy(1) );
           $n      = $data->n_pair( $data->dummy(1) );
           $t_corr = $corr->t_corr( $n );

           use PDL::GSL::CDF;

           $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t_corr->abs, $n-2 ));

       t significance test for Pearson correlations.

       t_corr processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   n_pair
         Signature: (a(n); b(n); indx [o]c())

       Returns the number of good pairs between 2 lists. Useful with corr (esp. when bad values
       are involved)

       n_pair processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   corr_dev
         Signature: (a(n); b(n); float+ [o]c())

           $corr = $a->dev_m->corr_dev($b->dev_m);

       Calculates correlations from dev_m vals. Seems faster than doing corr from original vals
       when data pdl is big

       corr_dev processes bad values.  It will set the bad-value flag of all output ndarrays if
       the flag is set for any of the input ndarrays.

   t_test
         Signature: (a(n); b(m); float+ [o]t(); [o]d())

           my ($t, $df) = t_test( $pdl1, $pdl2 );

           use PDL::GSL::CDF;

           my $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t->abs, $df ));

       Independent sample t-test, assuming equal var.

       t_test processes bad values.  It will set the bad-value flag of all output ndarrays if the
       flag is set for any of the input ndarrays.

   t_test_nev
         Signature: (a(n); b(m); float+ [o]t(); [o]d())

       Independent sample t-test, NOT assuming equal var. ie Welch two sample t test. Df follows
       Welch-Satterthwaite equation instead of Satterthwaite (1946, as cited by Hays, 1994, 5th
       ed.). It matches GNumeric, which matches R.

           my ($t, $df) = $pdl1->t_test( $pdl2 );

       t_test_nev processes bad values.  It will set the bad-value flag of all output ndarrays if
       the flag is set for any of the input ndarrays.

   t_test_paired
         Signature: (a(n); b(n); float+ [o]t(); [o]d())

       Paired sample t-test.

       t_test_paired processes bad values.  It will set the bad-value flag of all output ndarrays
       if the flag is set for any of the input ndarrays.

   binomial_test
         Signature: (x(); n(); p_expected(); [o]p())

       Binomial test. One-tailed significance test for two-outcome distribution. Given the number
       of successes, the number of trials, and the expected probability of success, returns the
       probability of getting this many or more successes.

       This function does NOT currently support bad value in the number of successes.

       Usage:

         # assume a fair coin, ie. 0.5 probablity of getting heads
         # test whether getting 8 heads out of 10 coin flips is unusual

         my $p = binomial_test( 8, 10, 0.5 );  # 0.0107421875. Yes it is unusual.

METHODS

   rtable
       Reads either file or file handle*. Returns observation x variable pdl and var and obs ids
       if specified. Ids in perl @ ref to allow for non-numeric ids. Other non-numeric entries
       are treated as missing, which are filled with $opt{MISSN} then set to BAD*. Can specify
       num of data rows to read from top but not arbitrary range.

       *If passed handle, it will not be closed here.

       Default options (case insensitive):

           V       => 1,        # verbose. prints simple status
           TYPE    => double,
           C_ID    => 1,        # boolean. file has col id.
           R_ID    => 1,        # boolean. file has row id.
           R_VAR   => 0,        # boolean. set to 1 if var in rows
           SEP     => "\t",     # can take regex qr//
           MISSN   => -999,     # this value treated as missing and set to BAD
           NROW    => '',       # set to read specified num of data rows

       Usage:

       Sample file diet.txt:

           uid height  weight  diet
           akw 72      320     1
           bcm 68      268     1
           clq 67      180     2
           dwm 70      200     2

           ($data, $idv, $ido) = rtable 'diet.txt';

           # By default prints out data info and @$idv index and element

           reading diet.txt for data and id... OK.
           data table as PDL dim o x v: PDL: Double D [4,3]
           0   height
           1   weight
           2   diet

       Another way of using it,

           $data = rtable( \*STDIN, {TYPE=>long} );

   group_by
       Returns pdl reshaped according to the specified factor variable. Most useful when used in
       conjunction with other threading calculations such as average, stdv, etc. When the factor
       variable contains unequal number of cases in each level, the returned pdl is padded with
       bad values to fit the level with the most number of cases. This allows the subsequent
       calculation (average, stdv, etc) to return the correct results for each level.

       Usage:

           # simple case with 1d pdl and equal number of n in each level of the factor

               pdl> p $a = sequence 10
               [0 1 2 3 4 5 6 7 8 9]

               pdl> p $factor = $a > 4
               [0 0 0 0 0 1 1 1 1 1]

               pdl> p $a->group_by( $factor )->average
               [2 7]

           # more complex case with threading and unequal number of n across levels in the factor

               pdl> p $a = sequence 10,2
               [
                [ 0  1  2  3  4  5  6  7  8  9]
                [10 11 12 13 14 15 16 17 18 19]
               ]

               pdl> p $factor = qsort $a( ,0) % 3
               [
                [0 0 0 0 1 1 1 2 2 2]
               ]

               pdl> p $a->group_by( $factor )
               [
                [
                 [ 0  1  2  3]
                 [10 11 12 13]
                ]
                [
                 [  4   5   6 BAD]
                 [ 14  15  16 BAD]
                ]
                [
                 [  7   8   9 BAD]
                 [ 17  18  19 BAD]
                ]
               ]
            ARRAY(0xa2a4e40)

           # group_by supports perl factors, multiple factors
           # returns factor labels in addition to pdl in array context

           pdl> p $a = sequence 12
           [0 1 2 3 4 5 6 7 8 9 10 11]

           pdl> $odd_even = [qw( e o e o e o e o e o e o )]

           pdl> $magnitude = [qw( l l l l l l h h h h h h )]

           pdl> ($a_grouped, $label) = $a->group_by( $odd_even, $magnitude )

           pdl> p $a_grouped
           [
            [
             [0 2 4]
             [1 3 5]
            ]
            [
             [ 6  8 10]
             [ 7  9 11]
            ]
           ]

           pdl> p Dumper $label
           $VAR1 = [
                     [
                       'e_l',
                       'o_l'
                     ],
                     [
                       'e_h',
                       'o_h'
                     ]
                   ];

   which_id
       Lookup specified var (obs) ids in $idv ($ido) (see rtable) and return indices in $idv
       ($ido) as pdl if found. The indices are ordered by the specified subset. Useful for
       selecting data by var (obs) id.

           my $ind = which_id $ido, ['smith', 'summers', 'tesla'];

           my $data_subset = $data( $ind, );

           # take advantage of perl pattern matching
           # e.g. use data from people whose last name starts with s

           my $i = which_id $ido, [ grep { /^s/ } @$ido ];

           my $data_s = $data($i, );

SEE ALSO

       PDL::Basic (hist for frequency counts)

       PDL::Ufunc (sum, avg, median, min, max, etc.)

       PDL::GSL::CDF (various cumulative distribution functions)

REFERENCES

       Hays, W.L. (1994). Statistics (5th ed.). Fort Worth, TX: Harcourt Brace College
       Publishers.

AUTHOR

       Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>

       All rights reserved. There is no warranty. You are allowed to redistribute this software /
       documentation as described in the file COPYING in the PDL distribution.