lunar (1) QTLtools-cis.1.gz

Provided by: qtltools_1.3.1+dfsg-4_amd64 bug

NAME

       QTLtools cis - cis QTL analysis

SYNOPSIS

       QTLtools   cis   --vcf  [in.vcf|in.vcf.gz|in.bcf|in.bed.gz]  --bed  quantifications.bed.gz
       [--nominal float | --permute integer | --mapping in.txt] --out output.txt [OPTIONS]

DESCRIPTION

       This mode maps cis (proximal) quantitative trait loci (QTLs) that  affect  the  phenotype,
       using       linear       regression.        The       method      is      detailed      in
       <https://www.nature.com/articles/ncomms15452>.   We  first  regress   out   the   provided
       covariates  from the phenotype data, followed by running the linear regression between the
       phenotype residuals and the genotype.  If --normal and --cov  are  provided  at  the  same
       time,  then  the residuals after the covariate correction are rank normal transformed.  It
       incorporates an efficient permutation scheme to control for differential multiple  testing
       burden  of  each  phenotype.  You can run a nominal pass (--nominal threshold) listing all
       genotype-phenotype associations below a certain threshold, a permutation  pass  (--permute
       no_of_permutations)  to empirically characterize the null distribution of associations for
       each phenotype separately, thus adjusting the nominal p-value of the best association  for
       a  phenotype,  or  a  conditional  analysis pass (--mapping filename) to discover multiple
       proximal QTLs with independent effects on a phenotype.

       As multiple molecular phenotypes can belong to  higher  order  biological  entities,  e.g.
       exons  of genes, QTLtools cis allows grouping of phenotypes to maximize the discoveries in
       such particular cases.  Specifically, QTLtools can either aggregate multiple phenotypes in
       a given group into a single phenotype via PCA (--grp-pca1) or by taking their mean (--grp-
       mean), or directly use all individual phenotypes in an extended  permutation  scheme  that
       accounts  for  their  number  and  correlation structure (--grp-best).  In our experience,
       --grp-best outperforms the other options for expression QTLs (eQTLs).

       The conditional analysis  pass  first  uses  permutations  to  derive  a  nominal  p-value
       threshold  per  phenotype  that  varies  and  reflects the number of independent tests per
       cis-window.  Then, it uses a forward-backward stepwise regression to learn the  number  of
       independent  signals  per  phenotype,  determine the best candidate variant per signal and
       assign all significant hits to the independent signal they relate to.

       Since linear regressions assumes normally distributed data, we highly recommend using  the
       --normal  option  to rank normal transform the phenotype quantifications in order to avoid
       false positive associations due to outliers.

OPTIONS

       --vcf [in.vcf|in.bcf|in.vcf.gz|in.bed.gz]
              Genotypes in VCF/BCF format, or another molecular  phenotype  in  BED  format.   If
              there  is  a  DS  field in the genotype FORMAT of a variant (dosage of the genotype
              calculated from genotype probabilities, e.g. after imputation), then this  is  used
              as the genotype.  If there is only the GT field in the genotype FORMAT then this is
              used and it is converted to a dosage.  REQUIRED.

       --bed quantifications.bed.gz
              Molecular phenotype quantifications in BED format.  REQUIRED.

       --out output.txt
              Output file.  REQUIRED.

       --cov covariates.txt
              Covariates to correct the phenotype data with.

       --normal
              Rank normal transform the  phenotype  data  so  that  each  phenotype  is  normally
              distributed.  RECOMMENDED.

       --std-err
              Calculate and output the standard error of the beta (slope).

       --window integer
              Size  of the cis window flanking each phenotype's start position.  DEFAULT=1000000.
              RECOMMENDED=1000000.

       --nominal float
              Calculate the nominal p-value for the genotype-phenotype associations and print out
              the ones that pass the provided threshold.  Give 1.0 to print everything.  Mutually
              exclusive with --permute and --mapping.

       --permute integer
              Adjust the best nominal p-value for this phenotype accounting  for  the  number  of
              variants  and  the linkage disequilibrium in its cis-window.  We recommend at least
              1000 permutation for the final analysis, and in most cases you will see diminishing
              returns  when going over 5000.  However, if you are doing exploratory analyses like
              which/how many covariates to include, you can go as low as 100.  Mutually exclusive
              with --nominal and --mapping.  RECOMMENDED=1000.

       --mapping thresholds_filename
              The  conditional  analysis.   First  you  need  to run a permutation analysis, then
              generate a thresholds file using the runFDR_cis.R script in the  script  directory.
              Mutually exclusive with --nominal and --permute.

       --grp-best
              Correct  for multiple phenotypes within a phenotype group.  Mutually exclusive with
              --grp-pca1 and --grp-mean.

       --grp-pca1
              Run PCA on phenotypes within a phenotype group and use PC1 for association testing.
              Mutually exclusive with --grp-best and --grp-mean.

       --grp-mean
              Average  phenotypes  within  a  group  and use the results for association testing.
              Mutually exclusive with --grp-best and --grp-pca1.

       --chunk integer1 integer2
              For parallelization.  Divide the data into integer2 number of  chunks  and  process
              chunk  number  integer1.   Chunk  0  will  print a header.  Mutually exclusive with
              --region and --region-pair.  Minimum number of chunks has to be at least  the  same
              number of chromosomes in the --bed file.

       --region chr:start-end
              Genomic  region  to  be  processed.   E.g. chr4:12334456-16334456, or chr5 Mutually
              exclusive with --chunk and --region-pair.

       --region-pair chr:start-end chr:start-end
              Genomic region for genotypes followed by region for  phenotypes  to  be  processed.
              Mutually exclusive with --chunk and --region.

OUTPUT FILES

       --nominal output file
        Space  separated output file with the following columns (certain columns are only printed
        based on options).  We recommend including chunk 0 to print out  a  header  in  order  to
        avoid confusion.

         1     phe_id | grp_id                     The  phenotype  ID  or  if one of the grouping
                                                   options is provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only printed if --group-best | --group-pca1  |
                                                   --group-mean.    The  phenotype  ID,  variance
                                                   explained by PC1, or number of  phenotypes  in
                                                   the phenotype group for --group-best, --group-
                                                   pca1, and --group-mean, respectively.

         5.2   n_phe_in_grp                        Only printed if --group-pca1  |  --group-mean.
                                                   The  number  of  phenotypes  in  the phenotype
                                                   group.
         6     n_var_in_cis                        The number variants in the cis window for this
                                                   phenotype.
         7     dist_phe_var                        The  distance  between  the  variant  and  the
                                                   phenotype start positions.
         8     var_id                              The variant ID.
         9     var_chr                             The variant chromosome.
        10     var_from                            The start position of the variant.
        11     var_to                              The end position of the variant.
        12     nom_pval                            The nominal p-value of the association between
                                                   the variant and the phenotype.
        13     r_squared                           The r squared of the linear regression.
        14     slope                               The beta (slope) of the linear regression.
        14.1   slope_se                            The  standard error of the beta.  Only printed
                                                   if --std-err is provided.
        15     best_hit                            Whether this varint was the best hit for  this
                                                   phenotype.

       --permute output file
        Space  separated output file with the following columns (certain columns are only printed
        based on options).  We recommend including chunk 0 to print out  a  header  in  order  to
        avoid confusion.

         1     phe_id | grp_id                     The  phenotype  ID  or  if one of the grouping
                                                   options is provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only printed if --group-best | --group-pca1  |
                                                   --group-mean.    The  phenotype  ID,  variance
                                                   explained by PC1, or number of  phenotypes  in
                                                   the phenotype group for --group-best, --group-
                                                   pca1, and --group-mean, respectively.
         5.2   n_phe_in_grp                        Only printed if --group-pca1  |  --group-mean.
                                                   The  number  of  phenotypes  in  the phenotype
                                                   group.
         6     n_var_in_cis                        The number variants in the cis window for this
                                                   phenotype.
         7     dist_phe_var                        The  distance  between  the  variant  and  the
                                                   phenotype start positions.
         8     var_id                              The most significant variant ID.
         9     var_chr                             The most significant variant's chromosome.
        10     var_from                            The start position  of  the  most  significant
                                                   variant.
        11     var_to                              The  end  position  of  the  most  significant
                                                   variant.
        12     dof1                                The number  of  degrees  of  freedom  used  to
                                                   compute the p-values.
        13     dof2                                Estimated number of degrees of freedom used in
                                                   beta approximation p-value calculations.
        14     bml1                                The first shape parameter of the  fitted  beta
                                                   distribution  (alpha parameter).  These should
                                                   be close to 1.
        15     bml2                                The second shape parameter of the fitted  beta
                                                   distribution     (beta    parameter).     This
                                                   corresponds  to  the   effective   number   of
                                                   independent tests in the region.
        16     nom_pval                            The nominal p-value of the association between
                                                   the   most   significant   variant   and   the
                                                   phenotype.
        17     r_squared                           The r squared of the linear regression.
        18     slope                               The beta (slope) of the linear regression.

        18.1   slope_se                            The  standard error of the beta.  Only printed
                                                   if --std-err is provided.
        19     adj_emp_pval                        Adjusted empirical p-value from  permutations.
                                                   This  is  the  adjusted  p-value not using the
                                                   beta  approximation.   Simply  calculated  as:
                                                   (number    of    p-values    observed   during
                                                   permutations that were smaller than  or  equal
                                                   to  the  nominal  p-value  +  1)  / (number of
                                                   permutations + 1).  The  most  significant  p-
                                                   value  achievable  would  be  1  /  (number of
                                                   permutations + 1).
        20     adj_beta_pval                       Adjusted empirical p-value given by the fitted
                                                   beta   distribution.   We  strongly  recommend
                                                   using this adjusted p-value in any  downstream
                                                   analysis.

       --mapping output file
        Space  separated output file with the following columns (certain columns are only printed
        based on options).  We recommend including chunk 0 to print out  a  header  in  order  to
        avoid confusion.

         1     phe_id | grp_id                     The  phenotype  ID  or  if one of the grouping
                                                   options is provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only printed if --group-best | --group-pca1  |
                                                   --group-mean.    The  phenotype  ID,  variance
                                                   explained by PC1, or number of  phenotypes  in
                                                   the phenotype group for --group-best, --group-
                                                   pca1, and --group-mean, respectively.
         5.2   n_phe_in_grp                        Only printed if --group-pca1  |  --group-mean.
                                                   The  number  of  phenotypes  in  the phenotype
                                                   group.
         6     n_var_in_cis                        The number variants in the cis window for this
                                                   phenotype.
         7     dist_phe_var                        The  distance  between  the  variant  and  the
                                                   phenotype start positions.
         8     var_id                              The most significant variant ID.
         9     var_chr                             The most significant variant's chromosome.
        10     var_from                            The start position  of  the  most  significant
                                                   variant.
        11     var_to                              The  end  position  of  the  most  significant
                                                   variant.
        12     rank                                The rank of the association.  This  tells  you
                                                   if the variant has been mapped as belonging to
                                                   the best  signal  (rank=0),  the  second  best
                                                   (rank=1),  etc  ...   As  a  consequence,  the
                                                   maximum rank value for a given phenotype tells
                                                   you  how  many  independent  signals there are
                                                   (e.g. rank=2 means 3 independent signals).
        13     fwd_pval                            The nominal forward p-value of the association
                                                   between  the  most significant variant and the
                                                   phenotype.
        14     fwd_r_squared                       The  r   squared   of   the   forward   linear
                                                   regression.
        15     fwd_slope                           The   beta   (slope)  of  the  forward  linear
                                                   regression.
        15.1   fwd_slope_se                        The standard error of the forward beta.   Only
                                                   printed if --std-err is provided.
        16     fwd_best_hit                        Whether  or  not  this variant was the forward
                                                   most significant variant.
        17     fwd_sig                             Whether   this   variant   was    significant.
                                                   Currently all variants are significant so this
                                                   is redundant.

        18     bwd_pval                            The   nominal   backward   p-value   of    the
                                                   association   between   the  most  significant
                                                   variant and the phenotype.
        19     bwd_r_squared                       The  r  squared   of   the   backward   linear
                                                   regression.
        20     bwd_slope                           The   beta  (slope)  of  the  backward  linear
                                                   regression.
        20.1   bwd_slope_se                        The standard error of the backward beta.  Only
                                                   printed if --std-err is provided.
        21     bwd_best_hit                        Whether  or  not this variant was the backward
                                                   most significant variant.
        22     bwd_sig                             Whether   this   variant   was    significant.
                                                   Currently all variants are significant so this
                                                   is redundant.

NOMINAL ANALYSIS EXAMPLES

       o Nominal  pass,  correcting  for  technical  covariates,  rank  normal  transforming  the
         phenotype, and printing out associations with a p-value <= 0.01 on chromosome 22 between
         17000000 and 18000000 bp, and excluding some samples (see QTLtools (1)):

         QTLtools  cis  --vcf  genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
         genes.covariates.pc50.txt.gz  --nominal  0.01  --region chr22:17000000-18000000 --normal
         --out nominals.txt --exclude-samples sample_names_to_exclude.txt

       o Nominal pass with parallelization  correcting  for  technical  covariates,  rank  normal
         transforming  the  phenotype,  and printing out associations with a p-value <= 0.01.  To
         facilitate parallelization on compute  cluster,  we  developed  an  option  to  run  the
         analysis into chunks of molecular phenotypes.  For instance, to run analysis on chunk 12
         when splitting the example data set into 20 chunks, run:

         QTLtools  cis  --vcf  genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
         genes.covariates.pc50.txt.gz    --nominal    0.01   --chunk   12   20   --normal   --out
         nominals_12_20.txt

       o If you want to submit the whole analysis with 20 jobs on a  compute  cluster,  just  run
         (qsub needs to be changed to the job submission system used [bsub, psub, etc...]):

         for j in $(seq 0 20); do
             echo  "QTLtools  cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz
             --cov genes.covariates.pc50.txt.gz --nominal  0.01  --chunk  $j  20  --normal  --out
             nominals_$j\_20.txt" | qsub
         done

PERMUTATION ANALYSIS EXAMPLES

       o Permutation  pass,  correcting  for  technical  covariates, rank normal transforming the
         phenotype, and running 1000 permutations with a specific random seed  on  chromosome  22
         between 17000000 and 18000000 bp:

         QTLtools  cis  --vcf  genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
         genes.covariates.pc50.txt.gz --permute 1000  --region  chr22:17000000-18000000  --normal
         --seed 1354145 --out permutation.txt

       o Permutation  pass  with parallelization correcting for technical covariates, rank normal
         transforming  the   phenotype,   and   running   5000   permutations.    To   facilitate
         parallelization  on  compute  cluster,  we  developed an option to run the analysis into
         chunks of molecular phenotypes.   For  instance,  to  run  analysis  on  chunk  12  when
         splitting the example data set into 20 chunks, run:

         QTLtools  cis  --vcf  genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
         genes.covariates.pc50.txt.gz   --permute   5000   --chunk   12   20    --normal    --out
         permutations_12_20.txt

       o If  you  want  to  submit the whole analysis with 20 jobs on a compute cluster, just run
         (qsub needs to be changed to the job submission system used [bsub, psub, etc...]):

         for j in $(seq 0 20); do
             echo "QTLtools cis --vcf genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz
             --cov  genes.covariates.pc50.txt.gz  --permute  5000  --chunk  $j  20 --normal --out
             permutations_$j\_20.txt" | qsub
         done

       o When your phenotypes in the BED file are grouped, you can perform a permutation pass  at
         the phenotype group level in order to discover group-level QTLs:

         QTLtools  cis  --vcf  genotypes.chr22.vcf.gz  --bed  exons.50percent.chr22.bed.gz  --cov
         genes.covariates.pc50.txt.gz    --permute     1000     --normal     --grp-best     --out
         permutation.group.txt

CONDITIONAL ANALYSIS EXAMPLE

       Conditional analysis to discover independent signals.

       1 First  we  need  to  run  a  permutation analysis (see previous section), then calculate
         nominal p-value threshold for each gene.  Here an FDR of 5% is given as an example:

         cat permutations_*.txt | gzip -c > permutations_all.txt.gz
         Rscript ./script/qtltools_runFDR_cis.R permutations_all.txt.gz 0.05 permutations_all

       2 Now you can proceed with the  actual  conditional  analysis.   Here  splitting  into  20
         chunks, and when all complete concatenate the results:

         for j in $(seq 0 20); do
             echo  "QTLtools  cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz
             --cov genes.covariates.pc50.txt.gz --mapping permutations_all.thresholds.txt --chunk
             $j 20 --normal --out conditional_$j\_20.txt" | qsub
         done
         cat conditional_*.txt > conditional_all.txt

       3 If  you  are interested in the most significant variants per independent signal, you can
         filter the results, using the backward p-value:

         awk '{ if ($21 == 1) print $0 }' conditional_all.txt > conditional_top_variants.txt

SEE ALSO

       QTLtools(1)

       QTLtools website: <https://qtltools.github.io/qtltools>

BUGS

       o Versions up to and including 1.2, suffer from a bug  in  reading  missing  genotypes  in
         VCF/BCF files.  This bug affects variants with a DS field in their genotype's FORMAT and
         have a missing genotype (DS field is .) in one of the samples, in which  case  genotypes
         for  all  the  samples  are  set  to missing, effectively removing this variant from the
         analyses.

       Please submit bugs to <https://github.com/qtltools/qtltools>

CITATION

       Delaneau, O., Ongen, H., Brown, A. et al. A complete tool set for molecular QTL  discovery
       and analysis. Nat Commun 8, 15452 (2017).  <https://doi.org/10.1038/ncomms15452>

AUTHORS

       Olivier Delaneau (olivier.delaneau@gmail.com), Halit Ongen (halitongen@gmail.com)