Provided by: vcftools_0.1.14+dfsg-2ubuntu0.1_amd64 bug

NAME

       vcftools - analyse VCF files

SYNOPSIS

       vcftools [OPTIONS]

DESCRIPTION

       The vcftools program is run from the command line. The interface is inspired by PLINK, and
       so should be largely familiar to users of that package. Commands take the following form:

         vcftools --vcf file1.vcf --chr 20 --freq

       The above command tells  vcftools  to  read  in  the  file  file1.vcf,  extract  sites  on
       chromosome  20,  and  calculate  the  allele frequency at each site.  The resulting allele
       frequency estimates are stored in the output file, out.freq.  As  in  the  above  example,
       output  from  vcftools  is  mainly  sent to output files, as opposed to being shown on the
       screen.

       Note that some commands may only be available in the latest version of vcftools. To obtain
       the  latest  version,  you should use SVN to checkout the latest code, as described on the
       home page.

       Also note that polyploid genotypes are not currently supported.

   Basic Options
       --vcf <filename>
              This option defines the VCF file to be processed. The files need to be decompressed
              prior  to  use  with  vcftools.  vcftools  expects  files  in  VCF  format  v4.0, a
              specification of which can be found here.

       --gzvcf <filename>
              This option can be used in place of the --vcf option to read  compressed  (gzipped)
              VCF  files  directly.  Note that this option can be quite slow when used with large
              files.

       --out <prefix>
              This option defines the output filename prefix for all files generated by vcftools.
              For  example,  if <prefix> is set to output_filename, then all output files will be
              of the form output_filename.*** . If this option is omitted, all output files  will
              have the prefix 'out.'.

   Site Filter Options
       --chr <chromosom>
              Only process sites with a chromosome identifier matching <chromosome>

       --from-bp <integer>

       --to-bp <integer>
              These  options  define the physical range of sites will be processed. Sites outside
              of this range will be excluded. These options can only be used in conjunction  with
              --chr.

       --snp <string>
              Include  SNP(s)  with matching ID. This command can be used multiple times in order
              to include more than one SNP.

       --snps <filename>
              Include a list of SNPs given in a file. The file should contain a list of SNP  IDs,
              with one ID per line.

       --exclude <filename>
              Exclude  a list of SNPs given in a file. The file should contain a list of SNP IDs,
              with one ID per line.

       --positions <filename>
              Include a set of sites on the basis of a list of positions. Each line of the  input
              file  should  contain  a  (tab-separated) chromosome and position.  The file should
              have a header line. Sites not included in the list are excluded.

       --bed <filename>

       --exclude-bed <filename>
              Include or exclude a set of sites on the basis of a BED file. Only the first  three
              columns  (chrom,  chromStart and chromEnd) are required. The BED file should have a
              header line.

       --remove-filtered-all

       --remove-filtered <sting>

       --keep-filtered <sting>
              These options are used to filter sites on the basis  of  their  FILTER  flag.   The
              first option removes all sites with a FILTER flag. The second option can be used to
              exclude sites with a specific filter flag. The third option can be used  to  select
              sites  on  the basis of specific filter flags.  The second and third options can be
              used multiple times to specify multiple  FILTERs.  The  --keep-filtered  option  is
              applied before the --remove-filtered option.

       --minQ <float>
              Include only sites with Quality above this threshold.

       --min-meanDP <float>

       --max-meanDP <float>
              Include sites with mean Depth within the thresholds defined by these options.

       --maf <float>

       --max-maf <float>
              Include only sites with Minor Allele Frequency within the specified range.

       --non-ref-af <float>

       --max-non-ref-af <float>
              Include only sites with Non-Reference Allele Frequency within the specified range.

       --hue <float>
              Assesses  sites  for  Hardy-Weinberg Equilibrium using an exact test, as defined by
              Wigginton, Cutler and Abecasis (2005). Sites with a  p-value  below  the  threshold
              defined by this option are taken to be out of HWE, and therefore excluded.

       --geno <float>
              Exclude sites on the basis of the proportion of missing data (defined to be between
              0 and 1).

       --min-alleles <int>

       --max-alleles <int>
              Include only sites with a number  of  alleles  within  the  specified  range.   For
              example, to include only bi-allelic sites, one could use:

                    vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

       --mask <filename>

       --invert-mask <filename>

       --mask-min <filename>
              Include  sites  on  the  basis  of  a FASTA-like file. The provided file contains a
              sequence of integer digits (between 0 and 9) for each position on a chromosome that
              specify if a site at that position should be filtered or not.  An example mask file
              would look like:

                    >1
                    0000011111222...

              In this example, sites in the VCF file located within the  first  5  bases  of  the
              start  of  chromosome 1 would be kept, whereas sites at position 6 onwards would be
              filtered out. The threshold integer that determines if sites are filtered or not is
              set using the --mask-min option, which defaults to 0.  The chromosomes contained in
              the mask file must be sorted in the same order as the VCF file. The  --mask  option
              is  used  to specify the mask file to be used, whereas the --invert-mask option can
              be used to specify a mask file that will be inverted before being applied.

   Individual Filters
       --indv <string>
              Specify an individual to be kept in the analysis. This option can be used  multiple
              times to specify multiple individuals.

       --keep <filename>
              Provide a file containing a list of individuals to include in subsequent a nalysis.
              Each individual ID (as defined in the VCF  headerline)  should  be  included  on  a
              separate line.

       --remove-indv <string>
              Specify  an  individual  to  be  removed from the analysis. This option can be used
              multiple times to specify multiple  individuals.  If  the  --indv  option  is  also
              specified, then the --indv option is executed before the --remove-indv option.

       --remove <filename>
              Provide  a file containing a list of individuals to exclude in subsequent analysis.
              Each individual ID (as defined in the VCF  headerline)  should  be  included  on  a
              separate  line.  If  both  the  --keep  and the --remove options are used, then the
              --keep option is execute before the --remove option.

       --mon-indv-meanDP <float>

       --max-indv-meanDP <float>
              Calculate the mean coverage  on  a  per-individual  basis.  Only  individuals  with
              coverage  within  the  range  specified by these options are included in subsequent
              analyses.

       --mind <float>
              Specify the minimum call rate threshold for each individual.

       --phased
              First excludes all individuals having  all  genotypes  unphased,  and  subsequently
              excludes  all  sites with unphased genotypes. The remaining data therefore consists
              of phased data only.

   Genotype Filters
       --remove-filtered-geno-all

       --remove-filtered-geno <string>
              The first option removes all genotypes with a FILTER flag. The second option can be
              used to exclude genotypes with a specific filter flag.

       --minGQ <float>
              Exclude  all  genotypes with a quality below the threshold specified by this option
              (GQ).

       --minDP <float>
              Exclude all genotypes with a sequencing depth below that specified by  this  option
              (DP)

   Output Statistics
       --freq

       --counts

       --freq2

       --counts2
              Output per-site frequency information. The --freq outputs the allele frequency in a
              file with the suffix '.frq'. The --counts option outputs a similar  file  with  the
              suffix '.frq.count', that contains the raw allele counts at each site.  The --freq2
              and --count2 options are used to suppress allele information in the output file. In
              this case, the order of the freqs/counts depends on the numbering in the VCF file.

       --depth
              Generates a file containing the mean depth per individual. This file has the suffix
              '.idepth'.

       --site-depth

       --site-mean-depth
              Generates a file containing the depth per site. The --site-depth option outputs the
              depth  for each site summed across individuals. This file has the suffix '.ldepth'.
              Likewise, the --site-mean-depth outputs the mean  depth  for  each  site,  and  the
              output file has the suffix '.ldepth.mean'.

       --geno-depth
              Generates  a  (possibly  very large) file containing the depth for each genotype in
              the VCF file. Missing entries are given the value  -1.  The  file  has  the  suffix
              '.gdepth'.

       --site-quality
              Generates  a  file containing the per-site SNP quality, as found in the QUAL column
              of the VCF file. This file has the suffix '.lqual'.

       --het  Calculates a measure of heterozygosity on a per-individual basis.  Specfically, the
              inbreeding  coefficient,  F,  is  estimated  for  each individual using a method of
              moments. The resulting file has the suffix '.het'.

       --hardy
              Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as  defined
              by  Wigginton, Cutler and Abecasis (2005)). The resulting file (with suffix '.hwe')
              also contains the  Observed  numbers  of  Homozygotes  and  Heterozygotes  and  the
              corresponding Expected numbers under HWE.

       --missing
              Generates  two  files  reporting  the  missingness on a per-individual and per-site
              basis. The two files have suffixes '.imiss' and '.lmiss' respectively.

       --hap-r2

       --geno-r2

       --ld-window <int>

       --ld-window-bp <int>

       --min-r2 <float>
              These options  are  used  to  report  Linkage  Disequilibrium  (LD)  statistics  as
              summarised  by  the  r2 statistic. The --hap-r2 option informs vcftools to output a
              file reporting the r2 statistic using phased haplotypes. This  is  the  traditional
              measure  of  LD  often  reported  in  the population genetics literature. If phased
              haplotypes are unavailable then the --geno-r2 option may be used, which  calculates
              the  squared  correlation  coefficient  between  genotypes encoded as 0, 1 and 2 to
              represent the number of non-reference alleles in each individual. This is the  same
              as  the LD measure reported by PLINK. The haplotype version outputs a file with the
              suffix '.hap.ld', whereas the genotype version  outputs  a  file  with  the  suffix
              '.geno.ld'.  The haplotype version implies the option --phased.

              The  --ld-window  option  defines the maximum SNP separation for the calculation of
              LD. Likewise, the --ld-window-bp option can be used to define the maximum  physical
              separation  of  SNPs  included  in the LD calculation. Finally, the --min-r2 sets a
              minimum value for r2 below which the LD statistic is not reported.

       --SNPdnsity <int>
              Calculates the number and density of SNPs in bins of size defined by  this  option.
              The resulting output file has the suffix '.snpden'.

       --TsTv <int>
              Calculates  the  Transition  /  Transversion  ratio in bins of size defined by this
              option. The resulting output file  has  the  suffix  '.TsTv'.  A  summary  is  also
              supplied in a file with the suffix '.TsTv.summary'.

       --FILTER-summary
              Generates a summary of the number of SNPs and Ts/Tv ratio for each FILTER category.
              The output file has the suffix '.FILTER.summary.

       --filtered-sites
              Creates two files listing sites that have been kept or removed after filtering. The
              first  file,  with suffix '.kept.sites', lists sites kept by vcftools after filters
              have been applied. The second file, with the suffix  '.removed.sites',  list  sites
              removed by the applied filters.

       --singletons
              This  option  will  generate  a  file detailing the location of singletons, and the
              individual they occur in. The  file  reports  both  true  singletons,  and  private
              doubletons (i.e. SNPs where the minor allele only occurs in a single individual and
              that individual is homozygotic for that allele).  The output file  has  the  suffix
              '.singletons'.

       --site-pi

       --window-pi <int>
              These options are used to estimate levels of nucleotide diversity. The first option
              does this on a per-site basis, and the output file has the suffix '.sites.pi'.  The
              second  option calculates the nucleotide diversity in windows, with the window size
              defined  in  the  option  argument.  Output  for  this  option   has   the   suffix
              '.windowed.pi'.  The  windowed  version requires phased data, and hence use of this
              option implies the --phased option.

   Output in Other Formats
       --O12  This option outputs the genotypes as a large matrix. Three files are produced.  The
              first,  with suffix '.012', contains the genotypes of each individual on a separate
              line. Genotypes are represented as 0, 1 and 2,  where  the  number  represent  that
              number  of  non-reference  alleles.  Missing  genotypes  are represented by -1. The
              second file, with suffix '.012.indv' details the individuals included in  the  main
              file. The third file, with suffix '.012.pos' details the site locations included in
              the main file.

       --IMPUTE
              This option outputs phased haplotypes in IMPUTE reference-panel format.  As  IMPUTE
              requires   phased   data,  using  this  option  also  implies  --phased.   Unphased
              individuals and  genotypes  are  therefore  excluded.  Only  bi-allelic  sites  are
              included  in  the  output.  Using  this  option  generates three files.  The IMPUTE
              haplotype file has the suffix '.impute.hap', and the IMPUTE  legend  file  has  the
              suffix  '.impute.hap.legend'.  The  third  file,  with  suffix  '.impute.hap.indv',
              details the individuals included in the haplotype file, although this file  is  not
              needed by IMPUTE.

       --ldhat

       --ldhat-geno
              These  options  output data in LDhat format. Use of these options  also require the
              --chr option to by used. The --ldhat option outputs phased data only, and therefore
              also  implies  --phased,  leading  to  unphased  individuals  and  genotypes  being
              excluded. Alternatively,  the  --ldhat-geno  option  treats  all  of  the  data  as
              unphased,  and therefore outputs LDhat files in genotype/unphased format. In either
              case, two files are generated with the suffixes '.ldhat.sites'  and  '.ldhat.locs',
              which correspond to the LDhat 'sites' and 'locs' input files respectively.

       --BEAGLE-GL
              This  option  outputs  genotype  likelihood  information  for input into the BEAGLE
              program. This option requires the VCF file to contain the FORMAT GL tag, which  can
              generally be output by SNP callers such as the GATK.  Use of this option requires a
              chromosome to be specified via the --chr option. The resulting  output  file  (with
              the  suffix '.BEAGLE.GL') contains genotype likelihoods for biallelic sites, and is
              suitable for input into BEAGLE via the 'like=' argument.

       --plink
              This option outputs the genotype data in PLINK PED format. Two files are generated,
              with  suffixes  '.ped'  and  '.map'. Note that only bi-allelic loci will be output.
              Further details of these files can be found in the PLINK documentation.

              Note: This option can be very slow on large datasets. Using  the  --chr  option  to
              divide up the dataset is advised.

       --plink-tped
              The  --plink  option  above can be extremely slow on large datasets. An alternative
              that might be considerably quicker is to output in  the  PLINK  transposed  format.
              This  can  be achieved using the --plink-tped option, which produces two files with
              suffixes '.tped' and '.tfam'.

       --recode
              The --recode option is used to generate a VCF file from the input VCF  file  having
              applied  the  options  specified  by  the  user.  The  output  file  has the suffix
              '.recode.vcf'.

              By default, the INFO fields are removed from the output file, as  the  INFO  values
              may  be  invalidated  by  the  recoding  (e.g.  the  total  depth  may  need  to be
              recalculated if  individuals  are  removed).  This  default  functionality  can  be
              overridden  by  using  the  --keep-INFO <string> option, where <string> defines the
              INFO key to keep in the output file. The --keep-INFO  flag  can  be  used  multiple
              times.  Alternatively,  the  option  --keep-INFO-all can be used to retain all INFO
              fields.

   Miscellaneous
       --extract-FORMAT-info <string>
              Extract information from the genotype fields in the VCF file relating to a specfied
              FORMAT  identifier.  For example, using the option '--extract-FORMAT-info GT' would
              extract the all of the GT (i.e. Genotype) entries. The resulting  output  file  has
              the suffix '.<FORMAT_ID>.FORMAT'.

       --get-INFO <string>
              This option is used to extract information from the INFO field in the VCF file. The
              <string> argument specifies the INFO tag to be extracted, and  the  option  can  be
              used multiple times in order to extract multiple INFO entries.  The resulting file,
              with suffix '.INFO', contains the required  INFO  information  in  a  tab-separated
              table. For example, to extract the NS and DB flags, one would use the command:

                    vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   VCF File Comparison Options
       The  file  comparison  options  are currently in a state of flux and likely buggy.  If you
       find a bug, please report it. Note that genotype-level filters are not supported in  these
       options.

       --diff <filename>

       --gzdiff <filename>
              Select  a  VCF  file  for  comparison  with the file specified by the --vcf option.
              Outputs two files describing the sites and individuals  common  /  unique  to  each
              file.    These    files    have    the    suffixes    '.diff.sites_in_files'    and
              '.diff.indv_in_files' respectively. The  --gzdiff  version  can  be  used  to  read
              compressed VCF files.

       --diff-site-discordance
              Used  in  conjunction  with the --diff option to calculate discordance on a site by
              site basis. The resulting output file has the suffix '.diff.sites'.

       --diff-indv-discordance
              Used in conjunction with the --diff option  to  calculate  discordance  on  a  per-
              individual basis. The resulting output file has the suffix '.diff.indv'.

       --diff-discordance-matrix
              Used in conjunction with the --diff option to calculate a discordance matrix.  This
              option only works with bi-allelic loci with matching alleles that  are  present  in
              both files. The resulting output file has the suffix '.diff.discordance.matrix'.

       --diff-switch-error
              Used   in   conjunction   with  the  --diff  option  to  calculate  phasing  errors
              (specifically 'switch errors'). This option generates two output  files  describing
              switch  errors  found  between  sites, and the average switch error per individual.
              These  two  files  have  the  suffixes   '.diff.switch'   and   '.diff.indv.switch'
              respectively.

   Options still in development
       The  following options are yet to be finalised, are likely to contain bugs, and are likely
       to change in the future.

       --fst <filename>

       --gzfst <filename>
              Calculate FST for a pair of VCF files, with the second file being specified by this
              option.   FST   is   currently  calculated  using  the  formula  described  in  the
              supplementary material of the Phase I HapMap paper. Currently,  only  pairwise  FST
              calculations  are  supported,  although  this will likely change in the future. The
              --gzfst option can be used to read compressed VCF files.

       --LROH Identify Long Runs of Homozygosity.

       --relatedness
              Output Individual Relatedness Statistics.