Provided by: vcftools_0.1.11+dfsg-1_amd64 bug

NAME

       vcftools - analyse VCF files

SYNOPSIS

       vcftools [OPTIONS]

DESCRIPTION

       The  vcftools  program is run from the command line. The interface is inspired by PLINK, and so should be
       largely familiar to users of that package. Commands take the following form:

         vcftools --vcf file1.vcf --chr 20 --freq

       The above command tells vcftools to read in the file file1.vcf,  extract  sites  on  chromosome  20,  and
       calculate  the allele frequency at each site.  The resulting allele frequency estimates are stored in the
       output file, out.freq. As in the above example, output from vcftools is mainly sent to output  files,  as
       opposed to being shown on the screen.

       Note  that  some  commands  may only be available in the latest version of vcftools. To obtain the latest
       version, you should use SVN to checkout the latest code, as described on the home page.

       Also note that polyploid genotypes are not currently supported.

   Basic Options
       --vcf <filename>
              This option defines the VCF file to be processed. The files need to be decompressed prior  to  use
              with  vcftools.  vcftools  expects files in VCF format v4.0, a specification of which can be found
              here.

       --gzvcf <filename>
              This option can be used in place of the --vcf  option  to  read  compressed  (gzipped)  VCF  files
              directly. Note that this option can be quite slow when used with large files.

       --out <prefix>
              This  option  defines the output filename prefix for all files generated by vcftools. For example,
              if  <prefix>  is  set  to  output_filename,  then  all  output  files  will   be   of   the   form
              output_filename.*** . If this option is omitted, all output files will have the prefix 'out.'.

   Site Filter Options
       --chr <chromosom>
              Only process sites with a chromosome identifier matching <chromosome>

       --from-bp <integer>

       --to-bp <integer>
              These  options  define  the physical range of sites will be processed. Sites outside of this range
              will be excluded. These options can only be used in conjunction with --chr.

       --snp <string>
              Include SNP(s) with matching ID. This command can be used multiple times in order to include  more
              than one SNP.

       --snps <filename>
              Include a list of SNPs given in a file. The file should contain a list of SNP IDs, with one ID per
              line.

       --exclude <filename>
              Exclude a list of SNPs given in a file. The file should contain a list of SNP IDs, with one ID per
              line.

       --positions <filename>
              Include  a  set  of  sites on the basis of a list of positions. Each line of the input file should
              contain a (tab-separated) chromosome and position.  The file should have a header line. Sites  not
              included in the list are excluded.

       --bed <filename>

       --exclude-bed <filename>
              Include or exclude a set of sites on the basis of a BED file. Only the first three columns (chrom,
              chromStart and chromEnd) are required. The BED file should have a header line.

       --remove-filtered-all

       --remove-filtered <sting>

       --keep-filtered <sting>
              These  options  are  used  to  filter  sites  on the basis of their FILTER flag.  The first option
              removes all sites with a FILTER flag. The second option can  be  used  to  exclude  sites  with  a
              specific filter flag. The third option can be used to select sites on the basis of specific filter
              flags.   The  second and third options can be used multiple times to specify multiple FILTERs. The
              --keep-filtered option is applied before the --remove-filtered option.

       --minQ <float>
              Include only sites with Quality above this threshold.

       --min-meanDP <float>

       --max-meanDP <float>
              Include sites with mean Depth within the thresholds defined by these options.

       --maf <float>

       --max-maf <float>
              Include only sites with Minor Allele Frequency within the specified range.

       --non-ref-af <float>

       --max-non-ref-af <float>
              Include only sites with Non-Reference Allele Frequency within the specified range.

       --hue <float>
              Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler
              and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken  to
              be out of HWE, and therefore excluded.

       --geno <float>
              Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1).

       --min-alleles <int>

       --max-alleles <int>
              Include  only  sites with a number of alleles within the specified range.  For example, to include
              only bi-allelic sites, one could use:

                    vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

       --mask <filename>

       --invert-mask <filename>

       --mask-min <filename>
              Include sites on the basis of a FASTA-like file. The provided file contains a sequence of  integer
              digits (between 0 and 9) for each position on a chromosome that specify if a site at that position
              should be filtered or not.  An example mask file would look like:

                    >1
                    0000011111222...

              In this example, sites in the VCF file located within the first 5 bases of the start of chromosome
              1  would be kept, whereas sites at position 6 onwards would be filtered out. The threshold integer
              that determines if sites are filtered or not is set using the --mask-min option, which defaults to
              0.  The chromosomes contained in the mask file must be sorted in the same order as the  VCF  file.
              The  --mask  option  is used to specify the mask file to be used, whereas the --invert-mask option
              can be used to specify a mask file that will be inverted before being applied.

   Individual Filters
       --indv <string>
              Specify an individual to be kept in the analysis. This  option  can  be  used  multiple  times  to
              specify multiple individuals.

       --keep <filename>
              Provide  a  file  containing  a  list  of  individuals  to  include  in subsequent a nalysis. Each
              individual ID (as defined in the VCF headerline) should be included on a separate line.

       --remove-indv <string>
              Specify an individual to be removed from the analysis. This option can be used multiple  times  to
              specify  multiple  individuals.  If the --indv option is also specified, then the --indv option is
              executed before the --remove-indv option.

       --remove <filename>
              Provide a file containing a list of individuals to exclude in subsequent analysis. Each individual
              ID (as defined in the VCF headerline) should be included on a separate line. If  both  the  --keep
              and the --remove options are used, then the --keep option is execute before the --remove option.

       --mon-indv-meanDP <float>

       --max-indv-meanDP <float>
              Calculate  the  mean coverage on a per-individual basis. Only individuals with coverage within the
              range specified by these options are included in subsequent analyses.

       --mind <float>
              Specify the minimum call rate threshold for each individual.

       --phased
              First excludes all individuals having all genotypes unphased, and subsequently excludes all  sites
              with unphased genotypes. The remaining data therefore consists of phased data only.

   Genotype Filters
       --remove-filtered-geno-all

       --remove-filtered-geno <string>
              The  first  option  removes  all  genotypes  with  a FILTER flag. The second option can be used to
              exclude genotypes with a specific filter flag.

       --minGQ <float>
              Exclude all genotypes with a quality below the threshold specified by this option (GQ).

       --minDP <float>
              Exclude all genotypes with a sequencing depth below that specified by this option (DP)

   Output Statistics
       --freq

       --counts

       --freq2

       --counts2
              Output per-site frequency information. The --freq outputs the allele frequency in a file with  the
              suffix  '.frq'.  The  --counts  option  outputs  a similar file with the suffix '.frq.count', that
              contains the raw allele counts at each site.   The  --freq2  and  --count2  options  are  used  to
              suppress  allele  information  in  the  output  file.  In this case, the order of the freqs/counts
              depends on the numbering in the VCF file.

       --depth
              Generates a file containing the mean depth per individual. This file has the suffix '.idepth'.

       --site-depth

       --site-mean-depth
              Generates a file containing the depth per site. The --site-depth option outputs the depth for each
              site  summed  across  individuals.  This  file   has   the   suffix   '.ldepth'.   Likewise,   the
              --site-mean-depth  outputs  the  mean  depth  for  each  site,  and the output file has the suffix
              '.ldepth.mean'.

       --geno-depth
              Generates a (possibly very large) file containing the depth for each genotype  in  the  VCF  file.
              Missing entries are given the value -1. The file has the suffix '.gdepth'.

       --site-quality
              Generates a file containing the per-site SNP quality, as found in the QUAL column of the VCF file.
              This file has the suffix '.lqual'.

       --het  Calculates  a  measure  of  heterozygosity on a per-individual basis.  Specfically, the inbreeding
              coefficient, F, is estimated for each individual using a method of moments. The resulting file has
              the suffix '.het'.

       --hardy
              Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as defined  by  Wigginton,
              Cutler  and  Abecasis  (2005)). The resulting file (with suffix '.hwe') also contains the Observed
              numbers of Homozygotes and Heterozygotes and the corresponding Expected numbers under HWE.

       --missing
              Generates two files reporting the missingness on a per-individual  and  per-site  basis.  The  two
              files have suffixes '.imiss' and '.lmiss' respectively.

       --hap-r2

       --geno-r2

       --ld-window <int>

       --ld-window-bp <int>

       --min-r2 <float>
              These  options  are  used to report Linkage Disequilibrium (LD) statistics as summarised by the r2
              statistic. The --hap-r2 option informs vcftools to output a file reporting the r2 statistic  using
              phased haplotypes. This is the traditional measure of LD often reported in the population genetics
              literature.  If  phased  haplotypes  are  unavailable then the --geno-r2 option may be used, which
              calculates the squared correlation coefficient  between  genotypes  encoded  as  0,  1  and  2  to
              represent  the  number  of  non-reference  alleles  in each individual. This is the same as the LD
              measure reported by PLINK. The haplotype version outputs a file with the suffix '.hap.ld', whereas
              the genotype version outputs a file with the suffix '.geno.ld'.  The haplotype version implies the
              option --phased.

              The --ld-window option defines the maximum SNP separation for the calculation of LD. Likewise, the
              --ld-window-bp option can be used to define the maximum physical separation of  SNPs  included  in
              the LD calculation. Finally, the --min-r2 sets a minimum value for r2 below which the LD statistic
              is not reported.

       --SNPdnsity <int>
              Calculates  the  number  and density of SNPs in bins of size defined by this option. The resulting
              output file has the suffix '.snpden'.

       --TsTv <int>
              Calculates the Transition / Transversion ratio in  bins  of  size  defined  by  this  option.  The
              resulting output file has the suffix '.TsTv'. A summary is also supplied in a file with the suffix
              '.TsTv.summary'.

       --FILTER-summary
              Generates  a  summary  of  the number of SNPs and Ts/Tv ratio for each FILTER category. The output
              file has the suffix '.FILTER.summary.

       --filtered-sites
              Creates two files listing sites that have been kept or removed after filtering.  The  first  file,
              with  suffix  '.kept.sites',  lists  sites  kept  by vcftools after filters have been applied. The
              second file, with the suffix '.removed.sites', list sites removed by the applied filters.

       --singletons
              This option will generate a file detailing the location of singletons,  and  the  individual  they
              occur in. The file reports both true singletons, and private doubletons (i.e. SNPs where the minor
              allele  only  occurs  in  a single individual and that individual is homozygotic for that allele).
              The output file has the suffix '.singletons'.

       --site-pi

       --window-pi <int>
              These options are used to estimate levels of nucleotide diversity. The first option does this on a
              per-site basis, and the output file has the suffix '.sites.pi'. The second option  calculates  the
              nucleotide  diversity  in windows, with the window size defined in the option argument. Output for
              this option has the suffix '.windowed.pi'. The windowed version requires phased  data,  and  hence
              use of this option implies the --phased option.

   Output in Other Formats
       --O12  This  option  outputs  the  genotypes as a large matrix. Three files are produced. The first, with
              suffix '.012', contains the genotypes of  each  individual  on  a  separate  line.  Genotypes  are
              represented  as  0,  1  and  2,  where  the number represent that number of non-reference alleles.
              Missing genotypes are represented by -1. The second file,  with  suffix  '.012.indv'  details  the
              individuals  included  in  the  main file. The third file, with suffix '.012.pos' details the site
              locations included in the main file.

       --IMPUTE
              This option outputs phased haplotypes in IMPUTE reference-panel format. As IMPUTE requires  phased
              data,  using  this option also implies --phased.  Unphased individuals and genotypes are therefore
              excluded. Only bi-allelic sites are included in the output.  Using  this  option  generates  three
              files.  The IMPUTE haplotype file has the suffix '.impute.hap', and the IMPUTE legend file has the
              suffix   '.impute.hap.legend'.  The  third  file,  with  suffix  '.impute.hap.indv',  details  the
              individuals included in the haplotype file, although this file is not needed by IMPUTE.

       --ldhat

       --ldhat-geno
              These options output data in LDhat format. Use of these options  also require the --chr option  to
              by used. The --ldhat option outputs phased data only, and therefore also implies --phased, leading
              to  unphased  individuals  and  genotypes  being  excluded. Alternatively, the --ldhat-geno option
              treats all of the data as unphased, and therefore outputs LDhat files in genotype/unphased format.
              In either case, two files are generated with the suffixes '.ldhat.sites' and '.ldhat.locs',  which
              correspond to the LDhat 'sites' and 'locs' input files respectively.

       --BEAGLE-GL
              This option outputs genotype likelihood information for input into the BEAGLE program. This option
              requires  the  VCF file to contain the FORMAT GL tag, which can generally be output by SNP callers
              such as the GATK.  Use of this option requires a chromosome to be specified via the --chr  option.
              The  resulting  output  file  (with  the  suffix  '.BEAGLE.GL')  contains genotype likelihoods for
              biallelic sites, and is suitable for input into BEAGLE via the 'like=' argument.

       --plink
              This option outputs the genotype data in PLINK PED format. Two files are generated, with  suffixes
              '.ped'  and  '.map'. Note that only bi-allelic loci will be output. Further details of these files
              can be found in the PLINK documentation.

              Note: This option can be very slow on large datasets. Using the --chr  option  to  divide  up  the
              dataset is advised.

       --plink-tped
              The  --plink  option  above  can be extremely slow on large datasets. An alternative that might be
              considerably quicker is to output in the PLINK transposed format. This can be achieved  using  the
              --plink-tped option, which produces two files with suffixes '.tped' and '.tfam'.

       --recode
              The  --recode  option  is  used  to generate a VCF file from the input VCF file having applied the
              options specified by the user. The output file has the suffix '.recode.vcf'.

              By default, the INFO fields are  removed  from  the  output  file,  as  the  INFO  values  may  be
              invalidated  by  the recoding (e.g. the total depth may need to be recalculated if individuals are
              removed). This default functionality can be overridden by using the --keep-INFO  <string>  option,
              where  <string>  defines the INFO key to keep in the output file. The --keep-INFO flag can be used
              multiple times. Alternatively, the option --keep-INFO-all can be used to retain all INFO fields.

   Miscellaneous
       --extract-FORMAT-info <string>
              Extract information from the genotype fields in  the  VCF  file  relating  to  a  specfied  FORMAT
              identifier.  For example, using the option '--extract-FORMAT-info GT' would extract the all of the
              GT (i.e. Genotype) entries. The resulting output file has the suffix '.<FORMAT_ID>.FORMAT'.

       --get-INFO <string>
              This option is used to extract information from the INFO field  in  the  VCF  file.  The  <string>
              argument  specifies  the  INFO  tag  to be extracted, and the option can be used multiple times in
              order to extract multiple INFO entries.  The resulting file, with  suffix  '.INFO',  contains  the
              required  INFO  information in a tab-separated table. For example, to extract the NS and DB flags,
              one would use the command:

                    vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   VCF File Comparison Options
       The file comparison options are currently in a state of flux and likely buggy.  If you find a bug, please
       report it. Note that genotype-level filters are not supported in these options.

       --diff <filename>

       --gzdiff <filename>
              Select a VCF file for comparison with the file specified by the --vcf option.  Outputs  two  files
              describing  the  sites and individuals common / unique to each file. These files have the suffixes
              '.diff.sites_in_files' and '.diff.indv_in_files' respectively. The --gzdiff version can be used to
              read compressed VCF files.

       --diff-site-discordance
              Used in conjunction with the --diff option to calculate discordance on a site by site  basis.  The
              resulting output file has the suffix '.diff.sites'.

       --diff-indv-discordance
              Used in conjunction with the --diff option to calculate discordance on a per-individual basis. The
              resulting output file has the suffix '.diff.indv'.

       --diff-discordance-matrix
              Used  in  conjunction  with the --diff option to calculate a discordance matrix.  This option only
              works with bi-allelic loci with matching alleles that are present in  both  files.  The  resulting
              output file has the suffix '.diff.discordance.matrix'.

       --diff-switch-error
              Used  in  conjunction  with  the  --diff  option to calculate phasing errors (specifically 'switch
              errors'). This option generates two output files describing switch errors found between sites, and
              the average switch error per individual. These two files  have  the  suffixes  '.diff.switch'  and
              '.diff.indv.switch' respectively.

   Options still in development
       The  following  options  are yet to be finalised, are likely to contain bugs, and are likely to change in
       the future.

       --fst <filename>

       --gzfst <filename>
              Calculate FST for a pair of VCF files, with the second file being specified by this option. FST is
              currently calculated using the formula described in the supplementary  material  of  the  Phase  I
              HapMap  paper.  Currently, only pairwise FST calculations are supported, although this will likely
              change in the future. The --gzfst option can be used to read compressed VCF files.

       --LROH Identify Long Runs of Homozygosity.

       --relatedness
              Output Individual Relatedness Statistics.

vcftools 0.1.5                                      July 2011                                        VCFTOOLS(1)