Provided by: vcftools_0.1.15-1_amd64 bug

NAME

       vcftools  -  Utilities  for  the  variant call format (VCF) and binary variant call format
       (BCF)

SYNOPSIS

       vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT  PREFIX  ]  [  FILTERING
       OPTIONS ]  [ OUTPUT OPTIONS ]

DESCRIPTION

       vcftools  is a suite of functions for use on genetic variation data in the form of VCF and
       BCF files. The tools provided will be used mainly to summarize data, run  calculations  on
       data, filter out data, and convert data into other useful file formats.

EXAMPLES

       Output allele frequency for all sites in the input vcf file from chromosome 1
         vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

       Output a new vcf file from the input vcf file that removes any indel sites
         vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only

       Output file comparing the sites in two vcf files
         vcftools   --gzvcf  input_file1.vcf.gz  --gzdiff  input_file2.vcf.gz  --diff-site  --out
         in1_v_in2

       Output a new vcf file to standard out without any sites  that  have  a  filter  tag,  then
       compress it with gzip
         vcftools  --gzvcf  input_file.vcf.gz --remove-filtered-all --recode --stdout | gzip -c >
         output_PASS_only.vcf.gz

       Output a Hardy-Weinberg p-value for every site in the bcf file  that  does  not  have  any
       missing genotypes
         vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out output_noMissing

       Output nucleotide diversity at a list of positions
         zcat  input_file.vcf.gz  |  vcftools  --vcf  -  --site-pi --positions SNP_list.txt --out
         nucleotide_diversity

BASIC OPTIONS

       These options are used to specify the input and output files.

   INPUT FILE OPTIONS
         --vcf <input_filename>
           This option defines the VCF file to be processed. VCFtools expects files in VCF format
           v4.0,  v4.1  or v4.2. The latter two are supported with some small limitations. If the
           user provides a dash character '-' as a file name, the program expects a VCF  file  to
           be piped in through standard in.

         --gzvcf <input_filename>
           This  option can be used in place of the --vcf option to read compressed (gzipped) VCF
           files directly.

         --bcf <input_filename>
           This option can be used in place of the --vcf option to read BCF2 files directly.  You
           do  not  need  to  specify  if this file is compressed with BGZF encoding. If the user
           provides a dash character '-' as a file name, the program expects a BCF2  file  to  be
           piped in through standard in.

   OUTPUT FILE OPTIONS
         --out <output_prefix>
           This  option  defines  the output filename prefix for all files generated by vcftools.
           For example, if <prefix> is set to output_filename, then all output files will  be  of
           the  form  output_filename.*** . If this option is omitted, all output files will have
           the prefix "out." in the current working directory.

         --stdout
         -c
           These options direct the vcftools output to standard out  so  it  can  be  piped  into
           another  program  or  written  directly to a filename of choice. However, a select few
           output functions cannot be written to standard out.

         --temp <temporary_directory>
           This option can be used to redirect any temporary files that vcftools creates  into  a
           specified directory.

SITE FILTERING OPTIONS

       These  options  are  used  to  include  or  exclude  certain sites from any analysis being
       performed by the program.

   POSITION FILTERING
         --chr <chromosome>
         --not-chr <chromosome>
           Includes or excludes sites with indentifiers matching <chromosome>. These options  may
           be used multiple times to include or exclude more than one chromosome.

         --from-bp <integer>
         --to-bp <integer>
           These  options  specify  a  lower  bound  and  upper  bound for a range of sites to be
           processed. Sites with positions less  than  or  greater  than  these  values  will  be
           excluded.  These options can only be used in conjunction with a single usage of --chr.
           Using one of these does not require use of the other.

         --positions <filename>
         --exclude-positions <filename>
           Include or exclude a set of sites on the basis of a list of positions in a file.  Each
           line  of  the input file should contain a (tab-separated) chromosome and position. The
           file can have comment lines that start with a "#", they will be ignored.

         --positions-overlap <filename>
         --exclude-positions-overlap <filename>
           Include or exclude a set of sites on the basis of  the  reference  allele  overlapping
           with a list of positions in a file. Each line of the input file should contain a (tab-
           separated) chromosome and position. The file can have comment lines that start with  a
           "#", they will be ignored.

         --bed <filename>
         --exclude-bed <filename>
           Include  or  exclude  a  set of sites on the basis of a BED file. Only the first three
           columns (chrom, chromStart and chromEnd) are required. The BED  file  is  expected  to
           have  a header line. A site will be kept or excluded if any part of any allele (REF or
           ALT) at a site is within the range of one of the BED entries.

         --thin <integer>
           Thin sites so that no two sites are within the specified distance from one another.

         --mask <filename>
         --invert-mask <filename>
         --mask-min <integer>
           These options are used to specify a FASTA-like mask file to filter with. The mask file
           contains  a  sequence  of  integer  digits  (between  0  and 9) for each position on a
           chromosome that specify if a site at that position should be filtered or not.
           An example mask file would look like:
             >1
             0000011111222...
             >2
             2222211111000...
           In this example, sites in the VCF file located within the first 5 bases of  the  start
           of  chromosome  1 would be kept, whereas sites at position 6 onwards would be filtered
           out. And sites after the 11th position on chromosome 2 would be filtered out as well.
           The "--invert-mask" option takes the same format mask file  as  the  "--mask"  option,
           however it inverts the mask file before filtering with it.
           And the "--mask-min" option specifies a threshold mask value between 0 and 9 to filter
           positions by. The default threshold is 0, meaning only sites with that value or  lower
           will be kept.

   SITE ID FILTERING
         --snp <string>
           Include SNP(s) with matching ID (e.g. a dbSNP rsID). This command can be used multiple
           times in order to include more than one SNP.

         --snps <filename>
         --exclude <filename>
           Include or exclude a list of SNPs given in a file. The file should contain a  list  of
           SNP IDs (e.g. dbSNP rsIDs), with one ID per line. No header line is expected.

   VARIANT TYPE FILTERING
         --keep-only-indels
         --remove-indels
           Include  or  exclude  sites that contain an indel. For these options "indel" means any
           variant that alters the length of the REF allele.

   FILTER FLAG FILTERING
         --remove-filtered-all
           Removes all sites with a FILTER flag other than PASS.

         --keep-filtered <string>
         --remove-filtered <string>
           Includes or excludes all sites marked with a specific FILTER flag. These  options  may
           be used more than once to specify multiple FILTER flags.

   INFO FIELD FILTERING
         --keep-INFO <string>
         --remove-INFO <string>
           Includes or excludes all sites with a specific INFO flag. These options only filter on
           the presence of the flag and not its value. These options can be used  multiple  times
           to specify multiple INFO flags.

   ALLELE FILTERING
         --maf <float>
         --max-maf <float>
           Include  only sites with a Minor Allele Frequency greater than or equal to the "--maf"
           value and less than or equal to the "--max-maf" value. One of  these  options  may  be
           used  without  the other. Allele frequency is defined as the number of times an allele
           appears over all individuals at that site, divided by the total number of  non-missing
           alleles at that site.

         --non-ref-af <float>
         --max-non-ref-af <float>
         --non-ref-ac <integer>
         --max-non-ref-ac <integer>

         --non-ref-af-any <float>
         --max-non-ref-af-any <float>
         --non-ref-ac-any <integer>
         --max-non-ref-ac-any <integer>
           Include only sites with all Non-Reference (ALT) Allele Frequencies (af) or Counts (ac)
           within the range specified, and including the specified  value.  The  default  options
           require  all alleles to meet the specified criteria, whereas the options appended with
           "any" require only one allele to meet the criteria. The Allele frequency is defined as
           the  number  of  times an allele appears over all individuals at that site, divided by
           the total number of non-missing alleles at that site.

         --mac <integer>
         --max-mac <integer>
           Include only sites with Minor Allele Count greater than or equal to the "--mac"  value
           and  less  than  or  equal  to the "--max-mac" value. One of these options may be used
           without the other. Allele count is simply the number of times that allele appears over
           all individuals at that site.

         --min-alleles <integer>
         --max-alleles <integer>
           Include  only  sites  with  a  number  of alleles greater than or equal to the "--min-
           alleles" value and less than or equal to  the  "--max-alleles"  value.  One  of  these
           options may be used without the other.
           For example, to include only bi-allelic sites, one could use:
             vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

   GENOTYPE VALUE FILTERING
         --min-meanDP <float>
         --max-meanDP <float>
           Includes  only  sites  with  mean depth values (over all included individuals) greater
           than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP"
           value.  One of these options may be used without the other. These options require that
           the "DP" FORMAT tag is included for each site.

         --hwe <float>
           Assesses sites for Hardy-Weinberg Equilibrium using  an  exact  test,  as  defined  by
           Wigginton,  Cutler  and  Abecasis  (2005).  Sites  with  a p-value below the threshold
           defined by this option are taken to be out of HWE, and therefore excluded.

         --max-missing <float>
           Exclude sites on the basis of the proportion of missing data (defined to be between  0
           and  1,  where  0  allows sites that are completely missing and 1 indicates no missing
           data allowed).

         --max-missing-count <integer>
           Exclude sites with more than this number of missing genotypes over all individuals.

         --phased
           Excludes all sites that contain unphased genotypes.

   MISCELLANEOUS FILTERING
         --minQ <float>
           Includes only sites with Quality value above this threshold.

INDIVIDUAL FILTERING OPTIONS

       These options are used to include or exclude certain individuals from any  analysis  being
       performed by the program.
         --indv <string>
         --remove-indv <string>
           Specify an individual to be kept or removed from the analysis. This option can be used
           multiple times to specify multiple individuals. If both options  are  specified,  then
           the "--indv" option is executed before the "--remove-indv option".

         --keep <filename>
         --remove <filename>
           Provide  files  containing  a  list  of  individuals  to  either include or exclude in
           subsequent analysis. Each individual ID (as defined in the VCF headerline)  should  be
           included  on  a  separate  line. If both options are used, then the "--keep" option is
           executed before the "--remove" option. When multiple files are provided, the union  of
           individuals from all keep files subtracted by the union of individuals from all remove
           files are kept. No header line is expected.

         --max-indv <integer>
           Randomly thins individuals so that only the specified number are retained.

GENOTYPE FILTERING OPTIONS

       These options are used to exclude genotypes from  any  analysis  being  performed  by  the
       program. If excluded, these values will be treated as missing.
         --remove-filtered-geno-all
           Excludes all genotypes with a FILTER flag not equal to "." (a missing value) or PASS.

         --remove-filtered-geno <string>
           Excludes genotypes with a specific FILTER flag.

         --minGQ <float>
           Exclude  all  genotypes  with  a  quality  below  the threshold specified. This option
           requires that the "GQ" FORMAT tag is specified for all sites.

         --minDP <float>
         --maxDP <float>
           Includes only genotypes greater than or equal to the "--minDP" value and less than  or
           equal  to  the  "--maxDP"  value.  This  option  requires  that the "DP" FORMAT tag is
           specified for all sites.

OUTPUT OPTIONS

       These options specify which analyses or conversions to perform on  the  data  that  passed
       through all specified filters.

   OUTPUT ALLELE STATISTICS
         --freq
         --freq2
           Outputs  the  allele  frequency  for  each  site in a file with the suffix ".frq". The
           second option is used to suppress output of any information about the alleles.

         --counts
         --counts2
           Outputs the raw allele counts for each site in a file with  the  suffix  ".frq.count".
           The second option is used to suppress output of any information about the alleles.

         --derived
           For  use with the previous four frequency and count options only. Re-orders the output
           file columns so that the ancestral allele appears first. This  option  relies  on  the
           ancestral allele being specified in the VCF file using the AA tag in the INFO field.

   OUTPUT DEPTH STATISTICS
         --depth
           Generates  a  file  containing the mean depth per individual. This file has the suffix
           ".idepth".

         --site-depth
           Generates a file containing the depth per site summed  across  all  individuals.  This
           output file has the suffix ".ldepth".

         --site-mean-depth
           Generates  a  file containing the mean depth per site averaged across all individuals.
           This output file has the suffix ".ldepth.mean".

         --geno-depth
           Generates a (possibly very large) file containing the depth for each genotype  in  the
           VCF file. Missing entries are given the value -1. The file has the suffix ".gdepth".

   OUTPUT LD STATISTICS
         --hap-r2
           Outputs  a  file reporting the r2, D, and D' statistics using phased haplotypes. These
           are the  traditional  measures  of  LD  often  reported  in  the  population  genetics
           literature. The output file has the suffix ".hap.ld". This option assumes that the VCF
           input file has phased haplotypes.

         --geno-r2
           Calculates the squared correlation coefficient between genotypes encoded as 0, 1 and 2
           to  represent the number of non-reference alleles in each individual. This is the same
           as the LD measure reported by PLINK. The D and D' statistics are  only  available  for
           phased genotypes. The output file has the suffix ".geno.ld".

         --geno-chisq
           If  your  data contains sites with more than two alleles, then this option can be used
           to test for genotype independence via the chi-squared statistic. The output  file  has
           the suffix ".geno.chisq".

         --hap-r2-positions <positions list file>
         --geno-r2-positions <positions list file>
           Outputs a file reporting the r2 statistics of the sites contained in the provided file
           verses  all  other  sites.  The  output  files  have  the  suffix  ".list.hap.ld"   or
           ".list.geno.ld", depending on which option is used.

         --ld-window <integer>
           This  optional  parameter  defines  the  maximum number of SNPs between the SNPs being
           tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

         --ld-window-bp <integer>
           This optional parameter defines the maximum number of physical bases between the  SNPs
           being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

         --ld-window-min <integer>
           This  optional  parameter  defines  the  minimum number of SNPs between the SNPs being
           tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

         --ld-window-bp-min <integer>
           This optional parameter defines the minimum number of physical bases between the  SNPs
           being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

         --min-r2 <float>
           This  optional  parameter sets a minimum value for r2, below which the LD statistic is
           not reported by the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

         --interchrom-hap-r2
         --interchrom-geno-r2
           Outputs a file reporting the r2 statistics for sites  on  different  chromosomes.  The
           output  files have the suffix ".interchrom.hap.ld" or ".interchrom.geno.ld", depending
           on the option used.

   OUTPUT TRANSITION/TRANSVERSION STATISTICS
         --TsTv <integer>
           Calculates the Transition / Transversion ratio in bins of size defined by this option.
           Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv".

         --TsTv-summary
           Calculates  a simple summary of all Transitions and Transversions. The output file has
           the suffix ".TsTv.summary".

         --TsTv-by-count
           Calculates the Transition / Transversion ratio as a  function  of  alternative  allele
           count.   Only  uses  bi-allelic  SNPs.  The  resulting  output  file  has  the  suffix
           ".TsTv.count".

         --TsTv-by-qual
           Calculates the Transition / Transversion ratio as a function of SNP quality threshold.
           Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv.qual".

         --FILTER-summary
           Generates  a  summary  of the number of SNPs and Ts/Tv ratio for each FILTER category.
           The output file has the suffix ".FILTER.summary".

   OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS
         --site-pi
           Measures nucleotide divergency on a per-site basis. The output  file  has  the  suffix
           ".sites.pi".

         --window-pi <integer>
         --window-pi-step <integer>
           Measures  the  nucleotide diversity in windows, with the number provided as the window
           size. The output file has  the  suffix  ".windowed.pi".  The  latter  is  an  optional
           argument used to specify the step size in between windows.

   OUTPUT FST STATISTICS
         --weir-fst-pop <filename>
           This option is used to calculate an Fst estimate from Weir and Cockerham's 1984 paper.
           This is the preferred calculation of Fst. The provided file must  contain  a  list  of
           individuals  (one  individual  per  line)  from  the  VCF  file that correspond to one
           population. This option can be used multiple times to calculate Fst for more than  two
           populations.  These  files  will  also  be  included  as "--keep" options. By default,
           calculations are done on a per-site basis. The output file has the suffix ".weir.fst".

         --fst-window-size <integer>
         --fst-window-step <integer>
           These options can be used with "--weir-fst-pop"  to  do  the  Fst  calculations  on  a
           windowed basis instead of a per-site basis. These arguments specify the desired window
           size and the desired step size between windows.

   OUTPUT OTHER STATISTICS
         --het
           Calculates a measure of heterozygosity on a  per-individual  basis.  Specfically,  the
           inbreeding coefficient, F, is estimated for each individual using a method of moments.
           The resulting file has the suffix ".het".

         --hardy
           Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as defined  by
           Wigginton,  Cutler  and Abecasis (2005)). The resulting file (with suffix ".hwe") also
           contains the Observed numbers of Homozygotes and Heterozygotes and  the  corresponding
           Expected numbers under HWE.

         --TajimaD <integer>
           Outputs  Tajima's  D  statistic  in bins with size of the specified number. The output
           file has the suffix ".Tajima.D".

         --indv-freq-burden
           This option calculates the number of variants within each  individual  of  a  specific
           frequency. The resulting file has the suffix ".ifreqburden".

         --LROH
           This  option  will  identify and output Long Runs of Homozygosity. The output file has
           the suffix ".LROH". This function is experimental, and will use a  lot  of  memory  if
           applied to large datasets.

         --relatedness
           This  option  is  used  to  calculate  and output a relatedness statistic based on the
           method of  Yang  et  al,  Nature  Genetics  2010  (doi:10.1038/ng.608).  Specifically,
           calculate  the  unadjusted  Ajk  statistic. Expectation of Ajk is zero for individuals
           within a populations, and one for an individual with themselves. The output  file  has
           the suffix ".relatedness".

         --relatedness2
           This  option  is  used  to  calculate  and output a relatedness statistic based on the
           method of Manichaikul et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559).
           The output file has the suffix ".relatedness2".

         --site-quality
           Generates  a  file containing the per-site SNP quality, as found in the QUAL column of
           the VCF file. This file has the suffix ".lqual".

         --missing-indv
           Generates a file reporting the missingness on a per-individual basis. The file has the
           suffix ".imiss".

         --missing-site
           Generates  a  file  reporting  the  missingness  on a per-site basis. The file has the
           suffix ".lmiss".

         --SNPdensity <integer>
           Calculates the number and density of SNPs in bins of size defined by this option.  The
           resulting output file has the suffix ".snpden".

         --kept-sites
           Creates a file listing all sites that have been kept after filtering. The file has the
           suffix ".kept.sites".

         --removed-sites
           Creates a file listing all sites that have been removed after filtering. The file  has
           the suffix ".removed.sites".

         --singletons
           This  option  will  generate  a  file  detailing  the  location of singletons, and the
           individual they  occur  in.  The  file  reports  both  true  singletons,  and  private
           doubletons  (i.e.  SNPs  where the minor allele only occurs in a single individual and
           that individual is homozygotic for that  allele).  The  output  file  has  the  suffix
           ".singletons".

         --hist-indel-len
           This  option  will  generate  a  histogram file of the length of all indels (including
           SNPs). It shows both the count and the percentage of all indels for indel lengths that
           occur  at  least  once in the input file. SNPs are considered indels with length zero.
           The output file has the suffix ".indel.hist".

         --hapcount <BED file>
           This option will output the number of unique haplotypes within user specified bins, as
           defined by the BED file. The output file has the suffix ".hapcount".

         --mendel <PED file>
           This option is use to report mendel errors identified in trios. The command requires a
           PLINK-style PED file, with the first four columns specifying a family  ID,  the  child
           ID,  the  father  ID,  and  the  mother  ID. The output of this command has the suffix
           ".mendel".

         --extract-FORMAT-info <string>
           Extract information from the genotype fields in the VCF file relating to  a  specified
           FORMAT identifier. The resulting output file has the suffix ".<FORMAT_ID>.FORMAT". For
           example, the following command would  extract  the  all  of  the  GT  (i.e.  Genotype)
           entries:
             vcftools --vcf file1.vcf --extract-FORMAT-info GT

         --get-INFO <string>
           This  option  is  used to extract information from the INFO field in the VCF file. The
           <string> argument specifies the INFO tag to be extracted, and the option can  be  used
           multiple  times  in  order  to extract multiple INFO entries. The resulting file, with
           suffix ".INFO", contains the required INFO information in a tab-separated  table.  For
           example, to extract the NS and DB flags, one would use the command:
             vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   OUTPUT VCF FORMAT
         --recode
         --recode-bcf
           These  options are used to generate a new file in either VCF or BCF from the input VCF
           or BCF file after applying the filtering options specified by  the  user.  The  output
           file  has  the  suffix ".recode.vcf" or ".recode.bcf". By default, the INFO fields are
           removed from the output file, as the INFO values may be invalidated  by  the  recoding
           (e.g.  the  total  depth may need to be recalculated if individuals are removed). This
           behavior may be overridden by the following options. By default, BCF files are written
           out as BGZF compressed files.

         --recode-INFO <string>
         --recode-INFO-all
           These  options can be used with the above recode options to define an INFO key name to
           keep in the output file. This option can be used multiple times to keep  more  of  the
           INFO fields. The second option is used to keep all INFO values in the original file.

         --contigs <string>
           This  option can be used in conjunction with the --recode-bcf when the input file does
           not have any contig declarations. This option expects a  file  name  with  one  contig
           header per line. These lines are included in the output file.

   OUTPUT OTHER FORMATS
         --012
           This  option  outputs  the  genotypes as a large matrix. Three files are produced. The
           first, with suffix ".012", contains the genotypes of each  individual  on  a  separate
           line.  Genotypes are represented as 0, 1 and 2, where the number represent that number
           of non-reference alleles. Missing genotypes are represented by -1.  The  second  file,
           with  suffix  ".012.indv" details the individuals included in the main file. The third
           file, with suffix ".012.pos" details the site locations included in the main file.

         --IMPUTE
           This option outputs phased haplotypes in  IMPUTE  reference-panel  format.  As  IMPUTE
           requires  phased  data,  using this option also implies --phased. Unphased individuals
           and genotypes are therefore excluded.  Only  bi-allelic  sites  are  included  in  the
           output.  Using  this  option  generates three files. The IMPUTE haplotype file has the
           suffix ".impute.hap", and the IMPUTE legend file has the suffix  ".impute.hap.legend".
           The  third  file,  with suffix ".impute.hap.indv", details the individuals included in
           the haplotype file, although this file is not needed by IMPUTE.

         --ldhat
         --ldhelmet
         --ldhat-geno
           These options output data in LDhat/LDhelmet format. This option requires  the  "--chr"
           filter  option  to  also  be  used. The two first options output phased data only, and
           therefore also implies  "--phased"  be  used,  leading  to  unphased  individuals  and
           genotypes being excluded. For LDhelmet, only snps will be considered, and therefore it
           implies "--remove-indels". The second option treats all of the data as  unphased,  and
           therefore  outputs  LDhat  files  in  genotype/unphased  format.  Two output files are
           generated with the suffixes ".ldhat.sites" and ".ldhat.locs", which correspond to  the
           LDhat  "sites"  and  "locs"  input  files  respectively;  for  LDhelmet, the two files
           generated have the suffixes ".ldhelmet.snps" and ".ldhelmet.pos", which corresponds to
           the "SNPs" and "positions" files.

         --BEAGLE-GL
         --BEAGLE-PL
           These  options  output  genotype  likelihood  information  for  input  into the BEAGLE
           program. The VCF file is required to contain FORMAT fields with  "GL"  or  "PL"  tags,
           which  can  generally  be  output  by SNP callers such as the GATK. Use of this option
           requires a chromosome to be specified via the "--chr"  option.  The  resulting  output
           file has the suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype likelihoods for
           biallelic sites. This file is suitable for input into BEAGLE via the "like=" argument.

         --plink
         --plink-tped
         --chrom-map
           These options output the genotype data in PLINK PED format. With the first option, two
           files  are  generated, with suffixes ".ped" and ".map". Note that only bi-allelic loci
           will  be  output.  Further  details  of  these  files  can  be  found  in  the   PLINK
           documentation.
           Note:  The  first option can be very slow on large datasets. Using the --chr option to
           divide up the dataset is advised, or alternatively use the --plink-tped  option  which
           outputs the files in the PLINK transposed format with suffixes ".tped" and ".tfam".
           For  usage with variant sites in species other than humans, the --chrom-map option may
           be used to specify a file name that has a tab-delimited mapping of chromosome name  to
           a desired integer value with one line per chromosome. This file must contain a mapping
           for every chromosome value found in the file.

COMPARISON OPTIONS

       These options are used to compare the original variant file to another  variant  file  and
       output  the  results.  All  of  the  diff functions require both files to contain the same
       chromosomes and that the files be sorted in the same order. If one of the  files  contains
       chromosomes that the other file does not, use the --not-chr filter to remove them from the
       analysis.

   DIFF VCF FILE
         --diff <filename>
         --gzdiff <filename>
         --diff-bcf <filename>
           These options compare the original input file to this specified VCF, gzipped  VCF,  or
           BCF  file.  These options must be specified with one additional option described below
           in order to specify what type of comparison is  to  be  performed.  See  the  examples
           section for typical usage.

   DIFF OPTIONS
         --diff-site
           Outputs  the  sites  that  are  common  / unique to each file. The output file has the
           suffix ".diff.sites_in_files".

         --diff-indv
           Outputs the individuals that are common / unique to each file. The output file has the
           suffix ".diff.indv_in_files".

         --diff-site-discordance
           This  option calculates discordance on a site by site basis. The resulting output file
           has the suffix ".diff.sites".

         --diff-indv-discordance
           This option calculates discordance on a per-individual  basis.  The  resulting  output
           file has the suffix ".diff.indv".

         --diff-indv-map <filename>
           This  option allows the user to specify a mapping of individual IDs in the second file
           to those in the first file. The program expects the file to  contain  a  tab-delimited
           line  containing  an  individual's name in file one followed by that same individual's
           name in file two with one mapping per line.

         --diff-discordance-matrix
           This option calculates a discordance matrix. This option only  works  with  bi-allelic
           loci  with  matching alleles that are present in both files. The resulting output file
           has the suffix ".diff.discordance.matrix".

         --diff-switch-error
           This option calculates phasing errors  (specifically  "switch  errors").  This  option
           creates  an  output  file  describing  switch  errors found between sites, with suffix
           ".diff.switch".

AUTHORS

       Adam Auton (adam.auton@einstein.yu.edu)
       Anthony Marcketta (anthony.marcketta@einstein.yu.edu)