Ubuntu Manpage: fa2htgs - formatter for high throughput genome sequencing project submissions

Provided by: ncbi-tools-bin_6.1.20120620-7_amd64

NAME

       fa2htgs - formatter for high throughput genome sequencing project submissions

SYNOPSIS

       fa2htgs  [-]  [-6 str]  [-7 str]  [-A filename]  [-C str] [-D] [-L filename] [-M str] [-N]
       [-O filename] [-P str] [-Q filename] [-S str] [-T filename] [-X] [-a str] [-b N]  [-c str]
       [-d str]  [-e filename]  [-f]  -g str [-h str] [-i filename] [-k str] [-l N] [-m] [-n str]
       [-o filename] [-p N] [-q] [-r str] -s str [-t filename] [-u] [-v] [-w] [-x str]

DESCRIPTION

       fa2htgs is a program used to generate Seq-submits (an ASN.1 sequence submission file)  for
       high throughput genome sequencing projects.

       fa2htgs will read a FASTA file (or an Ace Contig file with Phrap sequence quality values),
       a Sequin submission template file, (to  get  contact  and  citation  information  for  the
       submission),  and  a series of command line arguments (see below).  This program will then
       combines these information to make a  submission  suitable  for  GenBank.  Once  you  have
       generated your submission file, you need to follow the submission protocol (see the README
       present on your FTP account or mailed out to your Center).

       fa2htgs is intended for the automation by  scripts  for  bulk  submission  of  unannotated
       genome  sequence.  It  can  easily  be extended from its current simple form to allow more
       complicated processing.  A  submission  prepared  with  fa2htgs  can  also  be  read  into
       Psequin(1), and then annotated more extensively.

       Questions  and  concerns about this processing protocol, or how to use this tool should be
       forwarded to <htgs@ncbi.nlm.nih.gov>.

OPTIONS

A summary of options is included below.

- Print usage message

-6 str SP6 clone (e.g., Contig1,left)

-7 str T7 clone (e.g., Contig2,right)

-A filename
Filename for accession list input (mutually exclusive with -T and -i). The input
file contains a tab-delimited table with three to five columns, which are accession
number, start position, stop position, and (optionally) length and strand. If
start > stop, the minus strand on the referenced accession is used. A gap is
indicated by the word "gap" instead of an accession, 0 for the start and stop
positions, and a number for the length.

-C str Clone library name (will appear as /clone-lib="str" on the source feature)

-D HTGS_DRAFT sequence

-L filename
Read phrap contig order from filename. This is a tab-delimited file that can be
used to drive the order of contigs (normally specified by -P), as well as
indicating the SP6 and T7 ends. It can also be used when contigs are known to be
in opposite orientation. For example:

Contig2 + 1 SP6 left
Contig3 + 1
Contig1 - T7 right

The first column is the contig name, the second is the orientation, the third is
the fragment_group, the fourth indicates the SP6 or T7 end, and the fifth says
which side of SP6 or T7 end had vector removed.

-M str Map name (will appear as /map="str" on the source feature)

-N Annotate assembly_fragments

-O filename
Read comment from filename (100-character-per-line maximum; ~ is a linebreak and `~
is a literal ~. You can check the format with PSequin(1).)

-P str Contigs to use, separated by commas. If -P is not indicated with the -T option,
then the fragments will go in in the order that they are in the ace file (which is
appropriate for a phase 1 record, but not for a phase 2 or 3). If you need to set
the order of the segments of the ace file, you need to set it with the -P flag,
like this: -P "Contig1,Contig4,Contig3,Contig2,Contig5"

-Q filename
Read quality scores from filename

-S str Strain name

-T filename
Filename for phrap input (mutually exclusive with -A and -i)

-X The coordinates in the input file are on the resulting segmented sequence. (Bases
1 through n of each accession are used.) Otherwise, the coordinates are on the
individual accessions, which need not start at base 1 of the record.

-a str GenBank accession; use if and only if updating a sequence.

-b N Gap length (default = 100; anything from 0 to 1000000000 is legal)

-c str Clone name (will appear as /clone in the source feature; can be the same as -s)

-d str Title for sequence (will appear in GenBank DEFINITION line)

-e filename
Log errors to filename

-f htgs_fulltop keyword

-g str Genome Center tag (probably the same as your login name on the NCBI FTP server)

-h str Chromosome (will appear as /chromosome in the source feature)

-i filename
Filename for fasta input (default is stdin; mutually exclusive with -A and -T)

-k str Add the supplied string as a keyword.

-l N Length of sequence in bp (default = 0). The length is checked against the actual
number of bases we get. For phase 1 and 2 sequence it is also used to estimate gap
lengths. For phase 1 and 2 records, it is important to use a number GREATER than
the amount of provided nucleotide, otherwise this will generate false `gaps'. Here
is assumed that the putative full length of the BAC or cosmid will be used. There
should be at least 20 to 30 `n' in between the segments (you can check for these in
Sequin), as this will ensure proper behavior when this sequence is used with BLAST.
Otherwise `artifactual' unrelated segment neighbors may be brought into proximity
of each other.

-m Take comment from template

-n str Organism name (default = Homo sapiens)

-o filename
Filename for asn.1 output (default = stdout)

-p N HTGS phase:
1 A collection of unordered contigs with gaps of unknown length. A Phase 1
record must at the very least have two segments with one gap. (default)
2 A series of ordered contigs, possibly with known gap lengths. This could be
a single sequence without gaps, if the sequence has ambiguities to resolve.
3 A single contiguous sequence. This sequence is finished, but not
necessarily annotated.

-q htgs_cancelled keyword

-r str Remark for update (brief comment describing the nature of the update, such as "new
sequence", "new citation", or "updated features")

-s str Sequence name. The sequence must have a name that is unique within the genome
center. We use the combination of the genome center name (-g argument) and the
sequence name (-s) to track this sequence and to talk to you about it. The name
can have any form you like but must be unique within your center.

-t filename
Filename for Seq-submit template (default = template.sub)

-u Take biosource from template

-v htgs_activefin keyword

-w Whole Genome Shotgun flag

-x str Secondary accession numbers, separated by commas, s.t. U10000,L11000.

In some cases a large segment will supersede another or group of other accession
numbers (records). These records which are no longer wanted in GenBank should be
made secondary. Using the -x argument you can list the Accession Numbers you want
to make secondary. This will instruct us to remove the accession number(s) from
GenBank, and will no longer be part of the GenBank release. They will nonetheless
be available from Entrez.

GREAT CARE should be taken when using this argument!!! Improper use of accession
numbers here will result in the inappropriate withdrawal of GenBank records from
GenBank, EMBL and DDBJ. We provide this parameter as a convenience to submitting
centers, but this may need to be removed if it is not used carefully.

AUTHOR

       The National Center for Biotechnology Information.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHOR

SEE ALSO