Provided by: samtools_1.22.1-1_amd64 bug

NAME

       samtools-checksum - produces checksums of SAM / BAM / CRAM content

SYNOPSIS

       samtools checksum [options] in.sam|in.bam|in.cram|in.fastq [ ... ]
       samtools checksum -m [options] in.checksum [ ... ]

DESCRIPTION

       With  no  options,  this  produces an order agnostic checksum of sequence, quality, read-name and barcode
       related aux data in a SAM, BAM, CRAM or FASTQ file.  The CRC32 checksum is used, combined together  in  a
       multiplicative prime field of size (2<<31)-1.

       The purpose of this mode is to validate that no data has been lost in data processing through the various
       steps  of  alignment,  sorting  and  processing.   Only primary alignments are recorded and the checksums
       computed are order agnostic so the same checksums are produced in name collated or position sorted output
       files.

       One set of checksums is produced per read-group as well as a combined file, plus a set for  records  that
       have  no read-group assigned.  This allows for validation of merging multiple runs and splitting pools by
       their read-group.  The checksums are also reported for QC-pass only and QC-fail only  (indicated  by  the
       QCFAIL BAM flag), so checksums of data identified and removed as contamination can also be tracked.

       All  of  the above are compatible with Biobambam2's bamseqchksum tool, which was the inspiration for this
       samtools command.  The -B option further enhances compatibility by using the same output format, although
       it limits the functionality to the order agnostic checksums and fewer types validated.

       The -m or --merge option can be used to merge previously generated checksums.  The  input  filenames  are
       checksum outputs from this tool (via shell redirection or the -o) option.  The intended use of this is to
       validate  no  data  is  lost  or  corruption  during  file  merging  of  read-group  specific  files,  by
       algorithmically computing the expected checksum output.

       Additionally checksum can track other columns including BAM flags, mapping information (MAPQ and  CIGAR),
       pair information (RNEXT, PNEXT and TLEN), as well as a wider list of tags.

       With the -O option the checksums become record order specific.  Combined together with the -a option this
       can  be  used  to  validate  SAM,  BAM and CRAM format conversions.  The CRCs per record are XORed with a
       record counter for the Nth record per read group.  See the detailed description below for  single  -O  vs
       double and the implications on reordering between read-groups.

       When  performing such validation, it is also useful to enable data sanitisation first, as CRAM can fix up
       certain types of inconsistencies including common issues such as MAPQ and  CIGAR  strings  for  unaligned
       data.

OUTPUT

       The output format consists of a machine readable table of checksums and human readable text starting with
       a "#" character.

       For  compatibility  with  bamseqchksum  the data is CRCed in specific orders before combining together to
       form a checksum column.  The last column reported is then the combination of all checksums in  that  row,
       permitting easy comparison by looking at a single value.

       The columns reported are as follows.

           Group     The read group name.  There is always an "all" group which represents all records.  This is
                     followed by one checksum set per read-group found in the file.

           QC        This  is  either "all" or "pass".  "Pass" refers to records that do not have the QCFAIL BAM
                     flag specified.

           flag+seq  The checksum of SAM FLAG + SEQ fields

           +name     The checksum of SAM QNAME + FLAG + SEQ fields

           +qual     The checksum of SAM FLAG + SEQ + QUAL fields

           +aux      The checksum of SAM FLAG + SEQ + selected auxiliary fields

           +chr/pos  The checksum of SAM FLAG + SEQ + RNAME (chromosome) + POSition fields

           +mate     The checksum of SAM FLAG + SEQ + RNEXT + PNEXT + ISIZE fields.

           combined  The combined checksum of all columns prior to this column.  The first row will be  for  all
                     alignments,  so the combined checksum on the first row may be used as a whole file combined
                     checksum.

       An example output can be seen below.

         # Checksum for file: NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.bam
         # Aux tags:          BC,FI,QT,RT,TC
         # BAM flags:         PAIRED,READ1,READ2

         # Group    QC        count  flag+seq  +name     +qual     +aux      combined
         all        all    42890086  71169bbb  633fd9f7  2a2e693f  71169bbb  09d03ed4
         SRR010946  all      262249  2957df86  3b6dcbc9  66be71f7  2957df86  58e89c25
         SRR002165  all       97846  47ff17e0  6ff8fc7b  58366bf5  47ff17e0  796eecb0
         [...cut...]

OPTIONS

       -@ COUNT  Uses COUNT compute threads in decoding the file.  Typically  this  does  not  gain  much  speed
                 beyond 2 or 3.  The default is to use a single thread.

       -B, --bamseqchksum
                 Produces  a  report compatible with biobambam2's bamseqchksum default output. Note this is only
                 expected to work if no other format options have been enabled.  Specifically the header line is
                 not updated to reflect additional columns if requested.

                 Bamseqchksum has more output modes and many alternative checksums.  We only support the default
                 CRC32 method.

       -F FLAG, --exclude-flags FLAG
                 Specifies which alignment FLAGs to filter out.  This defaults to  secondary  and  supplementary
                 alignments  (0x900) as these can be duplicates of the primary alignment.  This ensures the same
                 number of records are checksummed in unaligned and aligned files.

       -f FLAG, --require-flags FLAG
                 A list of FLAGs that are required.  Defaults to zero.   An  example  use  of  this  may  be  to
                 checksum QCFAIL only.

       -b FLAG, --flag-mask FLAG
                 The  BAM FLAG is masked first before checksumming.  The unaligned flags will contain data about
                 the sequencing run - whether it is paired in sequencing and if so  whether  this  is  READ1  or
                 READ2.   These  flags  will  not change post-alignment and so everything except these three are
                 masked out.  FLAG defaults to PAIRED,READ1,READ2 (0xc1).

       -c, --no-rev-comp
                 By default the sequence and quality strings are reverse complemented  before  checksumming,  so
                 unaligned data does not affect the checksums.  This option disables this and checksums as-is.

       -t STR, --tags STR
                 Specifies  a  comma-separated list of aux tags to checksum.  These are concatenated together in
                 their canonical BAM encoding in the order listed in STR, prior to computing the checksums.

                 If STR begins with "*" then all tags are used.  This can then be followed by a comma  separated
                 list  of  tags  to exclude.  For example "*,MD,NM" is all tags except MD and NM.  In this mode,
                 the tags are combined in alphanumeric order.

                 The default value is "BC,FI,QT,RT,TC".

       -O, --in-order

                 By default the CRCs are  combined  in  a  multiplicative  field  that  is  order  agnostic,  as
                 multiplication  is  an  associative  operation.   This  option  XORs  the CRC with the a number
                 indicating the Nth record number for this grouping prior to the multiply step, making the final
                 multiplicative checksum dependent on the order of the input data.

                 For the "all" row the count is taken from the Nth record in the read-group associated with this
                 record (or the "-" row for read-group-less data).  This  ensures  that  the  checksums  can  be
                 subsequently  merged together algorithmically using the -m option, but it does mean there is no
                 validation of record swaps between read-groups.  Note however due to the way ties are resolved,
                 when running samtools merge out.bam rg1.bam rg2.bam we may get different orderings if we merged
                 the two files in the opposite order.  This can happen when two read-groups have  alignments  at
                 the same position with the same BAM flags.  Hence if we wish to check a samtools split followed
                 by samtools merge round trip works then this counter per readgroup is a benefit.

                 However,  if  absolute ordering needs to be validated regardless of read-groups, specifying the
                 -O option twice will compute the "all" row by combining the CRC with the Nth record in the file
                 rather than the Nth record in its readgroup.  This  output  can  no  longer  can  merged  using
                 checksum -m.

       -P, --check-pos
                 Adds  a  column  to  the  output  with  combined  chromosome and position checksums.  This also
                 incorporates the flag/sequence CRC.

       -C, --check-cigar
                 Adds a column to the output with combined mapping  quality  and  CIGAR  checksums.   This  also
                 incorporates the flag/sequence CRC.

       -M, --check-mate
                 Adds  a  column  to  the output with combined mate reference, mate position and template length
                 checksums.  This also incorporates the flag/sequence CRC.

       -b FLAGS, --sanitize FLAGS
                 Perform data sanitization prior to checksumming.  This is off by default.   See  samtools  view
                 for the FLAG terms accepted.

       -N COUNT, --count COUNT
                 Limits the checksumming to the first COUNT records from the file.

       -a, --all Checksum  all data.  This is equivalent to -PCMOc -b 0xfff -f0 -F0 -z all,cigarx -t *,cF,MD,NM.
                 It is useful for validating round-trips between file formats, such as BAM to CRAM.

       -T, --tabs
                 Use tabs for separating columns instead of aligned spaces.

       -q, --show-qc
                 Also show QC pass and fail rows per read-group.  These are based on the QCFAIL BAM flag.

       -o FILE, --output FILE
                 Output checksum report to FILE instead of stdout.

       -m FILE, --merge FILE...
                 Merge checksum outputs produced by the -o option.  This can be used to simulate or validate the
                 effect of computing checksum on the output of a samtools merge command.

                 The columns to report are read from the "# Group" line.  The rows to report are still  governed
                 by the -q, -v and -T options so this can also be used for reformatting of a single file.

                 Note the "all" row merging cannot be done when the two levels of order-specific checksums (-OO)
                 has been used.

       -v, --verbose
                 Increase  verbosity.   At  level  1 or higher this also shows rows that have zero count values,
                 which can aid machine parsing.

EXAMPLES

       o To check that an aligned and position sorted file contains the same data as the pre-alignment FASTQ:

           samtools checksum -q pos-aln.bam
           samtools import -u -1 rg1.fastq.gz -2 rg2.fastq.gz | samtools checksum -q

         The output for this consists of some human readable comments starting with "#" and a series of checksum
         lines per read-group and QC status.

           # Checksum for file: SRR554369.P_aeruginosa.cram
           # Aux tags:          BC,FI,QT,RT,TC
           # BAM flags:         PAIRED,READ1,READ2

           # Group    QC        count  flag+seq  +name     +qual     +aux      combined
           all        all     3315742  4a812bf2  22d15cfe  507f0f57  4a812bf2  035e2f5b
           all        pass    3315742  4a812bf2  22d15cfe  507f0f57  4a812bf2  035e2f5b

         Note as no barcode tags exist, the "+aux" column is the same as the "flag+seq" column it is based upon.

       o To check round-tripping from BAM to CRAM and back again we can convert the BAM to CRAM and then run the
         checksum on the CRAM file.  This does not need explicitly converting back to BAM as htslib will  decode
         the CRAM and convert it back to the same in-memory representation that is utilised in BAM.

           samtools checksum -a 9827_2#49.1m.bam
           [...cut...]
           samtools view -@8 -C -T $HREF 9827_2#49.1m.bam | samtools checksum -a
           # Checksum for file: -
           # Aux tags:          *,cF,MD,NM
           # BAM flags:         PAIRED,PROPER_PAIR,UNMAP,MUNMAP,REVERSE,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,DUP,SUPPLEMENTARY

           # Group    QC        count  flag+seq  +name     +qual     +aux      +chr/pos  +cigar    +mate     combined
           all        all       99890  066a0706  0805371d  5506e19f  6b0eec58  60e2347c  09a2c3ba  347a3214  66c5e2de
           1#49       all       99890  066a0706  0805371d  5506e19f  6b0eec58  60e2347c  09a2c3ba  347a3214  66c5e2de

       o To  validate  that  splitting  a  file by regroup retains all the data, we can compute checksums on the
         split BAMs and merge the checksum reports together to compare against the original unsplit file.  (Note
         in the example below diff will report the filename changing, which is expected.)

           samtools split -u /tmp/split/noRG.bam -f '/tmp/split/%!.%.' in.cram
           samtools checksum -a in.cram -o in.chksum
           s=$(for i in /tmp/split/*.bam;do echo "<(samtools checksum -a $i)";done)
           eval samtools checksum -m $s -o split.chksum
           diff in.chksum split.chksum

AUTHOR

       Written by James Bonfield from the Sanger Institute.
       Inspired by bamseqchksum, written by David Jackson of Sanger Institute and amended by German Tischler.

SEE ALSO

       samtools(1), samtools-view(1),

       Samtools website: <http://www.htslib.org/>

samtools-1.22.1                                   14 July 2025                              samtools-checksum(1)