lunar (1) seqcluster.1.gz

Provided by: seqcluster_1.2.9+ds-3_all bug

NAME

       seqcluster - seqcluster Documentation [image: seqcluster banner] [image]

       Analysis  of  small  RNA sequencing data. It detect unit of transcription over the genome,
       annotate them and create an HTML  interactive  report  that  helps  to  explore  the  data
       quickly.

       Contents:

INSTALLATION

   Seqcluster
       With bcbio installed

       If you already have
       `bcbio`_
       , seqcluster comes with it. If you want the last development version:

          /bcbio_anaconda_bin_path/seqcluster_install.py --upgrade

       Docker:

          docker pull lpantano/smallsrna

       Bioconda binary

       install conda if you want an isolate env:

          wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
          bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/seqcluster/anaconda

       You can install directly from binstar (only for linux):

          ~/install/seqcluster/anaconda/conda install seqcluster seqbuster bedtools samtools pip nose numpy scipy pandas pyvcf -c bioconda

       With  that  you will have everything you need for the python package.  The last step is to
       add seqcluster to your PATH if conda is not already there.

       Go to Tools dependecies below to continue with the installation.

       Note: After installation is highly recommended to get the last updated version doing:

          seqcluster_install.py --upgrade

       automated installation

       Strongly recommended to use bcbio installation if you work with sequencing  data.  But  if
       you want a minimal installation:

          pip install fabric
          seqcluster_install --upgrade
          mkdir -p $PATH_TO_TOOLS/bin
          seqcluster_install --tools $PATH_TO_TOOLS

       After that you will need to add to your path: export PATH=$PATH_TO_TOOLS/bin:$PATH

   Tools dependecies for a full small RNA pipeline
       For seqcluster command:

       • bedtools

       • samtools

       • rnafold (for HTML report)

       For some steps of a typical small RNA-seq pipeline (recommended to use directly
       `bcbio`_
        ):

       • STAR, bowtie

       • fastqc

       • cutadapt (install with bioconda using the same python env than seqcluster.

       You will need to link the cutadapt binary to your PATH)

   Data
       Easy  way  to  install your small RNA seq data with cloudbiolinux.  Seqcluster has snipped
       code to do that for you. Recommended to use
       `bcbio`_
        for the pipeline since will install everything you need in a single step bcbio_nextgen.py
       upgrade -u development --tools --genomes hg19 --aligners bowtie.

       But If you want to run seqcluster step by step an example of hg19 human version it will be
       (another well annotated supported genome is mm10):

       Download genome data:

          seqcluster_install --data $PATH_TO_DATA --genomes hg19 --aligners bowtie2 --datatarget smallrna

       If you want to install STAR indexes  since  gets  kind  of  better  results  than  bowtie2
       (warning, 40GB memory RAM needed):

          seqcluster_install --data $PATH_TO_DATA --genomes hg19 --aligners star

   R package
       Install isomiRs package for R using devtools:

          devtools::install_github('lpantano/isomiRs')

       To install all packages used by the Rmd report:

          Rscript -e 'source(https://raw.githubusercontent.com/lpantano/seqcluster/master/scripts/install_libraries.R)'

CITATION

       Please if you use seqcluster make sure to cite the other tools are integrated here:

       A  non-biased  framework  for the annotation and classification of the non-miRNA small RNA
       transcriptome. Pantano L1, Estivill X, Martí E. Bioinformatics. 2011 Nov 15;27(22):3202-3.
       doi: 10.1093/bioinformatics/btr527. Epub 2011 Oct 5. PMID: 21976421

       SeqBuster  is a bioinformatic tool for the processing and analysis of small RNAs datasets,
       reveals ubiquitous miRNA modifications in human embryonic cells. Pantano  L,  Estivill  X,
       Martí E. Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.

       Quinlan  AR  and  Hall  IM,  2010.  BEDTools:  a flexible suite of utilities for comparing
       genomic features. Bioinformatics. 26, 6, pp. 841–842.

       Dale RK,  Pedersen  BS,  and  Quinlan  AR.  Pybedtools:  a  flexible  Python  library  for
       manipulating     genomic     datasets     and    annotations.    Bioinformatics    (2011).
       doi:10.1093/bioinformatics/btr539

       Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G.,  Abecasis  G.,
       Durbin   R.  and  1000  Genome  Project  Data  Processing  Subgroup  (2009)  The  Sequence
       alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

       Li H A statistical framework for SNP calling, mutation discovery, association mapping  and
       population  genetical  parameter estimation from sequencing data. Bioinformatics. 2011 Nov
       1;27(21):2987-93. Epub 2011 Sep 8. [PMID: 21903627]

GETTING STARTED

       Best practices are implemented in a python framework.

   clustering of small RNA sequences
       seqcluster generates a list of clusters of small RNA  sequences,  their  genome  location,
       their annotation and the abundance in all the sample of the project [image]

       REMOVE ADAPTER

       I am currently using cutadapt:

          cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq

       COLLAPSE READS

       To  reduce  computational  time,  I recommend to collapse sequences, also it would help to
       apply filters based on abundances.  Like removing sequences that appear only once.

          seqcluster collapse -f sample1_clean.fastq -o collapse

       Here I am only using sequences that had the adapter,  meaning  that  for  sure  are  small
       fragments.

       This is compatible with UMI barcodes. If you have in the read name UMI_ATCGAT ``, then the
       tool will remove PCR dupiclates as well. To confirm this happened, the tool should  output
       this sentence during the processing of the file: ``Find UMI tags in read names, collapsing
       by UMI.

       PREPARE SAMPLES

          seqcluster prepare -c file_w_samples -o res --minl 17 --minc 2 --maxl 45

       the file_w_samples should have the following format:

          lane1_sequence.txt_1_1_phred.fastq      cc1
          lane1_sequence.txt_2_1_phred.fastq      cc2
          lane2_sequence.txt_1_1_phred.fastq      cc3
          lane2_sequence.txt_2_1_phred.fastq      cc4

       two columns file, where the first column is the name  of  the  file  with  the  small  RNA
       sequences for each sample, and the second column in the name of the sample.

       The fastq files should be like this:

          @seq_1_x11
          CCCCGTTCCCCCCTCCTCC
          +
          QUALITY_LINE
          @seq_2_x20
          TGCGCAGTGGCAGTATCGTAGCCAATG
          +
          QUALITY_LINE
          </pre>

       Where  _x[09]  indicate the abundance of that sequence, and the middle number is the index
       of the sequence.

       This script will generate: seqs.fastq and seqs.ma.  * seqs.fastq:  have  unique  sequences
       and unique ids * seqs.ma: is the abundance matrix of all unique sequences in all samples

       ALIGNMENT

       You  should use an aligner to map seqs.fa to your genome. A possibility is bowtie or STAR.
       From here, we need a file in BAM format for the next step.  VERY IMPORTANT: the  BAM  file
       should be sorted

          bowtie -a --best --strata -m 5000 INDEX seqs.fastq -S | samtools view -Sbh /dev/stdin | samtools sort -o /dev/stdout temp > seqs.sort.bam

       or

          STAR --genomeDir $star_index_folder --readFilesIn res/seqs.fastq --alignIntronMax 1  --outFilterMultimapNmax 1000 --outSAMattributes NH HI NM --outSAMtype BAM SortedByCoordinate

       CLUSTERING

          seqcluster cluster -a res/Aligned.sortedByCoord.out.bam  -m res/seqs.ma -g $GTF_FILE  -o res/cluster -ref PATH_TO_GENOME_FASTA --db example

       • -a is the SAM file generated after mapped with your tool, which input has been seqs.fa

       • -m the previous seqs.fa

       • -b annotation files in bed format (see below examples) [deprecated]

       • -g annotation files in gtf format (see below examples) [recommended]

       • -i genome fasta file used in the mapping step (only needed if -s active)

       • -o output folder

       • -ref genome fasta file. Needs fai file as well there. (i.e hg19.fa, hg19.fa.fai)

       • -d create debug logging

       • -s construction of putative precursor (NOT YET IMPLEMENTED)

       • --db  (optional)  will  create sqlite3 database with results that will be used to browse
         data with html web page (under development)

       Example of a bed file for annotation  (the  fourth  column  should  be  the  name  of  the
       feature):

          chr1    157783  157886  snRNA   0       -

       Strongly  recommend  gtf  format.  Bed  annotation  is  deprecated. Go here to know how to
       download data from hg19 and mm10.

       Example of a gtf file for annotation (the third column should be the name of  the  feature
       and the value after gene name attribute is the specific annotation):

          chr1    source  miRNA      1       11503   .       +       .       gene name 'mir-102' ;

       hint: scripts to generate human and mouse annotation are inside seqcluster/scripts folder.

       OUTPUTScounts.tsv: count matrix that can be input of downstream analyses

       • size_counts.tsv: size distribution of the small RNA by annotation group

       • seqcluster.json: json file containing all information

       • log/run.log: all messages at debug level

       • log/trace.log: to keep trace of algorithm decisions

   Interactive HTML Report
       This will create html report using the following command assuming the output of seqcluster
       cluster is at res:

          seqcluster report -j res/seqcluster.json -o report -r $GENONE_FASTA_PATH

       where $GENOME_FASTA_PATH is the path to the genome fasta file used in the alignment.

       Note: you can try our new visualization tool!

       • report/html/index.html: table with all clusters and the annotation with sorting option

       • report/html/[0-9]/maps.html: summary of the cluster with expression profile, annotation,
         and all sequences inside

       • report/html/[0-9]/maps.fa: putative precursor

       An example of the output is below: [image]

   Easy start with bcbio-nextgen.py
       Note:If you already are using bcbio, visit bcbio to run the pipeline there.

       To install the small RNA data:

          bcbio_nextgen.py upgrade -u development --tools --datatarget smallrna

       Options to run in a cluster

       It uses ipython-cluster-helper to send jobs to nodes in the cluster

       • --parallel should set to ipython--scheduler should be set to sge,lsf,slurm--num-jobs  indicates how much jobs to launch. It will run samples independently. If you
         have 4 samples, and set this to 4, 4 jobs will be launch to the cluster

       • --queue the queue to use

       • --resources allows to set any special parameter for the cluster, such as, email  in  sge
         system: M=my@email.com

       Read  complete usability here: https://github.com/roryk/ipython-cluster-helper An examples
       in slurm system is:

          --parallel ipython --scheduler slurm --num-jobs 4 --queue general

       Output

       • one folder for each analysys, and inside one per sample

          • adapter: *clean.fastq is the file after adapter removal, *clean_trimmed.fastq is  the
            collapse  clean.fastq, *fragments.fastq is file without adapter, *short.fastq is file
            with reads < 16 nt.

          • align: BAM file results from align trimmed.fastq

          • mirbase: file with miRNA anotation and novel miRNA discovery with mirdeep2

          • tRNA: analysis done with tdrmapper [citation needed]

          • qc: *_fastqc.html is the fastqc results from the uncollapse fastq file

       • seqcluster: is the result of running  seqcluster.  See  its  documentation  for  further
         information.

       • report/srna-report.Rmd:  template  to  create  a  quick html report with exploration and
         differential expression analysis. See example here

OUTPUTS

   seqclustercounts.tsv: count matrix that can be input of  downstream  analyses.  nloci  will  be  0
         always that the meta-cluster has been resolved successfully. For instance, it can happen
         that you got sequences you have a bunch of sequences mapping to  hundreds  of  different
         places on the genome, then seqcluster doesn’t resolve that, and put everything under the
         larger region covered by those sequences. So, mainly, 0  all  are  good  rows.  The  ann
         column is just where the meta-clusters overlap with. It can happen that one name appears
         many times if different locations of the meta-cluster map to different  copies  of  that
         feature. OR if the annotation file used had multiple lines for that.

       • read_stats.tsv:  number  of reads for each sample after each step in the analysis. Meant
         to give a hint if we lose a lot of information or not.

       • size_counts.tsv: size distribution of the small  RNA  by  annotation  group.  (position,
         reads, cluster)

       • seqcluster.json: json file containing all information. This file is used as the input of
         the report suit.

       • log/run.log: all messages at debug level

       • log/trace.log: to keep trace of algorithm decisions

   Report
       Beside the static HTML report that you can get using report subcommand, you  can  download
       this HTML. (watch the repository to get notifications of new releases.)

       • Go inside seqclusterViz folder

       • Open reader.html

       • Upload the seqcluster.db file generated by report subcommand.

       • Start browsing your data!

       Meaning of different sections:

       • Top-left table shows list of meta-clusters, user can filter by number ID or keywords.

       • Top-right table shows positions where this meta-cluster has been detected.

       • Expression  profile  along  precursor: Lines are number of reads in that position of the
         precursor. It is sum of the log2 RPM of the expression for each sample.

       • Table: raw counts for each sample and sequence. Only top 100 are shown.

       • secondary structure: The region with more sequences inside meta-cluster is used to  plot
         the  secondary structure. Colors refers to abundance in each position. Darker means more
         abundance.

       An example of the HTML code:  _ ..examples

EXAMPLES OF SMALL RNA ANALYSIS

   miRQC data
       About

       mirRQC project

       samples overview:

       >> Universal Human miRNA reference RNA (Agilent Technologies, #750700), human brain  total
       RNA  (Life  Technologies, #AM6050), human liver total RNA (Life Technologies, #AM7960) and
       MS2-phage RNA (Roche, #10165948001) were diluted to a platform-specific concentration. RNA
       integrity  and  purity  were  evaluated  using  the Experion automated gel electrophoresis
       system (Bio-Rad) and Nanodrop spectrophotometer. All RNA  samples  were  of  high  quality
       (miRQC  A:  RNA  quality  index (RQI, scale from 0 to 10) = 9.0; miRQC B: RQI = 8.7; human
       liver RNA: RQI = 9.2) and high purity (data  not  shown).  RNA  was  isolated  from  serum
       prepared  from  three healthy donors using the miRNeasy mini kit (Qiagen) according to the
       manufacturer's instructions, and RNA samples were pooled. Informed  consent  was  obtained
       from  all  donors  (Ghent  University  Ethical Committee). Different kits for isolation of
       serum RNA are available; addressing their impact was  outside  the  scope  of  this  work.
       Synthetic  miRNA  templates  for  let-7a-5p,  let-7b-5p,  let-7c,  let-7d-5p, miR-302a-3p,
       miR-302b-3p,  miR-302c-3p,  miR-302d-3p,  miR-133a  and  miR-10a-5p  were  synthesized  by
       Integrated DNA Technologies and 5′ phosphorylated. Synthetic let-7 and miR-302 miRNAs were
       spiked into MS2-phage RNA and total human liver RNA, respectively, at 5  ×  106  copies/μg
       RNA.  These  samples  do  not  contain  endogenous  miR-302 or let-7 miRNAs, which allowed
       unbiased analysis of cross-reactivity between the  individual  miR-302  and  let-7  miRNAs
       measured  by  the  platform  and  the different miR-302 and let-7 synthetic templates in a
       complex RNA background. Synthetic miRNA templates for miR-10a-5p,  let-7a-5p,  miR-302a-3p
       and  miR-133a were spiked in human serum RNA at 6 × 103 copies per microliter of serum RNA
       or at 5-times higher, 2-times higher, 2-times  lower  and  5-times  lower  concentrations,
       respectively. All vendors received 10 μl of each serum RNA sample.

       Commands

       Data was download from GEO web with this script. The following 2 configs were used for the
       two sets: mirqc samples  and non mirqc samples. Samples were analyzed with bcbio with  the
       following commands

       report

       Report showing part of the output report of bcbio pipelines together with some validations
       are here.

MIRNA ANNOTATION

       miRNA annotation is running inside bcbio small RNAseq pipeline together with  other  tools
       to do a complete small RNA analysis.

       For some comparison with other tools go here.

       You  can  run  samples after processing the reads as shown below.  Currently there are two
       version: JAVA

       Naming

       See always up to date information here in mirtop open project.

       It is a working process, but since 10-21-2015 isomiR naming has changed to:

       • Nucleotide  substitution:  NUMBER|NUCLEOTIDE_ISOMIR|NUCLEOTIDE_REFERENCE  means  at  the
         position  giving  by  the  number  the  nucleotide  in  the sequence has substituted the
         nucleotide in the reference. This, as well, is a post-transcriptional modification.

       • Additions at 3' end: 0/NA means no modification. UPPER CASE LETTER means addition at the
         end.  Note these nucleotides don't match the precursor. So they are post-transcriptional
         modification.

       • Changes at 5' end: 0/NA means  no  modification.  UPPER  CASE  LETTER  means  nucleotide
         insertions  (sequence  starts  before miRBase mature position). LOWWER CASE LETTER means
         nucleotide deletions (sequence starts after miRBase mature position).

       • Changes at 3' end: 0/NA means  no  modification.  UPPER  CASE  LETTER  means  nucleotide
         insertions  (sequence  ends  after  miRBase  mature  position). LOWWER CASE LETTER means
         nucleotide deletions (sequence ends before miRBase mature position).

   Processing of reads
       REMOVE ADAPTER

       I am currently using cutadapt.

          cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq

       COLLAPSE READS

       To reduce computational time, I recommend to collapse sequences, also  it  would  help  to
       apply filters based on abundances.  Like removing sequences that appear only once.

          seqcluster collapse -f sample1_clean.fastq -o collapse

       Here  I  am  only  using  sequences  that had the adapter, meaning that for sure are small
       fragments. The output will be named as sample1_clean_trimmed.fastq

   Prepare databases
       For human or mouse, follows this instruction to download easily miRBase files. In  general
       you  only  need  hairpin.fa  and miRNA.str from miRBase site. mirGeneDB is also supported,
       download the needed files here.

       Highly recommended to filter hairpin.fa to contain only the desired species.

   miRNA/isomiR annotation with JAVA
       MIRALIGNER

       Download the tool from miraligner repository.

       Download the mirbase files (hairpin and miRNA) from the ftp and save it to DB folder.

       You can map the miRNAs with.

          java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i sample1_clean_trimmed.fastq -db DB  -o output_prefix

       Cite

       SeqBuster is a bioinformatic tool for the processing and analysis of small RNAs  datasets,
       reveals  ubiquitous  miRNA  modifications in human embryonic cells. Pantano L, Estivill X,
       Martí E. Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.

       NOTE: Check comparison of multiple tools for miRNA annotation.

   Convert to GFF3-srna
       Use mirtop to convert to GFF3-srna format. This is the desired format to share the  isomiR
       information and can be used to join multiple projects together easily.

       See   to  know  how  to  convert  all  the output into a single file and share easily with
       collaborators:

          mirtop gff --format seqbuster --sps hsa --hairpin database/hairpin.fa --gtf database/hsa.gff3 -o test_out out_folder/*/*.mirna

   Post-analysis with R
       Use the outputs to do differential expression, clustering and  descriptive  analysis  with
       this package: isomiRs

       To  load  the data you can use IsomirDataSeqFromFiles function and get the count data with
       isoCounts to move to DESeq2 or similar packages.

   Manual of miraligner(JAVA)
       options

       Add -freq if you have your fasta/fastq file with this format and you want a  third  column
       with the frequency (normally value after x character):

          >seq_1_x4
          CACCGCTGTCGGGGAACCGCGCCAATTT

       Add -pre if you want also sequences that map to the precursor but outside the mature miRNA

       • Parameter -sub: mismatches allowed (0/1)

       • Parameter -trim: nucleotides allowed for trimming (max 3)

       • Parameter -add: nucleotides allowed for addition (max 3)

       • Parameter -s: species (3 letter, human=>hsa)

       • Parameter -i: fasta file

       • Parameter -db: folder where miRBase files are(one copy at miraligner-1.0/DB folder)

       • Parameter -o: prefix for the output files

       • Parameter  -freq: add frequency of the sequence to the output (just where input is fasta
         file with name matching this patter: >seq_3_x67)

       • Parameter -pre: add sequences mapping to precursors as well

       input

       A fasta/fastq file reads:

          >seq
          CACCGCTGTCGGGGAACCGCGCCAATTT

       or tabular file with counts information:

       CACCGCTGTCGGGGAACCGCGCCAATTT 45

       output

       Track file
       *
       .mirna.opt: information about the process

       Non mapped sequences will be on
       *
       .nomap

       Header of the
       *
       .mirna.out file:

       • seq: sequence

       • freq/name: depending on the input this column contains counts (tabular  input  file)  or
         name (fasta file)

       • mir: miRNA name

       • start: start of the sequence at the precursor

       • end: end of the sequence at the precursor

       • mism:  nucleotide  substitution  position  |  nucleotide  at  sequence  |  nucleotide at
         precursor

       • addition: nucleotides at 3 end added:

            precursor         => cctgtggttagctggttgcatatcc
            annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
            sequence add:  TT =>   TGTGGTTAGCTGGTTGCATATTT

       • tr5: nucleotides at 5 end different from the annonated sequence in miRBase:

            precursor             => cctgtggttagctggttgcatatcc
            annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
            sequence tr5:  CC => CCTGTGGTTAGCTGGTTGCATAT
            sequence tr5:  tg =>     TGGTTAGCTGGTTGCATAT

       • tr3: nucleotides at 3 end different from the annotated sequence in miRBase:

            precursor         => cctgtggttagctggttgcatatcc
            annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
            sequence tr3: cc  =>   TGTGGTTAGCTGGTTGCATATCC
            sequence tr3: AT  =>   TGTGGTTAGCTGGTTGCAT

       • s5: offset nucleotides at the begining of the annotated miRNAs:

            precursor         => agcctgtggttagctggttgcatatcc
            annotated miRNA   =>     TGTGGTTAGCTGGTTGCATAT
            s5                => AGCCTGTG

       • s3:offset nucleotides at the ending of the annotated miRNAs:

            precursor         =>  cctgtggttagctggttgcatatccgc
            annotated miRNA   =>    TGTGGTTAGCTGGTTGCATAT
            s3                =>                     ATATCCGC

       • type: mapped on precursor or miRNA sequences

       • ambiguity: number of different detected precursors

       Example:

          seq                 miRNA           start   end     mism    tr5     tr3     add     s5      s3      DB amb
          TGGCTCAGTTCAGCAGGACC    hsa-mir-24-2    50      67      0       qCC     0       0       0       0       precursor 1
          ACTGCCCTAAGTGCTCCTTCTG  hsa-miR-18a*    47      68      0       0       0       tG      ATCTACTG        CTGGCA  miRNA 1

COLLAPSE FASTQ(.GZ) FILES

       Definition

       Normally quality values are lost in  small  RNA-seq  pipelines  due  to  collapsing  after
       adapter  recognition.  This  option  allow  to  collapse  reads after adapter removal with
       cutadapt or any other tool. This way the mapping can use quality values, allowing  to  map
       using bwa for instance, or any other alignment tool that doesn't support FASTA files.

       Methods

       The new quality values are the average of each of the sequence collapse.

       Example

          seqcluster collapse -f sample_trimmed.fastq -o collapse

       • -f is the fastq(.gz) file

       • -o  the  folder where the outout will be created. A new FASTQ file, where the name stand
         for:

            @seq_[0-9]_x[0-9]

       The number right after _x means the abundance of this sequence in the sample

HANDLING MULTI-MAPPED READS

       Definition

       multi-mapped reads are the sequences that map more  than  one  time  on  the  genome,  for
       instance, because there are multiple copies of a gene, like happens with tRNA precursors

       Consequence

       Many pipelines ignores these sequences as defaults, what means that you are losing at leas
       20-30% of the data. In this case is difficult to decide where these  sequences  come  from
       and currently there are three strategies:

       • ignore them

       • count  as  many times as they appear: for instance, if a sequences map twice, just count
         it two times in the  two  loci.  This  will  due  an  over-representation  of  the  loci
         abundances,  and  actually  is  against  the  assumption  of  all  packages that perform
         differential expression in count data.

       • weight them: divide the total count by the number of places it  maps.  In  the  previous
         example,  each  loci  would  get  1/2 * count. This produces weird dispersion values for
         packages that fit this value as part of the model.

       Our implementation [image]

       We try to decide the origin of these sequences. The most common scenario is that  a  group
       of  sequences  map two three different regions, probably due to multi-copies on the genome
       of the precursor.

       We introduce two options:

       • most-voting strategy: In this case, we just count once all sequences, and we output this
         like one unit of transcription with multiple regions. This is the option by default.

       • bayes inference: we give the same prior probability to all locations, and use the number
         of sequences starting in the same position than the one we are  trying  to  predict  its
         location  as  P(B|A).  With this we calculate the posterior that will be used to get the
         proportion of counts to the different locations. We apply the code from the book: "Think
         Bayes" ( Allen B. Downey). This is still under development. To activate this option, the
         user just needs to add --method babes

       The main advantage of this, it is that it can be the input of any downstream analysis that
       is  applied  to RNA-seq, like DESeq, edgeR ... As well, there is less noise, because there
       is only one output coming from here, not three.

TOOLS FOR DOWNSTREAM ANALYSIS

   Web-servers
       TFmiR: disease-specific miRNA/transcription factor co-regulatory networks  v1.2.  It  uses
       results  from  UP/DOWN  regulated  miRNA/Genes  and allows to focus in only one disease to
       create different type of relationships between miRNA/TF/Gene. Easy to use.  Probably  need
       to filter the output sometime due to the big networks that can result from an analysis.

       Diana-TarBase  v7.0:  Database  for validated miRNA targets. Many filter options. Good for
       small candidate miRNAs set studies.

       StarScan: Database to browse the targets of miRNAs from degradome data.  It  has  a  fancy
       interface, and many species and data from GEO.

       miRtex  gives  targets  from  literature.  Good  for  finding  validated  targets  to help
       discussion in papers or further functional experiment based on new hypothesis.

       piRBase: Database for piRNA annotation and function. Published last year, for now the best
       I can find out there.

       chimira: Web tool to analyze isomiR. It gives you a quick idea of you samples.

       MicroCosm: MiRNA target database. Updated and download option.

       IsomiR Bank: isomiR database from many species and tissues. For single queries is useful.

   Command-lines
       miRVaS : tools to predict the functional changed due to nt changes in the miRNA sequence.

RELEVANT PAPERS ABOUT ISOMIRS AND OTHER NOVEL SMALL RNAS WITH FUNCTIONAL RELEVANCE

   ValidationOur  approach  can be adapted to many polyadenylation-based RT-qPCR technologies already
         exiting, providing a convenient way to distinguish long and short 3′-isomiRs.

   IsomiRs
       Naturally existing isoforms of miR-222 have distinct functions: this work demonstrates the
       capacity for 3' isomiRs to mediate differential functions, we contend more attention needs
       to be given to 3' variance given the prevalence of this class of isomiR.

       miR-142-3p isomiR:  "We furthermore demonstrate  that  miRNA  5′-end  variation  leads  to
       differential targeting and can thus broaden the target range of miRNAs."

       A highly expressed miR-101 isomiR is a functional silencing small RNA.

       A challenge for miRNA: multiple isomiRs in miRNAomics.

       miR-183-5p  isomiR  changes  in  breast  cancer.  Validated target regulation of new genes
       different from the reference miRNA.

       A comprehensive survey of 3' animal miRNA modification events and a possible role  for  3'
       adenylation in modulating miRNA targeting effectiveness.

       PAPD5-mediated  3′  adenylation  and  subsequent  degradation  of  miR-21  is disrupted in
       proliferative disease.

       High-resolution analysis of the human retina miRNome reveals isomiR variations  and  novel
       microRNAs.

       Sequence features of Drosha and Dicer cleavage sites affect the complexity of isomiRs.

       Knowledge  about  the  presence  or  absence  of miRNA isoforms (isomiRs) can successfully
       discriminate amongst 32 TCGA cancer types

   General
       A novel piRNA mechanism in regulating gene expression  in  highly  differentiated  somatic
       cells.

       Differential  and  coherent  processing  patterns  from  small  RNAs  to detect changes in
       profiles of processing small RNAs.

       Survey of 800+ datasets from human tissue and  body  fluid  reveals  XenomiRs  are  likely
       artifacts

   Targets
       Identification of factors involved in target RNA-directed microRNA degradation.

   Techonolgy
       miRQC:  work  studying  the  accuracy  and specificity of different technologies to detect
       miRNAs.

       Important features affecting the detection of small RNA biomarkers:  How  the  sample  can
       affect the detection of biomarkers (like RIN value, concentration, ...)

       Comparison  of  alignment and normalization . I will take the message that TMM and DESeq/2
       normalization are the best to avoid strong bias if we consider to have a small  proportion
       of  DE miRNAs. For the alignments, here you have another comparison for miRNAs annotation:
       https://rawgit.com/lpantano/tools-mixer/master/mirna/mirannotation/stats.html

       review of tools for detect miRNA-disease network.

       review of tools  for miRNA de-novo and interaction analysis

       Evaluation   of   microRNA   alignment   techniques   BIG   meeting   on    Dec,3    2015:
       bcbio-srnaseq-BIG-20151203.pdf

DOCUMENTATION

CLASSES

       Visit GitHub code

       I am in the process to document all classes and methods

       • IndexModule IndexSearch Page

AUTHOR

       Lorena Pantano, Francisco Pantano, Eulalia Marti

       2023, Lorena Pantano