注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

List of RNA-Seq bioinformatics tools (I)  

2014-02-25 21:10:06|  分类: 生物信息分析 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
From Wikipedia, the free encyclopedia

RNA-Seq (RNA-Seq.ppt / RNA-Seq Guide / RNA-Seq for everyone ) is a revolutionary[1] technique to perform transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some related web resources.

To follow an integrated guide to the analysis of RNA-seq data, please see - Next Generation Sequencing (NGS)/RNA, Hands-On Tutorial or RNA-Seq Workflow. Also, important links are SEQanswers, RNA-SeqList, RNA-SeqBlog, Biostar and bioscholar.


Quality control and pre-processing data[edit]

Quality control and filtering data[edit]

Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), linkers, overrepresented sequences or noise to assure a coherent final result.

  • clean_reads clean_reads.
  • condetri condetri.
  • cutadapt cutadapt removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case.
  • FastQC FastQC is a quality control tool for high-throughput sequence data (Babraham Institute) and is developed in Java. Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand alone application or it can be integrated into a larger pipeline solution. See also seqanswers/FastQC.
  • FASTX FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA/RNA.
  • Flexbar Flexbar performs removal of adapter sequences, trimming and filtering features.
  • FreClu FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology.
  • HTSeq HTSeq.
  • htSeqTools htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits.
  • PRINSEQ PRINSEQ generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios.
  • qrqc qrqc is a Bioconductor package to quick read quality control.
  • RNA-SeQC RNA-SeQC is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePatternweb interface. The input could be one or more BAM files. HTML reports are generated as output.
  • RSeQC RSeQC analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization.
  • Sabre sabre.
  • SAMStat SAMStat identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping.
  • SEECER seecer SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly.
  • Sickle Sickle.
  • ShortRead ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions (BioStrings, BSgenome, IRanges, and so on). See alsoseqanswers/ShortRead.
  • Trimmomatic Trimmomatic performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64.

Pre-processing data[edit]

Further tasks performed before alignment.

  • DeconRNASeq DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data.
  • FastQ Screen FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc).
  • FLASH FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads.

Alignment Tools[edit]

After control assessment, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also Tools for mapping high-throughput sequencing data , List of sequence alignment software and HTS Mappers.

Short (Unspliced) aligners[edit]

Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods,Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group, despite the time spent tend to be more sensitive, generating more reads correctly aligned. See a comparative study of short aligners - comparative study.

  • BFAST BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST.
  • Burrows-Wheeler Aligner (BWA) BWA implements two algorithms, mainly based on Burrows–Wheeler transform. The first algorithm is used with reads with low error rate (<3%). The second algorithm was designed to handle more errors and implements a combined strategy: Burrows–Wheeler transform and Smith-Waterman method. BWA allows mismatches and small gaps (insertions and deletions). The output is presented in SAM format. See also seqanswers/BWA.
  • Short Oligonucleotide Analysis Package (SOAP) SOAP.
  • GNUMAP GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software.
  • Maq Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches. See also seqanswers/Maq.
  • NovoAlign (commercial) NovoAlign is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. Novoalign tolerates up to 8 mismatches per read, and up to 7bp of indels. It is able to deal with bisulphite data. Output in SAM format. See also seqanswers/NovoAlign.
  • SHRiMP SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm. See alsoseqanswers/SHRiMP.
  • Stampy Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format. See also seqanswers/Stampy.
  • ZOOM (commercial) ZOOM is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions. See also seqanswers/ZOOM.

Spliced aligners[edit]

Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also Methods to study splicing from high-throughput RNA Sequencing data and Methods to Study RNA-Seq (workflow).

Aligners based on known splice junctions (annotation-guided aligners)[edit]

In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST).

  • Erange Erange is a tool to alignment and data quantification to mammalian transcriptomes. See also seqanswers/Erange.
  • IsoformEx IsoformEx.
  • MapAL MapAL.
  • OSA OSA.
  • RNA-MATE RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting. See also seqanswers/RNA-MATE.
  • RUM RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format.

  评论这张
 
阅读(1001)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017