注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

Kallisto: RNA-seq数据快速量化软件(Manual)  

2016-09-30 16:18:23|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
    2016年4月4日,NBT杂志上发表了一篇题为“Near-optimal probabilistic RNA-seq quantification”的论文。数据量化软件 Kallisto成为了该文的主要亮点。Kallisto能在10分钟之内完成30M Reads的序列比对和定量分析(从此笔记本电脑也可以跑跑转录组数据了)。
转录组分析主要分两步:
第一步:序列比对,就是把测序数据先比对到参考基因组序列上(主要使用tophat2, bowtie2, HISAT 等软件);
第二步:表达量统计,计算每一个基因(转录本)的 reads 数量(Cufflinks,HTseq-count 等软件)
    传统的比对是将reads分割成k-mer后,将每一个k-mer分配到hash表中一个唯一的位置,再进行序列比对。通过这种转换,可以大大提高序列比对的效率。当存在k-mers可以比对到基因组的不同位置上的情况时,就会降低定量分析的准确度。但是Kallisto有效地解决了这个问题。Kallisto并不需要知道Reads来源于转录本的具体位置,只要知道是哪个转录本就可以精确定量(着重于确定一个 read 属于哪一个基因,而不关心 read 在基因上的位置)。
Kallisto: RNA-seq数据快速量化软件(Manual) - 喜欢吃桃子 - wangyufeng的博客
 
Overview of kallisto
Kallisto: RNA-seq数据快速量化软件(Manual) - 喜欢吃桃子 - wangyufeng的博客
Performance of kallisto and other methods.
 Near-optimal probabilistic RNA-seq quantification
We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.
FULL TEXT:http://www.nature.com/nbt/journal/v34/n5/full/nbt.3519.html

Manual

Typing kallisto produces a list of usage options, which are:

kallisto 0.43.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index
    quant         Runs the quantification algorithm
    pseudo        Runs the pseudoalignment step
    h5dump        Converts HDF5-formatted results to plaintext
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

The usage commands are:

index

kallisto index builds an index from a FASTA formatted file of target sequences. The arguments for the index command are:

kallisto 0.43.0
Builds a kallisto index

Usage: kallisto index [arguments] FASTA-files

Required argument:
-i, --index=STRING          Filename for the kallisto index to be constructed

Optional argument:
-k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)
    --make-unique           Replace repeated target names with unique names

The Fasta file supplied can be either in plaintext or gzipped format.

quant

kallisto quant runs the quantification algorithm. The arguments for the quant command are:

kallisto 0.43.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: value is estimated from the input data)
-t, --threads=INT             Number of threads to use (default: 1)
    --pseudobam               Output pseudoalignments in SAM format to stdout

kallisto can process either single-end or paired-end reads. The default running mode is paired-end and requires an even number of FASTQ files represented as pairs, e.g.

kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq pairB_1.fastq pairB_2.fastq

For single-end mode you supply the --single flag, as well as the -l and -s options, and list any number of FASTQ files, e.g

kallisto quant -i index -o output --single -l 200 -s 20 file1.fastq.gz file2.fastq.gz file3.fastq.gz

FASTQ files can be either plaintext or gzipped.

Important note: only supply one sample at a time to kallisto. The multiple FASTQ (pair) option is for users who have samples that span multiple FASTQ files.

In the case of single-end reads, the -l option must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer. For paired-end reads, the average fragment length can be directly estimated from the reads and the program will do so if -l is not used (this is the preferred run mode).

The number of bootstrap samples is specified using -b. Note that because of the large amount of data that may be produced when the number of bootstrap samples is high, kallisto outputs bootstrap results in HDF5 format. The h5dump command can be used afterwards to convert this output to plaintext, however most convenient is to analyze bootstrap results with sleuth.

kallisto quant produces three output files by default:

  • abundances.h5 is a HDF5 binary file containing run info, abundance esimates, bootstrap estimates, and transcript length information length. This file can be read in by sleuth
  • abundances.tsv is a plaintext file of the abundance estimates. It does not contains bootstrap estimates. Please use the --plaintext mode to output plaintext abundance estimates. Alternatively, kallisto h5dumpcan be used to output an HDF5 file to plaintext. The first line contains a header for each column, includingestimated counts, TPM, effective length.
  • run_info.json is a json file containing information about the run
Optional arguments
  • --bias learns parameters for a model of sequences specific bias and corrects the abundances accordlingly.

  • -t, --threads specifies the number of threads to be used both for pseudoalignment and running bootstrap. The default value is 1 thread, specifying more than the number of bootstraps or the number of cores on your machine has no additional effect.

  • --fr-stranded runs kallisto in strand specific mode, only fragments where the first read in the pair pseudoaligns to the forward strand of a transcript are processed. If a fragment pseudoaligns to multiple transcripts, only the transcripts that are consistent with the first read are kept.

  • --rf-stranded same as --fr-stranded but the first read maps to the reverse strand of a transcript.

Pseudobam

--pseudobam outputs all pseudoalignments in SAM format to the standard output. The stream can either be redirected into a file, or converted to bam using samtools.

For example

kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq > out.sam

or by piping directly into samtools

kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq | samtools view -Sb - > out.bam

A detailed description of the SAM output is here.

pseudo

kallisto pseudo runs only the pseudoalignment step and is meant for usage in single cell RNA-seq. The arguments for the pseudo command are:

kallisto 0.43.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto pseudo [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-u  --umi                     First file in pair is a UMI file
-b  --batch=FILE              Process files listed in FILE
    --single                  Quantify single-end reads
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: value is estimated from the input data)
-t, --threads=INT             Number of threads to use (default: 1)
    --pseudobam               Output pseudoalignments in SAM format to stdout

The form of the command and the meaning of the parameters are identical to the quant command. However, pseudo does not run the EM-algorithm to quantify abundances. In addition the pseudo command has an option to specify many cells in a batch file, e.g.

kallisto pseudo -i index -o output -b batch.txt

which will read information about each cell in the batch.txt file and process all cells simultaneously.

The format of the batch file is

#id file1 file 2
cell1 cell1_1.fastq.gz cell1_1.fastq.gz
cell2 cell2_1.fastq.gz cell2_1.fastq.gz
cell3 cell3_1.fastq.gz cell3_1.fastq.gz
...

where the first column is the id of the cell and the next two fields are the corresponding files containing the paired end reads. Any lines starting with # are ignored. In the case of single end reads, specified with--single, only one file should be specified per cell.

When the --umi option is specified the batch file is of the form

#id umi-file file-1
cell1 cell_1.umi cell_1.fastq.gz
cell2 cell_2.umi cell_2.fastq.gz
cell3 cell_3.umi cell_3.fastq.gz
...

where the umi-file is a text file of the form

TTACACTGAC
CCACTCTATG
CAGGAAATCG
...

listing the Unique Molecular Identifier (UMI) for each read. The order of UMIs and reads in the fastq file must match. Even though the UMI data is single end we do not require or make use of the fragment length.

When run in UMI mode kallisto will use the sequenced reads to pseudoalign and find an equivalence class, but rather than count the number of reads for each equivalence class, kallisto counts the number of distinct UMIs that pseudoalign to each equivalence class.

h5dump

kallisto h5dump converts HDF5-formatted results to plaintext. The arguments for the h5dump command are:

kallisto 0.43.0
Converts HDF5-formatted results to plaintext

Usage:  kallisto h5dump [arguments] abundance.h5

Required argument:
-o, --output-dir=STRING       Directory to write output to

version

kallisto version displays the current version of the software.

cite

kallisto cite displays the citation for the paper.

  评论这张
 
阅读(946)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017