注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

eQTL data analysis workflow and softwere  

2015-01-20 14:12:32|  分类: 生物信息分析 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
eQTL data analysis workflow and softwere - 喜欢吃桃子 - wangyufeng的博客
Analysis softwere: Matrix eQTL: Ultra fast eQTL analysis via large matrix operations

Key features

  • Designed for eQTL analysis of large datasets.
  • Performs testing for all or only local transcript-SNPpair.s
  • Ultra-fast, no loss of precision.
  • Equally fast for models with covariates.
  • Supports
    • Linear additive and ANOVA models. Supports testing for the effect of genotype-covariate interaction.
    • Covariates to account for sex, population structure, surrogate variables, etc.
    • Correlated and heteroskedastic errors.
    • Correction for multiple testing using FDR external link.
    • Separate p-value thresholds and FDR control for local and distant eQTLs (more info).
  • Convenient R package at CRAN Repository external link.
Performance comparison:
MethodNo covar.10 covar.
Matrix eQTL, Matlabexternal link11.811.8minutes
Matrix eQTL, Rev Rexternal link14.614.6minutes
Matrix eQTL, R+GOTOexternal link19.419.4minutes
Plink external link9.4583.3days
Merlin external link19.620.0days
R/qtl external link1.04.7days
snpMatrix external link3.25.1days
eMap external link17.8N/Adays
FastMap external link10.3N/Ahours

Info: Details of the testing procedure. 

Fact: Matrix eQTL results match those by other software. 

Comparison conducted analyzing CF dataset with 573,337 SNPs and 22,011 transcripts over 840 samples. Tested on a quad-core PC, using additive linear models with zero and with 10 covariates.

Installing Matrix eQTL

Matrix eQTL is available as an R package called MatrixEQTL and can be conveniently downloaded and installed from CRAN Repository external link with a simple R command:

? install.packages("MatrixEQTL")



Sample code and toy data set

The best way to start using Matrix eQTL is to first run the sample code on provided toy dataset. The sample code and data are part of the package and are also available atthis page. The package manual contains the sample code at the bottom of the help page for Matrix eQTL main function available via ?Matrix_eQTL_main command.

The sample code performs eQTL analysis of a toy data set consisting of three files:genotype, expression, and covariates. For every gene-SNP pair it runs linear regression analysis accounting for the set of covariates.

Let's go over the sample code line by line.

First step is to load the package:

? library("MatrixEQTL");

The toy data set files are stored with the package at the following location.

? base.dir = find.package("MatrixEQTL");

Then we set the parameters such as selected linear model and names of genotype and expression data files.

? useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS 
? SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
? expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

A separate file may be provided with extra covariates. In case of no covariates set the variable covariates_file_name to character().

? covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep=""); 
? output_file_name = tempfile();

The p-value threshold determines which gene-SNP associations are saved in the output file output_file_name. Note that for larger datasets the threshold should be lower. Setting the threshold to a high value for a large dataset may cause excessively large output files.

? pvOutputThreshold = 1e-2;

Finally, define the covariance matrix for the error term. This parameter is rarely used. If the covariance matrix is a multiple of identity, set it to numeric().

? errorCovariance = numeric();

The next section of the sample code contains three very similar parts loading the files with genotype, gene expression, and covariates. In each part one can set the file delimiter (i.e. tabulation "\t", comma ",", or space " "), the string representation for missing values, the number of rows with column labels, and the number of columns with row labels. Finally, one can change the number of the variables in a slice for the file reading procedure (do not change if not sure).

? snps = SlicedData$new();
? snps$fileDelimiter = "\t";      # the TAB character
? snps$fileOmitCharacters = "NA"; # denote missing values;
? snps$fileSkipRows = 1;          # one row of column labels
? snps$fileSkipColumns = 1;       # one column of row labels
? snps$fileSliceSize = 2000;      # read file in pieces of 2,000 rows
? snps$LoadFile( SNP_file_name );

Finally, the main Matrix eQTL function is called:

? me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE,
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);

Each significant gene-SNP association is recorded in a separate line in the output file and in the returned object me. In case of cis/trans eQTL analysis described below, two output files are produced, one with cis-eQTLs, another only with trans. Every record contains a SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR.



Cis- and trans- eQTL analysis

Matrix eQTL can distinguish local (cis-) and distant (trans-) eQTLs and perform separate correction for multiple comparisons for those groups. The second sample codeshows how to run such analysis. The main Matrix eQTL function Matrix_eQTL_main requires several extra parameters for cis/trans analysis:

  • pvOutputThreshold.cis – p-value threshold for cis-eQTLs.
  • output_file_name.cis – detected cis-eQTLs are saved in this file.
  • cisDist – maximum distance at which gene-SNP pair is considered local.
  • snpspos – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See sample SNP location file.
  • genepos – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See sample gene location file.

For more information see Matrix eQTL reference manual via command ?Matrix_eQTL_main in R or click Matrix_eQTL_main.



Analyzing your own data

To analyze your own data one simply take one of the code samples and replace references to the toy data set with those to real data. This may involve up to all five files in the toy data set:

The first three data files must have columns corresponding to samples and with one gene/SNP/covariate in each row. The columns of all three files must have matching order. All measurements must be numeric and the values in the genotype data set do not have to be descrete.



Analyzing your own LARGE data (BLAS)

Matrix eQTL is designed to handle large genotype and expression data sets. They are loaded using SlicedData classes which store the data in slices of 1000 rows (default size). The analysis is then performed for each pair of slices of genotype and expression data sets.

Matrix eQTL gains efficiency by expressing the most computationally intensive part of the calculations in terms of large matrix operations, most importantly – matrix multiplications. Thus, using a fast BLAS in R, Revolution R, or Matlab, Matrix eQTL can outperform competitors by several orders of magnitude as presented in the summary table on the main page or the manuscript external link in Bioinformatics.

Unfortunately, the standard installation of R does not include a fast BLAS. If you are not using Revolution R (on Windows) you may be using an inefficient BLAS. You can improve the performance of large matrix multiplication in R up to 20 times by using a non-standard BLAS. The necessary steps depend on the platform you are using:



More Matrix eQTL

To get the most out of Matrix eQTL package please continue to the page about thefeatures of Matrix eQTL.

  评论这张
 
阅读(1135)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017