注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

基因组分析perl脚本:get_geneids.pl  

2010-11-13 16:13:28|  分类: Perl & bioperl |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

     I had  previously been using the BioMart website (http://www.ensembl.org/biomart/martview) to download Gene IDs as follows:

  • Select Ensembl Genes 58 from - CHOOSE DATABASE -
  • Select the required organism (e.g. Danio rerio genes (Zv8)) from - CHOOSE DATASET -
  • Click the Attributes link in the left hand navigation column
  • Ensure Features is selected
  • Click the + next to GENE:
  • Under Ensembl ensure only Ensembl Gene ID is selected
  • Click the Count button in the top left hand corner (you should get 28, 509 for this release)
  • Click the Results button in the top left hand corner
  • In the Export all results to row, ensure File is in the first drop down box and CSV is in the second
  • Click the Go button
  • The browser will download an export_mart.txt file, which you should rename to danio_rerio (or whatever you prefer)

This is quite a quick process, but can be even quicker using an automated tool. I therefore developed the following code:

#!/usr/local/bin/perl
## get_geneids.pl – get list of gene ids for an organism
## Coded by Steve Moss
## gawbul@gmail.com
##
## Input file name should be the organism id (e.g. danio_rerio) with no file extension
## and have each gene id on seperate lines

use strict;
use Bio::EnsEMBL::Registry;
use Time::HiRes qw(gettimeofday);

my $start_time = gettimeofday;

# crude check for input arguments
my $num_args = $#ARGV + 1;
if ($num_args == 0) {
die “No organism argument given.\n\n”;
}

# set input file as first argument
my $input_arg = $ARGV[0];
# split up path at /’s and set $org_name as last array element (should make it the filename)
my @org_name = split(/\//, $input_arg);
our $org_name = @org_name[-1];

# check there is no file extension and is in correct lowercase format
if (($org_name =~ m/^.*?\.\S{0,4}$/) or not ($org_name =~ m/^[a-z]+_[a-z]+$/)) {
die “Organism is malformed! Please rename to be only the organism name in the format genus_species, with no extension and all lowercase.\n\n”;
}

# setup module access object
my $registry = ‘Bio::EnsEMBL::Registry’;

# connect with ensembl database (port 3306 is up to version 47, 5306 is up to version 58)
# need to use earlier API for older versions – checkout from CVS
Bio::EnsEMBL::Registry->load_registry_from_db(
-host => ‘ensembldb.ensembl.org’,
-user => ‘anonymous’,
-verbose => “0″);

# setup gene adaptor and connect to database
my $gene_adaptor = Bio::EnsEMBL::Registry->get_adaptor($org_name, “core”, “gene”);
my @dbas = @{Bio::EnsEMBL::Registry->get_all_DBAdaptors(-species => $org_name)};
my $dbname = $dbas[0]->dbc->dbname();

# let user know we’re starting
print “Processing data from $dbname…\n”;
my @build_lines = “”;
# go through each gene from gene adaptor and fetch all gene ids
foreach my $gene (@{$gene_adaptor->fetch_all()})
{
push(@build_lines, $gene->stable_id() .”\n”);
}
my $loaded = scalar(@build_lines);
print “Loaded ” . ($loaded – 1) . ” gene IDs\n”;

# write output after loop
print “Outputting to file…\n”;
open (OUTFILE, “>>./$org_name”);
foreach my $line (@build_lines) {
print OUTFILE $line;
}
close (OUTFILE);

my $end_time = gettimeofday;
my $elapsed = $end_time – $start_time;
print “Finished in $elapsed seconds!\n”;

This script takes the organism name in its lower case latin form separated by an underscore rather than a space (e.g. danio_rerio). It then downloads all the Gene IDs for the organism for which the $gene->stable_id() is the Gene ID in the code. The number of Gene IDs downloaded matched the manual method. This created the files danio_rerio, gasterosteus_aculeatus, oryzias_latipes, takifugu_rubripes and tetraodon_nigroviridis, which I placed in a Gene_IDs sub-directory and which contained a single Gene ID on each line of the file.

links:http://davelunt.net/


  评论这张
 
阅读(861)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017