注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

基因组分析perl脚本:get_exons.pl  

2010-11-13 16:06:08|  分类: Perl & bioperl |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

This a perl script in order that you could retrieve the full exon sequences from EnsEMBL if need be. The code is as follows:

#!/usr/local/bin/perl
## get_exons.pl – a program to retrieve all exon sequences from all genes, for a given organism
## Coded by Steve Moss
## gawbul@gmail.com
##
## Input file name should be the organism id (e.g. danio_rerio) with no file extension
## and have each gene id on seperate lines

use strict;
use Bio::EnsEMBL::Registry;
use Bio::SeqFeature::Gene::Intron;
use Time::HiRes qw(gettimeofday);

my $start_time = gettimeofday;

# crude check for input arguments
my $num_args = $#ARGV + 1;
if ($num_args == 0) {
die “No filename argument given.\n\n”;
}

# set input file as first argument
my $input_file = $ARGV[0];
# split up path at /’s and set $org_name as last array element (should make it the filename)
my @org_name = split(/\//, $input_file);
our $org_name = @org_name[-1];

# check there is no file extension and is in correct lowercase format
if (($org_name =~ m/^.*?\.\S{0,4}$/) or not ($org_name =~ m/^[a-z]+_[a-z]+$/)) {
die “Filename is malformed! Please rename to be only the organism name in the format genus_species, with no file extension and all lowercase.\n\n”;
}

# setup module access object
my $registry = ‘Bio::EnsEMBL::Registry’;

# connect with ensembl database
$registry->load_registry_from_db (
-host => ‘ensembldb.ensembl.org’,
-user => ‘anonymous’
);

# open input file and read values into an array
open(INFILE, $input_file);
our @geneids = <INFILE>;
close(INFILE);

# let us know how many genes ids have been loaded
chomp(@geneids); # get rid of white space
# let user know how many gene ids have been loaded
print “Loaded ” . scalar(@geneids) . ” gene IDs\n”;

# setup adapters to access genome information
my $slice_adaptor = $registry->get_adaptor($org_name, ‘Core’, ‘Slice’);
my $tr_adaptor = $registry->get_adaptor($org_name, ‘Core’, ‘Transcript’);

# let user know we’re starting
print “Processing…\n”;
# declare array for building output file
my @build_lines = “”;
# use this to iterate through gene ids
foreach our $geneid (@geneids) {
# setup slice to retrieve gene information
my $slice = $slice_adaptor->fetch_by_gene_stable_id($geneid);

# Fetch all of the transcripts
my $transcripts = $tr_adaptor->fetch_all_by_Slice($slice);
while ( my $tr = shift @{$transcripts} ) {
# check if strand starts at positive value and list only those
if ($tr->start() > 0) {
# Fetch all of the exons
my $exons = $tr->get_all_Exons();
while (my $exon = shift @{$exons}) {
push (@build_lines, “>” . $geneid . ‘|’ . $tr->stable_id() . ‘|’ . $exon->stable_id() . ‘|’ . $exon->start() . ‘|’ . $exon->end() . ‘|’ . $exon->length() . “\n”;
push (@build_lines, $exon->seq() . “\n”);
}
}
}
}
print “Done\n”;

# write output after loop
print “Outputting to file…\n”;
open (OUTFILE, “>>./$org_name\.exons”);
foreach my $line (@build_lines) {
print OUTFILE $line;
}
close (OUTFILE);

my $end_time = gettimeofday;
my $elapsed = $end_time – $start_time;
print “Finished in $elapsed seconds!\n”;

The script outputs a .exons file (e.g. danio_rerio.exons), which is essentially a .fasta file and can be renamed with the .fasta extension if any analysis is required in that format. The file is in the format:

> GENE_ID|TRANSCRIPT_ID|EXON_ID|EXON_START|EXON_END|EXON_LENGTH
EXON_SEQUENCE

links :http://davelunt.net/

  评论这张
 
阅读(696)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017