注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

Perl programs and scripts developed by the Oklahoma University Genome Center  

2012-03-07 17:31:51|  分类: Perl & bioperl |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

454_base36 - perl script to convert 454/Roche base 36 universal accession number strings to/from numbers, X_Y coordinates, and timestamps. Click here for full documentation. Updated July 22, 2009
454_multi_asm - perl program to build a shell script to perform three 454 Newbler assemblies on 454 data (two assemblies with shortened reads) and copy the assembled contigs into chromat_dir with reduced quality scores for Phrap assembly. Updated August 7, 2009
all2many - perl script to split a fasta file with multiple contigs into separate fasta files.
autofish - a set of tools used to search shotgun reads for those reads that match a set of query contig sequences in order to close gaps in a sequence assembly. Autofish performs the search using hash words of the sequences. It is less sensitive than blastn, but performs alignments and rejects matches with high quality mismatches. Once the matching reads have been identified, they can be extracted from the input files and used to improve the assembled sequence. Click here for more information. New December 4, 2009
blast2many - perl script to split a blast output file with results of multiple queries into separate blast output files. Click here for full documentation.
Blast2table - A program to parse Blast output using BioPerl's Bio::Tools::Blast.pm and to write the data from each HSP in tabular form in a variety of formats. For some formats, the data may be modified to display the hit name as an HTML link to Genbank. The data optionally can be sorted in various ways. Blast2table processes a list of files named on the command line, or uses standard input if no filename is given. Output is written to standard output. Click here for full documentation. NOTE: Requires the module Bio::SearchIO.pm from BioPerl to be installed. See http://www.bioperl.org/ for more information. Updated June 1, 2009
blast_find_keywords - program to read a file of Blast hits (from Blast2table, above), an outline file containing a list of keywords, and an optional file containing EC numbers with matching names to augment the keyword list. (The EC file is the ENZYME nomenclature database, "enzyme.dat", found from "http://www.expasy.org/enzyme/".) The program uses the keyword list to select the "best" keyword for each "contig". The program also finds all other blast hits with better scores than the best hit matching any EC number or keyword. The program produces lists of the top blast hits, sorted by contig and then blast score. Once the outline file has been finalized, the report format can be requested. blast_find_keywords replaces older programs blast_best_contigs, blast_best_keyword, blast_best_nonkeyword, blast_print_keywords, and blast_sort_keywords. Click here for full documentation.
codon_usage - A program to get codon usage of nucleotide sequences of proteins in Fasta format. For general dna sequences (not just the in-frame dna sequences of proteins) the program computes di- and tri-nucleotide usage frequencies at each base location for the forward, reverse, and/or both strands. Click here for full documentation.
copy_454_rundata - perl script to copy subsets of 454/Roche GS20/FLX run data from a run folder to a project folder. Click here for full documentation. Updated June 1, 2009
EMBL_feature_programs - a collection of Perl scripts for working with EMBL feature files. These programs are used at the University of Oklahoma Advanced Center for Genome Technology in conjunction with Artemis from the Sanger Centre.
exgap - Program for contig ordering, subclone and primer selection for primer walking and graphic display of relationships between clone read pairs in contigs. Click here for full documentation.
extract_454_paired_ends - perl script to separate 454/Roche GS20/FLX mixed paired-end and non-paired-end run data. This program is called by get_454_paired_ends. Click here for full documentation. Updated June 10, 2009
extract_fasta - perl script to extract all or a portion of a contig from a fasta sequence file. and optionally the corresponding portion of a contig from a fasta qual file. The output contig may be reversed and complemented. The output contig name may be shortened automatically. Click here for full documentation.
fastaq2phd - perl program to convert all or part of a fasta sequence file and a matching fasta quality file to a phd file. Also can be used to convert a multi-contig fasta file into a set of phd files in a directory. Click here for full documentation.
find_fasta - a perl program to search a fasta file for patterns, which may be perl-style regular expressions. The input sequences may be either nucleotide or amino acid sequences, but IUB codes to represent inexact searches are not allowed; patterns are used instead. All matches are reported. Click here for full documentation.
fix_454_rundata - perl program to workaround read misnaming in SFF files for non-Titanium runs on FLX. Also used to run processing programs before or after file transfers. Click here for full documentation. New program June 1, 2009
ftp_chromats_mac - a Perl program to transfer ABI 377 chromatogram files from a Mac to a Unix host for further processing. The script automates the file transfer process and provides safeguards against overwriting existing data or resending the same data twice. ftp_chromats_mac also logs all transfers locally and on the target ftp host. Requires Mac Perl. ftp_chromats_mac.readme
ftp_chromats_windowsNT - a Perl/TK program to transfer chromatogram files collected on ABI 3700, Molecular Dynamics MegaBACE, MJGeneSys BaseStation, or SpectruMedix sequencers from a Windows 95/98/NT (or Unix) computer to a Unix host for further processing. The script automates the file transfer process and provides safeguards against overwriting existing data or resending the same data twice. ftp_chromats_windowsNT can be configured to run phred (by Brent Ewing and Phil Green at the University of Washington Genome Center) to compute base quality scores for each sample. These quality scores are displayed to give an overview of run quality before the data is transfered. ftp_chromats_windowsNT also logs all transfers locally and on the target ftp host. Requires Perl with the Perl/TK module. Also requires OUTkForms.pm. (See below.) ftp_chromats_windowsNT has been written for easy customization and is designed to integrate with sheet_writer.pl. ftp_chromats_windowsNT.readme
get_454_mids - perl script to separate 454/Roche GS20/FLX run data which uses MID tags. Click here for full documentation. Updated August 7, 2009
get_454_paired_ends - perl script to separate 454/Roche GS20/FLX mixed paired-end and non-paired-end run data and process paired-end data for use by phrap. Click here for full documentation. Updated June 1, 2009
get_454_pools - perl script to separate 454/Roche GS20/FLX tagged run data and run a series of programs on each separate set of data. Click here for full documentation. Updated August 7, 2009
get_amino_stats - a perl program to get amino acid counts of sequences in Fasta format. Click here for full documentation.
get_contig_ends - Get ends of contigs for blast searches to find matching ends for gap closure. May also be used to extract contigs longer than a minimum size, to remove leading and trailing Xs and Ns, to shorten long contig names, to reverse and complement the output contigs, and to reformat output sequence data lines to a specific number of bases per line. Click here for full documentation. Updated July 22, 2009
get_fasta_stats - Get statistics of contigs in Fasta format. A fasta quality file can also be read to give error and quality statistics. Can be used to compute mono-, di, and tri-nucleotide frequencies. Can produce contig length histograms and 454 read statistics for Newbler assembled contigs. Click here for full documentation. Updated August 10, 2009
get_multi_fasta_stats - perl script to read a list of fasta input files and output stats about each each file.
index_contigs_by_tag - perl script to read a fasta file and create index files based on a tag sequence contained at a fixed position in each contig. One index file is created for each unique tag sequence at that fixed position. Alternatively a tag_file containing a list of desired tags may be read.. Click here for full documentation. Updated July 22, 2009
match_contigs - program to match contigs for gap closure based on matching blast hits for contigs containing the same split gene. Used by match_contig_ends. Click here for full documentation.
match_contig_ends - Match ends of contigs for gap closure by performing blast searches on contig ends to look for matching split genes. Also matches contig ends against entire contigs to look for false joins. Uses get_contig_ends and match_contigs. Click here for full documentation.
maxmatch - a tool for comparing two sequences similar to what dotter does. Maxmatch uses a suffix tree data structure to find exact matches, so it is much faster, but not as sensitive as dotter. Click here for more information. New December 4, 2009
OUTkForms.pm - Perl/TK module for fill-in-the-blank menu applications. OUTkForms.pm provides a consistent appearance, using table driven input to define the form and to define simple syntax checking and error messages for the fields. OUTkForms.pm is used by ftp_chromats_windowsNT and Sheet_writer.pl. Requires Perl with the Perl/TK module. Help and customization information are included in the module. Rename OUTkForms_WindowsNT_V1.0.pm to OUTkForms.pm and place in one of the Perl @INC directories before using.
plate_reverse - program to rename samples from a sequencing plate that was loaded backwards. For a 96 well plate, well A01 is named as if it were in H12, A02 is named as if it were in H11, A03 is named as if it were in H10, ..., B01 is named as if it were in G12, ..., H12 is named as if it were in A01. For a 384 well plate, well A01 is named as if it were in P24, A02 is named as if it were in P23, A03 is named as if it were in P22, ..., B01 is named as if it were in O24, ..., P24 is named as if it were in A01. Click here for full documentation.
primer_check - perl program to check a primer request file for proper format and number of primers to be synthesized by the MerMade Oligo-nucleotide Synthesizer. Primer_check also adds control primers from a file and reports numbers and lengths of primers from each source. Click here for full documentation.
PrimOU - the University of Oklahoma version of the UT-SWMC Primo Primer Picking Program. Changes include fixing bugs and screening for uniqueness against existing known sequence in an entire project.
ReArray.exe - ReArray.exe is a self-extracting archive file for Microsoft Windows that contains a set of program and example data files for collecting scattered samples from one or more wells from various plates and putting them into one destination plate using the Beckman Biomek 2000. The Biomek uses the P20 tool to transfer up to 20uL per sample. There are two Tool Control Language programs. The program "R96_96.tcl" rearrays many 96-well microtiter plates into one 96-well microtiter plate. The program "R96_384.tcl" rearrays many 96-well microtiter plates into one 384-well microtiter plate. Click here for more information.
rename_454_reads_to_uaccno - perl script to automatically rename old 454 reads using X-Y coordinate based names to the new universal accession numbers. This function is available from sort_contigs -u. Use of this program is deprecated.
replace_454_data - A Perl program to remove assembled 454 GS20 files prefixed by "454_" from both both phd_dir and chromat_dir, to run get_contig_ends for trimming Ns and Xs from the ends of the 454 contigs and for removing short contigs, and to run fastaq2phd to create new phd files and chromat placeholders using fasta sequence and quality files assembled from 454 GS20 runs. Get_contig_ends and fastaq2phd can be run automatically a second time to create duplicate phd files for the contig end sequences to force phrap to treat the 454 contigs as contigs and not allow them to become singlets. Click here for full documentation. Updated June 1, 2009
report_polyphred.pl - This script reformats polyphred output to produce three tab-delimited files that can be opened as Excel spreadsheets for further statistical examination. The first file "excel_polyphred_report.txt", contains the overall report of all SNPs (including their scores) detected for each clone. It also lists the failed clones. The second file "excel_snp_counts.txt", contains the total count for each of the ten possible SNPs at every position. Finally, the third output file "ratios.txt", contains the ratios of those bases that occurred at each SNP position.
report_prettybase.pl - This script reformats prettybase output to produce four tab-delimited files that can be opened as Excel spreadsheets for further statistical examination and a Visual Basic macro file. The first file "snp_report.txt", contains the overall report of all SNPs (including their scores) detected for each clone. It also lists the failed clones. The second file "snp_counts.txt", contains the total count for each of the ten possible SNPs at every position. The third output file "snp_ratios.txt", contains the ratios of those bases that occurred at each SNP position. The fourth file "snp_summary.txt", summarizes SNP statistics by SNP position. Finally the fifth file "snp_macro.bas" provides a Visual Basic macro for coloring cells of the "snp_report.txt" file according to SNP type. (The "snp_report.txt" file must be opened first; then the "snp_macro.bas" file should be imported into the report file through the macro menu in Excel and run from there.)
select_contigs - Select a subset of contigs from a fasta input file into a new fasta output file. A fasta quality file also can be processed. Partial contigs can be extracted. Can also automatically rename old 454 reads using X-Y coordinate based names to the new universal accession numbers. Various filtering options are available. Click here for full documentation. If contigs need to be reordered or joined together, then use sort_contigs instead. New August 7, 2009
SheetWriter - a sample sheet generating program for the Macintosh for PE/ABI 377 sequencers
sheet_writer.pl - Perl/TK program to creates run sheets for ABI 3700 capillary DNA sequencer or MJGeneSys BaseStation 96-lane gel sequencer. Sheet_writer.pl is easily customizable and is written to provide information to ftp_chromats_windowsNT for easy file transfers without re-entry of run information. Requires Perl with the Perl/TK module. Also requires OUTkForms.pm module. (See above.)
sort_contigs - Sort a fasta input file, alphabetically by contig name, numerically by contig name, by contig size, or according to a file giving contig order. The output can be a set of ordered contigs, or a single joined contig with separator strings. A fasta quality file also can be processed. Contigs can be reversed and complemented, and partial contigs can be extracted. Can also automatically rename old 454 reads using X-Y coordinate based names to the new universal accession numbers. Various filtering options are available. If neither reordering of contigs nor joining contigs is required, then select_contigs may be a better choice, especially for large input files, because it uses much less memory. Click here for full documentation. Updated August 7, 2009
split_454_pools - perl script to separate 454/Roche GS20/FLX tagged run data and run a series of programs on each separate set of data This program is called by get_454_pools. Click here for full documentation. Updated August 7, 2009
stream_file_extract - splits an input file into multiple output text files. A file name prefix may used, along with the names from the input file. Data for files to be created may be embedded within data to be passed to Standard Output. This was written to be able to pass multiple files back from an rsh command. The program stream_file_insert may be used to create the input file containing the embedded files to be extracted. Click here for full documentation.
stream_file_insert - inserts one or more text files into the Standard Output stream for later extraction by stream_file_extract. A file name prefix may be removed before embedding the names into the output. This was written to be able to pass multiple files back from an rsh command. Only text files are supported. Click here for full documentation.
unique_contigs - Read a fasta input file and output the set of unique contigs. The program screens for both duplicate contig names and duplicate sequences. Click here for full documentation.
zap_file.pl - A program to perform a global string substitution on a binary file. The full pathname to the file may be specified, or the program will find a file in the Perl \@INC include libaries if the Perl style relative path is known. Click here for full documentation.


 



A listing of other examples of programs and scripts developed by our informatics group to aid the contig assembly, proofreading, and data quality checking
We are working on making these programs and scripts available by links from this page
automake - perl script to manually initiate autophrap for a single project
autophrap.pl - master perl script with a make file to automatically assemble projects containing new data at both high and low stringency, to select extension and coverage primers for projects nearing completion, to compare high and low stringency assemblies, to produce update files for Genbank submission, and to update the Web assembly report. autophrap.pl executes several other scripts and programs. This script is automatically run each hour by a cron job. Each project is updated automatically once new sequence data is added
autorep.pl - produce a report about the automatic assembly performed by autophrap.pl
autosubmit.pl - generate asn.1 format Genbank submissions automatically using fa2htgs
autotroll.pl - check for filenames containing spaces (messes up phrap)
biggest_bases.pl - find the largest contigs in a fasta format file
changex2n.in - editor scripts for changing phrap x bases to n
check_consensus.pl - compare the latest assembly consensus with the most recent Genbank submission
chromat_check.pl - check chromat_dirs for stupid file names containing spaces etc.
contigpairs.pl - parse rel.out file into pairs of contigs with spanning clones
contigs2tbl - compute table of contig sizes for Web report
dsgaps.pl - locate DS-Gap regions in the phrap.out file
dsregions.pl - locate DS-Gap regions in the phrap.out file and output Primo format
find_latest.pl - a tool used by web_tabler.pl
forcescreen.pl - force ecoli screen on phrap data bases
initialize.pl - required by autosubmit.pl
lastbase.in - editor scripts for postprimo processing
makereadme - create, edit, or print 00readme.txt file for a project.
make_autoqual.html - a perl script to produce a tabular html listing of the number of reads with phred20 scores for each project
overlap.pl - look up two projects in automaster.list and crossmatch to compute overlap and unique regions
overlaps.pl - look up adjacent entries in automaster.list and crossmatch to compute overlap and unique regions
postprimoc.pl - convert primo output to mermade output for large-insert clones
postprimos.pl - convert primo output to mermade output with sub-clones information
print_unique - perl program to extract key and value fields from each record of a file, compute a single output value from the records with the same key, and output a single record for each key with the key and value.
printreadme.pl - print 00readme.txt files for all projects
printrev_phrap - modified Jeremy Parson's printrev program to use phrap output (via the rel.out file from relationships_phrap), to print the graphic output across multiple pages and to order contigs based on forward/reverse clone linkage information.
qualreport.pl - checks the phred20 bases, pUC vector only reads and bacterial host genomic contamination for each sequencing gel
readme_check.pl - check readme files for various strings
relationships_phrap - perl program to extract read information for all contigs in a phrap .ace file. The file produced is named rel.out.
total_bases.pl - compute the number of bases in a fasta format file
web_tabler.pl - build the status tables with accession numbers and web hot links
weekly_bacterial - perl program to automate weekly Blast searches of bacterial sequencing projects for Web searches of Blast results.

via:http://www.genome.ou.edu/informatics.html

  评论这张
 
阅读(1242)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017