注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

wangyufeng的博客

祝愿BB 健康开心快乐每一天

 
 
 

日志

 
 

dChip: SNP array data processing  

2011-10-17 20:30:42|  分类: 生物信息分析 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
        dChip: SNP array data processing

 SNP array resources               SNP information file                          Genome information file           

Read in SNP data                    External or Illumina SNP data           Filter SNPs                                        

Combine array types              Probe data view         

SNP data view                        Export SNP data or regions               

 

Besides the mRNA expression level analysis, oligonucleotide arrays have also been applied to Single-nucleotide-polymorphisms (SNP) (Chee et al. 96; Wang et al. 98) and loss-of-heterozygosity (LOH) studies (Lindblad-Toh et al. 00). Please cite Lin et al. 04 if dChip SNP analysis functions are used in your work.

 

SNP array resources

 

Affymetrix resources: Technical note on LOH and copy number analysis, CNAT software, SNP array publications

500K SNP array resources: Product page, Support materials, HapMap genotypes (CEL files, GEO, 48 samples), Copy number data

Array support materials: 10K array, 10K 2.0 array, 100K array, 500K array, 5.0 array, 6.0 array

Other SNP array analysis software from academics: CNAG, PLASQ, SNPscan

 

100K datasets from Affymetrix: 54 individuals (only genotype calls), HapMap trio dataset

Zhao et al. 2005 (Data), Garraway et al. 2005 (GEO Data); These two datasets use Early Access 100K array

Zhao et al. 2004 (Data); this dataset uses Early Access 10K array

If genotype text files are not directly downloadable from the GEO website, at each sample page, click “View full table” on the bottom, and use “File/Save as/Save as type: text file” to save to a text file (example).

 

CDF files: Early Access 10K array (unzip it), Early Access 100K array: CentHindAv2.CDF, CentXbaAv2.CDF

Mapping 10K – 500K SNP arrays: go to the “DNA analysis arrays” section of the Affymetrix library file page, download and unzip the library file for the respective array type.

 

SNP information file

 

[Use version 11/13/05+] Specify this file at “Analysis/Open group/Other information/SNP information file”. The SNP information comes from Affymetrix annotation CSV files. The fragment length information is used for filtering SNPs. The allele frequency of an ethnic group (Asian, African American or Caucasian) can be specified at “Tools/Options/Chromosome/Allele A frequency” and used in linkage analysis or allele sharing analysis. In version 5/18/06+, the allele frequency can be specified as "Compute from data" here so the arrays used in the array list file will be used to compute allele A frequency.


Unzip these files to get SNP information files: Mapping 10K 2.0, Mapping 100K, Mapping500K

 

When an allele frequency is 100% or 0% in the genome information file, it will be truncated to 0.001 or 0.999 before using them in linkage analysis. Non-specified allele frequency will be used as 50% for both A and B allele.

 

Genome information files for SNP arrays

 

To use the “Analysis/Chromosome” function to view SNP data, we need genome information file and optional RefGene and Cytoband file. The RefGene files provide the gene information. The genome assembly hg number of these files should be matched in one analysis session (Correspondence between NCBI build and UCSC assembly numbers).

 

Unzip these files to get genome information files or see their format:

HuSNP, Early Access 10K (ax13339 array; file names contain “10k”) and Mapping 10K (Mapping10K_Xba array, file names contain “11k”) SNP arrays: hg12.zip or hg15.zip

Mapping 10K 2.0 (hg17), Early Access 100K, Mapping 100K (hg16, Xba, Hind and combined), Mapping 100K, hg17

Early Access 500K (hg17), Mapping500K: hg17 (Nsp, Sty and combined), hg18 (use WinRAR to unzip)

If needed the information file of two sub-arrays can be row-wise combined using a text editor when combining sub-arrays for analysis.

 

2/12/08: SNP 6.0 and SNP 5.0 genome info files include one for the CNV probes. The two files may be combined in a text editor to view SNP and CNV probes together.

 

These files are based on the annotation CSV file downloaded from Affymetrix product support materials site. The “Tools/Make information file” function does not work for SNP arrays due to different annotation contents. For arrays with < 65K SNPs, one can manually copy and paste the columns in the CSV file so it has the same header line and columns as this example genome info file, and save in text format. We used Access to sort the rows of genome info file to have the same probe set ordering as in CDF file (can use the order in the PSI file of Affymetrix library files or a file exported by "Tools/Export expression data" after “Open group”) or External data file so the genome info file will be read fast. Alternatively, one may use Access to reorder external data file to have the same order as then genome info file.

 

[10/10/07] This Python program can be used to make dChip genome information files from recent Affy annotation CSV files of SNP arrays such as "Mapping250K_Nsp.na23.annot.csv". Download and install Python, start Python GUI, use "File/Open" to open "snp500k_info.py" (in the same directory as CSV and PSI files), modify file names in the program if necessary, and select "Run/Run Module".

 

Reading in SNP data

 

Use “Analysis/Open group” to open a group of SNP array CEL and paired TXT files. Make sure the Affymetrix analysis result TXT files containing SNP genotype calls exist in the same directory as the CEL files (one TXT file for each CEL file). These TXT files are exported by the GDAS software  and have the same file name as the corresponding CEL file the except the TXT extension (Example file). We need the SNPID column to be before the Call column. There can be other columns in between. For fast processing in dChip, be sure to keep the original ordering of the SNPs (as in CDF file) when exporting the genotype calls from GDAS. If some recent TXT files cause error such as "Unit 'AA' not found", open them in text editor, delete the tab or space in the 2nd row, just before “SNP ID” (“      SNP ID      Call”), and then save in text file. [Obsolete: For HuSNP array, the scan B CEL files (ending as “B.cel”) will be read, but the TXT files do not have the “B” indication (e.g “my_chipB.cel” has accompanying file “my_chip.txt”).]

 

For 500K SNP array, Natalie Twine observed that when batch exporting of .CHP files from GTYPE to generate the .TXT files, the resultant .txt files will have different orders of SNPs from the CDF file and this leads to long reading time at "Open group". The solution is to manually open a .CHP file in GTYPE and export as .TXT by clicking on the export icon.

 

Specify “Other information/CDF file”, and specify “Other information/sample information file” if needed. Then click OK to read data. You may uncheck "Analysis/Open group/Options/Load probe data in memory" to not load probe data for faster computation when array number or size is large.

 

[V5/29/06+] To read SNP CEL files with combined genotype file but no individual matching TXT/CHP genotype files, first prepare a combined genotype call file from Affymetrix genotyping software, in this format (save in text format in Excel). Make sure the column names are the CEL file names without “.CEL” extension. Then after “Open group”, use “Analysis/Get external data” and specify this genotype call file as “Data file” and check “Read SNP call file and save to DCP file”. In future sessions of dChip, “Open group” will be fine without doing “Get external data” again, since genotype calls are already in DCP files.

 

External SNP data or Illumina SNP array data

 

One can also use “Analysis/Get external data” to read in a tab-delimited text file containing SNP calls or SSLP LOH data. Each column contains the SNP call of one sample (example file). Check the “SNP or SSLP data” checkbox. If there are signal data as well (e.g. exported by “Tools/Export expression data” after signal analysis; see below) in the external data file, check the “Has both signal and SNP call” checkbox. This means there are SNP call columns after the signal value column for each sample.

[More discussion]

 

The external file may contain a SNP signal column as above, or two signal columns for two alleles (example file) as described in this paragraph. For Illumina SNP array data, this format may be used. The "Signal" column of external data file contains allele A signal and the "SE" column in external data file contains the allele B signal. The allele A/B signals may be normalized signals comparable across samples for a SNP, or computed allele-specific raw copy numbers. Save the data file in tab-delimited text format. At "Analysis/Get external data", check "Has both signal and SNP call", "Has standard error" and "SNP data", and check "Options/Model/Compute A & B allele signals for SNP array". An associated text format genome information file should be made to match the SNP IDs in the external data file. After reading in data, "Analysis/Chromosome" may be used for further analysis or visualization.

 

By default dChip expects external SNP genotype file to contain AA, AB, BB calls. If your data contain actual genotypes (e.g. GG, TT, AC), do the following the read it. Open your data in Excel, change the value in the first row, first column to "Mouse", and save the file in text format. This will signal dChip to convert real genotypes to AA/BB/AB calls. (Discussion)

 

We may also read in and visualize SSLP LOH data. See Wang et al. 2005 for applications. Download and unzip the example data file. The data file contains NI (noninformative), HET (heterozygous or retention) and LOH calls, and has “SSLP” in the first row and first column. Check “Analysis/Get external data/SSLP data” to read it in. dChip will automatically split each column into normal and tumor samples. Make a genome information file and save in text format. Make an array list with “Standardize separators” to separate the normal and tumor pairs before doing “Analysis/Chromosome”.

 

Also see the Illumina BeadStudio plugin for outputting dChip-format data.

 

Filtering SNPs

 

[Use version 10/13/05+] We can filter SNPs with better quality to use in the downstream analysis. For example, Nannya et al. 2005 used SNP fragment length and GC content to improve signal-to-noise ratio. Specify a SNP information file at "Analysis/Open group/SNP information file". Also use an array list file to group normal and tumor samples. After "Open group", use "Analysis/Filter SNPs" to filter SNPs using fragment length and No Call rate across samples. A conflict call between normal and tumor samples (e.g. A in normal and B or AB in tumor) is counted as a No Call. Afterwards specify the filtering SNP list at "Analysis/Chromosome/SNP list file" to only use these SNPs in analysis.

 

Combine sub-arrays or different SNP array types

 

There are two sub-arrays of 100K data for two restriction enzymes: XbaI and HindIII. Each array type can be analyzed to obtain signal values using “Open group”. To combine the data of the two arrays, use “Tools/Export expression value” to export both signal values and SNP calls (check “Has both signal and call” but uncheck “Has standard error”), then row-wise combine the data (See combine sub arrays). Checking “Tools/Export expression value/Append to this file” can avoid manually combing the data. Finally open the combined data file by “Analysis/Get external data” (check “Has SNP call” and “SNP data”), and use a combined genome info file. To save time, you can first look at the data of each sub array separately. If you find combining data could increase the resolution for aberration regions, you can combine the sub arrays.

 

Combining different generations of SNP arrays is similar to combine expression data of different arrays. First manually make a common probe set file. For Early Access 10K and Mapping10K131 array, the file contains ~8K common SNPs (based on a file “Template EA conversion 10K.xls”). For Mapping10K131 array and Mapping10K142 array, use Affymetrix annotation CSV files (131, 142) to find matching probe sets and make a common probe set file, or just use probe sets with the same IDs. For each array type, do “Normalize” and “Model-based expression”, and then use “Tools/Export expression value” with common probe set file to export both raw signal and SNP calls (check “Has both signal and call” but uncheck “Has standard error”). Follow the steps to column-wise combine the two exported files and save in text format. The combined file can be read at “Analysis/Get external data” (check “Has SNP call after expression” and “SNP data”) to do analysis without CEL files.

 

Combing 100K and 500K arrays may be more difficult since the two arrays use different restriction enzymes and the common SNPs are fewer. Right now one may open two dChip sessions side-by-side to visualize and compare.

 

Probe data view

 

At the PM/MM data view, the 20 probe pairs are ordered from left to right. The same ordering applies when using “Tools/Export probe set” to export probe level data.

 

On the left, the first 5 probe pairs are for PM and MM A allele, forward strand; the next 5 probe pairs are for PM and MM B allele, forward strand; the next 5 probe pairs are for PM and MM A allele, reverse strand; the last 5 probe pairs are for PM and MM B allele, reverse strand. On the right, these four probe pair sets are upper left 10 probes, lower left 10 probes, upper right 10 probes and lower right 10 probes. The middle probe pair of the 5 probe pairs has shift 0, and the others have shift from –4 to +4. For some probe sets, there are 7 probe pairs instead of 5 in each of the four sets.

 

SNP data view

 

This view is useful since the Affymetrix SNP genotype calls (Di et al. 05, Liu et al. 03) can be checked with their probe level data when in question. After “Open group” finishes, a “SNP” icon will show in the left panel. Click the icon to display the SNP view:

 

 

The squares in the top panel of the SNP view represent the arrays, clustered by Principle Component Analysis (PCA) using probe-level data of a particular probe set (data courtesy of Charles Wang). Red, blue, yellow and black colors are for allele call AA, BB, AB and “No Call”. The PCA method: For each MiniBlock i = 1 … M, compute Diff_A = max (pmA – mmA, 1), Ri = Diff_A / (Diff_A + Diff_B). The data of one SNP in one sample is (R1, R2 ,… RM). Finally use PCA to project S data points (for S samples) into two dimensions to visualize.

 

The bottom panel shows the probe-level data of this probe set in the currently selected array. In this example there are 4 mini-blocks (only for one strand), and each mini-block has intensity data for mmA (gray), pmA (red), pmB (blue) and mmB (gray). The intensity bars are scaled relative to the maximum intensity currently in view.

 

Use the “Home” and “End” key to go to another marker, and the “PageUp” and “PageDown” key to go to another array. Use Arrow keys to zoom the image, and Control+Left and Control+Right keys to adjust the point size. MAS SNP calls can be exported into a text file using “Tools/Export data/Expression value”. Press “Enter” key (or menu “View/Next model”) multiple times to cycle through the other views of the data for the same array/probe set.

 

Export SNP data or chromosome regions

 

Use “Chromosome/Export SNP data/Data under view” to export LOH, log2, raw or inferred copy numbers (“Chromosome/Next data type, Display inferred” to switch).  The current curve values will also be exported. At the LOH data view, when paired normal and tumor samples exist, the informative and conflict call percentages of sample pairs will also be exported. Open the exported file in a text editor and go to the bottom to see these values.

 

When exporting the inferred LOH calls, the option “Options/Chromosome/Inferred LOH call threshold” can be set to convert inferred probability of LOH to LOH calls. SNPs with Probability (LOH) > threshold will be exported as LOH, and SNPs with Probability (LOH) < 1 – threshold will be exported as Retention, and otherwise exported as “No LOH Call”. If the threshold is set to –1, the probability of LOH will be exported. "L", "R" and "N" in the exported file represent “Loss”, “Retention” and “Noninformative/No call”.

 

 “Tools/Export expression value” exports raw signal values and SNP genotype types. This is useful for combing array data.

 

To export interesting chromosome regions with curve exceeding specified threshold, first go to a data view (key ‘D’ or ‘I’) such as inferred LOH or copy number view. Also specify “Standardize separators” in array list file to divide different tumor samples or pairs before exporting. Then use “Chromosome/Export SNP data/Regions with significant curve value” to export. Checking “Options/Chromosome/Use min and max as threshold” will use “? Min or ? Max” as threshold, otherwise “? Threshold” is used as threshold. For example, to export regions with inferred copy number value beyond [0,7], set Min = 0, Max = Threshold = 7 at “Options/Chromosome”. At the LOH data view, the LOH prevalence score across samples is used for exporting; at the inferred copy number data view, the inferred copy number in individual samples are used for exporting. If cytoband or refgene files are specified at "Analysis/Chromosome", cytobands or genes contained in the regions will also be exported in addition to SNP names.

 (Updated 8/11/07)

via:http://biosun1.harvard.edu/complab/dchip/snp.htm
  评论这张
 
阅读(2028)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017