### ### This file contains all the subroutine definitions to calculate PSCN and ASCN for affy's mapping 10K 142 SNP data. ### I have been trying to avoid any unnecessary changes in order not to confuse the users. ### All functions and data sets have the same name as they were in PLASQ100K, except a number 2 put ### in tail to distinguish them. However, a few "necessary" changes are made here to improve the efficiency of the algorithm. ### They are listed here: ### ### 1. I feel that the pscn algorithm is too stringent in finding abnormal chromosome segments from calculated ### raw copy numbers. Some areas have uniformly smaller or larger raw copy numbers than normal two, ### however, they are not picked up by the algorithm. Contamination of tumor DNA with normal DNA is the possible reason, ### as was pointed too in your paper. Therefore, I allow the algorithm to consider this by specifying a contamination ratio p. ### The effect of a nonzero p is equivalent to inflate the raw copy numbers away from normal 2, so that ### the subsequent algorithm is more likely able to pick these segments out. ### ### 2. I avoided a 'FOR' loop in the reading step of CEL files to make the algorithm a little faster. See functions ### celExtNoNorm2, celExtNoNorm2 etc. ### ### 3. I also allow the CEL file locations to be specified by a path, instead of a folder. ### This way, we allow our CEL files to be located in other far away place in the system, ### instead, they have to be in the same close directory. ### (for example, if one prefer to put all his CEL data in some common place.) ### ### 4. The SNP information and probe information data for 10K SNP chip are named ### as SNPinfo2 and SnpProbe2, as their counterpart SNPinfo and SnpProbe in PLASQ100K. ### ### ### All the functions in this package have been tested on my data set. ### They hold together independently from the PALSQ100K package. However, I leave it to your decision on ### how to run them along with your PLASQ100K package. ### ### I tried to supply the explanation of each function as ### much as I feel it necessary, except for functions with no changes of anything, ### in hope to make you a little easier, to make them into the final unified R package. ### ### ### If you have any questions, please let me know. I apology for the delay. ### I have been trying to test it, and then to wrap up the result from applying the algorithm to our data sets ### and then to compare this with other algorithms (e.g., CNAT/DCHIPSNP). ### I believe this algorithm produce more details about the genome alterations than other does. ### ### Best, ### ### Jacob ### ######################################################################### ## Usage: ## PSCN<-pscn2(TumorFolder="path to the Tumor CEL file folder", ## CelNormFolder="path to the Normal CEL file folder", ## computeBetas=TRUE, ## normFile="tmp", ## betasFile="betas.Rdata", ## rawCNfile="rawCN.Rdata", ...) ## save(PSCN, file="PSCN.Rdata") ## ## ASCN<-ascn2(rawCN, pscn) ################################################################################## ## Input: ## TumorFolder and CelNormFolder: folders for tumor and normal CEL files. ## These CEL files should be normalized as whole before inputting here by other program, say, DchipSNP. ## computeBetas, normFile, betasFile, rawCNfile: same as PLASQ 100K. ## ...: other parameters. especially, ## p: for proportion of contaminated normal DNA in tumor DNA. default: 0.15. ## Output: ## A matrix of size N by (2K+1),whose first column is the SNP ID and ## the next 2K columns are major and minor PSCN copy numbers, each of the K samples has two columns. ## N is the number of SNPs considered. ############################################################################## #################### ## Reading CEL file routine. #################################################################################### ## The following changes have been made in functions dealing with reading CEL files: ## 1. Parameters changed in the function call to accommodate the 10K chip data. ## 2. Used a subroutine Read10K2 to read 10K2 SNP data sets. ## This can handle cases of affy's original version 3 CEL format (5 columns) ## or CEL files output from DchipSNP (3 columns in the CEL file). ## Also reserved for case when there are 4 columns in the CEL file. ## 3. Used a more efficient way ## to obtain probe signals for probes ## in SnpProbes by getting rid of a FOR loop. See the following line ## inside each of the functions ## CelNorm<-tmp[SnpProbe2[,3]*658+SnpProbe2[, 2]+1, ] ## 4. A folder with path can be specified to read Normal and Tumor CEL files from (to increase flexibility). ###################################################################################### ####################################################################################### ###Extract probe-level data from 10k .cel files without normalizing ###Description: ### This function takes a directory of 10K normal .cel files, extracts ### the probe-level-data, and ### combines each into a matrix along with probe information. ###Usage: ### celExtNoNorm2(NormalFolder) ###Arguments: ### NormalFolder: Directory containing Normal .cel files (IN ASCII FORMAT!). ###Details: ### The function first reads in each file, then ### places the data into a vector whose order is the same as the rows ### of SnpProbe2. Finally, the probe sets are placed in ### the same order that they appear in SNPinfo2 (i.e. in order along ### the genome). The .cel files MUST be in ASCII ### format, not binary! A conversion tool is available at the ### Affymetrix website http://www.affymetrix.com. ###Value: ### CelNorm: A matrix of all relevant normalized probe intensities and ### probe characteristics. The rows are the ### probes. The first five columns are: SNPID, followed by 0/1 ### indicators for whether the probe is PM, A, centered, and ### forward strand. The remaining columns are the samples, in the ### order in which they appear in the directories. ### ###Author(s): ### Jacob Zhang jacob.zhang@vai.org and ### Thomas LaFramboise tlafram@broad.mit.edu ################################################################################### celExtNoNorm2<-function(NormalFolder) { Read10K2<-function(file){ see <- scan(file, what = "character", nlines = 40, sep = "\t") inds <- grep("]", see) lastBad <- inds[length(inds)]+4 first0<-lastBad+grep("0", see[lastBad:(lastBad+10)])[1] ncols<-which(as.numeric(see[first0:(first0+5)])==1)[1] if(!ncols%in% 3:5) stop("Cehck your CEL file format!") else tmp <- scan(file, what = if(ncols==3) list(NULL, NULL, "numeric") else if(ncols==4) list(NULL, NULL, "numeric",NULL) else if(ncols==5) list(NULL, NULL, "numeric",NULL, NULL), skip = lastBad, nmax = 432964, sep = "\t") tmp<-as.numeric(tmp[[3]]) return(tmp) } Samps <- dir(NormalFolder) tmp<-NULL for(i in Samps) tmp<-cbind(tmp, Read10K2(paste(NormalFolder, i, sep="/") )) data(SnpProbe2) CelNorm<-tmp[SnpProbe2[,3]*658+SnpProbe2[, 2]+1, ] CelNorm <- cbind(apply(SnpProbe2[, -(2:3)], 2, as.integer), CelNorm) rm(tmp) rownames(CelNorm)<-rep("", nrow(CelNorm)) rm(SnpProbe2, inherits=TRUE) data(SNPinfo2) snps <- intersect(SNPinfo2[, 1], CelNorm[, 1]) SNPinfo2 <- SNPinfo2[SNPinfo2[, 1] %in% snps, ] CelNorm <- CelNorm[CelNorm[, 1] %in% snps, ] rm(snps) CelNorm <- CelNorm[as.vector(t(outer(match(SNPinfo2[, 1], CelNorm[, 1]), 0:39, "+"))), ] colnames(CelNorm)[6:(5+length(Samps))] <-Samps rm(SNPinfo2, inherits=TRUE) return(CelNorm) } ###################################################################################### ####celExtract2 package:PLASQ R Documentation ####Extract and normalize probe-level data from 10k .cel files ####Description: #### This function takes a directory of Tumor .cel files, ### extracts the probe-level-data from each, #### normalizes each to the normal samples using quintile #### normalization (Bolstad et al. 2003), and combines all normalized #### probe-level data into a table. ####Usage: #### celExtract2(TumorFolder, normFile) ####Arguments: #### TumorFolder: Directory containing Tumor .cel files (IN ASCII FORMAT!). #### normFile: A file containing an object called norm, of the type created #### by 'celExtNoNorm2', which is a matrix of #### UNNORMALIZED Tumor probe values. ####Details: #### The function first reads in each file, then #### places the data into a vector whose order is the same as the rows #### of SnpProbe2. The .cel files MUST be in ASCII format, #### not binary! Next, the vector is quantile normalized to the #### reference data contained in the object norm. Finally, the probe #### sets are placed in the same order that they appear in SNPinfo2 #### (i.e. in order along the genome). ####Value: #### A matrix of all relevant normalized probe intensities and probe #### characteristics. The rows are the probes. The #### first five columns are: SNPID, followed by 0/1 indicators for #### whether the probe is PM, A, centered, and forward strand. The #### remaining columns are the samples, in the order in which they #### appear in the directories. ####Author(s): #### Jacob Zhang jacob.zhang@vai.org and #### Thomas LaFramboise tlafram@broad.mit.edu ########################################################################################## celExtract2<-function(TumorFolder, normFile) { Read10K2<-function(file){ see <- scan(file, what = "character", nlines = 40, sep = "\t") inds <- grep("]", see) lastBad <- inds[length(inds)]+4 first0<-lastBad+grep("0", see[lastBad:(lastBad+10)])[1] ncols<-which(as.numeric(see[first0:(first0+5)])==1)[1] if(!ncols%in% 3:5) stop("Cehck your CEL file format!") else tmp <- scan(file, what = if(ncols==3) list(NULL, NULL, "numeric") else if(ncols==4) list(NULL, NULL, "numeric",NULL) else if(ncols==5) list(NULL, NULL, "numeric",NULL, NULL), skip = lastBad, nmax = 432964, sep = "\t") tmp<-as.numeric(tmp[[3]]) return(tmp) } Samps <- dir(TumorFolder) tmp<-NULL for(i in Samps) tmp<-cbind(tmp, Read10K2( paste(TumorFolder, i, sep="/") )) data(SnpProbe2) CelTumor<-tmp[SnpProbe2[,3]*658+SnpProbe2[, 2]+1, ] CelTumor <- cbind(apply(SnpProbe2[, -(2:3)], 2, as.integer), CelTumor) rownames(CelTumor)<-rep("", nrow( CelTumor)) colnames(CelTumor)<-c(colnames(SnpProbe2[, -(2:3)]), Samps) rm(tmp) rm(SnpProbe2, inherits=TRUE) data(SNPinfo2) snps <- intersect(SNPinfo2[, 1], CelTumor[, 1]) CelTumor <- CelTumor[CelTumor[, 1] %in% snps, ] dats<-NULL load(normFile) for(i in 1:length(Samps)){ print(paste("Normalizing ...",Samps[i])) dats<-cbind(dats,normalize.quantiles(cbind(CelTumor[, i+5], norm[, -(1:5)])) [, 1]) } dats<-cbind(CelTumor[, 1:5], dats) rm(norm) dats<-dats[as.vector(t(outer(match(SNPinfo2[, 1], dats[, 1]), 0:39, "+"))), ] colnames(dats)<- colnames(CelTumor) rm(CelTumor) rm(SNPinfo2, inherits=TRUE) return(dats) } #################################################################################### ###celExtNorm2 package:PLASQ R Documentation ###Extract and normalize probe-level data from 10k .cel files ###Description: ### This function takes a directory of Normal .cel files ### extracts the probe-level-data, ### normalizes them using quantile normalization (Bolstad et al. 2003). ###Usage: ### celExtNorm2(NormalFolder) ###Arguments: ### NormalFolder: Directory containing Normal .cel files (IN ASCII FORMAT!). ###Details: ### The function first reads in each file, then ### places the data into a vector whose order is the same as the rows ### of SnpProbe. The .cel files MUST be in ASCII format, ### not binary! Next, all vectors ### are normalized and put into a matrix with probe ### information. Finally, the ### probe sets are placed in the same order that they appear in ### SNPinfo2 (i.e. in order along the genome). ###Value: ### A matrix of all relevant normalized probe intensities and probe ### characteristics . The rows are the probes. The ### first five columns are: SNPID, followed by 0/1 indicators for ### whether the probe is PM, A, centered, and forward strand. The ### remaining columns are the samples, in the order in which they ### appear in the directories. ###Author(s): ### Jacob Zhang jacob.zhang@vai.org ### Thomas LaFramboise tlafram@broad.mit.edu ###See Also: ### 'SNPinfo2', 'celExtNoNorm2', 'celExtract2', 'getBetas2', 'findGeno2' ##################################################################################### celExtNorm2<-function(NormalFolder) { Read10K2<-function(file){ see <- scan(file, what = "character", nlines = 40, sep = "\t") inds <- grep("]", see) lastBad <- inds[length(inds)]+4 first0<-lastBad+grep("0", see[lastBad:(lastBad+10)])[1] ncols<-which(as.numeric(see[first0:(first0+5)])==1)[1] if(!ncols%in% 3:5) stop("Cehck your CEL file format!") else tmp <- scan(file, what = if(ncols==3) list(NULL, NULL, "numeric") else if(ncols==4) list(NULL, NULL, "numeric",NULL) else if(ncols==5) list(NULL, NULL, "numeric",NULL, NULL), skip = lastBad, nmax = 432964, sep = "\t") tmp<-as.numeric(tmp[[3]]) return(tmp) } Samps <- dir(NormalFolder) tmp<-NULL for(i in Samps) tmp<-cbind(tmp, Read10K2( paste(NormalFolder,i,sep="/"))) data(SnpProbe2) CelNorm<-tmp[SnpProbe2[,3]*658+SnpProbe2[, 2]+1, ] CelNorm <- cbind(apply(SnpProbe2[, -(2:3)], 2, as.integer), CelNorm) rm(tmp) rownames(CelNorm)<-rep("", nrow(CelNorm)) rm(SnpProbe2, inherits = T) print("Normalizing") CelNorm[,-(1:5)] <- normalize.quantiles(CelNorm[,-(1:5)]) data(SNPinfo2) snps <- intersect(SNPinfo2[, 1], CelNorm[, 1]) SNPinfo2 <- SNPinfo2[SNPinfo2[, 1] %in% snps, ] CelNorm <- CelNorm[CelNorm[, 1] %in% snps, ] rm(snps) CelNorm <- CelNorm[as.vector(t(outer(match(SNPinfo2[, 1], CelNorm[, 1]), 0:39, "+"))), ] colnames(CelNorm)[6:(5+length(Samps))] <-Samps rm(SNPinfo2, inherits = T) return(CelNorm) } ##################################################################################### ## No change is made for this function as with getBetas() in PLASQ100K, ## except the function call EMSNP, which has been changed to EMSNP2, ## Probably, we can just use original EMSNP function to reduce redundancy; ## it is up to your decision. ######################################################################################## getBetas2<-function(Norm) { Norm[, -(1:5)] <- normalize.quantiles(Norm[, -(1:5)]) findOne <- function(mat) { tmp <- mat[, -1] em <- try(EMSNP2(tmp), silent = T) if (inherits(em, "try-error")) { return(rep(NA, 11)) } else { betasF <- em$coeffsF betasR <- em$coeffsR return(c(mat[1, 1], betasF, betasR)) } } rslt <- matrix(nrow = 0, ncol = 11) colnames(rslt) <- c("SNPID", "G_F", "A_0F", "B_0F", "A_1F", "B_1F", "G_R", "A_0R", "B_0R", "A_1R", "B_1R") for (i in 1:(dim(Norm)[1]/40)) { print(paste("Analyzing ", i, "th", " SNP", sep = ""), quote = F) rslt <- rbind(rslt, findOne(Norm[(40 * i - 39):(40 * i), ])) } rm(Norm) rslt <- apply(rslt, 2, function(vec) { vec[vec < 0] <- 0 return(vec) }) rslt <- rslt[!is.na(rslt[, 2]), ] rslt <- t(apply(rslt, 1, function(x) { if (prod(x[2:4], na.rm = T) == 0) { x[2:6] <- NA } if (prod(x[7:9], na.rm = T) == 0) { x[7:11] <- NA } return(x) })) return(rslt) } ######################################################################################## ## No changes have been made for this function. ######################################################################################## getRaw2<-function(Intmat, params = NULL) { gaussian <- function(link = "exponential") { linktemp <- substitute(link) if (!is.character(linktemp)) { linktemp <- deparse(linktemp) if (linktemp == "link") linktemp <- eval(link) } if (linktemp == "exponential") stats <- list(linkfun = function(mu) exp(mu), linkinv = function(eta) log(eta), mu.eta = function(eta) 1/eta, valideta = function(eta) all(eta > 0)) structure(list(family = "gaussian", link = linktemp, linkfun = stats$linkfun, linkinv = stats$linkinv, variance = function(mu) rep.int(1, length(mu)), dev.resids = function(y, mu, wt) wt * ((y - mu)^2), aic = function(y, n, mu, wt, dev) sum(wt) * (log(dev/sum(wt) * 2 * pi) + 1) + 2, mu.eta = stats$mu.eta, initialize = expression({ n <- rep.int(1, nobs) mustart <- y }), validmu = function(mu) TRUE), class = "family") } if (is.null(params)) { data(betas) } else betas <- params SNPs <- intersect(Intmat[, 1], betas[, 1]) nSNP <- length(SNPs) samps <- colnames(Intmat)[-(1:5)] nSamps <- dim(Intmat)[2] - 5 Intmat <- Intmat[Intmat[, 1] %in% SNPs, ] ys <- log(as.matrix(Intmat[, -(1:5)])) types <- Intmat[, 2:5] betasMat <- betas[match(Intmat[, 1], betas[, 1]), -1] pma <- which(types[, 1] & types[, 2]) mmc <- which(!types[, 1] & types[, 3]) pmb <- which(types[, 1] & !types[, 2]) mmoa <- which(!types[, 1] & types[, 2] & !types[, 3]) mmob <- which(!types[, 1] & !types[, 2] & !types[, 3]) Find <- as.logical(types[, 4]) For <- which(as.logical(types[, 4])) Rev <- which(!as.logical(types[, 4])) pmaf <- intersect(pma, For) pmbf <- intersect(pmb, For) mmcf <- intersect(mmc, For) mmoaf <- intersect(mmoa, For) mmobf <- intersect(mmob, For) pmar <- intersect(pma, Rev) pmbr <- intersect(pmb, Rev) mmcr <- intersect(mmc, Rev) mmoar <- intersect(mmoa, Rev) mmobr <- intersect(mmob, Rev) X <- matrix(0, nrow = dim(Intmat)[1], ncol = 2) X[pmaf, 1] <- betasMat[pmaf, 2] X[pmaf, 2] <- betasMat[pmaf, 5] X[pmbf, 1] <- betasMat[pmbf, 4] X[pmbf, 2] <- betasMat[pmbf, 3] X[mmcf, 1] <- betasMat[mmcf, 4] X[mmcf, 2] <- betasMat[mmcf, 5] X[mmoaf, 1] <- betasMat[mmoaf, 4] X[mmobf, 2] <- betasMat[mmobf, 5] X[pmar, 1] <- betasMat[pmar, 7] X[pmar, 2] <- betasMat[pmar, 10] X[pmbr, 1] <- betasMat[pmbr, 9] X[pmbr, 2] <- betasMat[pmbr, 8] X[mmcr, 1] <- betasMat[mmcr, 9] X[mmcr, 2] <- betasMat[mmcr, 10] X[mmoar, 1] <- betasMat[mmoar, 9] X[mmobr, 2] <- betasMat[mmobr, 10] alphaF <- (betasMat[, 1] * Find) alphaR <- (betasMat[, 6] * (!Find)) alpha <- alphaF + alphaR CNTmat <- matrix(nrow = 0, ncol = 2 * nSamps) for (i in 1:nSNP) { print(paste("Analyzing SNP", i), quote = F) inds <- (40 * i - 39):(40 * i) outpt <- as.matrix(ys[inds, ]) cvrtA <- X[inds, 1] cvrtB <- X[inds, 2] nxt <- as.numeric(vector(length = 0)) for (j in 1:nSamps) { if (prod(is.na(cvrtA)) | prod(is.na(cvrtB)) | prod(is.na(alpha[inds])) | prod(is.na(outpt[, j]))) { nxt <- c(nxt, NA, NA) } else { model <- try(glm(outpt[, j] ~ 0 + cvrtA + cvrtB +offset(alpha[inds]), family = gaussian("exponential"), na.action = na.omit), silent=TRUE) if (inherits(model, "try-error")) { nxt <- c(nxt, NA, NA) } else { nxt <- c(nxt, as.vector(coef(model))) } } } CNTmat <- rbind(CNTmat, nxt) } CNTmat <- cbind(SNPs, CNTmat) colnames(CNTmat) <- c("SNPID", as.vector(t(outer(samps, c("A", "B"), paste, sep = ".")))) rownames(CNTmat) <- NULL return(CNTmat) } ################################################################################ ## The places that need change are the first ## line of data(SNPinfo2) and subsequent places where SNPinfo is needed. ########################################################################### getSeg2<-function(raw, qlambda=0.6, bandwidth=1,...) { data(SNPinfo2) SNPs <- intersect(SNPinfo2[, 1], raw[, 1]) dats <- raw[match(SNPs, raw[, 1]), -1] trunc <- function(vec) { vec[vec <= 0] <- 0.001 return(vec) } dats <- if (is.vector(dats)) { trunc(dats) } else apply(dats, 2, trunc) dats <- log(dats/2, 2) inf <- SNPinfo2[match(SNPs, SNPinfo2[, 1]), -1] chr <- inf[, 1] bp <- inf[, 2] cnms <- colnames(dats) nSNP <- length(SNPs) nSamp <- if (is.vector(dats)) { 1 } else dim(dats)[2] out <- matrix(nrow = 0, ncol = 6) for (k in 1:nSamp) { print(paste("SAMPLE", k)) ob <- if (is.vector(dats)) { cbind(chr, 1:nSNP, bp, dats) }else cbind(chr, 1:nSNP, bp, dats[, k]) ob <- as.data.frame(ob) colnames(ob) <- c("Chromosome", "PosOrder", "PosBase", "LogRatio") ob <- as.profileCGH(ob) nas <- ob$profileValuesNA[, c(1, 3)] nas <- cbind(nas, NA, NA) want <- glad(ob, verbose = FALSE, qlambda=qlambda, bandwidth=bandwidth)[[1]][, c(1, 3, 6, 7)] if (dim(nas)[2] == 4) { colnames(nas) <- colnames(want) all <- rbind(want, nas) }else all <- want all <- all[order(all[, 1], all[, 2]), ] for (l in 1:23) { mt <- all[all[, 1] == l, ] inds <- which(!is.na(mt[, 3])) mt[1:(inds[1]), 3] <- mt[inds[1], 3] mt[1:(inds[1]), 4] <- mt[inds[1], 4] mt[inds[length(inds)]:(dim(mt)[1]), 3] <- mt[inds[length(inds)], 3] mt[inds[length(inds)]:(dim(mt)[1]), 4] <- mt[inds[length(inds)], 4] inds2 <- inds[-(length(inds))][inds[-1] != (inds[-length(inds)] + 1)] inds3 <- inds[-1][inds[-1] != (inds[-length(inds)] + 1)] if (length(inds2) > 0) { for (q in 1:(length(inds2))) { mdpt <- mt[inds2[q], 2] + (mt[inds2[q], 2] + mt[inds2[q], 2])/2 m <- mt[(inds2[q]):(inds3[q]), ] m[m[, 2] <= mdpt, 3] <- mt[inds2[q], 3] m[m[, 2] <= mdpt, 4] <- mt[inds2[q], 4] m[m[, 2] > mdpt, 3] <- mt[inds3[q], 3] m[m[, 2] > mdpt, 4] <- mt[inds3[q], 4] mt[(inds2[q]):(inds3[q]), ] <- m } } all[all[, 1] == l, ] <- mt } want <- all rm(all) reg <- unique(want[, 4]) segs <- matrix(nrow = 0, ncol = 6) for (j in reg) { tmp <- want[want[, 4] == j, -4] nMark <- if (is.vector(tmp)) { 1 } else dim(tmp)[1] chrst <- if (is.vector(tmp)) { tmp[1:2] }else tmp[1, 1:2] nd <- if (is.vector(tmp)) { tmp[2] }else tmp[nMark, 2] mn <- if (is.vector(tmp)) { 2 * 2^tmp[3] }else 2 * 2^tmp[1, 3] segs <- rbind(segs, c(k, chrst, nd, nMark, mn)) } nReg <- dim(segs)[1] segs <- t(apply(segs, 1, unlist)) i <- 1 while (i <= nReg) { if (segs[i, 5] < 4 | segs[i, 4] - segs[i, 3] < 5000) { if (i == 1) { segs[2, 6] <- (segs[1, 5] * segs[1, 6] + segs[2, 5] * segs[2, 6])/sum(segs[1:2, 5]) segs[2, 5] <- segs[1, 5] + segs[2, 5] segs[2, 3] <- segs[1, 3] segs <- segs[-1, ] nReg <- nReg - 1 }else { if (i == nReg) { segs[i - 1, 6] <- (segs[i, 5] * segs[i, 6] + segs[i - 1, 5] * segs[i - 1, 6])/sum(segs[(i - 1):i, 5]) segs[i - 1, 5] <- segs[i, 5] + segs[i - 1, 5] segs[i - 1, 4] <- segs[i, 4] segs <- segs[-i, ] i <- i + 1 }else { if (abs(segs[i, 3] - segs[i - 1, 4]) < abs(segs[i, 4] - segs[i + 1, 3])) { segs[i - 1, 6] <- (segs[i, 5] * segs[i, 6] + segs[i - 1, 5] * segs[i - 1, 6])/sum(segs[(i - 1):i, 5]) segs[i - 1, 5] <- segs[i, 5] + segs[i - 1, 5] segs[i - 1, 4] <- segs[i, 4] segs <- segs[-i, ] nReg <- nReg - 1 }else { segs[i + 1, 6] <- (segs[i, 5] * segs[i, 6] + segs[i + 1, 5] * segs[i + 1, 6])/sum(segs[i:(i + 1), 5]) segs[i + 1, 5] <- segs[i, 5] + segs[i + 1, 5] segs[i + 1, 3] <- segs[i, 3] segs <- segs[-i, ] nReg <- nReg - 1 } } } }else i <- i + 1 } out <- rbind(out, segs) } colnames(out) <- c("Sample", "Chromosome", "Start.bp", "End.bp", "Num.SNPs", "Seg.CN") out <- out[order(out[, 1], out[, 2], out[, 3]), ] return(out) } ###################################################################### ## No change has been made for this function, except function call EMSNP into EMSNP2. ###################################################################### findGeno2<-function(snp, intMat, NCthresh = 0.99) { want <- intMat[intMat[, 1] == snp, -1] nSamp <- dim(want)[2] - 4 sampNames <- colnames(want)[-(1:4)] em <- try(EMSNP2(want)) if (inherits(em, "try-error")) { out <- rep(-1, nSamp) } else { zs <- em$z out <- apply(zs, 2, function(vec) { which.max(vec) - 1 }) maxes <- apply(zs, 2, max) out[maxes < NCthresh] <- -1 } names(out) <- sampNames return(out) } ############################################################# ## No change has been made for this function. ############################################################# EMSNP2<-function(mat, nreps = 10, precision = 0.1) { gaussian <- function(link = "exponential") { linktemp <- substitute(link) if (!is.character(linktemp)) { linktemp <- deparse(linktemp) if (linktemp == "link") linktemp <- eval(link) } if (linktemp == "exponential") stats <- list(linkfun = function(mu) exp(mu), linkinv = function(eta) log(eta), mu.eta = function(eta) 1/eta, valideta = function(eta) all(eta > 0)) structure(list(family = "gaussian", link = linktemp, linkfun = stats$linkfun, linkinv = stats$linkinv, variance = function(mu) rep.int(1, length(mu)), dev.resids = function(y, mu, wt) wt * ((y - mu)^2), aic = function(y, n, mu, wt, dev) sum(wt) * (log(dev/sum(wt) * 2 * pi) + 1) + 2, mu.eta = stats$mu.eta, initialize = expression({ n <- rep.int(1, nobs) mustart <- y }), validmu = function(mu) TRUE), class = "family") } n <- dim(mat)[2] - 4 nProbes <- dim(mat)[1] ys <- as.vector(mat[, -(1:4)]) types <- mat[, 1:4] pma <- which(types[, 1] & types[, 2]) mmc <- which(!types[, 1] & types[, 3]) pmb <- which(types[, 1] & !types[, 2]) mmoa <- which(!types[, 1] & types[, 2] & !types[, 3]) mmob <- which(!types[, 1] & !types[, 2] & !types[, 3]) Find <- rep(as.logical(types[, 4]), n) For <- which(as.logical(types[, 4])) Rev <- which(!as.logical(types[, 4])) zs <- matrix(0, nrow = 3, ncol = n) for (j in 1:n) { if (is.na(mat[1, j + 4])) { zs[, j] <- NA } else { as <- log(mat[pma, j + 4]) bs <- log(mat[pmb, j + 4]) Pval <- t.test(as, bs, "greater")$p.value if (Pval > 0.5) { zs[1, j] <- Pval - ((1 - Pval)/2) zs[2, j] <- 1 - Pval zs[3, j] <- (1 - Pval)/2 } if (Pval <= 0.5) { zs[1, j] <- Pval/2 zs[2, j] <- Pval zs[3, j] <- (1 - Pval) - Pval/2 } } } ys <- log(ys) betasF <- c(0, 0, 0, 0, 0) betasR <- c(0, 0, 0, 0, 0) ps <- c(1, 1, 0) r <- 0 diffs <- precision * 2 while (diffs > precision & r < nreps) { r <- r + 1 zs.old <- zs betasF.old <- betasF betasR.old <- betasR ps.old <- ps ps <- apply(zs, 1, sum)/n X <- matrix(0, nrow = nProbes * n, ncol = 4) for (j in 1:n) { if (!is.na(zs[1, j])) { CA <- zs[2, j] + 2 * zs[3, j] CB <- 2 * zs[1, j] + zs[2, j] X[nProbes * (j - 1) + pma, 1] <- CA X[nProbes * (j - 1) + pmb, 2] <- CB X[nProbes * (j - 1) + c(pmb, mmc, mmoa), 3] <- CA X[nProbes * (j - 1) + c(mmc, pma, mmob), 4] <- CB } } if (min(zs[1, ], na.rm = TRUE) > 0.9 * 0.7 | min(zs[3, ], na.rm = TRUE) > 0.9 * 0.7) { X[, 2] <- X[, 1] + X[, 2] X[, 3] <- X[, 3] + X[, 4] X <- X[, -c(1, 4)] rsltF <- glm(ys[Find] ~ X[Find, ], family = gaussian("exponential"), na.action = na.omit) rsltR <- glm(ys[!Find] ~ X[!Find, ], family = gaussian("exponential"), na.action = na.omit) betasF <- coef(rsltF)[c(1, 2, 2, 3, 3)] sigF <- sqrt(sum(residuals(rsltF, "deviance")^2)/(length(ys[Find]) - 3)) betasR <- coef(rsltR)[c(1, 2, 2, 3, 3)] sigR <- sqrt(sum(residuals(rsltR, "deviance")^2)/(length(ys[!Find]) - 3)) }else { rsltF <- glm(ys[Find] ~ X[Find, ], family = gaussian("exponential"), na.action = na.omit) rsltR <- glm(ys[!Find] ~ X[!Find, ], family = gaussian("exponential"), na.action = na.omit) betasF <- coef(rsltF) sigF <- sqrt(sum(residuals(rsltF, "response")^2)/(length(ys[Find]) - 5)) betasR <- coef(rsltR) sigR <- sqrt(sum(residuals(rsltR, "response")^2)/(length(ys[!Find]) - 5)) } for (j in 1:n) { if (!is.na(zs[1, j])) { pmajf <- nProbes * (j - 1) + intersect(pma, For) pmbjf <- nProbes * (j - 1) + intersect(pmb, For) mmcjf <- nProbes * (j - 1) + intersect(mmc, For) mmoajf <- nProbes * (j - 1) + intersect(mmoa, For) mmobjf <- nProbes * (j - 1) + intersect(mmob, For) pmajr <- nProbes * (j - 1) + intersect(pma, Rev) pmbjr <- nProbes * (j - 1) + intersect(pmb, Rev) mmcjr <- nProbes * (j - 1) + intersect(mmc, Rev) mmoajr <- nProbes * (j - 1) + intersect(mmoa, Rev) mmobjr <- nProbes * (j - 1) + intersect(mmob,Rev) m0 <- (if (length(pmbjf) == 0) {1}else prod(sapply(ys[pmbjf], dnorm, mean = log(betasF[1] + 2 * betasF[3]), sd = sigF))) * (if (length(c(pmajf, mmcjf, mmobjf)) == 0) { 1 }else prod(sapply(ys[c(pmajf, mmcjf, mmobjf)], dnorm, mean = log(betasF[1] + 2 * betasF[5]), sd = sigF))) * (if (length(mmoajf) == 0) { 1 } else prod(sapply(ys[mmoajf], dnorm, mean = log(betasF[1]), sd = sigF))) * (if (length(pmbjr) == 0) { 1 } else prod(sapply(ys[pmbjr], dnorm, mean = log(betasR[1] + 2 * betasR[3]), sd = sigR))) * (if (length(c(pmajr, mmcjr, mmobjr)) == 0) { 1 } else prod(sapply(ys[c(pmajr, mmcjr, mmobjr)], dnorm, mean = log(betasR[1] + 2 * betasR[5]), sd = sigR))) * (if (length(mmoajr) == 0) { 1 } else prod(sapply(ys[mmoajr], dnorm, mean = log(betasR[1]), sd = sigR))) m1 <- (if (length(pmajf) == 0) { 1 } else prod(sapply(ys[pmajf], dnorm, mean = log(betasF[1] + betasF[2] + betasF[5]), sd = sigF))) * (if (length(pmbjf) == 0) { 1 } else prod(sapply(ys[pmbjf], dnorm, mean = log(betasF[1] + betasF[3] + betasF[4]), sd = sigF))) * (if (length(mmcjf) == 0) { 1 } else prod(sapply(ys[mmcjf], dnorm, mean = log(betasF[1] + betasF[4] + betasF[5]), sd = sigF))) * (if (length(mmoajf) == 0) { 1 } else prod(sapply(ys[mmoajf], dnorm, mean = log(betasF[1] + betasF[4]), sd = sigF))) * (if (length(mmobjf) == 0) { 1 } else prod(sapply(ys[mmobjf], dnorm, mean = log(betasF[1] + betasF[5]), sd = sigF))) * (if (length(pmajr) == 0) { 1 } else prod(sapply(ys[pmajr], dnorm, mean = log(betasR[1] + betasR[2] + betasR[5]), sd = sigR))) * (if (length(pmbjr) == 0) { 1 } else prod(sapply(ys[pmbjr], dnorm, mean = log(betasR[1] + betasR[3] + betasR[4]), sd = sigR))) * (if (length(mmcjr) == 0) { 1 } else prod(sapply(ys[mmcjr], dnorm, mean = log(betasR[1] + betasR[4] + betasR[5]), sd = sigR))) * (if (length(mmoajr) == 0) { 1 } else prod(sapply(ys[mmoajr], dnorm, mean = log(betasR[1] + betasR[4]), sd = sigR))) * (if (length(mmobjr) == 0) { 1 } else prod(sapply(ys[mmobjr], dnorm, mean = log(betasR[1] + betasR[5]), sd = sigR))) m2 <- (if (length(pmajf) == 0) { 1 } else prod(sapply(ys[pmajf], dnorm, mean = log(betasF[1] + 2 * betasF[2]), sd = sigF))) * (if (length(c(mmcjf, pmbjf, mmoajf)) == 0) { 1 } else prod(sapply(ys[c(mmcjf, pmbjf, mmoajf)], dnorm, mean = log(betasF[1] + 2 * betasF[4]), sd = sigF))) * (if (length(mmobjf) == 0) { 1 }else prod(sapply(ys[mmobjf], dnorm, mean = log(betasF[1]), sd = sigF))) * (if (length(pmajr) == 0) { 1 } else prod(sapply(ys[pmajr], dnorm, mean = log(betasR[1] + 2 * betasR[2]), sd = sigR))) * (if (length(c(mmcjr, pmbjr, mmoajr)) == 0) { 1 } else prod(sapply(ys[c(mmcjr, pmbjr, mmoajr)], dnorm, mean = log(betasR[1] + 2 * betasR[4]), sd = sigR))) * (if (length(mmobjr) == 0) { 1 } else prod(sapply(ys[mmobjr], dnorm, mean = log(betasR[1]), sd = sigR))) denom <- ps %*% c(m0, m1, m2) zs[1, j] <- ps[1] * m0/denom zs[2, j] <- ps[2] * m1/denom zs[3, j] <- ps[3] * m2/denom } } diffs <- max(c(max(abs(zs - zs.old)), max(abs(ps - ps.old)), max(abs(betasF - betasF.old)), max(abs(betasR - betasR.old)))) } names(betasF) <- c("gamma_F", "alpha_F0", "beta_F0" , "alpha_F1", "beta_F1") names(betasR) <- c("gamma_R", "alpha_R0", "beta_R0", "alpha_R1", "beta_R1") rownames(zs) <- c("P[BB]", "P[AB]", "P[AA]") colnames(zs) <- colnames(mat)[-(1:4)] return(list(coeffsF = betasF, coeffsR = betasR, z = zs)) } ############################################################################# ## Need change the SNPinfo data file to SNPinfo2 in the first line ## and in subsequent places that need this data set. ## Introduced a new parameter p to correct the conservativeness of output. ## p is roughly the ratio of normal DNA in the total Tumor DNA when SNP data. ############################################################################# pscnStep2<-function (glad, rawCN, p=0.15) { data(SNPinfo2) SNPs <- intersect(SNPinfo2[, 1], rawCN[, 1]) rawCN <- rawCN[match(SNPs, rawCN[, 1]), ] SNPinfo2 <- SNPinfo2[match(SNPs, SNPinfo2[, 1]), ] nReg <- dim(glad)[1] pscnReg <- function(vec) { vec <- as.numeric(vec) # Sample Chromosome Start.bp End.bp Num.SNPs Seg.CN # 1 3 6.533470e+05 1.240497e+08 452 1.595968 mat <- rawCN[SNPinfo2[, 2] == vec[2] & SNPinfo2[, 3] >= vec[3] & SNPinfo2[, 3] <= vec[4], (2 * vec[1]):(2 * vec[1] + 1)] #all rawCN on chr vec[2] in the segment with specific sample. kx2 matrix mat=(mat-2*p)/(1-p) #change it nSNP <- dim(mat)[1] mn <- apply(mat, 1, min) mx <- apply(mat, 1, max) vec[6]=(vec[6]-2*p)/(1-p) ##Change it if (vec[6] <= 0.5 ) { return(matrix(0, nrow = nSNP, ncol = 2)) } if (vec[6] <= 1.5 & vec[6] > 0.5) { return(cbind(rep(0, nSNP), rep(1, nSNP))) } if (vec[6] > 1.5) { bads <- which(round(mn) <= 0) nas <- which(is.na(mn)) if (length(bads) > 0) { mn1 <- mean(mn[-bads], na.rm = T) mx1 <- mean(mx[-bads], na.rm = T) v1 <- round((mn1/(mn1 + mx1)) * vec[6]) v2 <- round(vec[6]) - v1 } if ((length(bads) - length(nas))/(length(mn) - length(nas)) > 0.95) { v1 <- 0 v2 <- max(c(round(vec[6]), 0)) } if (length(bads) == 0) { v1 <- round((mean(mn, na.rm = T)/mean(mn + mx, na.rm = T)) * vec[6]) v2 <- round(vec[6]) - v1 } return(cbind(rep(v1, nSNP), rep(v2, nSNP))) } } pscnMat <- matrix(NA, ncol = dim(rawCN)[2] - 1, nrow = dim(rawCN)[1]) for (j in 1:nReg) { vc <- as.numeric(glad[j, ]) print(paste("Segment", j, "of", nReg), quote = F) pscnMat[SNPinfo2[, 2] == vc[2] & SNPinfo2[, 3] >= vc[3] & SNPinfo2[, 3] <= vc[4], (2 * vc[1] - 1):(2 * vc[1])] <- pscnReg(vc) if (((vc[4] - vc[3]) < 5000) | (vc[5] < 4)) { pscnMat[SNPinfo2[, 2] == vc[2] & SNPinfo2[, 3] >= vc[3] & SNPinfo2[, 3] <= vc[4], (2 * vc[1] - 1):(2 * vc[1])] <- cbind(rep(1, vc[5]), rep(1, vc[5])) } } pscnMat <- cbind(rawCN[, 1], pscnMat) colnames(pscnMat) <- colnames(rawCN) minor <- function(str) { str <- substring(str, first = 1, last = nchar(str) - 1) return(paste(str, "minor", sep = "")) } major <- function(str) { str <- substring(str, first = 1, last = nchar(str) - 1) return(paste(str, "major", sep = "")) } nSamp <- (dim(rawCN)[2] - 1)/2 colnames(pscnMat)[2 * (1:nSamp)] <- sapply(colnames(pscnMat)[2 * (1:nSamp)], minor) colnames(pscnMat)[2 * (1:nSamp) + 1] <- sapply(colnames(pscnMat)[2 * (1:nSamp) + 1], major) return(pscnMat) } ################################################################################# ### All the functions called within this function need to be changed. ################################################################################# ################################################################################# ###pscn2 package:PLASQ R Documentation ###Infer Parent-Specific Copy Number from Probe-Level Data ###Description: ### This function is adapted from pscn, to infer parent-specific copy number from ### probe-level 10k SNP array data. ###Usage: ### pscn2(TumorFolder, CelNormFolder, betasInfile=NULL, ### computeBetas=F, ### normFile, betasFile=NULL, rawCNfile=NULL) ###Arguments: ###TumorFolder: Directory containing tumor .cel files. ###CelNormFolder: Directory containing normal .cel files. ###betasInfile: If model parameters have already been computed, an R image ### file containing an object betas: a matrix of its values (in ### the form output by the 'getBetas2'). ###computeBetas: Should the parameters be estimated using data in ### CelNormFolder? ###normFile: A file in which the unnormalized should be saved. Mandatory, ### but may be temporary. ###betasFile: If desired, file to which betas should be saved for later ### use. ###rawCNfile: If desired, file to which rawCN should be saved (for later ### allele-specific copy number computation using 'ascn'). ###Details: ### This function first extracts the normal probe-level data using ### 'celExtNoNorm2'. This result is saved to normFile argument. There ### are three ways to provide parameter estimates. 1) If betasInfile ### is NULL (the default) and computeBetas is TRUE, 'getBetas2' is used ### to estimate model parameters, which are saved to the file ### betasFile, if specified. 2) Alternatively, ### if betasInfile is NULL and computeBetas is FALSE (the default), ### the paramter estimates in data(betas2) is used. 3) Finally, ### betasInfile can specify an R image file with the betas object. ### Next, the probe-level intensities are read and normalized using ### 'celExtract2', and then raw allele-specific copy number is ### calculated from the model parameters and the probe-level data ### using 'getRaw2'. Finally, the pairwise sums of the raw copy numbers ### are smoothed with 'getSeg2', and parent-specific copy number is ### inferred from these segments and raw allele-specific copy number ### using 'pscnStep2'. Note that, if allele-specific copy number is ### also desired, one should save the raw allele-specific copy number ### by specifying the file name in the 'rawCNfile' argument so as to ### avoid having to recompute it. ###Value: ### A kx(2n+1) matrix of inferred parent-specific copy number FOR THE ### NON-NORMAL SAMPLES ONLY having one row for each SNP, the first ### column giving the SNP ID, followed by 2 columns for each sample: ### minor chromosome sample 1, major chromosome sample 1, minor ### chromosome sample 2, major chromosome samp le 2, ... . ###Author(s): ### Jacob Zhang jacob.zhang@vai.org ### Thomas LaFramboise tlafram@broad.mit.edu ###See Also: ### 'betas2', 'getBetas2', 'ascn2', 'celExtNoNorm2', 'celExtract2', ### 'getRaw2', 'getSeg2', 'pscnStep2' ############################################################################## pscn2<-function(TumorFolder, CelNormFolder, betasInfile = NULL, computeBetas = F, normFile, betasFile = NULL, rawCNfile = NULL,...) { print("Reading in Normal .cel files") norm <- celExtNoNorm2(CelNormFolder) save(norm, file = normFile) rm(norm) print("Reading in Cancer .cel files") cancer <- celExtract2(TumorFolder, normFile) save(cancer, file="cancer.Rdata") rm(cancer) if (is.null(betasInfile)) { if (computeBetas) { print("Estimating Model Parameters") load(normFile) norm <- norm betas <- getBetas2(norm) } else data(betas2) } if (!is.null(betasInfile)) { load(betasInfile) } data(SNPinfo2) snps <- intersect(SNPinfo2[, 1], betas[, 1]) SNPinfo2 <- SNPinfo2[SNPinfo2[, 1] %in% snps, ] betas <- betas[match(SNPinfo2[, 1], betas[, 1]), ] rm(SNPinfo2, inherits = T) if (!is.null(betasFile)) { save(betas, file = betasFile) } print("Finding Raw Copy Number") load("cancer.Rdata") rawCN <- getRaw2(cancer, betas) rm(betas) if (!is.null(rawCNfile)) { save(rawCN, file = rawCNfile) } print("Smoothing total copy number") nSamp <- (dim(rawCN)[2] - 1)/2 mat <- rawCN[, 1] for (i in 1:nSamp) { mat <- cbind(mat, apply(rawCN[, (2 * i):(2 * i + 1)], 1, sum)) } glad <- getSeg2(mat) print("Computing Parent-Specific Copy Number") out <- pscnStep2(glad, rawCN, p=0.15) return(out) } ######################################################################## ### No changes made here. ######################################################################### ascn2<-function(rawCN, pscn) { snps <- intersect(pscn[, 1], rawCN[, 1]) rawCN <- rawCN[match(snps, rawCN[, 1]), ] pscn <- pscn[match(snps, pscn[, 1]), ] ascnOne <- function(rawVec, pscnVec) { if (is.na(prod(pscnVec)) | is.na(prod(rawVec))) { return(c(NA, NA)) } if (sum(pscnVec) == 0) { return(c(0, 0)) } mxInd <- which.max(rawVec) if (sum(pscnVec) == 1) { out <- c(0, 0) out[mxInd] <- 1 return(out) } mnInd <- which.min(rawVec) mn <- min(rawVec) mx <- max(rawVec) if (sum(pscnVec) > 1) { out <- c(0, 0) out[mxInd] <- sum(pscnVec) mnRw <- (mn/(mn + mx)) * sum(pscnVec) if (abs(mnRw) >= abs(mnRw - pscnVec[1])) { out[mnInd] <- min(pscnVec) out[mxInd] <- max(pscnVec) } return(out) } } ASCN <- matrix(NA, nrow = dim(pscn)[1], ncol = dim(pscn)[2]) colnames(ASCN) <- colnames(rawCN) ASCN[, 1] <- pscn[, 1] for (i in 1:((dim(pscn)[2] - 1)/2)) { print(paste("Sample", i), quote = F) for (j in 1:(dim(pscn)[1])) { ASCN[j, (2 * i):(2 * i + 1)] <- ascnOne(rawCN[j, (2 * i):(2 * i + 1)], pscn[j, (2 * i):(2 * i + 1)]) } } return(ASCN) } ######################################################################### ### PSCNplot2: to plot PSCNs along chromosome. ### For detail, see PSCNplot in PLASQ100K. ### Need change the SNPinfo data by 10K SNPinfo2. ######################################################################### PSCNplot2<-function (data, ...) { data(SNPinfo2) int <- intersect(SNPinfo2[, 1], data[, 1]) inf <- SNPinfo2[SNPinfo2[, 1] %in% int, ] mat <- data[match(inf[, 1], data[, 1]), 2:3] breaks <- rep(0, 22) for (i in 1:22) { breaks[i] <- match(i + 1, inf[, 2]) inf[inf[, 2] == i + 1, 3] <- inf[inf[, 2] == i + 1, 3] + inf[breaks[i] - 1, 3] } linepos <- inf[breaks, 3] a <- c(0, linepos) b <- c(linepos, inf[dim(mat)[1], 3]) labpos <- a + (b - a)/2 delInds <- which(mat[, 1] + mat[, 2] == 0) plot(x = inf[, 3], y = rep(NA, dim(inf)[1]), axes = F, ylab = "estimated copy number", xlab = "chromosome number", ylim = c(0, max(mat[, 1] + mat[, 2], na.rm = TRUE)), cex = 0.5, ...) lower <- mat[, 1] upper <- apply(mat, 1, sum) segments(inf[, 3], 0, inf[, 3], lower, col = "red") segments(inf[, 3], lower, inf[, 3], upper, col = "green") if (length(delInds) > 0) { segments(inf[delInds, 3], 0, inf[delInds, 3], 2, col = "black") } axis(side = 2, las = 1) axis(side = 1, at = c(labpos[1:14], linepos[15], labpos[18], labpos[21], labpos[23]), labels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15 16", "17 18 19", "20 2122", "X"), tick = FALSE) segments(linepos, 0, linepos, max(mat[, 1] + mat[, 2], na.rm = TRUE), lty = 2) } ##################################################################################################### ##################################################################################################### #################################################################################################### ########################################################## End of definitions