Title: | Segmentation of single- and multi-track copy number data by penalized least squares regression (with hg38 and mm10). |
---|---|
Description: | Penalized least squares regression is applied to fit piecewise constant curves to copy number data to locate genomic regions of constant copy number. Procedures are available for individual segmentation of each sample, joint segmentation of several samples and joint segmentation of the two data tracks from SNP-arrays. Several plotting functions are available for visualization of the data and the segmentation results. |
Authors: | Gro Nilsen, Knut Liestoel and Ole Christian Lingjaerde. |
Maintainer: | Gro Nilsen <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.29.0.9000 |
Built: | 2024-11-24 06:25:58 UTC |
Source: | https://github.com/ShixiangWang/copynumber |
Joint segmentation of SNP array data resulting in piecewise constant curves with common break points for copy number data and B-allelle frequency data.
aspcf(logR, BAF, pos.unit = "bp", arms = NULL, kmin = 5, gamma = 40, baf.thres=c(0.1,0.9), skew = 3, assembly= "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names=NULL, verbose = TRUE)
aspcf(logR, BAF, pos.unit = "bp", arms = NULL, kmin = 5, gamma = 40, baf.thres=c(0.1,0.9), skew = 3, assembly= "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names=NULL, verbose = TRUE)
logR |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent columns the numeric copy number measurements for one or more samples. The header of copy number column(s) should give sample ID(s). |
BAF |
either a data frame or the name of a tab-separated file from which B-allelle frequency data can be read. Must be on the same format and size as |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
arms |
optional character vector containing chromosome arms (denoted 'p' and 'q') corresponding to the chromosomes and positions found in |
kmin |
minimum number of probes in each segment, default is 5. |
gamma |
penalty for each discontinuity in the curve, default is 40. |
baf.thres |
a numeric vector of length two giving the thresholds below and above which BAF probes are considered germline homozygous. Must be in the range 0 to 1, default is 0.1 and 0.9 for the lower and upper limit, respectively. |
skew |
a numeric value used to determine whether there is allelic skewness (one or two bands) in BAF. Default is 3. The larger the value the further the BAF measurements must be from 0.5 to imply two bands. |
assembly |
a string specifying which genome assembly version should be applied to determine chromosome arms. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
digits |
the number of decimals to be applied when reporting results. Default is 4. |
return.est |
logical value indicating whether a data frame holding LogR estimates should be returned along with the segments. Default is FALSE, which means that only segments are returned. |
save.res |
logical value indicating whether results should be saved in text files, default is FALSE. |
file.names |
optional character vector of length two giving the name of the files where the logR estimates and segments, respectively, should be saved in case |
verbose |
logical value indicating whether or not to print a progress message each time aspcf analysis is finished for a new chromosome arm. |
Piecewise constant curves are simultaneously fitted to the LogR and BAF data as described in Nilsen and Liestoel et al.(2012). This implies that break points will be the same for the LogR and BAF segmentation curves, while segment values differ. Segmentation is done separately on each chromosome arm in each sample.
If return.est = TRUE
a list with the following components:
logR_estimates |
a data frame where the first two columns give the chromosome numbers and probe positions, respectively, while subsequent column(s) give the LogR estimates for each sample. The estimate for a given probe equals the mean of the segment where the probe is located. |
segments |
a data frame describing each segment found. Each row represents a segment, and columns give the sample IDs, chromosome numbers, arms, local start positions, local end positions, number of probes in the segments, mean LogR values and mean BAF values, respectively. |
If return.est = FALSE
, only the data frame containing the segments is returned.
If save.res = TRUE
the results are also saved in text files with names as specified in file.names
. If file.names=NULL
, a folder named "aspcf_results" is created in the working directory, and the LogR estimates and the segmentation results are saved in this folder as tab-separated files named logR_estimates.txt and segments.txt, respectively.
It will usually be advisable to Winsorize the logR data before running aspcf
, see winsorize
on this. Missing values are not allowed in logR
, see imputeMissing
for imputation of missing copy number values.
Gro Nilsen, Knut Liestoel, Ole Christian Lingjaerde
Nilsen and Liestoel et al., "Copynumber: Efficient algorithms for single- and multi-track copy number segmentation", BMC Genomics 13:591 (2012), doi:10.1186/1471-2164-13-59
#Load LogR and BAF data: data(logR) data(BAF) #First winsorize logR to handle outliers: wins.logR <- winsorize(logR) #Run aspcf: aspcf.segments <- aspcf(wins.logR,BAF)
#Load LogR and BAF data: data(logR) data(BAF) #First winsorize logR to handle outliers: wins.logR <- winsorize(logR) #Run aspcf: aspcf.segments <- aspcf(wins.logR,BAF)
Segments, obtained by pcf
or multipcf
, are classified as "gain", "normal" or "loss" given the specified thresholds.
callAberrations(segments, thres.gain, thres.loss = -thres.gain)
callAberrations(segments, thres.gain, thres.loss = -thres.gain)
segments |
a data frame containing the segmentation results found by either |
thres.gain |
a numeric value giving the threshold to be applied for calling gains. |
thres.loss |
a numeric value giving the threshold to be applied for calling losses. Default is to use the negative value of |
Each region found in segments
is classified as "gain", "normal" or "loss". Regions with gain or loss will be those segments where the segment value is above or below the value given in thres.gain
or thres.loss
, respectively.
A new segment data frame where the segment values have been replaced by the classification "gain", "normal" or "loss".
Gro Nilsen
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Call gains as segments whose value is > 0.2, and losses as segments whose # value < -0.1 ab.seg <- callAberrations(seg,thres.gain=0.2,thres.loss=-0.1)
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Call gains as segments whose value is > 0.2, and losses as segments whose # value < -0.1 ab.seg <- callAberrations(seg,thres.gain=0.2,thres.loss=-0.1)
The segments data frame obtained e.g. by pcf
, multipcf
or aspcf
is converted to the GRanges format.
getGRangesFormat(segments)
getGRangesFormat(segments)
segments |
a data frame containing segmentation results found by e.g. |
GRanges, in the GenomicRanges package, is the standard BioConductor containers for range data. For some applications it may therefore be useful to convert segmentation results to this format.
The segments converted to the GRanges container class.
Gro Nilsen
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Obtain the GRanges format gr <- getGRangesFormat(seg)
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Obtain the GRanges format gr <- getGRangesFormat(seg)
Missing copy number values are imputed by a constant value or pcf-estimates.
imputeMissing(data, method, c = 0, pcf.est = NULL,...)
imputeMissing(data, method, c = 0, pcf.est = NULL,...)
data |
a data frame with numeric or character chromosome numbers in the first column, numeric local probe positions in the second, and numeric copy number data for one or more samples in subsequent columns. |
method |
the imputation method to be used. Must be one of "constant" and "pcf". |
c |
a numerical value to be imputed if method is "constant". Default is 0. |
pcf.est |
a data frame of same size as |
... |
other relevant parameters to be passed on to |
The available imputation methods are:
constant
:all missing values in data
are replaced by the specified value c
.
pcf
:the estimates from pcf-segmentation (see pcf
) are used to impute missing values. If pcf
has already been run, these estimates may be specified in pcf.est
. If pcf.est
is unspecified, pcf
is run on the input data. In pcf
the analysis is done on the observed values, and estimates for missing observations are set to be the estimate of the nearest observed probe.
A data frame of the same size and format as data
with all missing values imputed.
Gro Nilsen
#Load lymphoma data data(lymphoma) chrom <- lymphoma[,1] pos <- lymphoma[,2] #pick out data for the first six samples: cn.data <- lymphoma[,3:8] #Create missing values in cn.data at random positions: n <- nrow(cn.data)*ncol(cn.data) r <- matrix(rbinom(n=n,size=1,prob=0.95),nrow=nrow(cn.data),ncol=ncol(cn.data)) cn.data[r==0] <- NA #matrix with approximately 5% missing values mis.data <- data.frame(chrom,pos,cn.data) #Impute missing values by constant, c=0: imp.data <- imputeMissing(data=mis.data,method="constant") #Impute missing values by obtained pcf-values: pcf.est <- pcf(data=mis.data,return.est=TRUE) imp.data <- imputeMissing(data=mis.data,method="pcf",pcf.est=pcf.est) #Or run pcf within imputeMissing: imp.data <- imputeMissing(data=mis.data,method="pcf")
#Load lymphoma data data(lymphoma) chrom <- lymphoma[,1] pos <- lymphoma[,2] #pick out data for the first six samples: cn.data <- lymphoma[,3:8] #Create missing values in cn.data at random positions: n <- nrow(cn.data)*ncol(cn.data) r <- matrix(rbinom(n=n,size=1,prob=0.95),nrow=nrow(cn.data),ncol=ncol(cn.data)) cn.data[r==0] <- NA #matrix with approximately 5% missing values mis.data <- data.frame(chrom,pos,cn.data) #Impute missing values by constant, c=0: imp.data <- imputeMissing(data=mis.data,method="constant") #Impute missing values by obtained pcf-values: pcf.est <- pcf(data=mis.data,return.est=TRUE) imp.data <- imputeMissing(data=mis.data,method="pcf",pcf.est=pcf.est) #Or run pcf within imputeMissing: imp.data <- imputeMissing(data=mis.data,method="pcf")
Given a segmentation by pcf
, interpolate pcf-estimates for specific positions.
interpolate.pcf(segments,x)
interpolate.pcf(segments,x)
segments |
a data frame containing the segmentation results from |
x |
matrix or data.frame where the first column gives chrosomomes and the second gives positions. |
Pcf-estimates are interpolated for the chromosomes and postions specified in x
.
A data frame where the first two columns give the chromsomes and positions specified in the input x
and the remaining columns give the interpolated pcf-estimate for each sample represented in segments
.
The positions in segments
and x
must be of the same unit (bp, kbp, or mbp).
Gro Nilsen, Ole Christian Lingjaerde.
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Run pcf: seg <- pcf(data=sub.lymphoma,gamma=12) #Make a matrix with two positions and chromosomes for which we want to #interpolate the pcf-estimate: pos <- c(2000000,50000000) chr <- c(1,2) x <- cbind(chr,pos) #Interpolate int.pcf <- interpolate.pcf(seg,x)
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Run pcf: seg <- pcf(data=sub.lymphoma,gamma=12) #Make a matrix with two positions and chromosomes for which we want to #interpolate the pcf-estimate: pos <- c(2000000,50000000) chr <- c(1,2) x <- cbind(chr,pos) #Interpolate int.pcf <- interpolate.pcf(seg,x)
A subset of the aCGH data set taken from the reference below.
data(lymphoma)
data(lymphoma)
Data frame containing 3091 probes with log2-ratio copy numbers for 21 samples. The first column contains the chromosome numbers, the second gives the local probe positions (in base pairs), while the subsequent columns contain the copy number measurements for the individual samples.
Eide et al., "Genomic alterations reveal potential for higher grade transformation in follicular lymphoma and confirm parallel evolution of tumor cell clones", Blood 116:1489-1497, 2010
#Get data data(lymphoma)
#Get data data(lymphoma)
A subset of the 244K MicMa data set containing copy number measurements for six samples on chromosome 17.
data(micma)
data(micma)
Data frame containing 7658 probes with log2-ratio copy numbers for 6 samples on chromosome 17. The first column contains the chromosome numbers, the second gives the local probe positions (in base pairs), while the subsequent columns contain the copy number measurements for the individual samples.
Mathiesen et al., "High resolution analysis of copy number changes in disseminated tumor cells of patients with breast cancer", Int J Cancer 131(4):E405:E415, 2011
#Get data data(micma)
#Get data data(micma)
Joint segmentation resulting in piecewise constant curves with common break points for all samples.
multipcf(data, pos.unit = "bp", arms = NULL, Y = NULL, gamma = 40, normalize=TRUE, w=1, fast = TRUE, assembly = "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
multipcf(data, pos.unit = "bp", arms = NULL, Y = NULL, gamma = 40, normalize=TRUE, w=1, fast = TRUE, assembly = "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
data |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent columns the numeric copy number measurements for two or more samples. The header of copy number columns should give sample IDs. |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
arms |
optional character vector containing chromosome arms (denoted 'p' and 'q') corresponding to the chromosomes and positions found in |
Y |
either a data frame or the name of a tab-separated file containing original copy number data in the case where |
gamma |
penalty for each discontinuity in the curve, default is 40. |
normalize |
a logical value indicating whether each sample's copy number measurements should be scaled by the sample specific residual standard error. Default is TRUE. |
w |
a numeric vector giving an individual weight to be used for each sample. May be of length 1 if the same weight should be applied for each sample, default is 1 (no weighting). |
fast |
a logical value indicating whether a fast (not guaranteed to be exact) version should be run on chromosome arms with > 400 probes. Default is TRUE. |
assembly |
a string specifying which genome assembly version should be applied to determine chromosome arms. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
digits |
the number of decimals to be applied when reporting results. Default is 4. |
return.est |
logical value indicating whether a data frame with copy number estimates (multipcf estimates)should be returned along with the segments. Default is FALSE, which means that only segments are returned. |
save.res |
logical value indicating whether results should be saved in text files, default is FALSE. |
file.names |
optional character vector of length two giving the name of the files where the multipcf estimates and segments, respectively, should be saved in case |
verbose |
logical value indicating whether or not to print a progress message each time multipcf analysis is finished for a new chromosome arm. |
Piecewise constant curves are simultaneously fitted to the copy number data for several samples as described in the multiPCF algorithm in Nilsen and Liestoel et al. (2012). This implies that break points will be the same for all segmentation curves, but the mean segment values will differ among samples. Segmentation is done separately on each chromosome arm.
If return.est = TRUE
a list with the following components:
estimates |
a data frame where the first two columns give the chromosome numbers and probe positions, respectively, while subsequent columns give the copy number estimates for each sample. The estimate for a given probe and sample equals the sample mean of the segment where the probe is located. |
segments |
a data frame describing the segments found in the data. Each row represents a segment, and the first five columns give the chromosome numbers, arms, local start positions, local end positions, and the number of probes in the segments, respectively. Subsequent columns give the mean segment value for each sample, with sample IDs as column headers. |
If return.est = FALSE
only the data frame containing the segments is returned.
If save.res = TRUE
the results are also saved in text files with names as specified in file.names
. If file.names=NULL
, a folder named "multipcf_results" is created in the working directory, and the segments and copy number estimates are saved in this folder as tab-separated files named segments.txt and estimates.txt, respectively.
It is usually advisable to Winsorize data before running pcf, see winsorize
on this.
The input data must be complete, see imputeMissing
for imputation of missing copy number values.
Gro Nilsen, Knut Liestoel
Nilsen and Liestoel et al., "Copynumber: Efficient algorithms for single- and multi-track copy number segmentation", BMC Genomics 13:591 (2012), doi:10.1186/1471-2164-13-59
#Load lymphoma data: data(lymphoma) #Take out a subset of 3 biopsies from the first patient (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Check for missing values in data: any(is.na(sub.lymphoma)) #FALSE #First winsorize data to handle outliers: wins.lymph <- winsorize(sub.lymphoma) #Run multipcf on subset lymphoma data (using a low gamma because of low-density data) multi.segments <- multipcf(data=wins.lymph,gamma=12,Y=sub.lymphoma)
#Load lymphoma data: data(lymphoma) #Take out a subset of 3 biopsies from the first patient (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Check for missing values in data: any(is.na(sub.lymphoma)) #FALSE #First winsorize data to handle outliers: wins.lymph <- winsorize(sub.lymphoma) #Run multipcf on subset lymphoma data (using a low gamma because of low-density data) multi.segments <- multipcf(data=wins.lymph,gamma=12,Y=sub.lymphoma)
Fit a individual piecewise constant segmentation curve to each sample's copy number data.
pcf(data, pos.unit = "bp", arms = NULL, Y = NULL, kmin = 5, gamma = 40, normalize = TRUE, fast = TRUE, assembly = "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
pcf(data, pos.unit = "bp", arms = NULL, Y = NULL, kmin = 5, gamma = 40, normalize = TRUE, fast = TRUE, assembly = "hg19", digits = 4, return.est = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
data |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent column(s) the numeric copy number measurements for one or more samples. The header of copy number columns should give sample IDs. |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
arms |
optional character vector containing chromosome arms (denoted 'p' and 'q') corresponding to the chromosomes and positions found in |
Y |
either a data frame or the name of a tab-separated file containing original copy number data in the case where |
kmin |
minimum number of probes in each segment, default is 5. |
gamma |
penalty for each discontinuity in the curve, default is 40. |
normalize |
logical value indicating whether the copy number measurements should be scaled by the sample residual standard error. Default is TRUE. |
fast |
a logical value indicating whether a fast (not guaranteed to be exact) version should be run on chromosome arms with > 400 probes. |
assembly |
a string specifying which genome assembly version should be applied to determine chromosome arms. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
digits |
the number of decimals to be applied when reporting results. Default is 4. |
return.est |
logical value indicating whether a data frame holding copy number estimates (pcf values) should be returned along with the segments. Default is FALSE, which means that only segments are returned. |
save.res |
logical value indicating whether results should be saved in text files. |
file.names |
optional character vector of length two giving the name of the files where the pcf estimates and segments, respectively, should be saved in case |
verbose |
logical value indicating whether or not to print a progress message each time pcf analysis is finished for a new chromosome arm. |
A piecewise constant segmentation curve is fitted to the copy number observations as described in the PCF algorithm in Nilsen and Liestoel et al. (2012). Segmentation is done separately on each chromosome arm in each sample.
If return.est = TRUE
a list with the following components:
estimates |
a data frame where the first two columns give the chromosome numbers and probe positions respectively, while subsequent column(s) give the copy number estimates for each sample. The estimate for a given probe equals the mean of the segment where the probe is located. |
segments |
a data frame describing each segment found in the data. Each row represents a segment, while columns give the sampleID, chromosome number, arm, local start position, local end position, number of probes in the segment and mean value, respectively. |
If return.est = FALSE
, only the data frame containing the segments is returned.
If save.res = TRUE
the results are also saved in text files with names as specified in file.names
. If file.names=NULL
, a folder named "pcf_results" is created in the working directory, and the pcf estimates and segments are saved in this directory in tab-separated files named estimates.txt and segments.txt, respectively.
It is usually advisable to Winsorize data before running pcf, see winsorize
on this.
Missing copy number values are allowed. These are kept out of the pcf analysis, and copy number estimates for missing observations are later set to be the same as the estimate of the nearest observed probe.
Gro Nilsen, Knut Liestoel, Ole Christian Lingjaerde.
Nilsen and Liestoel et al., "Copynumber: Efficient algorithms for single- and multi-track copy number segmentation", BMC Genomics 13:591 (2012), doi:10.1186/1471-2164-13-59
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #First winsorize data to handle outliers: wins.lymph <- winsorize(sub.lymphoma) #Run pcf (using small gamma because of low-density data): pcf.segments <- pcf(data=wins.lymph,gamma=12,Y=sub.lymphoma)
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #First winsorize data to handle outliers: wins.lymph <- winsorize(sub.lymphoma) #Run pcf (using small gamma because of low-density data): pcf.segments <- pcf(data=wins.lymph,gamma=12,Y=sub.lymphoma)
A basic single-sample pcf segmentation which does not take chromosome borders into account
pcfPlain(pos.data, kmin = 5, gamma = 40, normalize = TRUE, fast = TRUE, digits = 4, return.est = FALSE, verbose = TRUE)
pcfPlain(pos.data, kmin = 5, gamma = 40, normalize = TRUE, fast = TRUE, digits = 4, return.est = FALSE, verbose = TRUE)
pos.data |
a data frame where the rows represent the probes, column 1 holds probe positions, and subsequent column(s) give the numeric copy number measurements for one or more samples. The header of copy number columns should give sample IDs. |
kmin |
minimum number of probes in each segment, default is 5. |
gamma |
penalty for each discontinuity in the curve, default is 40. |
normalize |
logical value indicating whether the copy number measurements should be scaled by the sample residual standard error. Default is TRUE. |
fast |
a logical value indicating whether a fast (not guaranteed to be exact) version should be run if the number of probes are > 400. |
digits |
the number of decimals to be applied when reporting results. Default is 4. |
return.est |
logical value indicating whether a data frame holding copy number estimates (pcf values) should be returned along with the segments. Default is FALSE, which means that only segments are returned. |
verbose |
logical value indicating whether or not to print a progress message each time pcf analysis is finished for a sample. |
A piecewise constant segmentation curve is fitted to the copy number observations as described in the PCF algorithm in Nilsen and Liestoel et al. (2012). Unlike the regular pcf
function, pcfPlain
does not make independent segmentations for each chromosome arm (i.e. breakpoints are not automatically inserted at the beginning and end of chromosome arms). The segmentation can thus be performed independently of assembly.
If return.est = TRUE
a list with the following components:
estimates |
a data frame where the first column gives the probe positions, while subsequent column(s) give the copy number estimates for each sample. The estimate for a given probe equals the mean of the segment where the probe is located. |
segments |
a data frame describing each segment found in the data. Each row represents a segment, while columns give the sampleID, start position, end position, number of probes in the segment and mean value, respectively. |
If return.est = FALSE
, only the data frame containing the segments is returned.
If probe positions are not available, the first column in data
may, e.g., contain the values 1:nrow(data)
.
Gro Nilsen, Knut Liestoel, Ole Christian Lingjaerde.
Nilsen and Liestoel et al., "Copynumber: Efficient algorithms for single- and multi-track copy number segmentation", BMC Genomics 13:591 (2012), doi:10.1186/1471-2164-13-59
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Run pcfPlain (remove first column of chromosome numbers): plain.segments <- pcfPlain(pos.data=sub.lymphoma[,-1],gamma=12)
#Load the lymphoma data set: data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Run pcfPlain (remove first column of chromosome numbers): plain.segments <- pcfPlain(pos.data=sub.lymphoma[,-1],gamma=12)
Create plots reflecting the location of aberrated segments. Results may be visualized over the entire genome or by chromosomes.
plotAberration(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
plotAberration(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
segments |
a data frame containing the segmentation results found by either |
thres.gain |
a numeric vector giving the threshold value(s) to be applied for calling gains. |
thres.loss |
a numeric vector of same length as |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. If unspecified the whole genome is plotted. |
layout |
the vector of length two giving the number of rows and columns in the plot window. Default is |
... |
other optional graphical parameters. These include the plot arguments
|
For each sample, the aberrated regions are shown in the color specified in colors[1]
(default dodgerblue) if the segment value is below thres.loss
and the color specified in colors[2]
(default red) if the segment value is above thres.gain
. Non-aberrated regions are shown in white. Each row in the plot represents a sample, while probe positions are reflected along the x-axis.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Load lymphoma data data(lymphoma) #Run pcf to obtain estimated copy number values seg <- pcf(data=lymphoma,gamma=12) #Plot aberrations for the entire genome plotAberration(segments=seg,thres.gain=0.15) #Plot aberrations for the first 4 chromosomes: plotAberration(segments=seg,thres.gain=0.1,chrom=c(1:4),layout=c(2,2))
#Load lymphoma data data(lymphoma) #Run pcf to obtain estimated copy number values seg <- pcf(data=lymphoma,gamma=12) #Plot aberrations for the entire genome plotAberration(segments=seg,thres.gain=0.15) #Plot aberrations for the first 4 chromosomes: plotAberration(segments=seg,thres.gain=0.1,chrom=c(1:4),layout=c(2,2))
Plot bivariate SNP data and/or aspcf segmentation results for each sample separately with chromosomes in different panels
plotAllele(logR = NULL, BAF = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly="hg19", baf.thres = c(0.1,0.9), winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
plotAllele(logR = NULL, BAF = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly="hg19", baf.thres = c(0.1,0.9), winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
logR |
a data frame with numeric or character chromosome numbers in the first column, numeric local probe positions in the second, and numeric copy number data for one or more samples in subsequent columns. The header of the copy number column(s) should give the sample IDss. |
BAF |
a data frame on the same format and size as |
segments |
a data frame or a list of data frames containing the segmentation results found by |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
sample |
a numeric vector indicating which sample(s) is (are) to be plotted. The number(s) should correspond to the sample's place (in order of appearance) in |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. |
assembly |
a string specifying which genome assembly version should be applied to define the chromosome ideogram. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
baf.thres |
a numeric vector of length 2 giving thresholds below/above which BAF-values will not be plotted (use this to remove germline homozygous BAF probes from the plot). |
winsoutliers |
an optional data frame of the same size as |
xaxis |
either "pos" or "index". The former implies that the xaxis will represent the genomic positions, whereas the latter implies that the xaxis will represent the probe indices. Default is "pos". |
layout |
an integer vector of length two giving the number of rows and columns in the plot. Default is |
plot.ideo |
a logical value indicating whether the chromosome ideogram should be plotted. Only applicable when |
... |
other graphical parameters. These include the common plot arguments |
Several chromosome may be displayed on the same page with the layout
option. If the number of chromosomes exceeds the desired page layout, the user is prompted before advancing to the next page of output.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Load logR and BAF data: data(logR) data(BAF) #Run aspcf:: aspcf.segments <- aspcf(logR,BAF) #Plot plotAllele(logR,BAF,aspcf.segments,layout=c(2,2))
#Load logR and BAF data: data(logR) data(BAF) #Run aspcf:: aspcf.segments <- aspcf(logR,BAF) #Plot plotAllele(logR,BAF,aspcf.segments,layout=c(2,2))
Plot copy number data and/or segmentation results for each chromosome separately with samples in different panels.
plotChrom(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly = "hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
plotChrom(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly = "hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
data |
a data frame with numeric or character chromosome numbers in the first column, numeric local probe positions in the second, and numeric copy number data for one or more samples in subsequent columns. The header of the copy number columns should be the sample IDs. |
segments |
a data frame or a list of data frames containing the segmentation results found by either |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
sample |
a numeric vector indicating which sample(s) is (are) to be plotted. The number(s) should correspond to the sample's place (in order of appearance) in |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. |
assembly |
a string specifying which genome assembly version should be applied to define the chromosome ideogram. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
winsoutliers |
an optional data frame of the same size as |
xaxis |
either "pos" or "index". The former implies that the xaxis will represent the genomic positions, whereas the latter implies that the xaxis will represent the probe index. Default is "pos". |
layout |
an integer vector of length two giving the number of rows and columns in the plot. Default is |
plot.ideo |
a logical value indicating whether the chromosome ideogram should be plotted. Only applicable when |
... |
other graphical parameters. These include the common plot arguments |
Several plots may be produced on the same page with the layout
option. If the number of plots exceeds the desired page layout, the user is prompted before advancing to the next page of output.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.res <- winsorize(data=sub.lymphoma,return.outliers=TRUE) #Use pcf to find segments: uni.segments <- pcf(data=wins.res,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.res,gamma=12) #Plot data and segments for chromosome 1 separately for each sample: plotChrom(data=sub.lymphoma,segments=list(uni.segments,multi.segments),chrom=1, layout=c(3,2)) #Let xaxis be probe index, and do not connect segments by vertical lines: plotChrom(data=sub.lymphoma,segments=list(uni.segments,multi.segments),chrom=1, xaxis="index",layout=c(3,2),legend=FALSE,connect=FALSE) #Data was winsorized earlier. Mark winsorized values by different color #and symbol: plotChrom(data=wins.res,chrom=1,winsoutliers=wins.res,layout=c(3,2)) #Save plots in working directory: plotChrom(data=sub.lymphoma,segments=uni.segments,chrom=c(1,2), layout=c(3,2),dir.print=getwd(),file.name=c("chromosome1","chromosome2"), onefile=FALSE)
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.res <- winsorize(data=sub.lymphoma,return.outliers=TRUE) #Use pcf to find segments: uni.segments <- pcf(data=wins.res,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.res,gamma=12) #Plot data and segments for chromosome 1 separately for each sample: plotChrom(data=sub.lymphoma,segments=list(uni.segments,multi.segments),chrom=1, layout=c(3,2)) #Let xaxis be probe index, and do not connect segments by vertical lines: plotChrom(data=sub.lymphoma,segments=list(uni.segments,multi.segments),chrom=1, xaxis="index",layout=c(3,2),legend=FALSE,connect=FALSE) #Data was winsorized earlier. Mark winsorized values by different color #and symbol: plotChrom(data=wins.res,chrom=1,winsoutliers=wins.res,layout=c(3,2)) #Save plots in working directory: plotChrom(data=sub.lymphoma,segments=uni.segments,chrom=c(1,2), layout=c(3,2),dir.print=getwd(),file.name=c("chromosome1","chromosome2"), onefile=FALSE)
A circular genome is plotted and the percentage of samples that have a gain or a loss at a genomic position is added in the middle of the circle. Gains/losses correspond to copy number values that are above/below a pre-defined threshold. In addition arcs representing some connection between genomic loci may be added.
plotCircle(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", freq.colors = c("red","limegreen"), alpha = 1/7, arcs = NULL, arc.colors = c("goldenrod1","dodgerblue"), d = 0.3, assembly = "hg19")
plotCircle(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", freq.colors = c("red","limegreen"), alpha = 1/7, arcs = NULL, arc.colors = c("goldenrod1","dodgerblue"), d = 0.3, assembly = "hg19")
segments |
a data frame containing the segmentation results found by either |
thres.gain |
a scalar giving the threshold value to be applied for calling gains. |
thres.loss |
a scalar giving the threshold value to be applied for calling losses. Default is to use the negative value of |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
freq.colors |
a vector giving the colors to be used for the amplification and deletion frequencies, respectively. Default is c("red","limegreen"). |
alpha |
a scalar in the range 0 to 1 determining the amount of scaling of the aberration frequencies. For the default value of 1/7 the distance between the genome circle and the zero-line of the frequency-circle corresponds to an aberration percentage of 100 %. See details. |
arcs |
an optional matrix or data frame with 5 columns specifying connections between genomic loci. The first two columns must give the chromosome numbers and local positions for the start points of the arcs, while the two next columns give the chromosome numbers and local positions for the end point of arcs. The last column should contain a vector of numbers 1,2,... indicating that the arcs belong to different classes. Each class of arcs will then be plotted in a different color. |
arc.colors |
a vector giving the colors to be used for plotting the different classes of arcs. Cannot be shorter than the number of unique classes in |
d |
a scalar > 0 representing the distance from the genome circle to the starting points of the arcs. Set d=0 to make arcs start at the genome circle. |
assembly |
a string specifying which genome assembly version should be applied to determine chromosome ideograms. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
To zoom in on the observed aberration frequencies one may increase alpha
. However, the user should be aware that this implies that the distance between the genome circle and the frequency zero-line does not reflect an aberration frequency of 100 %. Since the distance between the two circles is always 1/7, the maximum plotted percentage will be 100/(alpha*7) and any percentages that are higher than this will be truncated to this value.
Gro Nilsen
#load lymphoma data data(lymphoma) #Run pcf pcf.res <- pcf(data=lymphoma,gamma=12) plotCircle(segments=pcf.res,thres.gain=0.1) #Use alpha to view the frequencies in more detail: plotCircle(segments=pcf.res,thres.gain=0.1,alpha=1/5) #An example of how to specify arcs #Using multipcf, we compute the correlation between all segments and then #retrieve those that have absolute inter-chromosomal correlation > 0.7 multiseg <- multipcf(lymphoma) nseg = nrow(multiseg) cormat = cor(t(multiseg[,-c(1:5)])) chr.from <- c() pos.from <- c() chr.to <- c() pos.to <- c() cl <- c() thresh = 0.7 for (i in 1:(nseg-1)) { for (j in (i+1):nseg) { #Check if segment-correlation is larger than threshold and that the two #segments are located on different chromosomes if (abs(cormat[i,j]) > thresh && multiseg$chrom[i] != multiseg$chrom[j]) { chr.from = c(chr.from,multiseg$chrom[i]) chr.to = c(chr.to,multiseg$chrom[j]) pos.from = c(pos.from,(multiseg$start.pos[i] + multiseg$end.pos[i])/2) pos.to = c(pos.to,(multiseg$start.pos[j] + multiseg$end.pos[j])/2) if(cormat[i,j] > thresh){ cl <- c(cl,1) #class 1 for those with positive correlation }else{ cl <- c(cl,2) #class 2 for those with negative correlation } } } } arcs <- cbind(chr.from,pos.from,chr.to,pos.to,cl) #Plot arcs between segment with high correlations; positive correlation in #orange, negative correlation in blue: plotCircle(segments=pcf.res,thres.gain=0.15,arcs=arcs,d=0)
#load lymphoma data data(lymphoma) #Run pcf pcf.res <- pcf(data=lymphoma,gamma=12) plotCircle(segments=pcf.res,thres.gain=0.1) #Use alpha to view the frequencies in more detail: plotCircle(segments=pcf.res,thres.gain=0.1,alpha=1/5) #An example of how to specify arcs #Using multipcf, we compute the correlation between all segments and then #retrieve those that have absolute inter-chromosomal correlation > 0.7 multiseg <- multipcf(lymphoma) nseg = nrow(multiseg) cormat = cor(t(multiseg[,-c(1:5)])) chr.from <- c() pos.from <- c() chr.to <- c() pos.to <- c() cl <- c() thresh = 0.7 for (i in 1:(nseg-1)) { for (j in (i+1):nseg) { #Check if segment-correlation is larger than threshold and that the two #segments are located on different chromosomes if (abs(cormat[i,j]) > thresh && multiseg$chrom[i] != multiseg$chrom[j]) { chr.from = c(chr.from,multiseg$chrom[i]) chr.to = c(chr.to,multiseg$chrom[j]) pos.from = c(pos.from,(multiseg$start.pos[i] + multiseg$end.pos[i])/2) pos.to = c(pos.to,(multiseg$start.pos[j] + multiseg$end.pos[j])/2) if(cormat[i,j] > thresh){ cl <- c(cl,1) #class 1 for those with positive correlation }else{ cl <- c(cl,2) #class 2 for those with negative correlation } } } } arcs <- cbind(chr.from,pos.from,chr.to,pos.to,cl) #Plot arcs between segment with high correlations; positive correlation in #orange, negative correlation in blue: plotCircle(segments=pcf.res,thres.gain=0.15,arcs=arcs,d=0)
Plot the percentage of samples that have an amplification or deletion at a genomic position. Amplifications/deletions correspond to copy number values that are above/below a pre-defined threshold. Frequencies may be plotted over the entire genome or separately for each chromosome.
plotFreq(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
plotFreq(segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
segments |
a data frame containing the segmentation results found by either |
thres.gain |
a numeric vector giving the threshold value(s) to be applied for calling gains. |
thres.loss |
a numeric vector of same length as |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. If unspecified the whole genome is plotted, otherwise each specified chromosome is plotted in a separate panel. |
layout |
an integer vector of length two giving the number of rows and columns in the plot. Default is |
... |
other graphical parameters. These include the common plot arguments
|
The percentage of samples with an aberration is calculated and plotted for all genomic positions. Regions with gain or loss will be those where copy number values are above or below the values given in thres.gain
and thres.loss
, respectively.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Plot over entire genome, gain and loss thresholds are 0.1 and -0.1: plotFreq(segments=seg,thres.gain=0.1) #Plot by chromosomes, two sets of thresholds: plotFreq(segments=seg,thres.gain=c(0.1,0.2), thres.loss=c(-0.05,-0.1), chrom=c(1:23), layout=c(5,5))
#load lymphoma data data(lymphoma) #Run pcf seg <- pcf(data=lymphoma,gamma=12) #Plot over entire genome, gain and loss thresholds are 0.1 and -0.1: plotFreq(segments=seg,thres.gain=0.1) #Plot by chromosomes, two sets of thresholds: plotFreq(segments=seg,thres.gain=c(0.1,0.2), thres.loss=c(-0.05,-0.1), chrom=c(1:23), layout=c(5,5))
Data for one sample on one chromosome is segmented by pcf
for 10 values of gamma, and results are visualized in a multi-grid plot.
plotGamma(data, pos.unit = "bp", gammaRange = c(10,100), dowins = TRUE, sample = 1, chrom = 1, cv = FALSE, K = 5, cex = 2, col = "grey", seg.col="red", ...)
plotGamma(data, pos.unit = "bp", gammaRange = c(10,100), dowins = TRUE, sample = 1, chrom = 1, cv = FALSE, K = 5, cex = 2, col = "grey", seg.col="red", ...)
data |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent column(s) the numeric copy number measurements for one or more samples. The header of copy number columns should give sample IDs. |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
gammaRange |
a vector of length two giving the lowest and highest value of gamma to be applied. 10 (approximately) equally spaced values within this range are applied in the pcf-segmentation. Default range is |
dowins |
logical value indicating whether data should be winsorized before running |
sample |
an integer indicating which sample is to be segmented. The number should correspond to the sample's place (in order of appearance) in |
chrom |
a number or character indicating which chromosome is to be segmented. Default is chromosome 1. |
cv |
logical value indicating whether K-fold cross-validation should be done, see details. |
K |
the number of folds to use in K-fold cross-validation, default is 5. |
cex |
size of data points, default is 2. |
col |
color used to plot data points, default is "grey". |
seg.col |
color used to plot segments, default is "red". |
... |
other optional parameters to be passed to |
Data for one sample and one chromosome is selected, and pcf
is run on this data subset while applying 10 different gamma-values (within the given range). The output is a multi-grid plot with the data shown in the first panel, the segmentation results for the various gammas in the subsequent 10 panels, and the number of segments found for each gamma in the last panel.
If cv = TRUE
a K-fold cross-validation is also performed. For each fold, a random (100/K) per cent of the data are set to be missing, and pcf
is run using the different values of gamma
. The missing probe values are then predicted by the estimated value of their closest non-missing neighbour (see pcf
on this), and the prediction error for this fold is then calculated as the sum of the squared difference between the predicted and the observed values. The process is repeated over the K folds, and the average prediction errors are finally plotted along with the number of segments in the last panel of the plot. The value of gamma for which the minimum prediction error is found is marked by an asterix. Note that such cross-validation tends to favor small values of gamma, and the suitability of the so-called optimal gamma from this procedure should be critically assessed.
If cv = TRUE
a list containing:
gamma |
the gamma values applied. |
pred.error |
the average prediction error for each value of gamma. |
opt.gamma |
the gamma for which the average prediction error is minimized. |
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen, Knut Liestoel, Ole Christian Lingjaerde
#Micma data data(micma) plotGamma(micma,chrom=17)
#Micma data data(micma) plotGamma(micma,chrom=17)
Plot copy number data and/or segmentation results for the whole genome.
plotGenome(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, assembly="hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), ...)
plotGenome(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, assembly="hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), ...)
data |
a data frame with numeric or character chromosome numbers in the first column, numeric local probe positions in the second, and numeric copy number data for one or more samples in subsequent columns. The header of the copy number columns should be the sample IDs. |
segments |
a data frame or a list of data frames containing the segmentation results found by either |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
sample |
a numeric vector indicating which sample(s) is (are) to be plotted. The number(s) should correspond to the sample's place (in order of appearance) in |
assembly |
a string specifying which genome assembly version should be applied to define the chromosome ideogram. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
winsoutliers |
an optional data frame of the same size as |
xaxis |
either "pos" or "index". The former implies that the xaxis will represent the genomic positions, whereas the latter implies that the xaxis will represent the probe index. Default is "pos". |
layout |
an integer vector of length two giving the number of rows and columns in the plot. Default is |
... |
other graphical parameters. These include the common plot arguments |
Several plots may be produced on the same page with the layout
option. If the number of plots exceeds the desired page layout, the user is prompted before advancing to the next page of output.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.data <- winsorize(data=sub.lymphoma,return.outliers=TRUE) #Use pcf to find segments: uni.segments <- pcf(data=wins.data,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.data,gamma=12) #Plot data and pcf-segments over entire genome for all six samples (one page #for each sample): plotGenome(data=sub.lymphoma,segments=uni.segments) #Let each sample define its own range, and adjust range to fit all observations: plotGenome(data=sub.lymphoma,segments=uni.segments,equalRange=FALSE,q=0) #Add results from multipcf on top for four of the samples and let all plots #show on one page: plotGenome(data=sub.lymphoma,segments=list(uni.segments,multi.segments), layout=c(2,2),sample=c(1:4)) #Change segment-colors, line widths, and legend: plotGenome(data=sub.lymphoma,segments=list(uni.segments,multi.segments),layout=c(2,2), seg.col=c("red","blue"),seg.lwd=c(3,2),legend=c("uni","multi") ,sample=c(1:4)) #Aberration calling may be done by defining thresholds that determines the cuf-off #for what should be considered biologically significant aberrations. In this #example segments which are above 0.2 or below -0.2 are considered aberrated #regions: plotGenome(segments=uni.segments,sample=5,connect=FALSE) abline(h=0.2,col="blue",lty=5) abline(h=-0.2,col="blue",lty=5)
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.data <- winsorize(data=sub.lymphoma,return.outliers=TRUE) #Use pcf to find segments: uni.segments <- pcf(data=wins.data,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.data,gamma=12) #Plot data and pcf-segments over entire genome for all six samples (one page #for each sample): plotGenome(data=sub.lymphoma,segments=uni.segments) #Let each sample define its own range, and adjust range to fit all observations: plotGenome(data=sub.lymphoma,segments=uni.segments,equalRange=FALSE,q=0) #Add results from multipcf on top for four of the samples and let all plots #show on one page: plotGenome(data=sub.lymphoma,segments=list(uni.segments,multi.segments), layout=c(2,2),sample=c(1:4)) #Change segment-colors, line widths, and legend: plotGenome(data=sub.lymphoma,segments=list(uni.segments,multi.segments),layout=c(2,2), seg.col=c("red","blue"),seg.lwd=c(3,2),legend=c("uni","multi") ,sample=c(1:4)) #Aberration calling may be done by defining thresholds that determines the cuf-off #for what should be considered biologically significant aberrations. In this #example segments which are above 0.2 or below -0.2 are considered aberrated #regions: plotGenome(segments=uni.segments,sample=5,connect=FALSE) abline(h=0.2,col="blue",lty=5) abline(h=-0.2,col="blue",lty=5)
Heatmap reflecting the magnitude of estimated copy numbers relative to some pre-defined limits. Estimates may be obtained using pcf
or multipcf
, and results may be visualized over the entire genome or by chromosomes.
plotHeatmap(segments, upper.lim, lower.lim = -upper.lim, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
plotHeatmap(segments, upper.lim, lower.lim = -upper.lim, pos.unit = "bp", chrom = NULL, layout = c(1, 1),...)
segments |
a data frame containing the segmentation results found by either |
upper.lim |
a positive numeric vector giving the upper limits(s) to be applied. The colors in the heatmap will reflect the magnitude of the estimated copy numbers relative to this limit, see details. |
lower.lim |
a negative numeric vector of same length as |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. If unspecified the whole genome is plotted. |
layout |
the vector of length two giving the number of rows and columns in the plot window. Default is |
... |
other optional graphical parameters. These include the plot arguments
|
For each sample, the segments are represented by a rectangle plotted in a color corresponding to the difference between the segment copy number value and the limits.
If the value is below lower.lim
, the color of the rectangle will equal the input in colors[1]
(default dodgerblue). If the value is above lower.lim
, but below zero, the color of the rectangle will be a nuance between the input in colors[1]
and colors[2]
(default black). The closer the value is to zero, the closer the nuance will be to colors[2]
. Similary, if the value is above upper.lim
, the color of the rectangle will equal the input in colors[3]
(default red), whereas if the value is below upper.lim
, but above zero, the color will be a nuance between the input in colors[2]
and colors[3]
. Again, the closer the value is to zero, the closer the nuance will be to colors[2]
.
Each row in the heatmap represents a sample, while probe positions are reflected along the x-axis.
This function applies par(fig)
, and is therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Load lymphoma data data(lymphoma) #Run pcf to obtain estimated copy number values seg <- pcf(data=lymphoma,gamma=12) #Heatmap for entire genome, two limit values: plotHeatmap(segments=seg,upper.lim=c(0.1,0.5),layout=c(2,1)) #Heatmap for the first 4 chromosomes: plotHeatmap(segments=seg,upper.lim=0.1,chrom=c(1:4),layout=c(2,2))
#Load lymphoma data data(lymphoma) #Run pcf to obtain estimated copy number values seg <- pcf(data=lymphoma,gamma=12) #Heatmap for entire genome, two limit values: plotHeatmap(segments=seg,upper.lim=c(0.1,0.5),layout=c(2,1)) #Heatmap for the first 4 chromosomes: plotHeatmap(segments=seg,upper.lim=0.1,chrom=c(1:4),layout=c(2,2))
Plot copy number data and/or segmentation results for each sample separately with chromosomes in different panels.
plotSample(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly = "hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
plotSample(data = NULL, segments = NULL, pos.unit = "bp", sample = NULL, chrom = NULL, assembly = "hg19", winsoutliers = NULL, xaxis = "pos", layout = c(1,1), plot.ideo = TRUE, ...)
data |
a data frame with numeric or character chromosome numbers in the first column, numeric local probe positions in the second, and numeric copy number data for one or more samples in subsequent columns. The header of the copy number columns should be the sample IDs. |
segments |
a data frame or a list of data frames containing the segmentation results found by either |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
sample |
a numeric vector indicating which sample(s) is (are) to be plotted. The number(s) should correspond to the sample's place (in order of appearance) in |
chrom |
a numeric or character vector with chromosome number(s) to indicate which chromosome(s) is (are) to be plotted. |
assembly |
a string specifying which genome assembly version should be applied to define the chromosome ideogram. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
winsoutliers |
an optional data frame of the same size as |
xaxis |
either "pos" or "index". The former implies that the xaxis will represent the genomic positions, whereas the latter implies that the xaxis will represent the probe index. Default is "pos". |
layout |
an integer vector of length two giving the number of rows and columns in the plot. Default is |
plot.ideo |
a logical value indicating whether the chromosome ideogram should be plotted. Only applicable when |
... |
other graphical parameters. These include the common plot arguments
|
Several plots may be produced on the same page with the layout
option. If the number of plots exceeds the desired page layout, the user is prompted before advancing to the next page of output.
These functions apply par(fig)
, and are therefore not compatible with other setups for arranging multiple plots in one device such as par(mfrow,mfcol)
.
Gro Nilsen
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.data <- winsorize(data=sub.lymphoma) #Use pcf to find segments: uni.segments <- pcf(data=wins.data,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.data,gamma=12) #Plot data and pcf-segments for one sample separately for each chromosome: plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,layout=c(5,5)) #Add cytoband text to ideogram (one page per chromosome to ensure sufficient #space) plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,layout=c(1,1), cyto.text=TRUE) #Add multipcf-segmentation results, drop legend plotSample(data=sub.lymphoma,segments=list(uni.segments,multi.segments),sample=1, layout=c(5,5),seg.col=c("red","blue"),seg.lwd=c(3,2),legend=FALSE) #Plot by chromosome for two samples, but only chromosome 1-9. One window per #sample: plotSample(data=sub.lymphoma,segments=list(uni.segments,multi.segments),sample= c(2,3),chrom=c(1:9),layout=c(3,3),seg.col=c("red","blue"), seg.lwd=c(3,2),onefile=FALSE) #Zoom in on a particular region by setting xlim: plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,chrom=1,plot.ideo= FALSE,xlim=c(140,170))
#Lymphoma data data(lymphoma) #Take out a smaller subset of 6 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:6) #Winsorize data: wins.data <- winsorize(data=sub.lymphoma) #Use pcf to find segments: uni.segments <- pcf(data=wins.data,gamma=12) #Use multipcf to find segments as well: multi.segments <- multipcf(data=wins.data,gamma=12) #Plot data and pcf-segments for one sample separately for each chromosome: plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,layout=c(5,5)) #Add cytoband text to ideogram (one page per chromosome to ensure sufficient #space) plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,layout=c(1,1), cyto.text=TRUE) #Add multipcf-segmentation results, drop legend plotSample(data=sub.lymphoma,segments=list(uni.segments,multi.segments),sample=1, layout=c(5,5),seg.col=c("red","blue"),seg.lwd=c(3,2),legend=FALSE) #Plot by chromosome for two samples, but only chromosome 1-9. One window per #sample: plotSample(data=sub.lymphoma,segments=list(uni.segments,multi.segments),sample= c(2,3),chrom=c(1:9),layout=c(3,3),seg.col=c("red","blue"), seg.lwd=c(3,2),onefile=FALSE) #Zoom in on a particular region by setting xlim: plotSample(data=sub.lymphoma,segments=uni.segments,sample=1,chrom=1,plot.ideo= FALSE,xlim=c(140,170))
Selects multipcf segments based on a desired characteristic.
selectSegments(segments, what = "variance", thres = NULL, nseg = 10, large = TRUE, p = 0.1)
selectSegments(segments, what = "variance", thres = NULL, nseg = 10, large = TRUE, p = 0.1)
segments |
a data frame containing segments found by |
what |
the desired characteristic to base selection on. Must be one of "variance" (default),"length" and "aberration". See details below. |
thres |
an optional numeric threshold to be applied in the selection. |
nseg |
the desired number of segments to be selected, default is 10. Only used if |
large |
logical value indicating whether segments with large (TRUE) or small (FALSE) variance, length or mean value should be selected when |
p |
a number between 0 and 1 giving the minimum proportion of samples for which an aberration must be detected, default is 0.1. Only applicable if |
The input in what
determines how the segments are selected. Three options are available:
If what="variance"
the variance of the segment values across all samples is calculated for each segment. If thres
is specified, the subset of segments for which the variance is above (if large=TRUE
) or below (if large=FALSE
) the threshold is returned. If thres
is not given by the user, a given number of segments determined by the input in nseg
is selected; if large=TRUE
this will be the nseg
segments with the highest variance, whereas if large=FALSE
the subset will consist of the nseg
segments with the lowest variance.
If what="length"
selection is based on the genomic length of the segments (end position minus start position). If thres
is specified, the subset of segments for which the length is above (if large=TRUE
) or below (if large=FALSE
) this threshold is returned. If thres
is left unspecified, a given number of segments determined by the input in nseg
is selected; if large=TRUE
this will be the nseg
longest segments, whereas if large=FALSE
it will be the nseg
shortest segments.
If what="aberration"
the aberration frequency is used to select the subset of segments. If thres
is specified, the proportion of samples for which the segment value is above (if large=TRUE
) or below (if large=FALSE
) the threshold is calculated for each segment. The subset of segments where this frequency is above or equal to the proportion set by the parameter p
is returned. If thres
is not specified, the nseg
segments with the highest (1-p)-quantile (if large=TRUE
) or the lowest p-quantile (if large=FALSE
) is returned.
A list containing:
sel.seg |
data frame containing the selected segments. |
In addition, depending on the value of what
:
seg.var |
a vector giving the variance for each segment. Only returned if |
seg.length |
a vector giving the length of each segment. Only returned if |
seg.ab.prop |
a vector giving the aberration proportion for each segment. Only returned if |
seg.quantile |
a vector giving the (1-p)- or p-quantile for each segment. Only returned if |
Gro Nilsen
#Lymphoma data data(lymphoma) #Run multipcf segments <- multipcf(lymphoma,gamma=12) #Select the 10 segments with the highest variance: sel.seg1 <- selectSegments(segments) #Select the segments where the variance is below 0.001 sel.seg2 <- selectSegments(segments,thres=0.001,large=FALSE) #Select the 5 longest segments: sel.seg3 <- selectSegments(segments,what="length",nseg=5) #Select the segments where 20 % of the samples have segment value of 0.2 or more: sel.seg4 <- selectSegments(segments,what="aberration",thres=0.2,p=0.2) #Select the 20 segments with the largest median: sel.seg5 <- selectSegments(segments,what="aberration",nseg=20,p=0.5)
#Lymphoma data data(lymphoma) #Run multipcf segments <- multipcf(lymphoma,gamma=12) #Select the 10 segments with the highest variance: sel.seg1 <- selectSegments(segments) #Select the segments where the variance is below 0.001 sel.seg2 <- selectSegments(segments,thres=0.001,large=FALSE) #Select the 5 longest segments: sel.seg3 <- selectSegments(segments,what="length",nseg=5) #Select the segments where 20 % of the samples have segment value of 0.2 or more: sel.seg4 <- selectSegments(segments,what="aberration",thres=0.2,p=0.2) #Select the 20 segments with the largest median: sel.seg5 <- selectSegments(segments,what="aberration",nseg=20,p=0.5)
Artificial SNP array data containing a logR track and a BAF track
data(logR) data(BAF)
data(logR) data(BAF)
Two corresponding data sets containing 10000 probes with logR and BAF measurements, respectively, for 2 samples. The two first columns in both data sets contain chromosome numbers and local probe positions (in base pairs), while the subsequent columns contain logR-values and BAF-values in the two data sets, respectively.
#Get data data(logR) data(BAF)
#Get data data(logR) data(BAF)
This function returns a subset of copy number data according to the input and the specified chromosomes and/or samples.
subsetData(data, chrom = NULL, sample = NULL, sep="\t", ...)
subsetData(data, chrom = NULL, sample = NULL, sep="\t", ...)
data |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent columns the numeric copy number measurements for one or more samples. |
chrom |
a numeric or character vector with chromosome(s) for which data should be selected. If unspecified, all chromosomes in |
sample |
a numeric vector indicating for which sample(s) data should be selected. The number(s) should correspond to the sample's place (in order of appearance) in |
sep |
the separator of the input files if |
... |
optional parameters to be passed to |
A data frame containing the desired subset of data.
Gro Nilsen
#Load lymphoma data data(lymphoma) #Select data only for samples 1 and 6 and chromosomes 1:9: sub.data <- subsetData(data=lymphoma,chrom=c(1:9),sample=c(1,6))
#Load lymphoma data data(lymphoma) #Select data only for samples 1 and 6 and chromosomes 1:9: sub.data <- subsetData(data=lymphoma,chrom=c(1:9),sample=c(1,6))
This function returns a subset of segments according to the input and the specified chromosomes and/or samples.
subsetSegments(segments, chrom = NULL, sample = NULL, sep="\t", ...)
subsetSegments(segments, chrom = NULL, sample = NULL, sep="\t", ...)
segments |
either a data frame or the name of a tab-separated file from which segmentation results can be read. Segmentation results may come from |
chrom |
a numeric or character vector with chromosome(s) for which segments should be selected. If unspecified, all chromosomes in |
sample |
a numeric vector indicating for which sample(s) segments should be selected. The number(s) should correspond to the sample's place (in order of appearance) in |
sep |
the separator of the input files if |
... |
optional parameters to be passed to |
A data frame containing the desired subset of segments.
Gro Nilsen
#Load lymphoma data data(lymphoma) #Select segments only for samples 1 and 6 and chromosomes 1:9: segments <- pcf(lymphoma,gamma=12) sub.segments <- subsetSegments(segments=segments,chrom=c(1:9),sample=c(1,6))
#Load lymphoma data data(lymphoma) #Select segments only for samples 1 and 6 and chromosomes 1:9: segments <- pcf(lymphoma,gamma=12) sub.segments <- subsetSegments(segments=segments,chrom=c(1:9),sample=c(1,6))
Outliers in copy number data are detected and modified using MAD or PCF Winsorization.
winsorize(data, pos.unit = "bp", arms = NULL, method = "mad", tau = 2.5, k = 25, gamma = 40, iter = 1, assembly = "hg19", digits = 4, return.outliers = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
winsorize(data, pos.unit = "bp", arms = NULL, method = "mad", tau = 2.5, k = 25, gamma = 40, iter = 1, assembly = "hg19", digits = 4, return.outliers = FALSE, save.res = FALSE, file.names = NULL, verbose = TRUE)
data |
either a data frame or the name of a tab-separated file from which copy number data can be read. The rows of the data frame or file should represent the probes. Column 1 must hold numeric or character chromosome numbers, column 2 the numeric local probe positions, and subsequent column(s) the numeric copy number measurements for one or more samples. The header of copy number columns should give sample IDs. |
pos.unit |
the unit used to represent the probe positions. Allowed options are "mbp" (mega base pairs), "kbp" (kilo base pairs) or "bp" (base pairs). By default assumed to be "bp". |
arms |
optional character vector containing chromosome arms (denoted 'p' and 'q') corresponding to the chromosomes and positions found in |
method |
the Winsorization method to be applied, must be one of "mad" (default) or "pcf". |
tau |
Winsorization threshold, default is 2.5. |
k |
the half window size to be applied in median filtering, default is 25. |
gamma |
penalty for each discontinuity in the pcf curve, default is 40. Only applicable when |
iter |
number of iterations in PCF Winsorization, default is 1. |
assembly |
a string specifying which genome assembly version should be applied to determine chromosome arms. Allowed options are "hg19", "hg18", "hg17" and "hg16" (corresponding to the four latest human genome annotations in the UCSC genome browser). |
digits |
the number of decimals to be applied when reporting results. Default is 4. |
return.outliers |
logical value indicating whether a data frame identifying outliers should be returned, default is FALSE. |
save.res |
logical value indicating whether results should be saved in text files, default is FALSE. |
file.names |
optional character vector of length two giving the name of the files where the Winsorized data and outlier statuses, respectively, should be saved if |
verbose |
logical value indicating whether or not to print a progress message each time Winsorization is finished for a new chromosome arm. |
The copy number data are either MAD Winsorized or PCF Winsorized as described in Nilsen and Liestoel et al. (2012). Winsorization is done separately on each chromosome arm in each sample.
If return.outliers = TRUE
a list with the following components:
wins.data |
a data frame with chromosome numbers in the first column, probe positions in the second and the Winsorized copy number values for the sample(s) in subsequent column(s). |
wins.outliers |
a data frame with chromosome numbers in the first column, probe positions in the second and outlier statuses for each sample in the subsequent column(s). The values +/- 1 indicate that the observation is an outlier, whereas the value 0 indicates that it is not. |
If return.outliers = FALSE
only the data frame containing the winsorized data is returned.
If save.res=TRUE
the results are saved in text files with names as specified in file.names
. If file.names=NULL
, a folder named "Wins_res" is created in the working directory and Winsorized data and outlier statuses are saved in this directory in tab-separated files named wins.data.txt and wins.outliers.txt, respectively.
Any missing values in data
imply that the Winsorized value and outlier status for this probe will be missing as well. Also, if the number of probes within a chromosome arm is less than 2*k, Winsorization cannot be done and the data values are thus left unchanged.
Gro Nilsen, Knut Liestoel, Ole Christian Lingjaerde
Nilsen and Liestoel et al., "Copynumber: Efficient algorithms for single- and multi-track copy number segmentation", BMC Genomics 13:591 (2012), doi:10.1186/1471-2164-13-59
#Lymphoma data data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Do MAD Winsorization: wins.data <- winsorize(data=sub.lymphoma)
#Lymphoma data data(lymphoma) #Take out a smaller subset of 3 samples (using subsetData): sub.lymphoma <- subsetData(lymphoma,sample=1:3) #Do MAD Winsorization: wins.data <- winsorize(data=sub.lymphoma)