jointseg
package to partition bivariate DNA copy number
signals from SNP array data into segments of constant parent-specific
copy number. We demonstrate the use of the PSSeg
function
of this package for applying two different strategies. Both strategies
consist in first identifying a list of candidate change points through a
fast (greedy) segmentation method, and then to prune this list is using
dynamic programming [1]. The segmentation method presented here is
Recursive Binary Segmentation (RBS, [2]). We refer to [6] for a more
comprehensive performance assessment of this method and other
segmentation methods.
segmentation, change point model, binary segmentation, dynamic programming, DNA copy number, parent-specific copy number.
Please see Appendix for citing jointseg
.
PSSeg
requires normalized copy number signals, in the
form of total copy number estimates and allele B fractions for tumor,
the (germline) genotype of SNP. Loci are assumed to come from a single
chromosome and to be ordered by genomic position.
For illustration, we show of such a data set may be created from real
data. We use data from a public SNP array data set, which is distributed
in the acnr
package (on which the jointseg
package depends).
## 'data.frame': 40000 obs. of 7 variables:
## $ c : num 0.909 0.859 1.304 0.647 0.947 ...
## $ b : num NaN NaN NaN NaN NaN NaN NaN -0.035 NaN NaN ...
## $ genotype : num NA NA NA NA NA NA NA 0 NA NA ...
## $ bT : num NaN NaN NaN NaN NaN NaN NaN 0.161 NaN NaN ...
## $ bN : num NaN NaN NaN NaN NaN NaN NaN 0.196 NaN NaN ...
## $ region : chr "(0,1)" "(0,1)" "(0,1)" "(0,1)" ...
## $ cellularity: num 1 1 1 1 1 1 1 1 1 1 ...
This data set consists of copy number signals from 8 types of genomic regions:
##
## (0,1) (0,2) (0,3) (1,1) (1,2) (1,3) (2,2) (2,3)
## 5000 5000 5000 5000 5000 5000 5000 5000
These regions are coded as (C1, C2), where C1 denotes the minor copy number and C2 denotes the major copy number, i.e. the smallest and the largest of the two parental copy numbers (see e.g. [4] for more detailed definitions). For example, (1, 1) corresponds to a normal state, (0, 1) to an hemizygous deletion, (1, 2) to a single copy gain and (0, 2) to a copy-neutral LOH (loss of heterozygosity).
These real data can then be used to create a realistic DNA copy
number profile of user-defined length, and harboring a user-defined
number of breakpoints. This is done using the
getCopyNumberDataByResampling
function. Breakpoint
positions are drawn uniformly) among all possible loci. Between two
breakpoints, the copy number state corresponds to one of the types of
regions in
data}, and each data point is drawn with replacement from the corresponding true copy number signal from the region. More options are available from the documentation of
getCopyNumberDataByResampling}.
K <- 10
bkp <- c(408,1632,3905, 5890,6709, 10481, 12647,14089,17345,18657)
len <- 2e4
sim <- getCopyNumberDataByResampling(len, bkp=bkp, minLength=500, regData=data)
datS <- sim$profile
str(datS)
## 'data.frame': 20000 obs. of 7 variables:
## $ c : num 1.84 1.7 1.96 2.01 3.13 ...
## $ b : num 1.087 NaN -0.004 0.64 NaN ...
## $ genotype : num 1 NA 0 0.5 NA 0.5 NA NA NA NA ...
## $ bT : num 0.877 NaN 0.125 0.595 NaN 0.687 NaN NaN NaN NaN ...
## $ bN : num 0.79 NaN 0.128 0.438 NaN 0.619 NaN NaN NaN NaN ...
## $ region : chr "(1,2)" "(1,2)" "(1,2)" "(1,2)" ...
## $ cellularity: num 1 1 1 1 1 1 1 1 1 1 ...
The resulting copy-number profile is plotted below.
We advise the following (typical) preprocessing before segmentation: * log -transform total copy numbers in order to stabilize their variance; this step improve segmentation results for all methods.
dropOutliers
option in the
PSSeg
function, which internally calls the
smooth.CNA
function of the DNAcopy
package.
The default value for this option is TRUE
.PSSeg
function.We can now use the PSSeg
function to segment signals.
The method consists in three steps:
Note that this is fast:
## segmentation dpseg
## 0 0
To plot the PSSeg segmentation results together with the true breakpoints, do :
The PSSeg
function returns the original segmentation (by
RBS
), the result of the pruning step, and the best model
(among those selected by dynamic programming) according to the criterion
proposed by [3].
The quality of the best segmentation can be assessed as follows. The
number of true positives (TP) is the number of true change points for
which there exists a candidate change point closer than a given
tolerance tol
. The number of false positives is defined as
the number of true negatives (all those which are not change points) for
which the candidate change points are out of tolerance area and those in
tolerance area where there already exists a candidate change point. %The
true negative rate (TNR) is defined as 1-FPR. % True negative are
defined as the midpoints of intervals between true change points
(augmented by points 0 and n + 1, where n is the number of loci. The true
negative rate (TNR) is the proportion of true negatives for which there
is no candidate change point closer than `tol}. By construction, TP ∈ {0, 1, ⋯, K}
where K is the number of true
change points.
## TP FP
## 10 4
Obviously, this performance measure depends on the chosen tolerance:
perf <- sapply(0:10, FUN=function(tol) {
getTpFp(resRBS$bestBkp, sim$bkp, tol=tol,relax = -1)
})
print(perf)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## TP 5 8 8 9 10 10 10 10 10 10 10
## FP 9 6 6 5 4 4 4 4 4 4 4
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] jointseg_1.0.2 knitr_1.49 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
## [5] maketools_1.3.1 matrixStats_1.4.1 cachem_1.1.0 htmltools_0.5.8.1
## [9] acnr_1.0.0 buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2 sys_3.4.3
## [17] tools_4.4.2 evaluate_1.0.1 bslib_0.8.0 yaml_2.3.10
## [21] DNAcopy_1.81.0 jsonlite_1.8.9 rlang_1.1.4
jointseg
## To cite package 'jointseg' in publications, please use the following
## references:
##
## Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial (). jointseg:
## Joint segmentation of multivariate (copy number) signals.R package
## version 1.0.2.
##
## Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial. Performance
## evaluation of DNA copy number segmentation methods. Briefings in
## Bioinformatics (2015) 16 (4): 600-615.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
[1] Bellman, Richard. 1961. “On the Approximation of Curves by Line Segments Using Dynamic Programming.” Communications of the ACM 4 (6). ACM: 284.
[2] Gey, Servane, et al. 2008. “Using CART to Detect Multiple Change Points in the Mean for Large Sample.” https://hal.archives-ouvertes.fr/hal-00327146.
[3] Lebarbier, E. 2005. “Detecting Multiple Change-Points in the Mean of Gaussian Process by Model Selection.” Signal Processing 85 (4): 717-36.
[4] Neuvial, Pierre, et al. 2011. “Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies.” In Handbook of Statistical Bioinformatics, 1st ed. Springer Handbooks of Computational Statistics. Springer.
[5] Olshen, A B, et al.. 2004. “Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data.” Biostatistics 5 (4): 557-72.
[6] Pierre-Jean, Morgane, et al. 2015. “Performance Evaluation of DNA Copy Number Segmentation Methods.” Briefings in Bioinformatics, no. 4: 600-615.
[7] Staaf, Johan, et al. 2008. “Segmentation-Based Detection of Allelic Imbalance and Loss-of-Heterozygosity in Cancer Cells Using Whole Genome SNP Arrays.” Genome Biology 9 (9). BioMed Central: R136.