PSSeg: Parent-Specifc copy number segmentation

This vignette describes how to use the jointseg package to partition bivariate DNA copy number signals from SNP array data into segments of constant parent-specific copy number. We demonstrate the use of the PSSeg function of this package for applying two different strategies. Both strategies consist in first identifying a list of candidate change points through a fast (greedy) segmentation method, and then to prune this list is using dynamic programming [1]. The segmentation method presented here is Recursive Binary Segmentation (RBS, [2]). We refer to [6] for a more comprehensive performance assessment of this method and other segmentation methods.

segmentation, change point model, binary segmentation, dynamic programming, DNA copy number, parent-specific copy number.

Please see Appendix for citing jointseg.

Preparing data to be segmented

PSSeg requires normalized copy number signals, in the form of total copy number estimates and allele B fractions for tumor, the (germline) genotype of SNP. Loci are assumed to come from a single chromosome and to be ordered by genomic position.

For illustration, we show of such a data set may be created from real data. We use data from a public SNP array data set, which is distributed in the acnr package (on which the jointseg package depends).

data <- acnr::loadCnRegionData(dataSet="GSE29172", tumorFraction=1)
str(data)
## 'data.frame':    40000 obs. of  7 variables:
##  $ c          : num  0.909 0.859 1.304 0.647 0.947 ...
##  $ b          : num  NaN NaN NaN NaN NaN NaN NaN -0.035 NaN NaN ...
##  $ genotype   : num  NA NA NA NA NA NA NA 0 NA NA ...
##  $ bT         : num  NaN NaN NaN NaN NaN NaN NaN 0.161 NaN NaN ...
##  $ bN         : num  NaN NaN NaN NaN NaN NaN NaN 0.196 NaN NaN ...
##  $ region     : chr  "(0,1)" "(0,1)" "(0,1)" "(0,1)" ...
##  $ cellularity: num  1 1 1 1 1 1 1 1 1 1 ...

This data set consists of copy number signals from 8 types of genomic regions:

table(data[["region"]])
## 
## (0,1) (0,2) (0,3) (1,1) (1,2) (1,3) (2,2) (2,3) 
##  5000  5000  5000  5000  5000  5000  5000  5000

These regions are coded as (C1, C2), where C1 denotes the minor copy number and C2 denotes the major copy number, i.e. the smallest and the largest of the two parental copy numbers (see e.g. [4] for more detailed definitions). For example, (1, 1) corresponds to a normal state, (0, 1) to an hemizygous deletion, (1, 2) to a single copy gain and (0, 2) to a copy-neutral LOH (loss of heterozygosity).

idxs <- sort(sample(1:nrow(data), 2e4))
plotSeg(data[idxs, ])

These real data can then be used to create a realistic DNA copy number profile of user-defined length, and harboring a user-defined number of breakpoints. This is done using the getCopyNumberDataByResampling function. Breakpoint positions are drawn uniformly) among all possible loci. Between two breakpoints, the copy number state corresponds to one of the types of regions in data}, and each data point is drawn with replacement from the corresponding true copy number signal from the region. More options are available from the documentation ofgetCopyNumberDataByResampling}.

K <- 10
bkp <- c(408,1632,3905, 5890,6709, 10481, 12647,14089,17345,18657)
len <- 2e4
sim <- getCopyNumberDataByResampling(len, bkp=bkp, minLength=500, regData=data)
datS <- sim$profile
str(datS)
## 'data.frame':    20000 obs. of  7 variables:
##  $ c          : num  1.84 1.7 1.96 2.01 3.13 ...
##  $ b          : num  1.087 NaN -0.004 0.64 NaN ...
##  $ genotype   : num  1 NA 0 0.5 NA 0.5 NA NA NA NA ...
##  $ bT         : num  0.877 NaN 0.125 0.595 NaN 0.687 NaN NaN NaN NaN ...
##  $ bN         : num  0.79 NaN 0.128 0.438 NaN 0.619 NaN NaN NaN NaN ...
##  $ region     : chr  "(1,2)" "(1,2)" "(1,2)" "(1,2)" ...
##  $ cellularity: num  1 1 1 1 1 1 1 1 1 1 ...

The resulting copy-number profile is plotted below.

plotSeg(datS, sim$bkp)

Preprocessing

We advise the following (typical) preprocessing before segmentation: * log -transform total copy numbers in order to stabilize their variance; this step improve segmentation results for all methods.

datS$c <- log2(datS$c)-1
  • smooth single point outliers as suggested by [5] This step is controlled by the dropOutliers option in the PSSeg function, which internally calls the smooth.CNA function of the DNAcopy package. The default value for this option is TRUE.
  • convert allelic ratios to (unimodal) decrease in heterozygosity (d), as initially suggested by [7]. This step is performed internally in the PSSeg function.

PSSeg segmentation using RBS

We can now use the PSSeg function to segment signals. The method consists in three steps:

  • run a fast (yet approximate) segmentation on these signals in order to obtain a set of (at most hundreds of) candidate change points. This is done using Recursive Binary Segmentation (RBS [2]);
  • prune the obtained set of change points using dynamic programming [1]
  • select the best number of change points using a model selection criterion proposed by [3]

Initial segmentation and pruning

resRBS <- PSSeg(data=datS, K=2*K, method="RBS", stat=c("c", "d"), profile=TRUE)

Note that this is fast:

resRBS$prof[, "time"]
## segmentation        dpseg 
##            0            0

Plot segmented profile

To plot the PSSeg segmentation results together with the true breakpoints, do :

plotSeg(datS, list(true=sim$bkp, est=resRBS$bestBkp))

Results evaluation

The PSSeg function returns the original segmentation (by RBS), the result of the pruning step, and the best model (among those selected by dynamic programming) according to the criterion proposed by [3].

The quality of the best segmentation can be assessed as follows. The number of true positives (TP) is the number of true change points for which there exists a candidate change point closer than a given tolerance tol. The number of false positives is defined as the number of true negatives (all those which are not change points) for which the candidate change points are out of tolerance area and those in tolerance area where there already exists a candidate change point. %The true negative rate (TNR) is defined as 1-FPR. % True negative are defined as the midpoints of intervals between true change points (augmented by points 0 and n + 1, where n is the number of loci. The true negative rate (TNR) is the proportion of true negatives for which there is no candidate change point closer than `tol}. By construction, TP ∈ {0, 1, ⋯, K} where K is the number of true change points.

print(getTpFp(resRBS$bestBkp, sim$bkp, tol=5))
## TP FP 
## 10  4

Obviously, this performance measure depends on the chosen tolerance:

perf <- sapply(0:10, FUN=function(tol) {
    getTpFp(resRBS$bestBkp, sim$bkp, tol=tol,relax = -1)
})
print(perf)
##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## TP    5    8    8    9   10   10   10   10   10    10    10
## FP    9    6    6    5    4    4    4    4    4     4     4

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] jointseg_1.0.2 knitr_1.49     rmarkdown_2.29
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] maketools_1.3.1   matrixStats_1.4.1 cachem_1.1.0      htmltools_0.5.8.1
##  [9] acnr_1.0.0        buildtools_1.0.0  lifecycle_1.0.4   cli_3.6.3        
## [13] sass_0.4.9        jquerylib_0.1.4   compiler_4.4.2    sys_3.4.3        
## [17] tools_4.4.2       evaluate_1.0.1    bslib_0.8.0       yaml_2.3.10      
## [21] DNAcopy_1.81.0    jsonlite_1.8.9    rlang_1.1.4

Citing jointseg

citation("jointseg")
## To cite package 'jointseg' in publications, please use the following
## references:
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial (). jointseg:
##   Joint segmentation of multivariate (copy number) signals.R package
##   version 1.0.2.
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial. Performance
##   evaluation of DNA copy number segmentation methods.  Briefings in
##   Bioinformatics (2015) 16 (4): 600-615.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

References

[1] Bellman, Richard. 1961. “On the Approximation of Curves by Line Segments Using Dynamic Programming.” Communications of the ACM 4 (6). ACM: 284.

[2] Gey, Servane, et al. 2008. “Using CART to Detect Multiple Change Points in the Mean for Large Sample.” https://hal.archives-ouvertes.fr/hal-00327146.

[3] Lebarbier, E. 2005. “Detecting Multiple Change-Points in the Mean of Gaussian Process by Model Selection.” Signal Processing 85 (4): 717-36.

[4] Neuvial, Pierre, et al. 2011. “Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies.” In Handbook of Statistical Bioinformatics, 1st ed. Springer Handbooks of Computational Statistics. Springer.

[5] Olshen, A B, et al.. 2004. “Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data.” Biostatistics 5 (4): 557-72.

[6] Pierre-Jean, Morgane, et al. 2015. “Performance Evaluation of DNA Copy Number Segmentation Methods.” Briefings in Bioinformatics, no. 4: 600-615.

[7] Staaf, Johan, et al. 2008. “Segmentation-Based Detection of Allelic Imbalance and Loss-of-Heterozygosity in Cancer Cells Using Whole Genome SNP Arrays.” Genome Biology 9 (9). BioMed Central: R136.