Generating realistic data with known truth using the jointseg package

This vignette describes how to use the jointseg package to partition bivariate DNA copy number signals from SNP array data into segments of constant parent-specific copy number. We demonstrate the use of the PSSeg function of this package for applying two different strategies. Both strategies consist in first identifying a list of candidate change points through a fast (greedy) segmentation method, and then to prune this list is using dynamic programming [1]. The segmentation method presented here is Recursive Binary Segmentation (RBS, [2]). We refer to [3] for a more comprehensive performance assessment of this method and other segmentation methods.

segmentation, change point model, binary segmentation, dynamic programming, DNA copy number, parent-specific copy number.

Please see Appendix \(\ref{citation}\) for citing jointseg.

HERE

This vignette illustrates how the jointseg package may be used to generate a variety of copy-number profiles from the same biological ``truth’’. Such profiles have been used to compare the performance of segmentation methods in [3].

Citing `jointseg`

citation("jointseg")

## To cite package 'jointseg' in publications, please use the following
## references:
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial (2019).
##   jointseg: Joint segmentation of multivariate (copy number) signals.R
##   package version 1.0.2.
## 
##   Morgane Pierre-Jean, Guillem Rigaill and Pierre Neuvial. Performance
##   evaluation of DNA copy number segmentation methods.  Briefings in
##   Bioinformatics (2015) 16 (4): 600-615.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

Setup

The parameters are defined as follows:

n <- 1e4                                 ## signal length
bkp <- c(2334, 6121)                     ## breakpoint positions
regions <- c("(1,1)", "(1,2)", "(0,2)")  ## copy number regions

ylims <- cbind(c(0, 5), c(-0.1, 1.1))
colG <- rep("#88888855", n)
hetCol <- "#00000088"

For convenience we define a custom plot function for this vignette:

plotFUN <- function(dataSet, tumorFraction) {
    regDat <- acnr::loadCnRegionData(dataSet=dataSet, tumorFraction=tumorFraction)
    sim <- getCopyNumberDataByResampling(n, bkp=bkp,
                                         regions=regions, regData=regDat)
    dat <- sim$profile
    wHet <- which(dat$genotype==1/2)
    colGG <- colG
    colGG[wHet] <- hetCol
    plotSeg(dat, sim$bkp, col=colGG)
}

Affymetrix data

ds <- "GSE29172"

pct <- 1
plotFUN(ds, pct)

Data set GSE29172 : 1 % tumor cells

plotFUN(ds, pct)

Data set GSE29172 : 1 % tumor cells (another resampling)

pct <- 0.7
plotFUN(ds, pct)

Data set GSE29172 : 0.7 % tumor cells

pct <- 0.5
plotFUN(ds, pct)

Data set GSE29172 : 0.5 % tumor cells

Illumina data

ds <- "GSE11976"

Session information

sessionInfo()

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.51     jointseg_1.0.2 rmarkdown_2.31
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39     R6_2.6.1          fastmap_1.2.0     xfun_0.59        
##  [5] maketools_1.3.2   matrixStats_1.5.0 cachem_1.1.0      htmltools_0.5.9  
##  [9] acnr_1.0.0        buildtools_1.0.0  lifecycle_1.0.5   cli_3.6.6        
## [13] sass_0.4.10       jquerylib_0.1.4   compiler_4.6.1    sys_3.4.3        
## [17] tools_4.6.1       evaluate_1.0.5    bslib_0.11.0      yaml_2.3.12      
## [21] otel_0.2.0        jsonlite_2.0.0    rlang_1.2.0

References

[1] Bellman, Richard. 1961. “On the Approximation of Curves by Line Segments Using Dynamic Programming.” Communications of the ACM 4 (6). ACM: 284.

[2] Gey, Servane, et al. 2008. “Using CART to Detect Multiple Change Points in the Mean for Large Sample.” https://hal.archives-ouvertes.fr/hal-00327146.

[3] Pierre-Jean, Morgane, et al. 2015. “Performance Evaluation of DNA Copy Number Segmentation Methods.” Briefings in Bioinformatics, no. 4: 600-615.