Autocorrelation plots

Introduction

This document aims to test whether DNA copy number profiles built by resampling within the acnrpackage are “realist”, i.e. whether they have a similar spatial autocorrelation structure as the original data.

library("acnr")
library("R.utils")

We focus on one data set of the package, and look at the sample with the highest cellularity in this data set:

dataSets <- listDataSets()
dataSet <- dataSets[2]
print(dataSet)
## [1] "GSE13372_HCC1143"
tfs <- listTumorFractions(dataSet)
tf <- tfs[1]
(tf)
## [1] 1
dat <- loadCnRegionData(dataSet, tumorFraction=tf)
region <- as.factor(dat$region)

We focus here on total copy numbers:

x <- dat$c
rm(dat)

Autocorrelation plots

lim <- c(0, 0.05)
res <- sapply(levels(region), function (rr){
    xTrue <- x[which(region==rr)]
    xResamp <- sample(xTrue)
    mar <- c(2, 2, 1, 0)+0.2
    par(mar=mar)
    acf(xTrue, ylim=lim)
    mtext(sprintf("Original data: %s", rr), side=3)
    acf(xResamp, ylim=lim)
    mtext(sprintf("Resampled data: %s", rr), side=3)
})

From the autocorrelation plots it seems that the original data are slightly more autocorrelated than the resampled data. In order to check this hypothesis more quantatively, we perform the Ljung-Box test of the null hypothesis

  • H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).

against the alternative hypothesis

  • H1: The data are not independently distributed; they exhibit serial correlation

Autocorrelation tests

tst <- "Ljung-Box"
lag <- 20
res <- sapply(levels(region), function (rr){
    xTrue <- x[which(region==rr)]
    len <- length(xTrue)
    xResamp <- sample(xTrue)
    bt <- Box.test(xTrue, lag=lag, type=tst)
    br <- Box.test(xResamp, lag=lag, type=tst)
    c("Original data"=bt$p.value, "Resampled data"=br$p.value, "Region size"=len)
})
cpt <- sprintf("$p$-values of the %s auto-correlation test (lag=%s)", tst, lag)
knitr::kable(t(res), caption=cpt)
p-values of the Ljung-Box auto-correlation test (lag=20)
Original data Resampled data Region size
(0,1) 1.00e-07 0.1421875 8175
(0,2) 6.75e-04 0.8598743 14798
(1,1) 0.00e+00 0.2437166 16856
(1,2) 0.00e+00 0.3073736 15087
plot(sort(res[2, ]), ylab="p-value", xlab="rank", main="sorted p-values", pch=19, col=3, ylim=c(0,1))
points(sort(res[1, ]), pch=19, col=1)
abline(a=0, b=1/ncol(res), lty=2)

The original data is clearly more autocorrelated than the resampled data. Most of the tests reject the null hypothesis for the original data, while they are not rejected on the resampled data. This implies that there exists a spatial correlation in the original data, and perfoming resampling breaks this correlation.

However, we note that this auto-correlation is sufficently weak not to be detected by the Ljung-Box test when applied on “only” 2,000 data points per region:

tst <- "Ljung-Box"
lag <- 20
maxSize <- 2000
res2 <- sapply(levels(region), function (rr){
    xTrue <- head(x[which(region==rr)], maxSize)
    len <- length(xTrue)
    xResamp <- sample(xTrue)
    bt <- Box.test(xTrue, lag=lag, type=tst)
    br <- Box.test(xResamp, lag=lag, type=tst)
    c("Original data"=bt$p.value, "Resampled data"=br$p.value, "Region size"=len)
})
cpt <- sprintf("$p$-values of the %s auto-correlation test (lag=%s) for regions of size <= %s", tst, lag, maxSize)
knitr::kable(t(res2), caption=cpt)
p-values of the Ljung-Box auto-correlation test (lag=20) for regions of size <= 2000
Original data Resampled data Region size
(0,1) 0.2124001 0.2677989 2000
(0,2) 0.4166288 0.9566678 2000
(1,1) 0.8984522 0.5566155 2000
(1,2) 0.8605090 0.0589170 2000
plot(sort(res2[2, ]), ylab="p-value", xlab="rank", main="sorted p-values", pch=19, col=3, ylim=c(0,1))
points(sort(res2[1, ]), pch=19, col=1)
abline(a=0, b=1/ncol(res), lty=2)

Discussion

The original data is autocorrelated. This implies that the uniform resampling performed in the jointseg package yields copy number profiles whose signal distribution within regions of constant copy number level does not completely fit with the distribution of the original data. To address this issue, one possibility could be to perform block resampling. However, this raises the additional issue of chosing a block size.

We note that this autocorrelation is hardly detectable even in regions of 1,000 or 2,000 data points. Given the resolution of the Affymetrix GenomeWide SNP 6.0 chip type (on average, one data point every 1.5kb), this implies that the distribution of total copy numbers in regions smaller than a few megabases does provide a good approximation of the true distribution.

TODO

  • influence of celularity
  • other data sets

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] R.utils_2.12.3    R.oo_1.27.0       R.methodsS3_1.8.2 acnr_0.3.2       
## [5] rmarkdown_2.29   
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] maketools_1.3.1   cachem_1.1.0      knitr_1.49        htmltools_0.5.8.1
##  [9] buildtools_1.0.0  lifecycle_1.0.4   cli_3.6.3         sass_0.4.9       
## [13] jquerylib_0.1.4   compiler_4.4.2    sys_3.4.3         tools_4.4.2      
## [17] evaluate_1.0.1    bslib_0.8.0       yaml_2.3.10       jsonlite_1.8.9   
## [21] rlang_1.1.4