---
title: "Autocorrelation plots"
author: "Morgane Pierre-Jean and Pierre Neuvial"
date: "14/12/2016"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEncoding{UTF-8}
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

This document aims to test whether DNA copy number profiles built by resampling within the `acnr`package are "realist", i.e. whether they have a similar spatial autocorrelation structure as the original data.

```{r, message=FALSE}
library("acnr")
library("R.utils")
```

We focus on one data set of the package, and look at the sample with the highest cellularity in this data set:
```{r}
dataSets <- listDataSets()
dataSet <- dataSets[2]
print(dataSet)

tfs <- listTumorFractions(dataSet)
tf <- tfs[1]
(tf)

dat <- loadCnRegionData(dataSet, tumorFraction=tf)
region <- as.factor(dat$region)
```

We focus here on total copy numbers:

```{r}
x <- dat$c
rm(dat)
```


## Autocorrelation plots

```{r, fig.width=4, fig.height=2, message=FALSE}
lim <- c(0, 0.05)
res <- sapply(levels(region), function (rr){
    xTrue <- x[which(region==rr)]
    xResamp <- sample(xTrue)
    mar <- c(2, 2, 1, 0)+0.2
    par(mar=mar)
    acf(xTrue, ylim=lim)
    mtext(sprintf("Original data: %s", rr), side=3)
    acf(xResamp, ylim=lim)
    mtext(sprintf("Resampled data: %s", rr), side=3)
})
```

From the autocorrelation plots it seems that the original data are slightly more autocorrelated than the resampled data. In order to check this hypothesis more quantatively, we perform the Ljung-Box test of the null hypothesis

* H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).

against the alternative hypothesis

* H1: The data are not independently distributed; they exhibit serial correlation

## Autocorrelation tests

```{r, results='asis'}
tst <- "Ljung-Box"
lag <- 20
res <- sapply(levels(region), function (rr){
    xTrue <- x[which(region==rr)]
    len <- length(xTrue)
    xResamp <- sample(xTrue)
    bt <- Box.test(xTrue, lag=lag, type=tst)
    br <- Box.test(xResamp, lag=lag, type=tst)
    c("Original data"=bt$p.value, "Resampled data"=br$p.value, "Region size"=len)
})
cpt <- sprintf("$p$-values of the %s auto-correlation test (lag=%s)", tst, lag)
knitr::kable(t(res), caption=cpt)
```

```{r}
plot(sort(res[2, ]), ylab="p-value", xlab="rank", main="sorted p-values", pch=19, col=3, ylim=c(0,1))
points(sort(res[1, ]), pch=19, col=1)
abline(a=0, b=1/ncol(res), lty=2)
```

The original data is clearly more autocorrelated than the resampled data. Most of the tests reject the null hypothesis for the original data, while they are not rejected on the resampled data. This implies that there exists a spatial correlation in the original data, and perfoming resampling breaks this correlation.

However, we note that this auto-correlation is sufficently weak not to be detected by the Ljung-Box test when applied on "only" 2,000 data points per region:

```{r, results='asis'}
tst <- "Ljung-Box"
lag <- 20
maxSize <- 2000
res2 <- sapply(levels(region), function (rr){
    xTrue <- head(x[which(region==rr)], maxSize)
    len <- length(xTrue)
    xResamp <- sample(xTrue)
    bt <- Box.test(xTrue, lag=lag, type=tst)
    br <- Box.test(xResamp, lag=lag, type=tst)
    c("Original data"=bt$p.value, "Resampled data"=br$p.value, "Region size"=len)
})
cpt <- sprintf("$p$-values of the %s auto-correlation test (lag=%s) for regions of size <= %s", tst, lag, maxSize)
knitr::kable(t(res2), caption=cpt)
```

```{r}
plot(sort(res2[2, ]), ylab="p-value", xlab="rank", main="sorted p-values", pch=19, col=3, ylim=c(0,1))
points(sort(res2[1, ]), pch=19, col=1)
abline(a=0, b=1/ncol(res), lty=2)
```

## Discussion

The original data is autocorrelated. This implies that the uniform resampling performed in the `jointseg` package yields copy number profiles whose signal distribution within regions of constant copy number level does not completely fit with the distribution of the original data. To address this issue, one possibility could be to perform block resampling. However, this raises the additional issue of chosing a block size.

We note that this autocorrelation is hardly detectable even in regions of 1,000 or 2,000 data points. Given the resolution of the Affymetrix GenomeWide SNP 6.0 chip type (on average, one data point every 1.5kb), this implies that the distribution of total copy numbers in regions smaller than a few megabases does provide a good approximation of the true distribution.

## TODO

* influence of celularity
* other data sets

## Session information

```{r}
sessionInfo()
```