coverage_correlation

The covercorr package implements the coverage correlation coefficient introduced in the paper Coverage correlation: detecting singular dependencies between random variables. The coverage correlation coefficient is a nonparametric measure of statistical association designed to detect dependencies concentrated on low-dimensional structures within the joint distribution of two (or more) random variables or vectors. Based on Monge-Kantorovich ranks and geometric coverage processes, this statistic quantifies the extent to which the joint distribution concentrates on a singular subset with respect to the product of the marginals. The coverage correlation coefficient is distribution-free, admits an analytically tractable asymptotic null distribution, and can be computed efficiently, making it well-suited for uncovering complex, potentially nonlinear associations in large-scale testing.

This vignette covers the original pairwise function coverage_correlation(), and then the functions added in version 1.1.0: the unified front-end covercorr(), deterministic reference grids via coverage_correlation_grid(), and joint dependence among many variables via coverage_correlation_K().

The pairwise coefficient

Example 1

In this example, we demonstrate how to use coverage_correlation() with a simple simulation. We compute the coverage correlation coefficient between two one-dimensional Normal random variables, X and Y, and then vary the strength of their relationship to observe how the statistic changes.

set.seed(1)
n <- 100
X <- rnorm(n)
Y <- rnorm(n)
result <- coverage_correlation(Y, X, visualise = TRUE)

result
#> 
#> Coverage correlation
#> 
#> statistic = 0.03622657, p-value = 0.1151168
#> n = 100, dimension = 2, method = exact

In the example above, X and Y are independent.

The parameter visualise defaults to FALSE, but setting it to TRUE produces a plot that illustrates the intuition behind the coverage correlation coefficient. The coverage correlation coefficient first transforms X and Y into their Monge-Kantorovich ranks, denoted by X_rank and Y_rank, which are uniformly distributed on $[0, 1]$. The plot displays the pairs $(X_{\text{rank}_i}, Y_{\text{rank}_i})$ along with cubes of volume $n^{-1}$.

Inside the function coverage_correlation(), we compute $V_n$, the total uncovered area after taking the union of these cubes. The coverage correlation coefficient is then defined as \[ \kappa_n^{X, Y} := \frac{V_n - e^{-1}}{1 - e^{-1}}. \]

The function returns an object of class "covercorr" (a list) with the following elements:

$stat: the value of the coverage correlation coefficient.
$pval: the p-value of this statistic under the null hypothesis of independence.
$method: the method used to compute the statistic (e.g., "exact" or "approx").
$mc_se: if method "approx" was used, mc_se is the standard error of the Monte Carlo approximation, otherwise it is 0.

Printing the result (as above) uses the print method for the "covercorr" class; use str(result) to see the full list structure.

By default, method = "auto". In this mode, if the total dimension of X and Y (i.e., ncol(X) + ncol(Y), treating vectors as one-dimensional) is at most 6, the method is set to "exact"; otherwise, it uses "approx".

Next we can see how the result changes as we introduce dependence between X and Y.

set.seed(2)
n <- 100
X <- rnorm(n)
Z <- rnorm(n)
rho <- 0.9
Y <- rho * X + sqrt(1 - rho^2) * Z
result <- coverage_correlation(Y, X, visualise = TRUE)

result
#> 
#> Coverage correlation
#> 
#> statistic = 0.3591986, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exact

You may notice parts of some cubes appearing at the corners of the plot. This happens because we treat $[0, 1]^2$ as a torus. If a cube centered at one of the rank points lies partially outside $[0, 1]^2$, we wrap it around so that the plot reflects this topology.

Example 2

The coverage correlation coefficient can handle multidimensional random vectors as well.

set.seed(3)
n <- 100
p <- 2
X <- matrix(rnorm(p * n), ncol = p)
Y <- matrix(0, nrow = n, ncol = p)
Y[, 1] <- X[, 1]^2
Y[, 2] <- X[, 1] * X[, 2]
result <- coverage_correlation(Y, X)
result
#> 
#> Coverage correlation
#> 
#> statistic = 0.273216, p-value = < 2.2204e-16
#> n = 100, dimension = 4, method = exact

In this case we cannot visualise the whole plot as X and Y are not one-dimensional.

Example 3

In the example below, X and Y are independent and 2-dimensional. We set the method parameter equal to "approx".

set.seed(4)
n <- 50
p <- 2
X <- matrix(rnorm(p * n), ncol = p)
Y <- matrix(rnorm(p * n), ncol = p)
result <- coverage_correlation(Y, X, method = "approx")
result
#> 
#> Coverage correlation
#> 
#> statistic = 0.03919494, p-value = 0.0682784
#> n = 50, dimension = 4, method = approx
#> Monte Carlo standard error = 0.02595504

Deterministic reference grids

The Monge-Kantorovich rank transformation requires a set of reference points. By default coverage_correlation() generates these randomly with runif(), which means repeated calls on the same data can give slightly different results. When both inputs are one-dimensional, the natural reference set is the deterministic uniform grid $\{1/n, 2/n, \ldots, 1\}$, which makes the computation reproducible. coverage_correlation_grid() uses this grid by default.

set.seed(5)
n <- 100
X <- rnorm(n)
Y <- sin(3 * X) + rnorm(n) * 0.05

# Deterministic: repeated calls give identical results
r1 <- coverage_correlation_grid(Y, X)
r2 <- coverage_correlation_grid(Y, X)
c(r1$stat, r2$stat)
#> [1] 0.3624634 0.3624634

You can also supply your own reference points through u and v, for example a low-discrepancy sequence, as long as they match the shape of the corresponding input.

The unified interface: `covercorr()`

covercorr() is a single front-end that dispatches to the appropriate routine based on its input. It accepts a pair of variables, a matrix or data frame whose columns are treated as variables, or a list of variables. The reference argument chooses between random reference points ("random", the default) and the deterministic grid ("deterministic").

set.seed(6)
n <- 100
X <- rnorm(n)
Y <- sin(3 * X) + rnorm(n) * 0.05

# Pairwise, random reference points
covercorr(X, Y)
#> 
#> Coverage correlation
#> 
#> statistic = 0.4009571, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exact

# Pairwise, deterministic grid
covercorr(X, Y, reference = "deterministic")
#> 
#> Coverage correlation
#> 
#> statistic = 0.3661019, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exact

When given a matrix or data frame, each column is treated as a separate variable. A two-column input is handled as the pairwise case; more columns are treated jointly as several variables.

set.seed(7)
n <- 100
X <- rnorm(n)
Y <- sin(3 * X) + rnorm(n) * 0.05
covercorr(cbind(X, Y))
#> 
#> Coverage correlation
#> 
#> statistic = 0.37483, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exact

Joint dependence among many variables

coverage_correlation_K() generalises the coefficient from a pair to K variables, supplied as a list. It tests the null hypothesis of mutual independence among all the inputs.

set.seed(8)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
coverage_correlation_K(list(X1, X2, X3))
#> 
#> Coverage correlation
#> 
#> statistic = 0.4638644, p-value = < 2.2204e-16
#> n = 100, dimension = 3, method = exact

The same dependence can be tested through covercorr() by passing a list, or a matrix with more than two columns.

set.seed(9)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
covercorr(cbind(X1, X2, X3))
#> 
#> Coverage correlation
#> 
#> statistic = 0.439989, p-value = < 2.2204e-16
#> n = 100, dimension = 3, method = exact

A note on the K-variable fixed-grid case

A deterministic-grid p-value is available for the pairwise case and for the two-variable fixed-grid case. For three or more variables with a fixed grid, the null distribution has not been derived, so coverage_correlation_K_grid() (and covercorr() with reference = "deterministic" on three or more variables) returns pval = NA and reports pval_available = FALSE, while still computing the statistic itself.

set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
res <- coverage_correlation_K_grid(list(X1, X2, X3))
#> No p-value is available for the K-variable (K > 2) fixed-grid case: the null distribution has not been derived. Returning pval = NA.
res$stat
#> [1] 0.4222658
res$pval
#> [1] NA
res$pval_available
#> [1] FALSE

Visualising the joint density on real data

The examples above use simulated data so that the dependence structure is known. The package also ships a real dataset, CD8T, containing single-cell RNA-seq gene expression measurements for 9369 fetal CD8+ T cells across 1000 genes (Suo et al., 2022). We can use it to illustrate visualise_density(), which gives a picture of where in the unit square a pair of variables departs from independence.

visualise_density() takes the two Monge-Kantorovich rank vectors and estimates their joint density with a toroidally-wrapped two-dimensional kernel density estimate (the same torus topology used in the cube plots above, so that mass near an edge is not lost). It then plots the density on a log scale relative to the uniform density: regions redder than the mid-tone are more concentrated than they would be under independence, bluer regions are less concentrated, and a flat mid-toned image corresponds to independence.

For illustration we take two genes; you can substitute any pair of column indices. We first compute the coverage correlation coefficient for the pair, then place the raw scatter plot and the rank-density plot side by side.

data(CD8T)
n <- nrow(CD8T)

i <- 1
j <- 2
x <- CD8T[, i]
y <- CD8T[, j]

# The coefficient and its p-value for this pair
covercorr(x, y)
#> 
#> Coverage correlation
#> 
#> statistic = 0.0132117, p-value = 1.24082e-05
#> n = 9369, dimension = 2, method = exact

# Monge-Kantorovich ranks, using random reference points
set.seed(1)
x_rank <- MK_rank(as.matrix(x), runif(n))
y_rank <- MK_rank(as.matrix(y), runif(n))

op <- par(mfrow = c(1, 2))
plot(x, y, pch = ".", xlab = colnames(CD8T)[i], ylab = colnames(CD8T)[j])
visualise_density(x_rank, y_rank)

par(op)

Single-cell expression data are sparse, so many cells share the value zero for a given gene; MK_rank() breaks these ties at random, which is why the scatter plot shows banding along the axes. Pairs that the coverage correlation test flags as dependent typically show clear ridges of concentrated (red) mass in the density plot, whereas independent pairs give an essentially flat image.