The covercorr package implements the coverage correlation coefficient introduced in the paper Coverage correlation: detecting singular dependencies between random variables. The coverage correlation coefficient is a nonparametric measure of statistical association designed to detect dependencies concentrated on low-dimensional structures within the joint distribution of two (or more) random variables or vectors. Based on Monge-Kantorovich ranks and geometric coverage processes, this statistic quantifies the extent to which the joint distribution concentrates on a singular subset with respect to the product of the marginals. The coverage correlation coefficient is distribution-free, admits an analytically tractable asymptotic null distribution, and can be computed efficiently, making it well-suited for uncovering complex, potentially nonlinear associations in large-scale testing.
This vignette covers the original pairwise function
coverage_correlation(), and then the functions added in
version 1.1.0: the unified front-end covercorr(),
deterministic reference grids via
coverage_correlation_grid(), and joint dependence among
many variables via coverage_correlation_K().
In this example, we demonstrate how to use
coverage_correlation() with a simple simulation. We compute
the coverage correlation coefficient between two one-dimensional Normal
random variables, X and Y, and then vary the
strength of their relationship to observe how the statistic changes.
set.seed(1)
n <- 100
X <- rnorm(n)
Y <- rnorm(n)
result <- coverage_correlation(Y, X, visualise = TRUE)result
#>
#> Coverage correlation
#>
#> statistic = 0.03622657, p-value = 0.1151168
#> n = 100, dimension = 2, method = exactIn the example above, X and Y are
independent.
The parameter visualise defaults to FALSE,
but setting it to TRUE produces a plot that illustrates the
intuition behind the coverage correlation coefficient. The coverage
correlation coefficient first transforms X and
Y into their Monge-Kantorovich ranks, denoted by
X_rank and Y_rank, which are uniformly
distributed on \([0, 1]\). The plot
displays the pairs \((X_{\text{rank}_i},
Y_{\text{rank}_i})\) along with cubes of volume \(n^{-1}\).
Inside the function coverage_correlation(), we compute
\(V_n\), the total uncovered area after
taking the union of these cubes. The coverage correlation coefficient is
then defined as \[
\kappa_n^{X, Y} := \frac{V_n - e^{-1}}{1 - e^{-1}}.
\]
The function returns an object of class "covercorr" (a
list) with the following elements:
$stat: the value of the coverage correlation
coefficient.$pval: the p-value of this statistic under the null
hypothesis of independence.$method: the method used to compute the statistic
(e.g., "exact" or "approx").$mc_se: if method "approx" was used,
mc_se is the standard error of the Monte Carlo
approximation, otherwise it is 0.Printing the result (as above) uses the print method for
the "covercorr" class; use str(result) to see
the full list structure.
By default, method = "auto". In this mode, if the
total dimension of X and Y
(i.e., ncol(X) + ncol(Y), treating vectors as
one-dimensional) is at most 6, the method is set to
"exact"; otherwise, it uses "approx".
Next we can see how the result changes as we introduce dependence
between X and Y.
set.seed(2)
n <- 100
X <- rnorm(n)
Z <- rnorm(n)
rho <- 0.9
Y <- rho * X + sqrt(1 - rho^2) * Z
result <- coverage_correlation(Y, X, visualise = TRUE)result
#>
#> Coverage correlation
#>
#> statistic = 0.3591986, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exactYou may notice parts of some cubes appearing at the corners of the plot. This happens because we treat \([0, 1]^2\) as a torus. If a cube centered at one of the rank points lies partially outside \([0, 1]^2\), we wrap it around so that the plot reflects this topology.
The coverage correlation coefficient can handle multidimensional random vectors as well.
set.seed(3)
n <- 100
p <- 2
X <- matrix(rnorm(p * n), ncol = p)
Y <- matrix(0, nrow = n, ncol = p)
Y[, 1] <- X[, 1]^2
Y[, 2] <- X[, 1] * X[, 2]
result <- coverage_correlation(Y, X)
result
#>
#> Coverage correlation
#>
#> statistic = 0.273216, p-value = < 2.2204e-16
#> n = 100, dimension = 4, method = exactIn this case we cannot visualise the whole plot as X and
Y are not one-dimensional.
In the example below, X and Y are
independent and 2-dimensional. We set the method parameter
equal to "approx".
set.seed(4)
n <- 50
p <- 2
X <- matrix(rnorm(p * n), ncol = p)
Y <- matrix(rnorm(p * n), ncol = p)
result <- coverage_correlation(Y, X, method = "approx")
result
#>
#> Coverage correlation
#>
#> statistic = 0.03919494, p-value = 0.0682784
#> n = 50, dimension = 4, method = approx
#> Monte Carlo standard error = 0.02595504The Monge-Kantorovich rank transformation requires a set of reference
points. By default coverage_correlation() generates these
randomly with runif(), which means repeated calls on the
same data can give slightly different results. When both inputs are
one-dimensional, the natural reference set is the deterministic uniform
grid \(\{1/n, 2/n, \ldots, 1\}\), which
makes the computation reproducible.
coverage_correlation_grid() uses this grid by default.
set.seed(5)
n <- 100
X <- rnorm(n)
Y <- sin(3 * X) + rnorm(n) * 0.05
# Deterministic: repeated calls give identical results
r1 <- coverage_correlation_grid(Y, X)
r2 <- coverage_correlation_grid(Y, X)
c(r1$stat, r2$stat)
#> [1] 0.3624634 0.3624634You can also supply your own reference points through u
and v, for example a low-discrepancy sequence, as long as
they match the shape of the corresponding input.
covercorr()covercorr() is a single front-end that dispatches to the
appropriate routine based on its input. It accepts a pair of variables,
a matrix or data frame whose columns are treated as variables, or a list
of variables. The reference argument chooses between random
reference points ("random", the default) and the
deterministic grid ("deterministic").
set.seed(6)
n <- 100
X <- rnorm(n)
Y <- sin(3 * X) + rnorm(n) * 0.05
# Pairwise, random reference points
covercorr(X, Y)
#>
#> Coverage correlation
#>
#> statistic = 0.4009571, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exact
# Pairwise, deterministic grid
covercorr(X, Y, reference = "deterministic")
#>
#> Coverage correlation
#>
#> statistic = 0.3661019, p-value = < 2.2204e-16
#> n = 100, dimension = 2, method = exactWhen given a matrix or data frame, each column is treated as a separate variable. A two-column input is handled as the pairwise case; more columns are treated jointly as several variables.
coverage_correlation_K() generalises the coefficient
from a pair to K variables, supplied as a list. It tests
the null hypothesis of mutual independence among all
the inputs.
set.seed(8)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
coverage_correlation_K(list(X1, X2, X3))
#>
#> Coverage correlation
#>
#> statistic = 0.4638644, p-value = < 2.2204e-16
#> n = 100, dimension = 3, method = exactThe same dependence can be tested through covercorr() by
passing a list, or a matrix with more than two columns.
set.seed(9)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
covercorr(cbind(X1, X2, X3))
#>
#> Coverage correlation
#>
#> statistic = 0.439989, p-value = < 2.2204e-16
#> n = 100, dimension = 3, method = exactA deterministic-grid p-value is available for the pairwise case and
for the two-variable fixed-grid case. For three or more variables with a
fixed grid, the null distribution has not been derived, so
coverage_correlation_K_grid() (and covercorr()
with reference = "deterministic" on three or more
variables) returns pval = NA and reports
pval_available = FALSE, while still computing the statistic
itself.
set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- sin(3 * X1) + rnorm(n) * 0.05
X3 <- X1 * X2 + rnorm(n) * 0.1
res <- coverage_correlation_K_grid(list(X1, X2, X3))
#> No p-value is available for the K-variable (K > 2) fixed-grid case: the null distribution has not been derived. Returning pval = NA.
res$stat
#> [1] 0.4222658
res$pval
#> [1] NA
res$pval_available
#> [1] FALSEThe examples above use simulated data so that the dependence
structure is known. The package also ships a real dataset,
CD8T, containing single-cell RNA-seq gene expression
measurements for 9369 fetal CD8+ T cells across 1000 genes (Suo et al.,
2022). We can use it to illustrate visualise_density(),
which gives a picture of where in the unit square a pair of
variables departs from independence.
visualise_density() takes the two Monge-Kantorovich rank
vectors and estimates their joint density with a toroidally-wrapped
two-dimensional kernel density estimate (the same torus topology used in
the cube plots above, so that mass near an edge is not lost). It then
plots the density on a log scale relative to the uniform density:
regions redder than the mid-tone are more concentrated
than they would be under independence, bluer regions are
less concentrated, and a flat mid-toned image
corresponds to independence.
For illustration we take two genes; you can substitute any pair of column indices. We first compute the coverage correlation coefficient for the pair, then place the raw scatter plot and the rank-density plot side by side.
data(CD8T)
n <- nrow(CD8T)
i <- 1
j <- 2
x <- CD8T[, i]
y <- CD8T[, j]
# The coefficient and its p-value for this pair
covercorr(x, y)
#>
#> Coverage correlation
#>
#> statistic = 0.0132117, p-value = 1.24082e-05
#> n = 9369, dimension = 2, method = exact
# Monge-Kantorovich ranks, using random reference points
set.seed(1)
x_rank <- MK_rank(as.matrix(x), runif(n))
y_rank <- MK_rank(as.matrix(y), runif(n))
op <- par(mfrow = c(1, 2))
plot(x, y, pch = ".", xlab = colnames(CD8T)[i], ylab = colnames(CD8T)[j])
visualise_density(x_rank, y_rank)Single-cell expression data are sparse, so many cells share the value
zero for a given gene; MK_rank() breaks these ties at
random, which is why the scatter plot shows banding along the axes.
Pairs that the coverage correlation test flags as dependent typically
show clear ridges of concentrated (red) mass in the density plot,
whereas independent pairs give an essentially flat image.