Rediscover is an R package to identify mutually exclusive genomic events. It reimplements a privious R package (Discover) whose main contribution is a statistical analysis based on the Poisson-Binomial distribution that takes into account that some samples are more mutated than others. Rediscover is much faster than the discover implementation.
Rediscover can be installed from CRAN repository:
install.packages("Rediscover")
The package library has two main parts:
Figure 1 is a flowchart that depicts Rediscover pipeline. Given a binary matrix of genes x samples where a 1 will appear if gene i is mutated in sample j and 0 otherwise, getPM
function estimates the probabilities \(p_{ij}\) of the gene i being mutated in sample j. The input matrix could be matrix
or Matrix class
. Usually, mutations stored in Matrix class
are a sparse object and therefore, require less memory to store them.
The second step is the estimation of the p-values using these probabilities and the number of samples where two genes are co-altered. Rediscover offers different functions depending on the desired study:
getMutex
if the user wants to evaluate if genes are mutually exclusive.getMutexAB
if the user wants to evaluate if genes are mutually exclusive with respect to another event (amplifications, deletions, etc…)getMutexGroup
will be used when the user wants to obtain the probability that a certain group of genes being mutually exclusive. Unlike the getMutex
function, in this case the users introduces the set of genes of interest.Rediscover also provides a function to integrate its usage with maftools
and TCGAbiolinks
. Specifically, we added to the function somaticInteractions
from maftools
our analyses based on the Poisson-Binomial distribution resulting in a new function called discoversomaticInteractions
.
Given a binary matrix A with the mutation information, getPM
estimates the probabilities \(p_{ij}\) of the gene i being mutated in sample j. To make the package more memory efficient, we created an S4 class
object called PMatrix
that gives access to the values of the resulting probability matrix.
The required input for this function is:
The following code chunk shows an example of how to perform this function:
data("A_example")
PMA <- getPM(A_example)
PMA[1:4,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 0.4813047 0.4853081 0.4743055 0.4673148
## [2,] 0.5133379 0.5173420 0.5063276 0.4993132
## [3,] 0.4993267 0.5033346 0.4923142 0.4853031
## [4,] 0.5093353 0.5133411 0.5023232 0.4953086
As already mentioned, getPM
allows matrices of class matrix
as depicted on previous example. But this functions also supports matrices of type Matrix
. The following code chunk shows the same example but using a matrix of class Matrix
.
data("A_Matrix")
class(A_Matrix)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
PMA <- getPM(A_Matrix)
PMA[1:4,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 0.4813047 0.4853081 0.4743055 0.4673148
## [2,] 0.5133379 0.5173420 0.5063276 0.4993132
## [3,] 0.4993267 0.5033346 0.4923142 0.4853031
## [4,] 0.5093353 0.5133411 0.5023232 0.4953086
Finally, the next code chunk shows an example of applying this function to a real case. Specifically, information available in The Cancer Genome Atlas Program (TCGA) on colon adenocarcinoma (COAD) has been used.
data("TCGA_COAD")
PM_COAD <- getPM(TCGA_COAD)
From the probability matrix and the number of samples in which two genes are co-mutated, p-values can be estimated using the Poisson-Binomial distribution. That is, the probability of two genes being mutually exclusive is obtained. The p-values obtained are stored in a Matrix
class in order to be memory efficient.
Rediscover offers different functions depending on the desired study: Using a single matrix (getMutex
function), using two matrices (getMutexAB
function), using a specific group of genes (getMutexGroup
function).
As explained below, the p-values can be estimated using the Poisson-Binomial distribution.
Both functions getMutex
and getMutexAB
have four different approaches to estimate the
Poisson-Binomial distribution:
PoissonBinomial
Rpackage based on the work from (Biscarri, Zhao, & Brunner, 2018).We can perform the mutual exclusivity test for all pairs of genes of a single matrix using the getMutex
function.
The inputs of getMutex
are:
getPM
. By default is equal to getPM(A)
.The following code chunk shows an example of how to perform this function:
data("A_example")
PMA <- getPM(A_example)
mymutex <- getMutex(A=A_example,PM=PMA)
As in the previous case, an example is shown when using matrices of class Matrix.
data("A_Matrix")
PMA_Matrix <- getPM(A_Matrix)
mymutex <- getMutex(A=A_Matrix,PM=PMA_Matrix)
Finally, as in the previous case, an example of how to apply this function to a real case has been carried out. Specifically, information available in The Cancer Genome Atlas Program (TCGA) on colon adenocarcinoma (COAD) has been used. We applied to the top 100 most mutated genes.
data("TCGA_COAD")
data("PM_COAD")
COAD_mutex <- getMutex(TCGA_COAD[1:100,], PM_COAD[1:100,])
Moreover, there are some extra inputs that if user want to use the exact formula of the Poison-Binomial distribution:
By default, the mixed parameter is set to TRUE
and the exact p-value is computed for p-values lower that the upper threshold (the parameter th, which default value is 0.05). I.e. if mixed option is selected and th is set to 1, all the p-values are computed with the exact formula. The following code chunk shows performs the previous test but applying the exact method for p-values lower than 0.001.
data("TCGA_COAD")
data("PM_COAD")
COAD_mutex_exact <- getMutex(TCGA_COAD[1:100,], PM_COAD[1:100,],mixed = TRUE,th = 0.001)
If the mixed parameter is set to FALSE
all the p-values are computed with the approximation selected. The following code chunk show an example of how to performe the previous test but applying the Binomial
approximation for all the p-values:
data("TCGA_COAD")
data("PM_COAD")
COAD_mutex_exact <- getMutex(TCGA_COAD[1:100,], PM_COAD[1:100,],mixed = FALSE,method = "Binomial")
The second option is using two matrices. As in the first case, it is also necessary to enter the previously obtained probability matrix in addition to the initial matrix. But, unlike the previous case, an extra matrix B is used which has the same shape as matrix A, but may contain additional information, such as gene amplifications i in samples j. In this way, it will be necessary to enter both probability matrices in addition to the initial matrices. In the case where this function is applied directly, getMutex
allows not to enter the probability matrices, since it would be calculated internally in order to use this function.
Therefore, the inputs required by getMutexAB
are:
getPM
. By default is equal to getPM(A)
.getPM
. By default is equal to getPM(B)
.In addition, in this case there are also some extra possible entries that have been previously defined, but could be modified by the user:
Continuing with the example, in this case the result will not be a matrix with the probability that genes being mutually exclusive, but rather a matrix with the probability that genes being amplified. The following code chunk shows an example of how to perform this function with the default parameters.
data("A_example")
data("B_example")
PMA <- getPM(A_example)
PMB <- getPM(B_example)
mismutex <- getMutexAB(A=A_example, PM=PMA, B=B_example, PMB = PMB)
## as(<dgeMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
As in the previous cases, the following code chunk shows an example when using matrices of class Matrix.
data("A_Matrix")
data("B_Matrix")
PMA <- getPM(A_Matrix)
PMB <- getPM(B_Matrix)
mismutex <- getMutexAB(A=A_Matrix, PM=PMA, B=B_Matrix, PMB = PMB)
Finally, the next code chunk shows an example of how to apply this function to a real case. Specifically, information available in The Cancer Genome Atlas Program (TCGA) on colon adenocarcinoma (COAD) has been used. In this case, the TCGA_COAD_AMP
matrix correspond to the most 1000 mutated genes, and the extra matrix AMP_COAD
provides information about the amplifications over 100 genes. The example run getMutexAB
with the default parameters. The code, also explains how to merge different datasets.
data("TCGA_COAD")
data("PM_COAD")
data("AMP_COAD")
data("PM_AMP_COAD")
common <- intersect(colnames(TCGA_COAD), colnames(AMP_COAD))
keep <- match(common,colnames(TCGA_COAD))
TCGA_COAD_100 <- TCGA_COAD[1:100,keep]
PM_TCGA_COAD_100 <- PM_COAD[1:100,keep]
keep <- match(common,colnames(AMP_COAD))
AMP_COAD_100 <- AMP_COAD[1:100,keep]
PM_AMP_COAD_100 <- PM_AMP_COAD[1:100,keep]
mismutex <- getMutexAB(A=TCGA_COAD_100, PMA=PM_TCGA_COAD_100,
B=AMP_COAD_100, PMB = PM_AMP_COAD_100)
In the previous example, TCGA_COAD_AMP and AMP_COAD share the same number of rown, but this is not needed. The following code chunk show the sample analysis but with the 100 most mutated genes and 150 amplificated genes.
data("TCGA_COAD")
data("PM_COAD")
data("AMP_COAD")
data("PM_AMP_COAD")
common <- intersect(colnames(TCGA_COAD), colnames(AMP_COAD))
keep <- match(common,colnames(TCGA_COAD))
TCGA_COAD_100 <- TCGA_COAD[1:100,keep]
PM_TCGA_COAD_100 <- PM_COAD[1:100,keep]
keep <- match(common,colnames(AMP_COAD))
AMP_COAD_150 <- AMP_COAD[1:150,keep]
PM_AMP_COAD_150 <- PM_AMP_COAD[1:150,keep]
mismutex <- getMutexAB(A=TCGA_COAD_100, PMA=PM_TCGA_COAD_100,
B=AMP_COAD_150, PMB = PM_AMP_COAD_150)
Finally, the last option requires, as in the first case, a single matrix and the obtained probability matrix. As explained before, in this case a reduced version of the original matrix is introduced, i.e., starting from the original matrix, a specific group of genes and samples are selected. On the other hand, the probability matrix required for this study will be taken from the global probability matrix, i.e., the global probability matrix is first calculated by introducing the original matrix with all genes and all samples and then only the probabilities of the specifically selected genes and samples are chosen. Therefore, a matrix will be introduced that will contain a series of genes and samples from the original matrix, with their corresponding probabilities obtained from the global probability matrix.
In addition, unlike the other functions, this one allows to determine:
Therefore, the inputs required by getMutexGroup
are:
getPM
.Furthermore, there is also an extra possible entry that has been previously defined, but could be modified by the user:
The following code chunk shows an example of how to perform this function:
data("A_example")
A2 <- A_example[,1:40]
A2[1,1:10] <- 1
A2[2,1:10] <- 0
A2[3,1:10] <- 0
A2[1,11:20] <- 0
A2[2,11:20] <- 1
A2[3,11:20] <- 0
A2[1,21:30] <- 0
A2[2,21:30] <- 0
A2[3,21:30] <- 1
PM2 <- getPM(A2)
A <- A2[1:3,]
PM <- PM2[1:3,]
The next figure is a graphical representation of A, showing a matrix of 3 genes with mutations in some of samples (black areas).
Following, getMutexGroup
function has been used introducing the generated A matrix and performing three different studies; first one analyses the impurity, second one the coverage and last one the exclusivity.
getMutexGroup(A, PM, "Impurity")
## [1] 0.02459048
getMutexGroup(A, PM, "Coverage")
## [1] 2.245149e-05
getMutexGroup(A, PM, "Exclusivity")
## [1] 5.444509e-07
For the example shown below, the “TCGAbiolinks” library must be installed which is not available in the current version of bioconductor. However, it can be easily downloaded by accessing previous versions of Bioconductor.
Among the possible applications of Rediscover stands up the possibility of representing graphically the results obtained by applying discoverSomaticInteractions
function. In this case, maftools has been used in Colon Adenocarcinoma (COAD) and different plots have been performed to study the results obtained, which has allowed the study of co-ocurring and mutually exclusive genes.
coad.maf <- GDCquery_Maf("COAD", pipelines = "muse") %>% read.maf
Figure 3 shows the mutations contained in genes in each sample, with missesense mutations being the most common, although, nonsense mutations are noteworthy, as 54% of the samples are mutated in APC, which contains a large percentage of nonsense mutations.
oncoplot(coad.maf, top = 35)