Many numerical analyses are invalid when working with nominal data
because the mode is the only way to measure central tendency for nominal
data, and frequency testing, like Chi-square tests, is the most common
statistical analysis that makes sense. NIMAA
package (Jafari and Chen 2022) proposes a comprehensive
set of pipeline to perform nominal data mining, which can effectively
find label relationships of each nominal variable according to pairwise
association with other nominal data. You can also check for updates,
here NIMAA.
It uses bipartite graphs to show how two different types of data are linked together, and it puts them in the incidence matrix to continue with network analysis. Finding large submatrices with non-missing values and edge prediction are other applications of NIMAA to explore local and global similarities within the labels of nominal variables.
Then, using a variety of different network projection methods, two unipartite graphs are constructed on the given submatrix. NIMAA provides several options for clustering projected networks and selecting the best one based on internal measures and external prior knowledge (ground truth). When weighted bipartite networks are considered, the best clustering results are used as the benchmark for edge prediction analysis. This benchmark is used to figure out which imputation method is the best one to predict weight of edges in bipartite network. It looks at how similar the clustering results are before and after the imputations. By using edge prediction analysis, we tried to get more information from the whole dataset even though there were some missing values.
library(NIMAA)
In this section, we demonstrate how to do a NIMAA analysis on a
weighted bipartite network using the beatAML
dataset.
beatAML
is one of four datasets that can be found in the
NIMAA
package (Jafari and Chen
2022). This dataset has three columns: the first two contain
nominal variables, while the third contains numerical variables.
inhibitor | patient_id | median |
---|---|---|
Alisertib (MLN8237) | 11-00261 | 81.00097 |
Barasertib (AZD1152-HQPA) | 11-00261 | 60.69244 |
Bortezomib (Velcade) | 11-00261 | 81.00097 |
Canertinib (CI-1033) | 11-00261 | 87.03067 |
Crenolanib | 11-00261 | 68.13586 |
CYT387 | 11-00261 | 69.66083 |
Dasatinib | 11-00261 | 66.13318 |
Doramapimod (BIRB 796) | 11-00261 | 101.52120 |
Dovitinib (CHIR-258) | 11-00261 | 33.48040 |
Erlotinib | 11-00261 | 56.11189 |
Read the data from the package:
# read the data
<- NIMAA::beatAML beatAML_data
The plotIncMatrix()
function prints some information
about the incidence matrix derived from input data, such as its
dimensions and the proportion of missing values, as well as the image of
the matrix. It also returns the incidence matrix object.
NB: To keep the size of vignette small enough for CRAN rules, we won’t output the interactive figure here.
<- plotIncMatrix(
beatAML_incidence_matrix x = beatAML_data, # original data with 3 columns
index_nominal = c(2,1), # the first two columns are nominal data
index_numeric = 3, # the third column is numeric data
print_skim = FALSE, # if you want to check the skim output, set this as TRUE
plot_weight = TRUE, # when plotting the weighted incidence matrix
verbose = FALSE # NOT save the figures to local folder
)
Na/missing values Proportion: 0.2603
Given that we have the incidence matrix, we can easily reconstruct
the corresponding bipartite network. In the NIMAA
package,
we have two options for visualizing the bipartite network: static or
interactive plots.
The plotBipartite()
function customizes the
corresponding bipartite network visualization based on the
igraph
package (Csardi and Nepusz
2006) and returns the igraph object.
<- plotBipartite(inc_mat = beatAML_incidence_matrix, vertex.label.display = T) bipartGraph
# show the igraph object
bipartGraph#> IGRAPH bd183bf UNWB 650 47636 --
#> + attr: name (v/c), type (v/l), shape (v/c), color (v/c), weight (e/n)
#> + edges from bd183bf (vertex names):
#> [1] Alisertib (MLN8237) --11-00261 Barasertib (AZD1152-HQPA)--11-00261
#> [3] Bortezomib (Velcade) --11-00261 Canertinib (CI-1033) --11-00261
#> [5] Crenolanib --11-00261 CYT387 --11-00261
#> [7] Dasatinib --11-00261 Doramapimod (BIRB 796) --11-00261
#> [9] Dovitinib (CHIR-258) --11-00261 Erlotinib --11-00261
#> [11] Flavopiridol --11-00261 GDC-0941 --11-00261
#> [13] Gefitinib --11-00261 Go6976 --11-00261
#> [15] GW-2580 --11-00261 Idelalisib --11-00261
#> + ... omitted several edges
The plotBipartiteInteractive()
function generates a
customized interactive bipartite network visualization based on the
visNetwork
package (Almende B.V. and
Contributors, Thieurmel, and Robert 2021).
NB: To keep the size of vignette small enough, we do not output the
interactive figure here. Instead, we show a screenshot of part of the
beatAML
dataset.
plotBipartiteInteractive(inc_mat = beatAML_incidence_matrix)
NIMAA
package contains a function called
analyseNetwork
to provide more details about the network
topology and common centrality measures for vertices and edges.
<- analyseNetwork(bipartGraph)
analysis_reuslt
# showing the general measures for network topology
$general_stats
analysis_reuslt#> $vertices_amount
#> [1] 650
#>
#> $edges_amount
#> [1] 47636
#>
#> $edge_density
#> [1] 0.2258433
#>
#> $components_number
#> [1] 1
#>
#> $eigen_centrality_value
#> [1] 15721.82
#>
#> $hub_score_value
#> [1] 247175684
In the case of a weighted bipartite network, the dataset with the
fewest missing values should be used for the next steps. This is to
avoid the sensitivity problems of clustering-based methods. The
extractSubMatrix()
function extracts the submatrices that
have non-missing values or have a certain percentage of missing values
inside (not for elements-max matrix), depending on the argument’s input.
The result will also be shown as a plotly
plot (Sievert 2020), so you can see the screenshots
of beatAML
dataset below.
The extraction process is performed and visualized in two ways, which
can be chosen depending on the user’s preference: using the original
input matrix (row-wise) and using the transposed matrix (column-wise).
NIMAA
extracts the largest submatrices with non-missing
values or with a specified proportion of missing values (using the
bar
argument) in four ways predefined in the
shape
argument:
Here we extract two shapes of submatrix from the
beatAML_incidence_matrix
including square and rectangular,
with the maximum number of elements:
<- extractSubMatrix(
sub_matrices x = beatAML_incidence_matrix,
shape = c("Square", "Rectangular_element_max"), # the selected shapes of submatrices
row.vars = "patient_id",
col.vars = "inhibitor",
plot_weight = TRUE,
print_skim = FALSE
)#> binmatnest2.temperature
#> 20.12122
#> Size of Square: 96 rows x 96 columns
#> Size of Rectangular_element_max: 87 rows x 140 columns