R package mdendro enables the calculation of agglomerative hierarchical clustering (AHC), extending the standard functionalities in several ways:
Native handling of both similarity and dissimilarity (distances) matrices.
Calculation of pair-group dendrograms and variable-group multidendrograms [1].
Implementation of the most common AHC methods in both weighted and unweighted forms: single linkage, complete linkage, average linkage (UPGMA and WPGMA), centroid (UPGMC and WPGMC), and Ward.
Implementation of two additional parametric families of methods: versatile linkage [2], and beta flexible. Versatile linkage leads naturally to the definition of two additional methods: harmonic linkage, and geometric linkage.
Calculation of the cophenetic (or ultrametric) matrix.
Calculation of five descriptors of the final dendrogram: cophenetic correlation coefficient, space distortion ratio, agglomerative coefficient, chaining coefficient, and tree balance.
Plots of the descriptors for the parametric methods.
All this functionality is obtained with functions
linkage
, descval
and descplot
.
Function linkage
may be considered as a replacement for
functions hclust
(in package stats)
and agnes
(in package cluster).
To enhance usability and interoperability, the linkage
class includes several methods for plotting, summarizing information,
and class conversion.
There exist two main ways to install mdendro:
Installation from CRAN (recommended method):
install.packages("mdendro")
RStudio has a menu entry (Tools \(\rightarrow\) Install Packages) for this job.
Installation from GitHub (you may need to install first devtools):
install.packages("devtools")
library(devtools)
install_github("sergio-gomez/mdendro")
Since mdendro includes C++ code, you may need to install first Rtools in Windows, or Xcode in MacOS.
Let us start by using the linkage
function to calculate
the complete linkage AHC of the UScitiesD
dataset, a matrix
of distances between a few US cities:
library(mdendro)
<- linkage(UScitiesD, method = "complete") lnk
Now we can plot the resulting dendrogram:
plot(lnk)
The summary of this dendrogram is:
summary(lnk)
## Call:
## linkage(prox = UScitiesD,
## type.prox = "distance",
## digits = 0,
## method = "complete",
## group = "variable")
##
## Number of objects: 10
##
## Binary dendrogram: TRUE
##
## Descriptive measures:
## cor sdr ac cc tb
## 0.8077859 1.0000000 0.7738478 0.3055556 0.9316262
In particular, you can recognize the calculated descriptors:
cor
: cophenetic correlation coefficientsdr
: space distortion ratioac
: agglomerative coefficientcc
: chaining coefficienttb
: tree balanceIt is possible to work with similarity data without having to convert them to distances, provided they are in range [0.0, 1.0]. A typical example would be a matrix of non-negative correlations:
<- as.dist(Harman23.cor$cov)
sim <- linkage(sim, type.prox = "sim")
lnk plot(lnk, main = "Harman23")
There is also the option to choose between unweighted (default) and weighted methods:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 3)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar
<- linkage(cars, method = "arithmetic")
lnk1 plot(lnk1, main = "unweighted", nodePar = nodePar)
<- linkage(cars, method = "arithmetic", weighted = TRUE)
lnk2 plot(lnk2, main = "weighted", nodePar = nodePar)
When there are tied minimum distances in the agglomeration process,
you may ignore them and proceed choosing a random pair (pair-group
methods) or agglomerate them all at once (variable-group
multidendrograms). With linkage
you can use both
approaches, being multidendrograms the default one:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar
<- linkage(cars, method = "complete")
lnk1 plot(lnk1, main = "multidendrogram", nodePar = nodePar)
<- linkage(cars, method = "complete", group = "pair")
lnk2 plot(lnk2, main = "pair-group", nodePar = nodePar)
While multidendrograms are unique, you may obtain structurally different pair-group dendrograms by just reordering the data. As a consequence, the descriptors are invariant to permutations for multidendrograms, but not for pair-group dendrograms:
<- round(dist(scale(mtcars)), digits = 1)
cars <- linkage(cars, method = "complete")
lnk1 <- linkage(cars, method = "complete", group = "pair")
lnk2
# apply random permutation to data
set.seed(666)
<- sample(attr(cars, "Size"))
ord <- as.dist(as.matrix(cars)[ord, ord])
cars_p <- linkage(cars_p, method = "complete")
lnk1p <- linkage(cars_p, method = "complete", group = "pair")
lnk2p
# compare original and permuted cophenetic correlation
c(lnk1$cor, lnk1p$cor)
## [1] 0.7782257 0.7782257
c(lnk2$cor, lnk2p$cor)
## [1] 0.7780010 0.7780994
# compare original and permuted tree balance
c(lnk1$tb, lnk1p$tb)
## [1] 0.9564568 0.9564568
c(lnk2$tb, lnk2p$tb)
## [1] 0.9472909 0.9424148
In multidendrograms, the ranges (rectangles) show the heterogeneity between distances within the group, but they are optional in the plots:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar <- linkage(cars, method = "complete")
lnk plot(lnk, col.rng = "bisque", main = "with ranges", nodePar = nodePar)
plot(lnk, col.rng = NULL, main = "without ranges", nodePar = nodePar)
Plots including ranges are only available if you directly use the
plot.linkage
function from mdendro.
Anyway, you may still take advantage of other dendrogram plotting
packages, such as dendextend and
ape:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- linkage(cars, method = "complete")
lnk
<- as.dendrogram(lnk)
lnk.dend plot(dendextend::set(lnk.dend, "branches_k_color", k = 4),
main = "dendextend package",
nodePar = list(cex = 0.4, lab.cex = 0.5))
<- as.hclust(lnk)
lnk.hcl <- c("red", "forestgreen", "purple", "orange")
pal4 <- cutree(lnk.hcl, 4)
clu4 plot(ape::as.phylo(lnk.hcl),
type = "fan",
main = "ape package",
tip.color = pal4[clu4],
cex = 0.5)