title: “Introduction to Silly Putty” author: “Dwayne Tally, Zachary B. Abrams, and Kevin R. Coombes” date: “2024-02-07” output: rmarkdown::html_document: theme: journal highlight: kate vignette: > % % % % % —

Introduction

In many diseases, such as cancer, it is important to have a clear understanding of what potential clinical subgroup an individual patient belongs to. Unsupervised clustering is a useful analytic tool to address this problem. A variety of clustering methods already exist and are differentiated by the kinds of outcome measures they are intended to optimize. For example, K-means is designed to minimize the within-cluster sum of square errors. Partitioning around medoids generalizes this idea from the Euclidean distance metric defined by sums of squares to an arbitrary distance metric. The different linkage rules used in hierarchical clustering methods also change the nature of the value being optimized.

Ever since Kaufmann and Rooseeuw introduced the idea of the silhouette width, researches have used its average value to select the best method when applying different clustering methods to the same data set. To out knowledge, no one has tried to use silhouette width as a quantity to be optimized directly when finding clusters. To test the idea that optimizing the silhouette width could be used to cluster elements, we developed a novel algorithm that we call “SillyPutty”. In brief, after elements have been assigned to clusters, we can calculate the silhouette width (SW) of each element, yielding numbers between -1 and +1. A positive value of SW indicates that an element is likely to be properly clustered, while a negative value of SW indicates the element is probably not in the correct cluster. The repeated step in the SillyPutty algorithm is to reclassify the element with the most negative silhouette width by placing it into the cluster to which it is closest. This process can (usually) be repeated until there are no negative silhouette widths present in the data. (There is a small chance that this algorithm will fail to converge by entering a small infinite loop where the same elements are rearranged to get back to an earlier configuration.)

Setup

We must first load the necessary packages.

library(SillyPutty)
library(Umpire)
suppressMessages( library(Mercator) )
suppressMessages( library(mclust) ) # for adjusted rand index

Generating and Formatting Data

We use the Umpire R package (version 2.0.10) to generate more complex and realistic synthetic data. We then compute the Euclidean distances between elements. Then, we use the Mercator R package (version 1.1.3) to visualize the data. Finally, we use the mclust R package (version 6.0.0) to compute the Adjusted Rand index (ARI), a measure of cluster quality that compares clusters to externally known truth.

Assign Umpire Model Parameters

The next chunk of code creates the objects that we will use to simulate a data set. We set things up to represent a kind of cancer with four subtypes corresponding to recurrent sets of “hits”, where each hit can be thought of as an abstract “mutation” that affects the expression of a pathway of related genes.

set.seed(21315)
trueK <- 4
## Set up survival outcome; baseline is exponential
sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1)
## Build a CancerModel with four subtypes
nBlocks <- 20    
cm <- CancerModel(name="cansim",
                  nPossible=nBlocks,
                  nPattern=trueK,
                  OUT = function(n) rnorm(n, 0, 1), 
                  SURV= function(n) rnorm(n, 0, 1),
                  survivalModel=sm)
## Include 100 blocks/pathways that are not hit by cancer
nTotalBlocks <- nBlocks + 100
## Assign values to hyperparameters
## block size
blockSize <- round(rnorm(nTotalBlocks, 100, 30))
## log normal mean hyperparameters
mu0    <- 6
sigma0 <- 1.5
## log normal sigma hyperparameters
rate   <- 28.11
shape  <- 44.25
## block correlation
p <- 0.6
w <- 5
## Set up the baseline Engine
rho <- rbeta(nTotalBlocks, p*w, (1-p)*w)
base <- lapply(1:nTotalBlocks,
               function(i) {
                 bs <- blockSize[i]
                 co <- matrix(rho[i], nrow=bs, ncol=bs)
                 diag(co) <- 1
                 mu <- rnorm(bs, mu0, sigma0)
                 sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1)
                 covo <- co *(t(sigma) %*% sigma)
                 MVN(mu, covo)
               })
eng <- Engine(base)
## Alter the means if there is a hit
altered <- alterMean(eng, normalOffset, delta=0, sigma=1)
## Build the CancerEngine using character strings
object <- CancerEngine(cm, "eng", "altered")
rm(sm, nBlocks, cm, nTotalBlocks, blockSize, mu0, sigma0, rate, shape, p, w, rho, base, eng, altered)

Simulate Data

Now we can take a random sample of 144 elements from the distribution that we just defined.

trueN <- 144
dset <- rand(object, trueN, keepall = TRUE) # contains two objects
labels <- dset$clinical$CancerSubType # the true clusters/types
d1 <- dset$data # the noise-free simulated data

To make our data set even more realistic, we are going to add noise that mimics what happens in some biological assays.

SpecialNoise <- function(nFeat, nu = 0.1, shape = 1.02, scale = 0.05/shape) {
  NoiseModel(nu = nu,
             tau = rgamma(nFeat, shape = shape, scale = scale),
             phi = 0)
}
nm <- SpecialNoise(nrow(d1), nu = 0)
d1 <- blur(nm, d1)
dim(d1)
## [1] 11808   144

Euclidean Distance Matrix

Now we compute the Euclidean distances between pairs of elements in our simulated data set.

tdis <- t(d1)
dimnames(tdis) <- list(paste("Sample", 1:nrow(tdis), sep=''),
                     paste("Feature", 1:ncol(tdis), sep=''))
dis <- dist(tdis)   ## This step is the rate-liomiting factor. Only way to speed up is to use fewerw samples
names(labels) <- rownames(tdis)

Mercator Visualization

As noted above, we will use the Mercator package for visualization. This function will ensure that we generate consistent sets of pictures.

mercViews <- function(object, main, tag = NULL) {
  opar <- par(mfrow = c(2, 2))
  on.exit(par(opar))
  pts <- barplot(object, main = main)
  if (!is.null(tag)) {
    gt <- as.vector(as.matrix(table(getClusters(object))))
    loc <- pts[round((c(0, cumsum(gt))[-(1 + length(gt))] + cumsum(gt))/2)]
    mtext(tag, side =1, line = 0, at = loc, col = object@palette, font = 2)
  }
  plot(object, view = "tsne", main = "t-SNE")
  plot(object, view = "hclust")
  plot(object, view = "mds", main = "MDS")
}

Different Clustering Methods

We will apply various clustering methods to the data (represented primarily through its distance matrix). We want to demonstrate with this example that SillyPutty clustering can do a better job than hierarchical clustering or PAM.

Hierarchical Clustering

Figure 1 presents multiple views of the Euclidean distances between our simulated data. Since we know that we started with 4 clusters, we chose that as the number to find using the default method of hierarchical clustering with Ward’s linkage rule. (We will later illustrate how to use SillyPutty to find the number of clusters.)

The silhouette width plot in the upper left panel indicates that each of the clusters contains some poorly-classified elements, identified by their negative silhouette widths. Both the multidimensional scaling (MDS) plot in the lower right and the t-stochastic neighbor embedding (t-SNE) plot in the upper right clearly display colored points that appear to be in the wrong regions.

set.seed(1987)
vis <- Mercator(dis, "euclid", "hclust", K = trueK)
palette <- vis@palette[c(1:3, 7, 8, 6, 10, 4, 11, 5, 15, 14, 17:18, 9, 12, 16, 19:24)]
vis@palette <- palette
vis <- addVisualization(vis, "mds")
vis <- addVisualization(vis, "tsne")
mercViews(vis, "Hierarchical Clustering, Five Clusters")
Figure 1 : Hierachical Clustering, with four clusters.

Figure 1 : Hierachical Clustering, with four clusters.

The adjusted Rand index isn’t very good, either.

ari.hier <- adjustedRandIndex(labels, vis@clusters)
ari.hier
## [1] 0.7154718

Graphing Truth

Since we know the truth, we can reassign the clusters inside the Mercator object to see what everything is supposed to look like (Figure 2). Notice that the silhouette width plot agrees that everything is in the right place, and that the MDS and t-SNE plots are also consistent.

truebin <- remapColors(vis, setClusters(vis, labels))
mercViews(truebin, main = "True Cluster Types", 
          tag = unique(sort(labels)))
Figure 2 : Visualization of true cancer clusters.

Figure 2 : Visualization of true cancer clusters.

PAM Clustering

Here we apply PAM clustering to the same distance matrix (Figure 3). These results are clearly much worse than hierarchical clustering.

pc <- pam(dis, k = trueK, diss=TRUE)
pamc <- remapColors(vis, setClusters(vis, pc$clustering))
mercViews(pamc, main = "PAM, K = 4", 
          tag = paste("P", 1:trueK, sep = ""))