Evaluating single-cell clustering with ClustAssess
Source:vignettes/ClustAssess.Rmd
ClustAssess.Rmd
In this vignette we will illustrate several of the tools available in ClustAssess on a small single-cell RNA-seq dataset.
library(Seurat)
#> Loading required package: SeuratObject
#> Loading required package: sp
#>
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#>
#> intersect, t
library(ClustAssess)
library(ggplot2)
#>
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:base':
#>
#> is.element
theme_set(theme_classic())
data("pbmc_small")
# we will use the pbmc_small single-cell RNA-seq dataset available via Seurat
DimPlot(pbmc_small, group.by = "letter.idents")
Proportion of ambiguously clustered pairs
The proportion of ambiguously clustered pairs (PAC) uses consensus clustering to infer the optimal number of clusters for the data, by observing how variably pairs of elements cluster together. The lower the PAC, the more stable the clustering. PAC was originally presented in https://doi.org/10.1038/srep06207.
# retrieve scaled data for PAC calculation
pbmc.data <- GetAssayData(pbmc_small, assay = "RNA", layer = "scale.data")
# perform consensus clustering
cc.res <- consensus_cluster(t(pbmc.data),
k_max = 30,
n_reps = 100,
p_sample = 0.8,
p_feature = 0.8,
verbose = TRUE
)
#> Calculating consensus clustering
# assess the PAC convergence for a few values of k - each curve should
# have converged to some value
k.plot <- c(4, 6, 8, 10)
pac_convergence(cc.res, k.plot)
# visualize the final PAC across k - there seems to be a local maximum at k=7,
# indicating that 7 clusters leads to a less stable clustering of the data than
# nearby values of k
pac_landscape(cc.res)
Element-centric clustering comparison
We compare the similarity of clustering results between methods (in this case, between Louvain community detection and k-means) using Element-Centric Similarity (ECS), which quantifies the clustering similarity per cell. Higher ECS indicates that an observation is clustered similarly across methods. ECS was introduced in https://doi.org/10.1038/s41598-019-44892-y.
# first, cluster with Louvain algorithm
pbmc_small <- FindClusters(pbmc_small, resolution = 0.8, verbose = FALSE)
DimPlot(pbmc_small, group.by = "seurat_clusters")
# also cluster with PCA+k-means
pbmc_pca <- Embeddings(pbmc_small, "pca")
pbmc_small@meta.data$kmeans_clusters <- kmeans(pbmc_pca,
centers = 3,
nstart = 10,
iter.max = 1e3
)$cluster
DimPlot(pbmc_small, group.by = "kmeans_clusters")
# where are the clustering results more similar?
pbmc_small@meta.data$ecs <- element_sim_elscore(
pbmc_small@meta.data$seurat_clusters,
pbmc_small@meta.data$kmeans_clusters
)
suppressMessages(FeaturePlot(pbmc_small, "ecs") + scale_colour_viridis_c())
mean(pbmc_small@meta.data$ecs)
#> [1] 0.6632927
Jaccard similarity of cluster markers
As a common step in computational single-cell RNA-seq analyses, discriminative marker genes are identified for each cluster. These markers are then used to infer the cell type of the respective cluster. Here, we compare the markers obtained for each clustering method to ask: how similarly would each cell be interpreted across clustering methods? We compare the markers per cell using the Jaccard similarity (defined as the size of the intersect divided by the size of the overlap) of the marker gene lists. The higher the JSI, the more similar are the marker genes for that cell.
# first, we calculate the markers on the Louvain clustering
Idents(pbmc_small) <- pbmc_small@meta.data$seurat_clusters
louvain.markers <- FindAllMarkers(pbmc_small,
logfc.threshold = 1,
min.pct = 0.0,
test.use = "roc",
verbose = FALSE
)
# then we get the markers on the k-means clustering
Idents(pbmc_small) <- pbmc_small@meta.data$kmeans_clusters
kmeans.markers <- FindAllMarkers(pbmc_small,
logfc.threshold = 1,
min.pct = 0.0,
test.use = "roc",
verbose = FALSE
)
# next, compare the top 10 markers per cell
pbmc_small@meta.data$marker.gene.jsi <- marker_overlap(louvain.markers,
kmeans.markers,
pbmc_small@meta.data$seurat_clusters,
pbmc_small@meta.data$kmeans_clusters,
n = 10,
rank_by = "power"
)
# which cells have the same markers, regardless of clustering?
suppressMessages(FeaturePlot(pbmc_small, "marker.gene.jsi") + scale_colour_viridis_c())
mean(pbmc_small@meta.data$marker.gene.jsi)
#> [1] 0.5738636
Element-wise consistency
How consistently are cells clustered by k-means? We will rerun k-means clustering 20 times to investigate.
clustering.list <- list()
n.reps <- 20
for (i in 1:n.reps) {
# we set nstart=1 and a fairly high iter.max - this should mean that
# the algorithm converges, and that the variability in final clusterings
# depends mainly on the random initial cluster assignments
km.res <- kmeans(pbmc_pca, centers = 3, nstart = 1, iter.max = 1e3)$cluster
clustering.list[[i]] <- km.res
}
# now, we calculate the element-wise consistency (aka frustration), which
# performs pair-wise comparisons between all 20 clusterings; the
# consistency is the average per-cell ECS across all comparisons. The higher
# the consistency, the more consistently is that cell clustered across
# random seeds.
pbmc_small@meta.data$consistency <- element_consistency(clustering.list)
# which cells are clustered more consistently?
suppressMessages(FeaturePlot(pbmc_small, "consistency") + scale_colour_viridis_c())
mean(pbmc_small@meta.data$consistency)
#> [1] 0.6712922
Session info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Bucharest
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1.9000 ClustAssess_1.0.0 Seurat_5.1.0 SeuratObject_5.0.2
#> [5] sp_2.1-4
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.8 magrittr_2.0.3
#> [4] spatstat.utils_3.1-0 farver_2.1.2 rmarkdown_2.27
#> [7] fs_1.6.4 ragg_1.3.3 vctrs_0.6.5
#> [10] ROCR_1.0-11 spatstat.explore_3.2-7 progress_1.2.3
#> [13] htmltools_0.5.8.1 sass_0.4.9 sctransform_0.4.1
#> [16] parallelly_1.37.1 KernSmooth_2.23-22 bslib_0.7.0
#> [19] htmlwidgets_1.6.4 desc_1.4.3 ica_1.0-3
#> [22] plyr_1.8.9 plotly_4.10.4 zoo_1.8-12
#> [25] cachem_1.1.0 igraph_2.0.3 mime_0.12
#> [28] lifecycle_1.0.4 iterators_1.0.14 pkgconfig_2.0.3
#> [31] Matrix_1.7-0 R6_2.5.1 fastmap_1.2.0
#> [34] fitdistrplus_1.1-11 future_1.33.2 shiny_1.8.1.1
#> [37] digest_0.6.35 colorspace_2.1-1 patchwork_1.2.0
#> [40] tensor_1.5 RSpectra_0.16-2 irlba_2.3.5.1
#> [43] textshaping_0.4.0 labeling_0.4.3 progressr_0.14.0
#> [46] fansi_1.0.6 spatstat.sparse_3.0-3 httr_1.4.7
#> [49] polyclip_1.10-6 abind_1.4-5 compiler_4.4.0
#> [52] withr_3.0.1 fastDummies_1.7.3 highr_0.11
#> [55] MASS_7.3-60 tools_4.4.0 lmtest_0.9-40
#> [58] httpuv_1.6.15 future.apply_1.11.2 goftest_1.2-3
#> [61] glue_1.7.0 nlme_3.1-163 promises_1.3.0
#> [64] grid_4.4.0 Rtsne_0.17 cluster_2.1.6
#> [67] reshape2_1.4.4 generics_0.1.3 gtable_0.3.5
#> [70] spatstat.data_3.0-4 tidyr_1.3.1 hms_1.1.3
#> [73] data.table_1.15.4 utf8_1.2.4 spatstat.geom_3.2-9
#> [76] RcppAnnoy_0.0.22 ggrepel_0.9.5 RANN_2.6.1
#> [79] foreach_1.5.2 pillar_1.9.0 stringr_1.5.1
#> [82] spam_2.10-0 RcppHNSW_0.6.0 later_1.3.2
#> [85] splines_4.4.0 dplyr_1.1.4 lattice_0.22-5
#> [88] survival_3.5-8 deldir_2.0-4 tidyselect_1.2.1
#> [91] miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.47
#> [94] gridExtra_2.3 scattermore_1.2 xfun_0.44
#> [97] matrixStats_1.3.0 stringi_1.8.4 lazyeval_0.2.2
#> [100] yaml_2.3.8 evaluate_1.0.0 codetools_0.2-19
#> [103] tibble_3.2.1 cli_3.6.3 uwot_0.2.2
#> [106] xtable_1.8-4 reticulate_1.37.0 systemfonts_1.1.0
#> [109] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.13
#> [112] globals_0.16.3 spatstat.random_3.2-3 png_0.1-8
#> [115] fastcluster_1.2.6 parallel_4.4.0 pkgdown_2.1.1
#> [118] prettyunits_1.2.0 dotCall64_1.1-1 listenv_0.9.1
#> [121] viridisLite_0.4.2 scales_1.3.0 ggridges_0.5.6
#> [124] crayon_1.5.3 leiden_0.4.3.1 purrr_1.0.2
#> [127] rlang_1.1.4 cowplot_1.1.3