Assessment of Stability for Graph Clustering
Source:R/stability-based-parameter-assessment.R
automatic_stability_assessment.Rd
Evaluates the stability of different graph clustering methods in the clustering pipeline. The method will iterate through different values of the resolution parameter and compare, using the EC Consistency score, the partitions obtained at different seeds.
Usage
automatic_stability_assessment(
expression_matrix,
n_repetitions,
n_neigh_sequence,
resolution_sequence,
features_sets,
steps,
seed_sequence = NULL,
graph_reduction_embedding = "PCA",
include_umap_nn_assessment = FALSE,
n_top_configs = 3,
ranking_criterion = "iqr",
overall_summary = "median",
ecs_threshold = 1,
matrix_processing = function(dt_mtx, actual_npcs = 30, ...) {
actual_npcs <-
min(actual_npcs, ncol(dt_mtx)%/%2)
RhpcBLASctl::blas_set_num_threads(foreach::getDoParWorkers())
embedding <-
stats::prcomp(x = dt_mtx, rank. = actual_npcs)$x
RhpcBLASctl::blas_set_num_threads(1)
rownames(embedding) <- rownames(dt_mtx)
colnames(embedding) <- paste0("PC_", seq_len(ncol(embedding)))
return(embedding)
},
umap_arguments = list(),
prune_value = -1,
algorithm_dim_reduction = 1,
algorithm_graph_construct = 1,
algorithms_clustering_assessment = 1:3,
clustering_arguments = list(),
verbose = TRUE,
temp_file = NULL,
save_temp = TRUE
)
Arguments
- expression_matrix
An expression matrix having the features on the rows and the cells on the columns.
- n_repetitions
The number of repetitions of applying the pipeline with different seeds; ignored if seed_sequence is provided by the user. Defaults to
100
.- n_neigh_sequence
A sequence of the number of nearest neighbours.
- resolution_sequence
A sequence of resolution values. The resolution parameter controls the coarseness of the clustering. The higher the resolution, the more clusters will be obtained. The resolution parameter is used in the community detection algorithms.
- features_sets
A list of the feature sets. A feature set is a list of genes from the expression matrix that will be used in the dimensionality reduction.
- steps
A list with the same names as
feature_sets
. Each name has assigned a ector containing the sizes of the subsets; negative values will be interpreted as using all features.- seed_sequence
A custom seed sequence; if the value is NULL, the sequence will be built starting from 1 with a step of 100.
- graph_reduction_embedding
The type of dimensionality reduction used for the graph construction. The options are "PCA" and "UMAP". Defaults to
PCA
.- include_umap_nn_assessment
A boolean value indicating if the UMAP embeddings will be used for the nearest neighbours assessment. Defaults to
FALSE
.- n_top_configs
The number of top configurations that will be used for the downstream analysis in the dimensionality reduction step. Defaults to
3
.- ranking_criterion
The criterion used for ranking the configurations from the dimensionality reduction step. The options are "iqr", "median", "max", "top_qt", "top_qt_max", "iqr_median", "iqr_median_coeff" and "mean". Defaults to
iqr
.- overall_summary
A function used to summarize the stability of the configurations from the dimensionality reduction step across the different resolution values. The options are "median", "max", "top_qt", "top_qt_max", "iqr", "iqr_median", "iqr_median_coeff" and "mean". Defaults to
median
.- ecs_threshold
The ECS threshold used for merging similar clusterings.
- matrix_processing
A function that will be used to process the data matrix by using a dimensionality reduction technique. The function should have one parameter, the data matrix, and should return an embedding describing the reduced space. By default, the function will use the precise PCA method with
prcomp
.- umap_arguments
A list containing the arguments that will be passed to the UMAP function. Refer to the
uwot::umap
function for more details.- prune_value
Argument indicating whether to prune the SNN graph. If the value is 0, the graph won't be pruned. If the value is between 0 and 1, the edges with weight under the pruning value will be removed. If the value is -1, the highest pruning value will be calculated automatically and used.
- algorithm_dim_reduction
An index indicating the community detection algorithm that will be used in the Dimensionality reduction step.
- algorithm_graph_construct
An index indicating the community detection algorithm that will be used in the Graph construction step.
- algorithms_clustering_assessment
An index indicating which community detection algorithm will be used for the clustering step: Louvain (1), Louvain refined (2), SLM (3) or Leiden (4). More details can be found in the Seurat's
FindClusters
function.- clustering_arguments
A list containing the arguments that will be passed to the community detection algorithm, such as the number of iterations and the number of starts. Refer to the Seurat's
FindClusters
function for more details.- verbose
Boolean value used for displaying the progress of the assessment.
- temp_file
The path to the file where the object will be saved.
- save_temp
A boolean value indicating if the object will be saved to a file.
Value
A list having two fields:
all - a list that contains, for each clustering method and each resolution value, the EC consistency between the partitions obtained by changing the seed
filtered - similar to
all
, but for each configuration, we determine the number of clusters that appears the most and use only the partitions with this size
Examples
if (FALSE) { # \dontrun{
set.seed(2024)
# create an already-transposed artificial expression matrix
expr_matrix <- matrix(
c(runif(20 * 10), runif(30 * 10, min = 3, max = 4)),
nrow = 10, byrow = FALSE
)
colnames(expr_matrix) <- as.character(seq_len(ncol(expr_matrix)))
rownames(expr_matrix) <- paste("feature", seq_len(nrow(expr_matrix)))
autom_object <- automatic_stability_assessment(
expression_matrix = expr_matrix,
n_repetitions = 3,
n_neigh_sequence = c(5),
resolution_sequence = c(0.1, 0.5),
features_sets = list(
"set1" = rownames(expr_matrix)
),
steps = list(
"set1" = c(5, 7)
),
umap_arguments = list(
# the following parameters have been modified
# from the default values to ensure that
# the function will run under 5 seconds
n_neighbors = 3,
approx_pow = TRUE,
n_epochs = 0,
init = "random",
min_dist = 0.3
),
n_top_configs = 1,
algorithms_clustering_assessment = 1,
save_temp = FALSE,
verbose = FALSE
)
# the object can be further used to plot the assessment results
plot_feature_overall_stability_boxplot(autom_object$feature_stability)
plot_n_neigh_ecs(autom_object$set1$"5"$nn_stability)
plot_k_n_partitions(autom_object$set1$"5"$clustering_stability)
} # }