Assessment of Stability for Graph Clustering — automatic_stability

Evaluates the stability of different graph clustering methods in the clustering pipeline. The method will iterate through different values of the resolution parameter and compare, using the EC Consistency score, the partitions obtained at different seeds.

Usage

automatic_stability_assessment(
  expression_matrix,
  n_repetitions,
  n_neigh_sequence,
  resolution_sequence,
  features_sets,
  steps,
  seed_sequence = NULL,
  graph_reduction_embedding = "PCA",
  include_umap_nn_assessment = FALSE,
  n_top_configs = 3,
  ranking_criterion = "iqr",
  overall_summary = "median",
  ecs_threshold = 1,
  matrix_processing = function(dt_mtx, actual_npcs = 30, ...) {
     actual_npcs <-
    min(actual_npcs, ncol(dt_mtx)%/%2)
    
    RhpcBLASctl::blas_set_num_threads(foreach::getDoParWorkers())
     embedding <-
stats::prcomp(x = dt_mtx, rank. = actual_npcs)$x
    
    RhpcBLASctl::blas_set_num_threads(1)
rownames(embedding) <- rownames(dt_mtx)
   
colnames(embedding) <- paste0("PC_", seq_len(ncol(embedding)))
    
    return(embedding)
 },
  umap_arguments = list(),
  prune_value = -1,
  algorithm_dim_reduction = 1,
  algorithm_graph_construct = 1,
  algorithms_clustering_assessment = 1:3,
  clustering_arguments = list(),
  verbose = TRUE,
  temp_file = NULL,
  save_temp = TRUE
)

Arguments

expression_matrix: An expression matrix having the features on the rows and the cells on the columns.
n_repetitions: The number of repetitions of applying the pipeline with different seeds; ignored if seed_sequence is provided by the user. Defaults to 100.
n_neigh_sequence: A sequence of the number of nearest neighbours.
resolution_sequence: A sequence of resolution values. The resolution parameter controls the coarseness of the clustering. The higher the resolution, the more clusters will be obtained. The resolution parameter is used in the community detection algorithms.
features_sets: A list of the feature sets. A feature set is a list of genes from the expression matrix that will be used in the dimensionality reduction.
steps: A list with the same names as feature_sets. Each name has assigned a ector containing the sizes of the subsets; negative values will be interpreted as using all features.
seed_sequence: A custom seed sequence; if the value is NULL, the sequence will be built starting from 1 with a step of 100.
graph_reduction_embedding: The type of dimensionality reduction used for the graph construction. The options are "PCA" and "UMAP". Defaults to PCA.
include_umap_nn_assessment: A boolean value indicating if the UMAP embeddings will be used for the nearest neighbours assessment. Defaults to FALSE.
n_top_configs: The number of top configurations that will be used for the downstream analysis in the dimensionality reduction step. Defaults to 3.
ranking_criterion: The criterion used for ranking the configurations from the dimensionality reduction step. The options are "iqr", "median", "max", "top_qt", "top_qt_max", "iqr_median", "iqr_median_coeff" and "mean". Defaults to iqr.
overall_summary: A function used to summarize the stability of the configurations from the dimensionality reduction step across the different resolution values. The options are "median", "max", "top_qt", "top_qt_max", "iqr", "iqr_median", "iqr_median_coeff" and "mean". Defaults to median.
ecs_threshold: The ECS threshold used for merging similar clusterings.
matrix_processing: A function that will be used to process the data matrix by using a dimensionality reduction technique. The function should have one parameter, the data matrix, and should return an embedding describing the reduced space. By default, the function will use the precise PCA method with prcomp.
umap_arguments: A list containing the arguments that will be passed to the UMAP function. Refer to the uwot::umap function for more details.
prune_value: Argument indicating whether to prune the SNN graph. If the value is 0, the graph won't be pruned. If the value is between 0 and 1, the edges with weight under the pruning value will be removed. If the value is -1, the highest pruning value will be calculated automatically and used.
algorithm_dim_reduction: An index indicating the community detection algorithm that will be used in the Dimensionality reduction step.
algorithm_graph_construct: An index indicating the community detection algorithm that will be used in the Graph construction step.
algorithms_clustering_assessment: An index indicating which community detection algorithm will be used for the clustering step: Louvain (1), Louvain refined (2), SLM (3) or Leiden (4). More details can be found in the Seurat's FindClusters function.
clustering_arguments: A list containing the arguments that will be passed to the community detection algorithm, such as the number of iterations and the number of starts. Refer to the Seurat's FindClusters function for more details.
verbose: Boolean value used for displaying the progress of the assessment.
temp_file: The path to the file where the object will be saved.
save_temp: A boolean value indicating if the object will be saved to a file.

Value

A list having two fields:

all - a list that contains, for each clustering method and each resolution value, the EC consistency between the partitions obtained by changing the seed
filtered - similar to all, but for each configuration, we determine the number of clusters that appears the most and use only the partitions with this size

Examples

if (FALSE) { # \dontrun{
set.seed(2024)
# create an already-transposed artificial expression matrix
expr_matrix <- matrix(
    c(runif(20 * 10), runif(30 * 10, min = 3, max = 4)),
    nrow = 10, byrow = FALSE
)
colnames(expr_matrix) <- as.character(seq_len(ncol(expr_matrix)))
rownames(expr_matrix) <- paste("feature", seq_len(nrow(expr_matrix)))

autom_object <- automatic_stability_assessment(
    expression_matrix = expr_matrix,
    n_repetitions = 3,
    n_neigh_sequence = c(5),
    resolution_sequence = c(0.1, 0.5),
    features_sets = list(
        "set1" = rownames(expr_matrix)
    ),
    steps = list(
        "set1" = c(5, 7)
    ),
    umap_arguments = list(
        # the following parameters have been modified
        # from the default values to ensure that
        # the function will run under 5 seconds
        n_neighbors = 3,
        approx_pow = TRUE,
        n_epochs = 0,
        init = "random",
        min_dist = 0.3
    ),
    n_top_configs = 1,
    algorithms_clustering_assessment = 1,
    save_temp = FALSE,
    verbose = FALSE
)

# the object can be further used to plot the assessment results
plot_feature_overall_stability_boxplot(autom_object$feature_stability)
plot_n_neigh_ecs(autom_object$set1$"5"$nn_stability)
plot_k_n_partitions(autom_object$set1$"5"$clustering_stability)
} # }