Chapter 3 Dimensionality reduction

The first tab aims to explore and evaluate the effects of dimensionality reduction on the dataset. The first parameter we assess is feature set. By default, Seurat relies on the highly variable genes for PCA computation. We allow you to examine various feature sets and assess their stability. To demonstrate the effect of the feature set, we test two different sets of features:

highly variable genes: this set can be obtained in the Seurat pipeline by using either the SCTransform or FindVariableFeatures method. Depending on the method, using the default parameters will get 3000 genes.
most abundant genes: this set can be obtained by sorting the genes based on their expression level.

In addition to these two sets, we also evaluate the effect of the size of the feature set by choosing the top 500,1000,1500 genes for each set.

3.1 ECC per individual resolution values

The first plot aims to show the Element-Centric Consistency (ECC) across all communities obtained throughout the iterations. ECC is calculated for the different sets across varying resolutions that you may toggle through.

Figure 3.1: ECC per individual resolution values

You may also click anywhere in the plot, which will display the UMAP dimensionality reduction coloured by the ECC, for a specific feature set at a given resolution.

Figure 3.2: UMAP displayed for the selected area. We also include ECC summary statistics

3.2 Incremental ECS per individual resolution values

Another approach to assessing stability in our ClustAssess app is to compare consecutive steps for each feature set using Element-Centric Similarity on the most frequent partitions. This approach aims to evaluate the impact of increasing the number of genes on the final partitions and indirectly determine the signal-to-noise transition. In our case study, we observed an increase in similarity between consecutive steps for the most abundant and highly variable genes in the Immune dataset, indicating that selecting more genes can lead to more robust partitioning. In the app, you can toggle between different resolution values and modify various aspects of the plot, such as the width or text size, to optimize your visualization experience. By using this approach, researchers can gain a deeper understanding of the stability of their clustering results and optimize their feature selection for more robust partitioning.

Figure 3.3: Incremental ECS per individual resolution values

3.3 Overall Stability

This plot shows the overall ECC across different feature sets. We extract the medians from each resolution value and we get the distribution that we plot on the overall stability / incremental. For the this case study, we observe an increase in similarity between consecutive steps for the most abundant and highly variable genes, suggesting that selecting more genes would lead to a more robust partitioning.

Figure 3.4: Overall stability

We also show the overall incremental stability.

Figure 3.5: Overall Incremental Stability

3.4 Pairwise comparison of gene and metadata distribution

As a final step to aid in feature selection, ClustAssess allows for UMAP visualization of gene expression across multiple genes and metadata features. You can select one or multiple genes and compare their expression at the single cell level to other genes of interest. When selecting more than one gene, you can set a gene expression threshold to highlight cells that express the given set of genes above the threshold. Two additional UMAP plots are provided to enable evaluation of gene expression in relation to any available metadata features.

Figure 3.6: UMAP plots showing pairwise gene expression against metadata features

3.5 Choosing a feature set

The final section in this tab prompts you to choose a feature set in order to move on. This choice should be based on the evaluation of all plots. Based on the feature set with the greatest stability, we recommend the feature set with the highest stability, in this case the 5000 most abundant genes.

In order to suggest the most stable partition, we firstly rank all partitions on descending ECS values. The top half are intersected between the top overall ECS and overall incremental ECS. Remaining configurations are ranked based on the incremental interquartile ranges, choosing the most compact distribution.

Figure 3.7: Feature set selection. You will not be allowed to carry on unless you have selected a feature set