Calcualte the distance matrices using the BAM files

This function generates an average correlation/distance coefficient for every exon present in the BAM files. This is done by calculating the point-to-point correlation/distance of the distribution of reads across the transcript of each exon and comparing it across samples. The reason why exons are used instead of full length genes is that long intronic regions artificially increase the correlation since there is consistently no expression there, across samples. The user has the option to use genes instead, by running cast_gtf_to_genes separately, with non default parameters.

calculate_expression_similarity_transcript(
  bams = NULL,
  path.bams = ".",
  genes = NULL,
  path.gtf = list.files(".", pattern = "\\.g[tf]f$"),
  expression.matrix = NULL,
  subsample.genes = FALSE,
  make.index = FALSE,
  unique.only = TRUE,
  mapq.unique = 255,
  slack = 200,
  similarity.measure = "correlation_pearson",
  save.image.every.1000 = FALSE,
  ncores = 1,
  ...
)

Arguments

bams, path.bams	either a path to the directory where the BAM files are or a vector of paths to each individual file; if a path is specified, it extracts all files that end in .bam; looks in the working directory by default
genes	a tibble of the exons extracted from the gtf file; this is meant for speed if the output of `cast_gtf_to_genes` is already generated, or if the user wants to only calculate similarity for a subset of exons
path.gtf	the path to the gtf/gff annotation file (only used if genes is not provided); if unspecified, looks for one in the working directory
expression.matrix	expression matrix; not necessary but is used to filter the gtf to fewer entries and for subsampling if subsample.genes=TRUE; if not provided, raw read counts per exon are extracted from the BAM files
subsample.genes	logical, whether to subsample low abundance genes to decrease computational time; the first minimum of the distribution of abundances is calculated, and genes lower than it are subsampled to match the number of genes higher than it; the expression matrix needs to be provided for this calculation; a plot is generated to show that minimum
make.index	whether a BAM index should be generated; if this is FALSE (the default) and no index exists, the function will exit with an error; the index needs to have the same name as each BAM file, but ending with .bam.bai
unique.only	whether only uniquely mapped reads should contribute to the expression of a gene/exon; default is TRUE
mapq.unique	The values of the mapping quality field in the BAM file that corresponds to uniquely mapped reads; by default, values of 255 are used as these correspond to the most popular aligners, but an adjustment might be needed; the mapq scores should be as follows: 255 for STAR, 60 for hisat2, 255 for bowtie in -k mode, 40 for bowtie2 default, 50 for tophat
slack	slack needs to be >=readLength, adjust for efficiency; the default is 200, as it is higher than most modern sequencing experiments
similarity.measure	one of the similarity metrics to be used, defaults to pearson correlation; currently, only correlation is supported
save.image.every.1000	whether to save a workspace image after every 1000 exons are processed; default is FALSE
ncores	Number of cores for parallel computation; defaults to sequential computation, but parallelisation is highly encouraged; it is set to detectCores() if higher
...	arguments passed on to other methods

Value

A list with three elements: the first element is the expression matrix, as supplied or calculated; the other two are the expression levels matrix and expression levels similarity matrix; they have the same # of columns as the expression matrix, and as many rows as exons processed.

Examples

bams <- rep(system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE), 2)
genes <- data.frame("id" = 1:2,
                    "gene_id" = c("gene1", "gene2"),
                    "seqid" = c("seq1", "seq2"),
                    "start" = 1,
                    "end" = 1600)
expression.summary <- calculate_expression_similarity_transcript(
  bams = bams,
  genes = genes,
  mapq.unique = 99
)
#> Calculating expression similarity for 2 genes...
#>     this process may take a long time...
#>     ncores=1, running sequentially...
#> Finished! Time elapsed: 0.36 secs

Arguments

Value

See also

Examples