This function generates an average correlation/distance coefficient for every exon present in the BAM files. This is done by calculating the point-to-point correlation/distance of the distribution of reads across the transcript of each exon and comparing it across samples. The reason why exons are used instead of full length genes is that long intronic regions artificially increase the correlation since there is consistently no expression there, across samples. The user has the option to use genes instead, by running cast_gtf_to_genes separately, with non default parameters.

calculate_expression_similarity_transcript(
  bams = NULL,
  path.bams = ".",
  genes = NULL,
  path.gtf = list.files(".", pattern = "\\.g[tf]f$"),
  expression.matrix = NULL,
  subsample.genes = FALSE,
  make.index = FALSE,
  unique.only = TRUE,
  mapq.unique = 255,
  slack = 200,
  similarity.measure = "correlation_pearson",
  save.image.every.1000 = FALSE,
  ncores = 1,
  ...
)

Arguments

bams, path.bams

either a path to the directory where the BAM files are or a vector of paths to each individual file; if a path is specified, it extracts all files that end in .bam; looks in the working directory by default

genes

a tibble of the exons extracted from the gtf file; this is meant for speed if the output of cast_gtf_to_genes is already generated, or if the user wants to only calculate similarity for a subset of exons

path.gtf

the path to the gtf/gff annotation file (only used if genes is not provided); if unspecified, looks for one in the working directory

expression.matrix

expression matrix; not necessary but is used to filter the gtf to fewer entries and for subsampling if subsample.genes=TRUE; if not provided, raw read counts per exon are extracted from the BAM files

subsample.genes

logical, whether to subsample low abundance genes to decrease computational time; the first minimum of the distribution of abundances is calculated, and genes lower than it are subsampled to match the number of genes higher than it; the expression matrix needs to be provided for this calculation; a plot is generated to show that minimum

make.index

whether a BAM index should be generated; if this is FALSE (the default) and no index exists, the function will exit with an error; the index needs to have the same name as each BAM file, but ending with .bam.bai

unique.only

whether only uniquely mapped reads should contribute to the expression of a gene/exon; default is TRUE

mapq.unique

The values of the mapping quality field in the BAM file that corresponds to uniquely mapped reads; by default, values of 255 are used as these correspond to the most popular aligners, but an adjustment might be needed; the mapq scores should be as follows: 255 for STAR, 60 for hisat2, 255 for bowtie in -k mode, 40 for bowtie2 default, 50 for tophat

slack

slack needs to be >=readLength, adjust for efficiency; the default is 200, as it is higher than most modern sequencing experiments

similarity.measure

one of the similarity metrics to be used, defaults to pearson correlation; currently, only correlation is supported

save.image.every.1000

whether to save a workspace image after every 1000 exons are processed; default is FALSE

ncores

Number of cores for parallel computation; defaults to sequential computation, but parallelisation is highly encouraged; it is set to detectCores() if higher

...

arguments passed on to other methods

Value

A list with three elements: the first element is the expression matrix, as supplied or calculated; the other two are the expression levels matrix and expression levels similarity matrix; they have the same # of columns as the expression matrix, and as many rows as exons processed.

See also

Examples

bams <- rep(system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE), 2) genes <- data.frame("id" = 1:2, "gene_id" = c("gene1", "gene2"), "seqid" = c("seq1", "seq2"), "start" = 1, "end" = 1600) expression.summary <- calculate_expression_similarity_transcript( bams = bams, genes = genes, mapq.unique = 99 )
#> Calculating expression similarity for 2 genes...
#> this process may take a long time...
#> ncores=1, running sequentially...
#> Finished! Time elapsed: 0.36 secs