R/calculate_expression_similarity_transcript.R
calculate_expression_similarity_transcript.Rd
This function generates an average correlation/distance coefficient
for every exon present in the BAM files. This is done by calculating
the point-to-point correlation/distance of the distribution of reads
across the transcript of each exon and comparing it across samples.
The reason why exons are used instead of full length genes is that long
intronic regions artificially increase the correlation since there is
consistently no expression there, across samples. The user has the
option to use genes instead, by running cast_gtf_to_genes
separately,
with non default parameters.
calculate_expression_similarity_transcript( bams = NULL, path.bams = ".", genes = NULL, path.gtf = list.files(".", pattern = "\\.g[tf]f$"), expression.matrix = NULL, subsample.genes = FALSE, make.index = FALSE, unique.only = TRUE, mapq.unique = 255, slack = 200, similarity.measure = "correlation_pearson", save.image.every.1000 = FALSE, ncores = 1, ... )
bams, path.bams | either a path to the directory where the BAM files are or a vector of paths to each individual file; if a path is specified, it extracts all files that end in .bam; looks in the working directory by default |
---|---|
genes | a tibble of the exons extracted from the gtf file;
this is meant for speed if the output of |
path.gtf | the path to the gtf/gff annotation file (only used if genes is not provided); if unspecified, looks for one in the working directory |
expression.matrix | expression matrix; not necessary but is used to filter the gtf to fewer entries and for subsampling if subsample.genes=TRUE; if not provided, raw read counts per exon are extracted from the BAM files |
subsample.genes | logical, whether to subsample low abundance genes to decrease computational time; the first minimum of the distribution of abundances is calculated, and genes lower than it are subsampled to match the number of genes higher than it; the expression matrix needs to be provided for this calculation; a plot is generated to show that minimum |
make.index | whether a BAM index should be generated; if this is FALSE (the default) and no index exists, the function will exit with an error; the index needs to have the same name as each BAM file, but ending with .bam.bai |
unique.only | whether only uniquely mapped reads should contribute to the expression of a gene/exon; default is TRUE |
mapq.unique | The values of the mapping quality field in the BAM file that corresponds to uniquely mapped reads; by default, values of 255 are used as these correspond to the most popular aligners, but an adjustment might be needed; the mapq scores should be as follows: 255 for STAR, 60 for hisat2, 255 for bowtie in -k mode, 40 for bowtie2 default, 50 for tophat |
slack | slack needs to be >=readLength, adjust for efficiency; the default is 200, as it is higher than most modern sequencing experiments |
similarity.measure | one of the similarity metrics to be used, defaults to pearson correlation; currently, only correlation is supported |
save.image.every.1000 | whether to save a workspace image after every 1000 exons are processed; default is FALSE |
ncores | Number of cores for parallel computation; defaults to sequential computation, but parallelisation is highly encouraged; it is set to detectCores() if higher |
... | arguments passed on to other methods |
A list with three elements: the first element is the expression matrix, as supplied or calculated; the other two are the expression levels matrix and expression levels similarity matrix; they have the same # of columns as the expression matrix, and as many rows as exons processed.
bams <- rep(system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE), 2) genes <- data.frame("id" = 1:2, "gene_id" = c("gene1", "gene2"), "seqid" = c("seq1", "seq2"), "start" = 1, "end" = 1600) expression.summary <- calculate_expression_similarity_transcript( bams = bams, genes = genes, mapq.unique = 99 )#>#>#>#>