PLEASE NOTE: This analysis is run in Python so python must be installed and location specified if not on PATH. Both sreformat and PaTMaN must also be installed and path specified if not on PATH. Python >= 3.6 is required to use the neccesary packages. The Python component required the following libraries: os, Bio, gtfparse, pandas, numpy, math, scipy.stats, matplotlib.pyplot, seaborn as sns, statistics, logging. Please ensure these are installed for the verison of Python you supply.

preparedataset(
  pythonversion = "python",
  mRNA_3pUTR = "",
  miRNA_full = "",
  interactions = "",
  annotations = "",
  fullchromosomes = "",
  seed = 1,
  nonseed_miRNA = 0,
  flankingmRNA = 0,
  UTR_output = "",
  chr = "",
  o = "feamiR_",
  positiveset = "",
  negativeset = "",
  sreformatpath = "sreformat",
  patmanpath = "patman",
  patmanoutput = "",
  minvalidationentries = 40,
  num_runs = 100,
  check_python = T
)

Arguments

pythonversion

File path for installed Python version (default: python)

mRNA_3pUTR

Fasta file of only 3'UTRs, with gene name as name attribute (e.g. Serpinb8)

miRNA_full

Fasta file of full mature miRNA hairpins, with miRNA ID as name attribute (e.g. hsa-miR-576-3p)

interactions

CSV file containing only validated interactions between miRNA and mRNA (e.g. from miRTarBase). Must have columns miRNA (e.g. hsa-miR-576-3p), Target Gene (e.g. Serpinb8) and optionally Experiments (e.g. qRT-PCR) and/or Support Type (with values Functional MTI, Functional MTI (Weak), Non-Functional MTI, Non-Functional MTI (Weak))

annotations

GTF file (e.g. from Ensembl) with attributes seqname (chromosome), feature (with 3'UTRs labelled exactly 'three_prime_utr'), transcript_id, gene_id and gene_name matching fullchromosomes and interactions

fullchromosomes

Fasta file (e.g. top level file from Ensembl) containing full sequence for each chromosome with name as chromosome (e.g. 1, matching seqname from annotations)

seed

Binary, 1 if full miRNA seed features should be included in statistical analysis. Default: 1.

nonseed_miRNA

Binary, 1 if full miRNA features should be included in statistical analysis. Seed features are always included. Default: 0.

flankingmRNA

Binary, 1 if flanking region mRNA features should be included in statistical analysis. Seed features are always included. Default: 0.

UTR_output

String. File name 3'UTR fasta file should be saved as (when annotations and full chromosomes files are supplied)

chr

Number of chromosomes for species in question.

o

Output prefix for any files created and saved.

positiveset

CSV file containing validated pairs of miRNAs and mRNAs as output by initial stage of analysis. If positiveset and negative set are input, analysis begins at final statistical analysis stage.

negativeset

CSV file containing non-validated pairs of miRNAs and mRNAs as output by initial stage of analysis. If positiveset and negative set are input, analysis begins at final statistical analysis stage.

sreformatpath

File path for installed sreformat (default: sreformat)

patmanpath

File path for installed patman (default: patman)

patmanoutput

TXT file containing patman output (saved as output_prefix + patman_seed.txt). If supplied, analysis begins at patman output processing stage.

minvalidationentries

Minimum number of entries for a validation category to be considered separately in statistical analysis (default: 40)

num_runs

Number of subsamples to create (default: 100)

check_python

Whether the Python version should be checked (default: T)

Value

CSV containing full positive and negative sets. Folder statistical_analysis of heatmaps showing significance of various features under Fisher exact and Chi-squared tests. Seed analysis will always be run, full miRNA and flanking analysis if the respective parameters are set to 1. Folder subsamples containing CSVs for 100 subsamples with positive and negative samples equal for use in classifiers and feature selection.

Details

The function saves various files (using specified output_prefix) and if you wish to start preparation using one of these pre-output files then these can be specified and preparation will skip to that point (this should only be done with files output by the function).

Examples

preparedataset( pythonversion=Sys.which('python'), positiveset = system.file('samples','test_seed_positive.csv',package='feamiR'), negativeset=system.file('samples','test_seed_negative.csv',package='feamiR'), o='examples_', num_runs=0, check_python='F')
#> Warning: '-W' not found
#> [1] 127