Building custom metabolite and pathway databases for SMEW
Source:vignettes/articles/databases.Rmd
databases.RmdSMEW can use user-supplied annotation and pathway resources during
create_smew_app().
This vignette explains how to build the three optional CSV inputs:
-
metabolite_table: maps exact masses to metabolite IDs and names -
pathway_table: defines which metabolites belong to each pathway -
pathway_classification: optional categories used for ORA category visualisation
Where these files are used
You can pass these files directly to
create_smew_app():
create_smew_app(
intensity_csv = "path/to/intensity.csv",
metadata_csv = "path/to/metadata.csv",
output_dir = "test_app",
metabolite_table = "path/to/metabolite_table.csv",
pathway_table = "path/to/pathway_table.csv",
pathway_classification = "path/to/pathway_classification.csv",
adducts = c("M-H [1-]"),
ion_mode = "Negative"
)These files can be generated using custom databases or based on public resources. Users should however ensure that usage is consistent with the relevant licenses for any public databases used as sources for these files and that such databases are cited appropriately in any publications using SMEW.
Required structure of the input files
The three tables must be in CSV format and have the following
required columns: - metabolite_table.csv:
MetaboliteID, ExactMass,
MetaboliteName - pathway_table.csv:
PathwayID, PathwayName,
MetaboliteIDs - pathway_classification.csv:
PathwayName, PathwayID,
Category1, Category2
The MetaboliteIDs column in pathway_table.csv should be
a comma-delimited string of metabolite IDs. The metabolite IDs can be
any stable IDs of your choice, but they must match the
MetaboliteID column in metabolite_table.csv
for the corresponding metabolites. The PathwayName column
in pathway_classification.csv should match the
PathwayName column in pathway_table.csv for
the corresponding pathways.
Here we walk through generating the required structure from some example databases.
Workflow roadmap
This vignette is organised as follows:
- A complete KEGG workflow to build all three SMEW input files.
- Alternative sources for
metabolite_table.csv. - Alternative sources for
pathway_table.csv. - Final validation checks before running
create_smew_app().
1) Using KEGG to build metabolite_table.csv,
pathway_table.csv and
pathway_classification.csv
The KEGG database provides comprehensive information on metabolites and pathways, making it a valuable resource for building the required tables for SMEW. Below is an example of how to extract and format data from KEGG to create the necessary CSV files.
As an example, here is what the first few rows of each table might look like based on KEGG data:
# Example metabolite_table
metabolite_table <- read.csv('../databases/metabolite_masses.csv')
print(head(metabolite_table))
##> MetaboliteID ExactMass
##> 1 C16386 162.1157
##> 2 C16387 1105.3762
##> 3 C16388 1123.3867
##> 4 C16389 1121.3711
##> 5 C16394 197.0437
##> 6 C16407 434.1213
##> MetaboliteName
##> 1 (R)-Nicotine
##> 2 (2E,6Z,9Z,12Z,15Z,18Z)-Tetracosahexaenoyl-CoA
##> 3 (3R,6Z,9Z,12Z,15Z,18Z)-3-Hydroxytetracosapentaenoyl-CoA
##> 4 (6Z,9Z,12Z,15Z,18Z)-3-Oxotetracosapentaenoyl-CoA
##> 5 4-Amino-2,6-dinitrotoluene
##> 6 2',4,4',6'-Tetrahydroxychalcone 4'-O-glucoside
# Example pathway_table
pathway_table <- read.csv('../databases/pathways.csv')
print(head(pathway_table))
##> PathwayID PathwayName
##> 1 map00010 Glycolysis / Gluconeogenesis
##> 2 map00020 Citrate cycle (TCA cycle)
##> 3 map00030 Pentose phosphate pathway
##> 4 map00040 Pentose and glucuronate interconversions
##> 5 map00051 Fructose and mannose metabolism
##> 6 map00052 Galactose metabolism
##> MetaboliteIDs
##> 1 C00022,C00024,C00031,C00033,C00036,C00068,C00074,C00084,C00085,C00103,C00111,C00118,C00186,C00197,C00221,C00236,C00267,C00354,C00469,C00631,C00668,C01159,C01172,C01451,C05125,C06186,C06187,C06188,C15972,C15973,C16255
##> 2 C00022,C00024,C00026,C00036,C00042,C00068,C00074,C00091,C00122,C00149,C00158,C00311,C00417,C05125,C05379,C05381,C15972,C15973,C16254,C16255
##> 3 C00022,C00031,C00085,C00117,C00118,C00119,C00121,C00197,C00198,C00199,C00204,C00221,C00231,C00257,C00258,C00279,C00345,C00354,C00577,C00620,C00631,C00668,C00672,C00673,C01151,C01172,C01182,C01218,C01236,C01801,C02076,C03752,C04442,C05382,C06019,C06473,C20589
##> 4 C00022,C00026,C00029,C00103,C00111,C00116,C00167,C00181,C00191,C00199,C00204,C00216,C00231,C00259,C00266,C00309,C00310,C00312,C00333,C00379,C00433,C00470,C00474,C00476,C00502,C00508,C00514,C00532,C00558,C00618,C00714,C00789,C00800,C00817,C00905,C01068,C01101,C01508,C01904,C02266,C02273,C02426,C02753,C03033,C03291,C03826,C04053,C04349,C04575,C05385,C05411,C05412,C06118,C06441,C14899,C15930,C20680,C20902,C20903,C22337,C22712
##> 5 C00085,C00095,C00096,C00111,C00118,C00159,C00186,C00247,C00267,C00275,C00325,C00354,C00392,C00424,C00464,C00507,C00577,C00636,C00644,C00665,C00794,C00861,C00976,C01019,C01094,C01096,C01099,C01131,C01222,C01355,C01487,C01720,C01721,C01768,C01934,C02431,C02492,C02888,C02962,C02977,C02985,C02991,C03117,C03267,C03827,C03979,C05144,C05392,C06192,C11516,C11544,C18028,C18096,C20781,C20836
##> 6 C00029,C00031,C00052,C00085,C00089,C00095,C00103,C00111,C00116,C00118,C00124,C00137,C00159,C00243,C00267,C00446,C00492,C00577,C00668,C00794,C00795,C00880,C00984,C01097,C01113,C01132,C01216,C01235,C01286,C01613,C01697,C02262,C02669,C03383,C03733,C03785,C05396,C05399,C05400,C05401,C05402,C05404,C05796,C06311,C06376,C06377
# Example pathway_classification
pathway_classification <- read.csv('../databases/pathway_classification.csv')
print(head(pathway_classification))
##> PathwayID PathwayName Category1
##> 1 1100 Metabolic pathways Metabolism
##> 2 1110 Biosynthesis of secondary metabolites Metabolism
##> 3 1120 Microbial metabolism in diverse environments Metabolism
##> 4 1200 Carbon metabolism Metabolism
##> 5 1210 2-Oxocarboxylic acid metabolism Metabolism
##> 6 1212 Fatty acid metabolism Metabolism
##> Category2
##> 1 Global and overview maps
##> 2 Global and overview maps
##> 3 Global and overview maps
##> 4 Global and overview maps
##> 5 Global and overview maps
##> 6 Global and overview mapsTo build these tables from KEGG, you can use the KEGG REST API to retrieve information on metabolites and pathways. Below are the general steps to do this:
We load some libraries for data manipulation and API access:
library(dplyr)
library(stringr)
library(tidyr)
library(progress)
library(pbapply)
library(httr)
library(readr)-
Build
pathway_table.csv:
Use the KEGG API to get a list of pathways and their associated
metabolites. Create a data frame with the required columns
(PathwayID, PathwayName,
MetaboliteIDs). The PathwayID can be the KEGG
pathway ID (e.g., hsa00010 for glycolysis), and the
MetaboliteIDs column should contain a comma-delimited
string of the corresponding metabolite IDs.
First we retrieve the compound-pathway links from KEGG:
download_kegg_pathway_links <- function() {
url <- "https://rest.kegg.jp/link/pathway/compound"
message("Downloading KEGG compound–pathway links...")
txt <- readLines(url, warn = FALSE)
df <- tibble(raw = txt) |>
separate(raw, into = c("compound", "pathway"), sep = "\t")
return(df)
}
kegg_links <- download_kegg_pathway_links()
##> Downloading KEGG compound–pathway links...
print(head(kegg_links))
##> # A tibble: 6 × 2
##> compound pathway
##> <chr> <chr>
##> 1 cpd:C00022 path:map00010
##> 2 cpd:C00024 path:map00010
##> 3 cpd:C00031 path:map00010
##> 4 cpd:C00033 path:map00010
##> 5 cpd:C00036 path:map00010
##> 6 cpd:C00068 path:map00010Next we clean up the IDs to remove the prefixes:
kegg_links <- kegg_links |>
mutate(
compound = str_remove(compound, "cpd:"),
pathway = str_remove(pathway, "path:")
)
print(head(kegg_links))
##> # A tibble: 6 × 2
##> compound pathway
##> <chr> <chr>
##> 1 C00022 map00010
##> 2 C00024 map00010
##> 3 C00031 map00010
##> 4 C00033 map00010
##> 5 C00036 map00010
##> 6 C00068 map00010Now we associate pathway names with the pathway IDs:
download_kegg_pathway_names <- function() {
url <- "https://rest.kegg.jp/list/pathway"
message("Downloading KEGG pathway names...")
txt <- readLines(url, warn = FALSE)
df <- tibble(raw = txt) |>
separate(raw, into = c("pathway", "name"), sep = "\t")
df <- df |>
mutate(pathway = str_remove(pathway, "path:"))
return(df)
}
kegg_pathways <- download_kegg_pathway_names()
##> Downloading KEGG pathway names...
print(head(kegg_pathways))
##> # A tibble: 6 × 2
##> pathway name
##> <chr> <chr>
##> 1 map01100 Metabolic pathways
##> 2 map01110 Biosynthesis of secondary metabolites
##> 3 map01120 Microbial metabolism in diverse environments
##> 4 map01200 Carbon metabolism
##> 5 map01210 2-Oxocarboxylic acid metabolism
##> 6 map01212 Fatty acid metabolismNow finally, we combine the information to build the pathway table with the required columns:
kegg_pathway_db <- kegg_links |>
left_join(kegg_pathways, by = "pathway")
kegg_pathway_sets <- kegg_pathway_db |>
group_by(pathway, name) |>
summarise(
compounds = paste(unique(compound),collapse = ','),
.groups = "drop"
)
pathway_table <- kegg_pathway_sets |>
rename(PathwayID = pathway, PathwayName = name, MetaboliteIDs = compounds)
print(head(pathway_table))
##> # A tibble: 6 × 3
##> PathwayID PathwayName MetaboliteIDs
##> <chr> <chr> <chr>
##> 1 map00010 Glycolysis / Gluconeogenesis C00022,C00024,C00031,C0003…
##> 2 map00020 Citrate cycle (TCA cycle) C00022,C00024,C00026,C0003…
##> 3 map00030 Pentose phosphate pathway C00022,C00031,C00085,C0011…
##> 4 map00040 Pentose and glucuronate interconversions C00022,C00026,C00029,C0010…
##> 5 map00051 Fructose and mannose metabolism C00085,C00095,C00096,C0011…
##> 6 map00052 Galactose metabolism C00029,C00031,C00052,C0008…To use within smew, the pathway table should be saved as a CSV file:
write.csv(pathway_table, "pathway_table.csv", row.names = FALSE)-
Build
metabolite_table.csv:
Extract the exact masses and names of the metabolites from KEGG and
create a data frame with the required columns
(MetaboliteID, ExactMass,
MetaboliteName). The MetaboliteID can be the
KEGG compound ID (e.g., C00031 for glucose).
To build the metabolite_table.csv, you need to retrieve
a list of compounds and their properties. This process can be
time-consuming due to the large number of compounds in KEGG, so it’s
recommended to limit the retrieval to compounds that are relevant to
your study (e.g., those that appear in your pathway table). Below is an
example of how to retrieve compound information and build the metabolite
table:
# Function to retrieve compound mass information
get_kegg_compound <- function(id) {
url <- paste0("https://rest.kegg.jp/get/", id)
txt <- readLines(url, warn = FALSE)
exact_mass <- txt[grepl("EXACT_MASS", txt)]
formula <- txt[grepl("FORMULA", txt)]
exact_mass <- str_extract(exact_mass, "[0-9]+\\.[0-9]+")
formula <- str_trim(sub("FORMULA", "", formula))
tibble(
kegg_id = id,
formula = formula,
exact_mass = as.numeric(exact_mass)
)
}
# Function to retrieve compound name information
download_kegg_compounds <- function() {
url <- "https://rest.kegg.jp/list/compound"
res <- GET(url)
stop_for_status(res)
txt <- content(res, "text")
lines <- strsplit(txt, "\n")[[1]]
df <- tibble(raw = lines) |>
filter(raw != "") |>
separate(raw, into = c("kegg_id", "name"), sep = "\t")
return(df)
}
kegg_list <- download_kegg_compounds()
# Here we only keep the compounds that are in our pathway table to speed up the process and get the most relevant annotations but you can also run this with all possible compounds in KEGG if you want a more comprehensive database (but be aware this will take a long time to run.
# ids <- kegg_list$kegg_id
# For demonstration we will only retrieve the first few compounds:
# ids = sort(unique(kegg_pathway_db$compound))
ids = head(sort(unique(kegg_pathway_db$compound)), 20)
db_list <- pblapply(ids, function(id) {
Sys.sleep(0.2) # to avoid hitting KEGG API rate limits
tryCatch(
get_kegg_compound(id),
error = function(e) NULL
)
})
db <- bind_rows(db_list)
db <- left_join(db, kegg_list, by = "kegg_id")
# Optionally, simplify the names by taking the first name before any semicolon (as KEGG often lists multiple names separated by semicolons)
db$short_name = stringr::word(db$name,sep=';',1,1)
db_output = db[,c(1,3,5)]
colnames(db_output)=c('MetaboliteID','ExactMass','MetaboliteName')
print(head(db_output))
##> # A tibble: 6 × 3
##> MetaboliteID ExactMass MetaboliteName
##> <chr> <dbl> <chr>
##> 1 C00001 18.0 H2O
##> 2 C00002 507. ATP
##> 3 C00003 664. NAD+
##> 4 C00004 665. NADH
##> 5 C00005 745. NADPH
##> 6 C00006 744. NADP+To use within smew, the metabolite table should be saved as a CSV file:
write.csv(db_output, "metabolite_table.csv", row.names = FALSE)- Build
pathway_classification.csv(optional)
This table is optional and only used for ORA category overlays/plots.
Required columns:
-
PathwayName: must match pathway names used in ORA output -
PathwayID: pathway ID (for reference and plotting labels) -
Category1: top-level category -
Category2: more specific subcategory
Example structure:
parse_kegg_hierarchy <- function(file){
lines <- readLines(file)
print(head(lines))
lvl1 <- NA
lvl2 <- NA
out <- list()
for(line in lines){
if(str_starts(line,"A")){ lvl1 <- gsub("^A","",line) }
if(str_starts(line,"B")){ lvl2 <- trimws(gsub("^B ","",line)) }
if(str_starts(line,"C ")){
path <- gsub("^C","",line)
path_id <- str_extract(path,"[0-9]+")
path_name <- str_remove(path,"[0-9]+\\s+")
out[[length(out)+1]] <- data.frame(
PathwayID = path_id,
PathwayName = path_name,
Category1 = lvl1,
Category2 = lvl2,
stringsAsFactors = FALSE
)
}
}
bind_rows(out)
}
# download.file("https://rest.kegg.jp/get/br:br08901",destfile = 'databases/kegg_hierarchy.txt')
pathway_classification <- parse_kegg_hierarchy("../databases/kegg_hierarchy.txt")
##> [1] "+C\tMap number"
##> [2] "!"
##> [3] "AMetabolism"
##> [4] "B Global and overview maps"
##> [5] "C 01100 Metabolic pathways"
##> [6] "C 01110 Biosynthesis of secondary metabolites"
print(head(pathway_classification))
##> PathwayID PathwayName Category1
##> 1 01100 Metabolic pathways Metabolism
##> 2 01110 Biosynthesis of secondary metabolites Metabolism
##> 3 01120 Microbial metabolism in diverse environments Metabolism
##> 4 01200 Carbon metabolism Metabolism
##> 5 01210 2-Oxocarboxylic acid metabolism Metabolism
##> 6 01212 Fatty acid metabolism Metabolism
##> Category2
##> 1 Global and overview maps
##> 2 Global and overview maps
##> 3 Global and overview maps
##> 4 Global and overview maps
##> 5 Global and overview maps
##> 6 Global and overview mapsWrite to CSV:
write.csv(pathway_classification, "pathway_classification.csv", row.names = FALSE)Alternative sources for metabolite_table.csv
HMDB example for metabolite_table.csv
Use this when you want HMDB-native identifiers and rich metabolite metadata.
Another valuable resource for metabolite annotation is the Human Metabolome Database (HMDB). To build a mass table, you first need to download the relevant .xls metabolites table from https://hmdb.ca/downloads (e.g., the “All Metabolites” table). This will be a large .xls file containing information on all the metabolites in HMDB, including their exact masses and names. Using HMDB you can download data with either HMDB or ChEBI IDs for comparison against pathway tables.
You can then process it into the required table format for SMEW as follows:
library(xml2)
library(data.table)
##>
##> Attaching package: 'data.table'
##> The following objects are masked from 'package:dplyr':
##>
##> between, first, last
library(purrr)
##>
##> Attaching package: 'purrr'
##> The following object is masked from 'package:data.table':
##>
##> transpose
doc <- read_xml("../databases/hmdb_metabolites.xml")
doc <- xml_ns_strip(doc)
metabolites <- xml_find_all(doc, "//d1:metabolite")
# You can parallelise the process like follows to speed it up:
parse_one_text <- function(xml_text_string) {
node <- read_xml(xml_text_string)
tibble(
MetaboliteID = xml_text(xml_find_first(node, ".//accession")),
ChEBI = xml_text(xml_find_first(node, ".//chebi_id")),
MetaboliteName = xml_text(xml_find_first(node, ".//name")),
ExactMass = xml_double(xml_find_first(node, ".//monisotopic_molecular_weight"))
)
}
# Here we give an example on how to extract the first 20 metabolites but you can of course run this for all metabolites in the HMDB database (but be aware this will take a long time to run).
metabolite_texts <- as.character(head(metabolites,20))
annotation <- map_dfr(metabolite_texts, parse_one_text, .progress = TRUE)
# You can parallelise the process like follows to speed it up:
# library(furrr)
# plan(multisession, workers = n_cores)
# annotation <- future_map_dfr(metabolite_texts, parse_one_text, .progress = TRUE)
print(head(annotation))
##> # A tibble: 6 × 4
##> MetaboliteID ChEBI MetaboliteName ExactMass
##> <chr> <chr> <chr> <dbl>
##> 1 HMDB0000001 50599 1-Methylhistidine 169.
##> 2 HMDB0000002 15725 1,3-Diaminopropane 74.1
##> 3 HMDB0000005 30831 2-Ketobutyric acid 102.
##> 4 HMDB0000008 50613 2-Hydroxybutyric acid 104.
##> 5 HMDB0000010 1189 2-Methoxyestrone 300.
##> 6 HMDB0000011 17066 3-Hydroxybutyric acid 104.Depending on the pathway database you want to use, you can set either the HMDB ID column or the ChEBI column to be the MetaboliteID column for SMEW analysis.
LipidMaps example for metabolite_table.csv
Use this when your panel is lipid-heavy and you want LipidMaps IDs and masses.
LipidMaps is a valuable resource for lipid annotation. You can download the relevant data and process it into the required format for SMEW as follows:
download.file(
"https://www.lipidmaps.org/files/?file=LMSD&ext=sdf.zip",
"../databases/LMSD.zip"
)
unzip("../databases/LMSD.zip",exdir = '../databases/')You can then read the .sdf file and extract the relevant information
to build the metabolite_table.csv as follows:
lines <- readLines("../databases/structures.sdf")
records <- split(lines, cumsum(lines == "$$$$"))
records <- records[lengths(records) > 1]
parse_record <- function(rec){
text <- paste(rec, collapse="\n")
tibble(
MetaboliteID = stringr::str_match(text, "> <LM_ID>\\s*\\n([^\\n]+)")[,2],
MetaboliteName = stringr::str_match(text, "> <NAME>\\s*\\n([^\\n]+)")[,2],
ExactMass = as.numeric(stringr::str_match(text, "> <EXACT_MASS>\\s*\\n([^\\n]+)")[,2])
)
}
annotation <- map_dfr(records, parse_record)
print(head(annotation))
##> # A tibble: 6 × 3
##> MetaboliteID MetaboliteName ExactMass
##> <chr> <chr> <dbl>
##> 1 LMFA00000001 2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride 626.
##> 2 LMFA00000002 Serratamic acid 275.
##> 3 LMFA00000003 N-(3-(hexadecanoyloxy)-heptadecanoyl)-L-ornithine 639.
##> 4 LMFA00000005 N-(3-(15-methyl-hexadecanoyloxy)-13-methyl-tetradecano… 626.
##> 5 LMFA00000006 Lysine-containing siolipin 685.
##> 6 LMFA00000007 n-decanohydroxamic acid 187.Alternative sources for pathway_table.csv
WikiPathways example for pathway_table.csv
Use this when you want curated pathway graphs for a specific organism.
The WikiPathways database is another valuable resource for pathway information. You can download the relevant pathway data for your organism from https://data.wikipathways.org/current/gpml/, selecting your organism of interest. You can use steps like the following to download the data:
dir.create("../databases/wikipathways", recursive = TRUE, showWarnings = FALSE)
download.file(
"https://data.wikipathways.org/current/gpml/wikipathways-20260310-gpml-Mus_musculus.zip",
"../databases/wikipathways/mouse_gpml.zip",
mode = "wb"
)
unzip("../databases/wikipathways/mouse_gpml.zip", exdir = "../databases/wikipathways/gpml")To then create a pathway table for SMEW you can do the following:
library(xml2)
library(dplyr)
library(purrr)
library(progress)
parse_wikipathway <- function(file){
doc <- read_xml(file)
doc <- xml2::xml_ns_strip(doc)
pathway_node <- xml2::xml_find_first(doc, "//Pathway")
pathway_name <- xml2::xml_attr(pathway_node, "Name")
version <- xml2::xml_attr(pathway_node, "Version")
pathway_id <- sub("_.*", "", version)
nodes <- xml_find_all(doc, "//DataNode[@Type='Metabolite']")
if(length(nodes) == 0) return(NULL)
tibble(
PathwayID = pathway_id,
PathwayName = pathway_name,
MetaboliteID = xml_attr(xml_find_first(nodes, ".//Xref"), "ID"),
Database = xml_attr(xml_find_first(nodes, ".//Xref"), "Database")
)
}
files <- list.files("../databases/wikipathways/gpml", full.names = TRUE)
pb <- progress_bar$new(total = length(files))
metabolite_pathways <- map_dfr(files, function(f){
pb$tick()
parse_wikipathway(f)
})
# Here are the types of databases included:
print(unique(metabolite_pathways$Database))
##> [1] "ChEBI" "LIPID MAPS" "" "HMDB"
##> [5] "KEGG Compound" "PubChem-compound" "CAS" "Chemspider"
##> [9] "Wikidata"
# You can then filter to the relevant metabolite IDs you are using
# For example to use KEGG IDs:
metabolite_pathways_filtered <- metabolite_pathways %>%
filter(Database %in% c("KEGG Compound"))
pathway_table <- metabolite_pathways_filtered %>%
filter(!is.na(MetaboliteID)) %>%
group_by(PathwayID, PathwayName) %>%
summarise(
MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
.groups = "drop"
)
print(head(pathway_table))
##> # A tibble: 6 × 3
##> PathwayID PathwayName MetaboliteIDs
##> <chr> <chr> <chr>
##> 1 WP164 Glutathione metabolism C00669,C00097,C01419,C01…
##> 2 WP1770 One-carbon metabolism and related pathways C00021,C00606,C00097,C00…
##> 3 WP1771 Kennedy pathway C00189,C00019,C02737,C00…
##> 4 WP2185 Purine metabolism C05512,C00059,C00360,C04…
##> 5 WP2292 Chemokine signaling pathway C00076,C05981,C01245,C00…
##> 6 WP232 G protein signaling pathways C00165
# Or to use LipidMaps:
metabolite_pathways_filtered <- metabolite_pathways %>%
filter(Database %in% c("LIPID MAPS"))
pathway_table <- metabolite_pathways_filtered %>%
filter(!is.na(MetaboliteID)) %>%
group_by(PathwayID, PathwayName) %>%
summarise(
MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
.groups = "drop"
)
print(head(pathway_table))
##> # A tibble: 6 × 3
##> PathwayID PathwayName MetaboliteIDs
##> <chr> <chr> <chr>
##> 1 WP4335 Eicosanoid lipid synthesis map LMFA01030001…
##> 2 WP4344 Sphingolipid metabolism overview LMSP01020002…
##> 3 WP4345 Glycerolipids and glycerophospholipids LMGP10010000…
##> 4 WP4346 Cholesterol metabolism with Bloch and Kandutsch-Russe… LMFA01030056…
##> 5 WP4347 Eicosanoid metabolism via cyclooxygenases (COX) LMFA03010133…
##> 6 WP4348 Eicosanoid metabolism via lipoxygenases (LOX) LMFA03020037…Reactome example for pathway_table.csv
Use this when your metabolite IDs are ChEBI and you want organism-specific pathway mapping.
The Reactome database can be used to link ChEBI IDs to pathways. You can download the https://reactome.org/download/current/ChEBI2Reactome.txt table and then process it into the required format for SMEW as follows:
dir.create("../databases/reactome", recursive = TRUE, showWarnings = FALSE)
download.file(
"https://reactome.org/download/current/ChEBI2Reactome.txt",
"../databases/reactome/ChEBI2Reactome.txt",
mode = "wb"
)
reactome <- read.delim(
"../databases/ChEBI2Reactome.txt",
header = FALSE,
sep = "\t",
stringsAsFactors = FALSE
)
# You can then filter to the relevant organism like this
reactome_mouse <- reactome |>
dplyr::filter(V6 == "Mus musculus")
# Then group by pathway and summarise the ChEBI IDs into a comma-delimited string for the MetaboliteIDs column:
reactome_pathways <- reactome_mouse |>
dplyr::transmute(
PathwayID = V2,
PathwayName = V4,
MetaboliteID = V1
) |>
dplyr::group_by(PathwayID, PathwayName) |>
dplyr::summarise(
IDs = paste(unique(MetaboliteID), collapse = ","),
.groups = "drop"
)
print(head(reactome_pathways))
##> # A tibble: 6 × 3
##> PathwayID PathwayName IDs
##> <chr> <chr> <chr>
##> 1 R-MMU-1059683 "Interleukin-6 signaling" 30616,456216
##> 2 R-MMU-109704 "PI3K Cascade" 28815,30616,456216,57836,…
##> 3 R-MMU-110056 "MAPK3 (ERK1) activation" 30616,456216
##> 4 R-MMU-110312 "Translesion synthesis by REV1" 16516,33019,61481
##> 5 R-MMU-110320 "Translesion Synthesis by POLH" 16516,30616,33019,456216,…
##> 6 R-MMU-110329 "Cleavage of the damaged pyrimidine " 15901,17568,17821,27983,2…HMDB example for pathway_table.csv
Use this when you want pathway associations directly from HMDB metadata.
You can also extract pathway information directly from HMDB, although it is not as comprehensive as other databases for this purpose:
library(purrr)
library(dplyr)
parse_hmdb_pathways <- function(xml_text_string){
node <- read_xml(xml_text_string)
id <- xml2::xml_text(xml2::xml_find_first(node, ".//accession"))
pathways <- xml2::xml_find_all(node, ".//pathway")
if(length(pathways) == 0) return(NULL)
data.frame(
MetaboliteID = id,
PathwayName = xml2::xml_text(xml2::xml_find_first(pathways, ".//name")),
SMPDB_ID = xml2::xml_text(xml2::xml_find_first(pathways, ".//smpdb_id")),
KEGG_ID = xml2::xml_text(xml2::xml_find_first(pathways, ".//kegg_map_id")),
stringsAsFactors = FALSE
)
}
# for demonstration we will only parse the first 100 metabolites but you can of course run this for all metabolites in the HMDB database (but be aware this will take a long time to run).
metabolite_texts <- as.character( head(metabolites,100) )
metabolite_pathway_table = map_dfr(metabolite_texts, parse_hmdb_pathways)
pathway_table <- metabolite_pathway_table |>
filter(!is.na(KEGG_ID), KEGG_ID != "") |>
group_by(PathwayID = KEGG_ID, PathwayName) |>
summarise(
MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
.groups = "drop"
)
print( head(pathway_table) )
##> # A tibble: 6 × 3
##> PathwayID PathwayName MetaboliteIDs
##> <chr> <chr> <chr>
##> 1 map00010 Glycolysis / Gluconeogenesis HMDB0000030,HMDB0000122,HMDB0000124…
##> 2 map00020 Citric Acid Cycle HMDB0000030,HMDB0000072,HMDB0000094…
##> 3 map00051 Fructose and mannose metabolism HMDB0000124
##> 4 map00052 Galactose Metabolism HMDB0000048,HMDB0000107,HMDB0000122…
##> 5 map00071 Fatty acid Metabolism HMDB0000045,HMDB0000062
##> 6 map00072 Ketone Body Metabolism HMDB0000011,HMDB0000060Again in this case, you can set either SMPDB_ID or KEGG_ID as the PathwayID for SMEW analysis depending on your preference and the pathway database you want to use.
Final checks before create_smew_app()
- Confirm all required columns are present with exact names.
- Confirm every ID used in
pathway_table.csvexists inmetabolite_table.csv. - Check for duplicate metabolite or pathway rows unless intentional.
- Use a single delimiter style in
MetaboliteIDs(comma-delimited). - Save all three files as CSV and verify they load cleanly with
read.csv().