Building custom metabolite and pathway databases for SMEW

SMEW can use user-supplied annotation and pathway resources during create_smew_app().

This vignette explains how to build the three optional CSV inputs:

metabolite_table: maps exact masses to metabolite IDs and names
pathway_table: defines which metabolites belong to each pathway
pathway_classification: optional categories used for ORA category visualisation

Where these files are used

You can pass these files directly to create_smew_app():

create_smew_app(
  intensity_csv = "path/to/intensity.csv",
  metadata_csv = "path/to/metadata.csv",
  output_dir = "test_app",
  metabolite_table = "path/to/metabolite_table.csv",
  pathway_table = "path/to/pathway_table.csv",
  pathway_classification = "path/to/pathway_classification.csv",
  adducts = c("M-H [1-]"),
  ion_mode = "Negative"
)

These files can be generated using custom databases or based on public resources. Users should however ensure that usage is consistent with the relevant licenses for any public databases used as sources for these files and that such databases are cited appropriately in any publications using SMEW.

Required structure of the input files

The three tables must be in CSV format and have the following required columns: - metabolite_table.csv: MetaboliteID, ExactMass, MetaboliteName - pathway_table.csv: PathwayID, PathwayName, MetaboliteIDs - pathway_classification.csv: PathwayName, PathwayID, Category1, Category2

The MetaboliteIDs column in pathway_table.csv should be a comma-delimited string of metabolite IDs. The metabolite IDs can be any stable IDs of your choice, but they must match the MetaboliteID column in metabolite_table.csv for the corresponding metabolites. The PathwayName column in pathway_classification.csv should match the PathwayName column in pathway_table.csv for the corresponding pathways.

Here we walk through generating the required structure from some example databases.

Workflow roadmap

This vignette is organised as follows:

A complete KEGG workflow to build all three SMEW input files.
Alternative sources for metabolite_table.csv.
Alternative sources for pathway_table.csv.
Final validation checks before running create_smew_app().

1) Using KEGG to build `metabolite_table.csv`, `pathway_table.csv` and `pathway_classification.csv`

The KEGG database provides comprehensive information on metabolites and pathways, making it a valuable resource for building the required tables for SMEW. Below is an example of how to extract and format data from KEGG to create the necessary CSV files.

As an example, here is what the first few rows of each table might look like based on KEGG data:

# Example metabolite_table
metabolite_table <- read.csv('../databases/metabolite_masses.csv')
print(head(metabolite_table))
##>   MetaboliteID ExactMass
##> 1       C16386  162.1157
##> 2       C16387 1105.3762
##> 3       C16388 1123.3867
##> 4       C16389 1121.3711
##> 5       C16394  197.0437
##> 6       C16407  434.1213
##>                                            MetaboliteName
##> 1                                            (R)-Nicotine
##> 2           (2E,6Z,9Z,12Z,15Z,18Z)-Tetracosahexaenoyl-CoA
##> 3 (3R,6Z,9Z,12Z,15Z,18Z)-3-Hydroxytetracosapentaenoyl-CoA
##> 4        (6Z,9Z,12Z,15Z,18Z)-3-Oxotetracosapentaenoyl-CoA
##> 5                              4-Amino-2,6-dinitrotoluene
##> 6          2',4,4',6'-Tetrahydroxychalcone 4'-O-glucoside

# Example pathway_table
pathway_table <- read.csv('../databases/pathways.csv')
print(head(pathway_table))
##>   PathwayID                              PathwayName
##> 1  map00010             Glycolysis / Gluconeogenesis
##> 2  map00020                Citrate cycle (TCA cycle)
##> 3  map00030                Pentose phosphate pathway
##> 4  map00040 Pentose and glucuronate interconversions
##> 5  map00051          Fructose and mannose metabolism
##> 6  map00052                     Galactose metabolism
##>                                                                                                                                                                                                                                                                                                                                                                                                                                MetaboliteIDs
##> 1                                                                                                                                                                                                                   C00022,C00024,C00031,C00033,C00036,C00068,C00074,C00084,C00085,C00103,C00111,C00118,C00186,C00197,C00221,C00236,C00267,C00354,C00469,C00631,C00668,C01159,C01172,C01451,C05125,C06186,C06187,C06188,C15972,C15973,C16255
##> 2                                                                                                                                                                                                                                                                                                C00022,C00024,C00026,C00036,C00042,C00068,C00074,C00091,C00122,C00149,C00158,C00311,C00417,C05125,C05379,C05381,C15972,C15973,C16254,C16255
##> 3                                                                                                                                                                         C00022,C00031,C00085,C00117,C00118,C00119,C00121,C00197,C00198,C00199,C00204,C00221,C00231,C00257,C00258,C00279,C00345,C00354,C00577,C00620,C00631,C00668,C00672,C00673,C01151,C01172,C01182,C01218,C01236,C01801,C02076,C03752,C04442,C05382,C06019,C06473,C20589
##> 4 C00022,C00026,C00029,C00103,C00111,C00116,C00167,C00181,C00191,C00199,C00204,C00216,C00231,C00259,C00266,C00309,C00310,C00312,C00333,C00379,C00433,C00470,C00474,C00476,C00502,C00508,C00514,C00532,C00558,C00618,C00714,C00789,C00800,C00817,C00905,C01068,C01101,C01508,C01904,C02266,C02273,C02426,C02753,C03033,C03291,C03826,C04053,C04349,C04575,C05385,C05411,C05412,C06118,C06441,C14899,C15930,C20680,C20902,C20903,C22337,C22712
##> 5                                           C00085,C00095,C00096,C00111,C00118,C00159,C00186,C00247,C00267,C00275,C00325,C00354,C00392,C00424,C00464,C00507,C00577,C00636,C00644,C00665,C00794,C00861,C00976,C01019,C01094,C01096,C01099,C01131,C01222,C01355,C01487,C01720,C01721,C01768,C01934,C02431,C02492,C02888,C02962,C02977,C02985,C02991,C03117,C03267,C03827,C03979,C05144,C05392,C06192,C11516,C11544,C18028,C18096,C20781,C20836
##> 6                                                                                                          C00029,C00031,C00052,C00085,C00089,C00095,C00103,C00111,C00116,C00118,C00124,C00137,C00159,C00243,C00267,C00446,C00492,C00577,C00668,C00794,C00795,C00880,C00984,C01097,C01113,C01132,C01216,C01235,C01286,C01613,C01697,C02262,C02669,C03383,C03733,C03785,C05396,C05399,C05400,C05401,C05402,C05404,C05796,C06311,C06376,C06377

# Example pathway_classification
pathway_classification <- read.csv('../databases/pathway_classification.csv')
print(head(pathway_classification))
##>   PathwayID                                  PathwayName  Category1
##> 1      1100                           Metabolic pathways Metabolism
##> 2      1110        Biosynthesis of secondary metabolites Metabolism
##> 3      1120 Microbial metabolism in diverse environments Metabolism
##> 4      1200                            Carbon metabolism Metabolism
##> 5      1210              2-Oxocarboxylic acid metabolism Metabolism
##> 6      1212                        Fatty acid metabolism Metabolism
##>                  Category2
##> 1 Global and overview maps
##> 2 Global and overview maps
##> 3 Global and overview maps
##> 4 Global and overview maps
##> 5 Global and overview maps
##> 6 Global and overview maps

To build these tables from KEGG, you can use the KEGG REST API to retrieve information on metabolites and pathways. Below are the general steps to do this:

We load some libraries for data manipulation and API access:

library(dplyr)
library(stringr)
library(tidyr)
library(progress)
library(pbapply)
library(httr)
library(readr)

Build pathway_table.csv:

Use the KEGG API to get a list of pathways and their associated metabolites. Create a data frame with the required columns (PathwayID, PathwayName, MetaboliteIDs). The PathwayID can be the KEGG pathway ID (e.g., hsa00010 for glycolysis), and the MetaboliteIDs column should contain a comma-delimited string of the corresponding metabolite IDs.

First we retrieve the compound-pathway links from KEGG:

download_kegg_pathway_links <- function() {
  url <- "https://rest.kegg.jp/link/pathway/compound"
  message("Downloading KEGG compound–pathway links...")
  txt <- readLines(url, warn = FALSE)
  df <- tibble(raw = txt) |>
    separate(raw, into = c("compound", "pathway"), sep = "\t")
  return(df)
}

kegg_links <- download_kegg_pathway_links()
##> Downloading KEGG compound–pathway links...
print(head(kegg_links))
##> # A tibble: 6 × 2
##>   compound   pathway      
##>   <chr>      <chr>        
##> 1 cpd:C00022 path:map00010
##> 2 cpd:C00024 path:map00010
##> 3 cpd:C00031 path:map00010
##> 4 cpd:C00033 path:map00010
##> 5 cpd:C00036 path:map00010
##> 6 cpd:C00068 path:map00010

Next we clean up the IDs to remove the prefixes:

kegg_links <- kegg_links |>
  mutate(
    compound = str_remove(compound, "cpd:"),
    pathway = str_remove(pathway, "path:")
  )
print(head(kegg_links))
##> # A tibble: 6 × 2
##>   compound pathway 
##>   <chr>    <chr>   
##> 1 C00022   map00010
##> 2 C00024   map00010
##> 3 C00031   map00010
##> 4 C00033   map00010
##> 5 C00036   map00010
##> 6 C00068   map00010

Now we associate pathway names with the pathway IDs:

download_kegg_pathway_names <- function() {
  url <- "https://rest.kegg.jp/list/pathway"
  message("Downloading KEGG pathway names...")
  txt <- readLines(url, warn = FALSE)
  df <- tibble(raw = txt) |>
    separate(raw, into = c("pathway", "name"), sep = "\t")
  df <- df |>
    mutate(pathway = str_remove(pathway, "path:"))
  return(df)
}
kegg_pathways <- download_kegg_pathway_names()
##> Downloading KEGG pathway names...
print(head(kegg_pathways))
##> # A tibble: 6 × 2
##>   pathway  name                                        
##>   <chr>    <chr>                                       
##> 1 map01100 Metabolic pathways                          
##> 2 map01110 Biosynthesis of secondary metabolites       
##> 3 map01120 Microbial metabolism in diverse environments
##> 4 map01200 Carbon metabolism                           
##> 5 map01210 2-Oxocarboxylic acid metabolism             
##> 6 map01212 Fatty acid metabolism

Now finally, we combine the information to build the pathway table with the required columns:

kegg_pathway_db <- kegg_links |>
  left_join(kegg_pathways, by = "pathway")

kegg_pathway_sets <- kegg_pathway_db |>
  group_by(pathway, name) |>
  summarise(
    compounds = paste(unique(compound),collapse = ','),
    .groups = "drop"
  )

pathway_table <- kegg_pathway_sets |>
  rename(PathwayID = pathway, PathwayName = name, MetaboliteIDs = compounds)
print(head(pathway_table))
##> # A tibble: 6 × 3
##>   PathwayID PathwayName                              MetaboliteIDs              
##>   <chr>     <chr>                                    <chr>                      
##> 1 map00010  Glycolysis / Gluconeogenesis             C00022,C00024,C00031,C0003…
##> 2 map00020  Citrate cycle (TCA cycle)                C00022,C00024,C00026,C0003…
##> 3 map00030  Pentose phosphate pathway                C00022,C00031,C00085,C0011…
##> 4 map00040  Pentose and glucuronate interconversions C00022,C00026,C00029,C0010…
##> 5 map00051  Fructose and mannose metabolism          C00085,C00095,C00096,C0011…
##> 6 map00052  Galactose metabolism                     C00029,C00031,C00052,C0008…

To use within smew, the pathway table should be saved as a CSV file:

write.csv(pathway_table, "pathway_table.csv", row.names = FALSE)

Build metabolite_table.csv:

Extract the exact masses and names of the metabolites from KEGG and create a data frame with the required columns (MetaboliteID, ExactMass, MetaboliteName). The MetaboliteID can be the KEGG compound ID (e.g., C00031 for glucose).

To build the metabolite_table.csv, you need to retrieve a list of compounds and their properties. This process can be time-consuming due to the large number of compounds in KEGG, so it’s recommended to limit the retrieval to compounds that are relevant to your study (e.g., those that appear in your pathway table). Below is an example of how to retrieve compound information and build the metabolite table:

# Function to retrieve compound mass information
get_kegg_compound <- function(id) {
  url <- paste0("https://rest.kegg.jp/get/", id)
  txt <- readLines(url, warn = FALSE)
  exact_mass <- txt[grepl("EXACT_MASS", txt)]
  formula <- txt[grepl("FORMULA", txt)]
  exact_mass <- str_extract(exact_mass, "[0-9]+\\.[0-9]+")
  formula <- str_trim(sub("FORMULA", "", formula))
  tibble(
    kegg_id = id,
    formula = formula,
    exact_mass = as.numeric(exact_mass)
  )
}

# Function to retrieve compound name information
 download_kegg_compounds <- function() {
  url <- "https://rest.kegg.jp/list/compound"
  res <- GET(url)
  stop_for_status(res)
  txt <- content(res, "text")
  lines <- strsplit(txt, "\n")[[1]]
  df <- tibble(raw = lines) |>
    filter(raw != "") |>
    separate(raw, into = c("kegg_id", "name"), sep = "\t")
  return(df)
 }
 kegg_list <- download_kegg_compounds()

 # Here we only keep the compounds that are in our pathway table to speed up the process and get the most relevant annotations but you can also run this with all possible compounds in KEGG if you want a more comprehensive database (but be aware this will take a long time to run. 
 # ids <- kegg_list$kegg_id

 # For demonstration we will only retrieve the first few compounds:
 # ids = sort(unique(kegg_pathway_db$compound))
 ids = head(sort(unique(kegg_pathway_db$compound)), 20)
 
db_list <- pblapply(ids, function(id) {
  Sys.sleep(0.2) # to avoid hitting KEGG API rate limits
  tryCatch(
    get_kegg_compound(id),
    error = function(e) NULL
  )
})

db <- bind_rows(db_list)
db <- left_join(db, kegg_list, by = "kegg_id")
# Optionally, simplify the names by taking the first name before any semicolon (as KEGG often lists multiple names separated by semicolons)
db$short_name = stringr::word(db$name,sep=';',1,1)
db_output = db[,c(1,3,5)]
colnames(db_output)=c('MetaboliteID','ExactMass','MetaboliteName')

print(head(db_output))
##> # A tibble: 6 × 3
##>   MetaboliteID ExactMass MetaboliteName
##>   <chr>            <dbl> <chr>         
##> 1 C00001            18.0 H2O           
##> 2 C00002           507.  ATP           
##> 3 C00003           664.  NAD+          
##> 4 C00004           665.  NADH          
##> 5 C00005           745.  NADPH         
##> 6 C00006           744.  NADP+

To use within smew, the metabolite table should be saved as a CSV file:

write.csv(db_output, "metabolite_table.csv", row.names = FALSE)

Build pathway_classification.csv (optional)

This table is optional and only used for ORA category overlays/plots.

Required columns:

PathwayName: must match pathway names used in ORA output
PathwayID: pathway ID (for reference and plotting labels)
Category1: top-level category
Category2: more specific subcategory

Example structure:

parse_kegg_hierarchy <- function(file){
  lines <- readLines(file)
  print(head(lines))
  lvl1 <- NA
  lvl2 <- NA
  
  out <- list()
  for(line in lines){ 
    if(str_starts(line,"A")){ lvl1 <- gsub("^A","",line) }
    if(str_starts(line,"B")){ lvl2 <- trimws(gsub("^B ","",line)) }
    if(str_starts(line,"C ")){
      path <- gsub("^C","",line)
      path_id <- str_extract(path,"[0-9]+")
      path_name <- str_remove(path,"[0-9]+\\s+")
      out[[length(out)+1]] <- data.frame(
        PathwayID = path_id,
        PathwayName = path_name,
        Category1 = lvl1,
        Category2 = lvl2,
        stringsAsFactors = FALSE
      )
    }
  }
  bind_rows(out)
}
# download.file("https://rest.kegg.jp/get/br:br08901",destfile = 'databases/kegg_hierarchy.txt')
pathway_classification <- parse_kegg_hierarchy("../databases/kegg_hierarchy.txt")
##> [1] "+C\tMap number"                                   
##> [2] "!"                                                
##> [3] "AMetabolism"                                      
##> [4] "B  Global and overview maps"                      
##> [5] "C    01100  Metabolic pathways"                   
##> [6] "C    01110  Biosynthesis of secondary metabolites"
print(head(pathway_classification))
##>   PathwayID                                      PathwayName  Category1
##> 1     01100                               Metabolic pathways Metabolism
##> 2     01110            Biosynthesis of secondary metabolites Metabolism
##> 3     01120     Microbial metabolism in diverse environments Metabolism
##> 4     01200                                Carbon metabolism Metabolism
##> 5     01210                  2-Oxocarboxylic acid metabolism Metabolism
##> 6     01212                            Fatty acid metabolism Metabolism
##>                  Category2
##> 1 Global and overview maps
##> 2 Global and overview maps
##> 3 Global and overview maps
##> 4 Global and overview maps
##> 5 Global and overview maps
##> 6 Global and overview maps

Write to CSV:

write.csv(pathway_classification, "pathway_classification.csv", row.names = FALSE)

Alternative sources for `metabolite_table.csv`

HMDB example for `metabolite_table.csv`

Use this when you want HMDB-native identifiers and rich metabolite metadata.

Another valuable resource for metabolite annotation is the Human Metabolome Database (HMDB). To build a mass table, you first need to download the relevant .xls metabolites table from https://hmdb.ca/downloads (e.g., the “All Metabolites” table). This will be a large .xls file containing information on all the metabolites in HMDB, including their exact masses and names. Using HMDB you can download data with either HMDB or ChEBI IDs for comparison against pathway tables.

You can then process it into the required table format for SMEW as follows:

library(xml2)
library(data.table)
##> 
##> Attaching package: 'data.table'
##> The following objects are masked from 'package:dplyr':
##> 
##>     between, first, last
library(purrr)
##> 
##> Attaching package: 'purrr'
##> The following object is masked from 'package:data.table':
##> 
##>     transpose

doc <- read_xml("../databases/hmdb_metabolites.xml")
doc <- xml_ns_strip(doc)
metabolites <- xml_find_all(doc, "//d1:metabolite")

# You can parallelise the process like follows to speed it up:
parse_one_text <- function(xml_text_string) {
node <- read_xml(xml_text_string)
tibble(
    MetaboliteID = xml_text(xml_find_first(node, ".//accession")),
    ChEBI = xml_text(xml_find_first(node, ".//chebi_id")),
    MetaboliteName = xml_text(xml_find_first(node, ".//name")),
    ExactMass = xml_double(xml_find_first(node, ".//monisotopic_molecular_weight"))
)
}
# Here we give an example on how to extract the first 20 metabolites but you can of course run this for all metabolites in the HMDB database (but be aware this will take a long time to run).
metabolite_texts <- as.character(head(metabolites,20))
annotation <- map_dfr(metabolite_texts, parse_one_text, .progress = TRUE)

# You can parallelise the process like follows to speed it up:
# library(furrr)
# plan(multisession, workers = n_cores)
# annotation <- future_map_dfr(metabolite_texts, parse_one_text, .progress = TRUE)

print(head(annotation))
##> # A tibble: 6 × 4
##>   MetaboliteID ChEBI MetaboliteName        ExactMass
##>   <chr>        <chr> <chr>                     <dbl>
##> 1 HMDB0000001  50599 1-Methylhistidine         169. 
##> 2 HMDB0000002  15725 1,3-Diaminopropane         74.1
##> 3 HMDB0000005  30831 2-Ketobutyric acid        102. 
##> 4 HMDB0000008  50613 2-Hydroxybutyric acid     104. 
##> 5 HMDB0000010  1189  2-Methoxyestrone          300. 
##> 6 HMDB0000011  17066 3-Hydroxybutyric acid     104.

Depending on the pathway database you want to use, you can set either the HMDB ID column or the ChEBI column to be the MetaboliteID column for SMEW analysis.

LipidMaps example for `metabolite_table.csv`

Use this when your panel is lipid-heavy and you want LipidMaps IDs and masses.

LipidMaps is a valuable resource for lipid annotation. You can download the relevant data and process it into the required format for SMEW as follows:

download.file(
  "https://www.lipidmaps.org/files/?file=LMSD&ext=sdf.zip",
  "../databases/LMSD.zip"
)
unzip("../databases/LMSD.zip",exdir = '../databases/')

You can then read the .sdf file and extract the relevant information to build the metabolite_table.csv as follows:

lines <- readLines("../databases/structures.sdf")
records <- split(lines, cumsum(lines == "$$$$"))
records <- records[lengths(records) > 1]

parse_record <- function(rec){
  
  text <- paste(rec, collapse="\n")
  
  tibble(
    MetaboliteID = stringr::str_match(text, "> <LM_ID>\\s*\\n([^\\n]+)")[,2],
    MetaboliteName = stringr::str_match(text, "> <NAME>\\s*\\n([^\\n]+)")[,2],
    ExactMass = as.numeric(stringr::str_match(text, "> <EXACT_MASS>\\s*\\n([^\\n]+)")[,2])
  )
  
}

annotation <- map_dfr(records, parse_record)
print(head(annotation))
##> # A tibble: 6 × 3
##>   MetaboliteID MetaboliteName                                          ExactMass
##>   <chr>        <chr>                                                       <dbl>
##> 1 LMFA00000001 2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride           626.
##> 2 LMFA00000002 Serratamic acid                                              275.
##> 3 LMFA00000003 N-(3-(hexadecanoyloxy)-heptadecanoyl)-L-ornithine            639.
##> 4 LMFA00000005 N-(3-(15-methyl-hexadecanoyloxy)-13-methyl-tetradecano…      626.
##> 5 LMFA00000006 Lysine-containing siolipin                                   685.
##> 6 LMFA00000007 n-decanohydroxamic acid                                      187.

Alternative sources for `pathway_table.csv`

WikiPathways example for `pathway_table.csv`

Use this when you want curated pathway graphs for a specific organism.

The WikiPathways database is another valuable resource for pathway information. You can download the relevant pathway data for your organism from https://data.wikipathways.org/current/gpml/, selecting your organism of interest. You can use steps like the following to download the data:

dir.create("../databases/wikipathways", recursive = TRUE, showWarnings = FALSE)
download.file(
  "https://data.wikipathways.org/current/gpml/wikipathways-20260310-gpml-Mus_musculus.zip",
  "../databases/wikipathways/mouse_gpml.zip",
  mode = "wb"
)
unzip("../databases/wikipathways/mouse_gpml.zip", exdir = "../databases/wikipathways/gpml")

To then create a pathway table for SMEW you can do the following:

library(xml2)
library(dplyr)
library(purrr)
library(progress)

parse_wikipathway <- function(file){
  
  doc <- read_xml(file)
  doc <- xml2::xml_ns_strip(doc)
  pathway_node <- xml2::xml_find_first(doc, "//Pathway")
  pathway_name <- xml2::xml_attr(pathway_node, "Name")
  version <- xml2::xml_attr(pathway_node, "Version")
  pathway_id <- sub("_.*", "", version)
 
  nodes <- xml_find_all(doc, "//DataNode[@Type='Metabolite']")
  
  if(length(nodes) == 0) return(NULL)
  tibble(
    PathwayID = pathway_id,
    PathwayName = pathway_name,
    MetaboliteID = xml_attr(xml_find_first(nodes, ".//Xref"), "ID"),
    Database = xml_attr(xml_find_first(nodes, ".//Xref"), "Database")
  )
}

files <- list.files("../databases/wikipathways/gpml", full.names = TRUE)
pb <- progress_bar$new(total = length(files))
metabolite_pathways <- map_dfr(files, function(f){
  pb$tick()
  parse_wikipathway(f)
})

# Here are the types of databases included:
print(unique(metabolite_pathways$Database))
##> [1] "ChEBI"            "LIPID MAPS"       ""                 "HMDB"            
##> [5] "KEGG Compound"    "PubChem-compound" "CAS"              "Chemspider"      
##> [9] "Wikidata"

# You can then filter to the relevant metabolite IDs you are using
# For example to use KEGG IDs:
metabolite_pathways_filtered <- metabolite_pathways %>%
  filter(Database %in% c("KEGG Compound"))

pathway_table <- metabolite_pathways_filtered %>%
  filter(!is.na(MetaboliteID)) %>%
  group_by(PathwayID, PathwayName) %>%
  summarise(
    MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
    .groups = "drop"
  )
print(head(pathway_table))
##> # A tibble: 6 × 3
##>   PathwayID PathwayName                                MetaboliteIDs            
##>   <chr>     <chr>                                      <chr>                    
##> 1 WP164     Glutathione metabolism                     C00669,C00097,C01419,C01…
##> 2 WP1770    One-carbon metabolism and related pathways C00021,C00606,C00097,C00…
##> 3 WP1771    Kennedy pathway                            C00189,C00019,C02737,C00…
##> 4 WP2185    Purine metabolism                          C05512,C00059,C00360,C04…
##> 5 WP2292    Chemokine signaling pathway                C00076,C05981,C01245,C00…
##> 6 WP232     G protein signaling pathways               C00165

# Or to use LipidMaps:
metabolite_pathways_filtered <- metabolite_pathways %>%
  filter(Database %in% c("LIPID MAPS"))

pathway_table <- metabolite_pathways_filtered %>%
  filter(!is.na(MetaboliteID)) %>%
  group_by(PathwayID, PathwayName) %>%
  summarise(
    MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
    .groups = "drop"
  )
print(head(pathway_table))
##> # A tibble: 6 × 3
##>   PathwayID PathwayName                                            MetaboliteIDs
##>   <chr>     <chr>                                                  <chr>        
##> 1 WP4335    Eicosanoid lipid synthesis map                         LMFA01030001…
##> 2 WP4344    Sphingolipid metabolism overview                       LMSP01020002…
##> 3 WP4345    Glycerolipids and glycerophospholipids                 LMGP10010000…
##> 4 WP4346    Cholesterol metabolism with Bloch and Kandutsch-Russe… LMFA01030056…
##> 5 WP4347    Eicosanoid metabolism via cyclooxygenases (COX)        LMFA03010133…
##> 6 WP4348    Eicosanoid metabolism via lipoxygenases (LOX)          LMFA03020037…

Reactome example for `pathway_table.csv`

Use this when your metabolite IDs are ChEBI and you want organism-specific pathway mapping.

The Reactome database can be used to link ChEBI IDs to pathways. You can download the https://reactome.org/download/current/ChEBI2Reactome.txt table and then process it into the required format for SMEW as follows:

dir.create("../databases/reactome", recursive = TRUE, showWarnings = FALSE)
download.file(
  "https://reactome.org/download/current/ChEBI2Reactome.txt",
  "../databases/reactome/ChEBI2Reactome.txt",
  mode = "wb"
)

reactome <- read.delim(
  "../databases/ChEBI2Reactome.txt",
  header = FALSE,
  sep = "\t",
  stringsAsFactors = FALSE
)
# You can then filter to the relevant organism like this
reactome_mouse <- reactome |>
  dplyr::filter(V6 == "Mus musculus")

# Then group by pathway and summarise the ChEBI IDs into a comma-delimited string for the MetaboliteIDs column:
reactome_pathways <- reactome_mouse |>
  dplyr::transmute(
    PathwayID = V2,
    PathwayName = V4,
    MetaboliteID = V1
  ) |>
  dplyr::group_by(PathwayID, PathwayName) |>
  dplyr::summarise(
    IDs = paste(unique(MetaboliteID), collapse = ","),
    .groups = "drop"
  )
print(head(reactome_pathways))
##> # A tibble: 6 × 3
##>   PathwayID     PathwayName                           IDs                       
##>   <chr>         <chr>                                 <chr>                     
##> 1 R-MMU-1059683 "Interleukin-6 signaling"             30616,456216              
##> 2 R-MMU-109704  "PI3K Cascade"                        28815,30616,456216,57836,…
##> 3 R-MMU-110056  "MAPK3 (ERK1) activation"             30616,456216              
##> 4 R-MMU-110312  "Translesion synthesis by REV1"       16516,33019,61481         
##> 5 R-MMU-110320  "Translesion Synthesis by POLH"       16516,30616,33019,456216,…
##> 6 R-MMU-110329  "Cleavage of the damaged pyrimidine " 15901,17568,17821,27983,2…

HMDB example for `pathway_table.csv`

Use this when you want pathway associations directly from HMDB metadata.

You can also extract pathway information directly from HMDB, although it is not as comprehensive as other databases for this purpose:

library(purrr)
library(dplyr)

parse_hmdb_pathways <- function(xml_text_string){
  node <- read_xml(xml_text_string)
  id <- xml2::xml_text(xml2::xml_find_first(node, ".//accession"))
  pathways <- xml2::xml_find_all(node, ".//pathway")
  
  if(length(pathways) == 0) return(NULL)
  
  data.frame(
    MetaboliteID = id,
    PathwayName = xml2::xml_text(xml2::xml_find_first(pathways, ".//name")),
    SMPDB_ID = xml2::xml_text(xml2::xml_find_first(pathways, ".//smpdb_id")),
    KEGG_ID = xml2::xml_text(xml2::xml_find_first(pathways, ".//kegg_map_id")),
    stringsAsFactors = FALSE
  )
}

# for demonstration we will only parse the first 100 metabolites but you can of course run this for all metabolites in the HMDB database (but be aware this will take a long time to run).
metabolite_texts <- as.character( head(metabolites,100) ) 
metabolite_pathway_table = map_dfr(metabolite_texts, parse_hmdb_pathways)

pathway_table <- metabolite_pathway_table |>
  filter(!is.na(KEGG_ID), KEGG_ID != "") |>
  group_by(PathwayID = KEGG_ID, PathwayName) |>
  summarise(
    MetaboliteIDs = paste(unique(MetaboliteID), collapse = ","),
    .groups = "drop"
  )
print( head(pathway_table) )
##> # A tibble: 6 × 3
##>   PathwayID PathwayName                     MetaboliteIDs                       
##>   <chr>     <chr>                           <chr>                               
##> 1 map00010  Glycolysis / Gluconeogenesis    HMDB0000030,HMDB0000122,HMDB0000124…
##> 2 map00020  Citric Acid Cycle               HMDB0000030,HMDB0000072,HMDB0000094…
##> 3 map00051  Fructose and mannose metabolism HMDB0000124                         
##> 4 map00052  Galactose Metabolism            HMDB0000048,HMDB0000107,HMDB0000122…
##> 5 map00071  Fatty acid Metabolism           HMDB0000045,HMDB0000062             
##> 6 map00072  Ketone Body Metabolism          HMDB0000011,HMDB0000060

Again in this case, you can set either SMPDB_ID or KEGG_ID as the PathwayID for SMEW analysis depending on your preference and the pathway database you want to use.

Final checks before `create_smew_app()`

Confirm all required columns are present with exact names.
Confirm every ID used in pathway_table.csv exists in metabolite_table.csv.
Check for duplicate metabolite or pathway rows unless intentional.
Use a single delimiter style in MetaboliteIDs (comma-delimited).
Save all three files as CSV and verify they load cleanly with read.csv().

Where these files are used

Required structure of the input files

Workflow roadmap

1) Using KEGG to build metabolite_table.csv, pathway_table.csv and pathway_classification.csv

Alternative sources for metabolite_table.csv

HMDB example for metabolite_table.csv

LipidMaps example for metabolite_table.csv

Alternative sources for pathway_table.csv

WikiPathways example for pathway_table.csv

Reactome example for pathway_table.csv

HMDB example for pathway_table.csv

Final checks before create_smew_app()