Please see our interactive walkthrough tutorial for a detailed overview on the package functionalities:

Quickstart quide

Load Reactome pathways

reactome_pathways = sspa.process_reactome(organism="Homo sapiens")

Load some example metabolomics data in the form of a pandas DataFrame:

covid_data_processed = sspa.load_example_data(omicstype="metabolomics", processed=True)

Generate pathway scores using kPCA method

kpca_scores = sspa.sspa_KPCA(reactome_pathways, min_entity=3).fit_transform(covid_data_processed.iloc[:, :-2])

Input data

The input data should consist of an \(m×n\) pandas DataFrame of values corresponding to the abundance of all annotated metabolites, either as absolute (concentrations) or relative (peak intensities) quantifications. Rows represent \(m\) samples and columns represent \(n\) annotated compounds.

Sample metadata is necessary for some types of pathway analysis, particularly the conventional methods over-representation analysis (ORA) and gene set enrichment analysis (GSEA). Sample metadata should contain details of the clinical outcome or phenotype used for comparison and should be represented as a column either as part of the metabolite abundance DataFrame, or as a separate column/Series, as long as the sample identifiers can be matched between data structures.

Example

Below is a condensed example of the input data format:

sample_id	spermidine	1-methylnicotinamide	12,13-DiHOME	alpha-ketoglutarate	kynurenate	Group
1004596	-0.756979	0.552163	-0.317382	0.726321	-0.608606	A
1008097	0.079818	-0.839393	0.49128	-1.867786	-0.044496	A
1008631	0.978372	-1.281277	-0.199487	0.355229	0.014784	B
1012545	-0.93754	-0.242391	1.63653	2.080704	-0.31561	B
1022407	-0.652496	-0.110733	0.814461	-0.886903	0.409608	B

The index are sample identifiers and the column names are metabolite names/identifiers. Metdata columns can also be added at the end of the DataFrame.

Loading pathways

# Pre-loaded pathways
# Reactome v78
reactome_pathways  = sspa.process_reactome(organism="Homo sapiens")

# KEGG v98
kegg_human_pathways  = sspa.process_kegg(organism="hsa")

Load a custom GMT file (extension .gmt or .csv)

custom_pathways = sspa.process_gmt("wikipathways-20220310-gmt-Homo_sapiens.gmt")

Download latest version of pathways

# download KEGG latest
kegg_mouse_latest = sspa.process_kegg("mmu", download_latest=True, filepath=".")

# download Reactome latest
reactome_mouse_latest = sspa.process_reactome("Mus musculus", download_latest=True, filepath=".")

Note

Downloading the lastest version of KEGG pathways can take up to ten minutes.

Identifier harmonization

# download the conversion table
compound_names = processed_data.columns.tolist()
conversion_table = sspa.identifier_conversion(input_type="name", compound_list=compound_names)

# map the identifiers to your dataset
processed_data_mapped = sspa.map_identifiers(conversion_table, output_id_type="ChEBI", matrix=processed_data)

Warning

This step requries active internet connection and access to the MetaboAnalyst API. If you are experiencing errors please check you are able to query the API.

Conventional pathway analysis

ORA

ora = sspa.sspa_ora(processed_data_mapped, covid_data["Group"], reactome_pathways, 0.05, DA_testtype='ttest', custom_background=None)

# perform ORA 
ora_res = ora.over_representation_analysis()

# get t-test results
ora.ttest_res

# obtain list of differential molecules input to ORA
ora.DA_test_res

Statistical tests for selecing differential molecules

In over-representation the list of molecules of interest, or 'differential genes/metabolites/proteins, etc' are often determined using a statistical test such as the Student's t-test. In the sspa_ora function we allow users to specify the type of test used for this purpose, either DA_testtype='ttest' to use an independent samples t-test (default), or DA_testtype='mwu' to use a Mann Whitney U test.

GSEA

sspa.sspa_gsea(processed_data_mapped, covid_data['Group'], reactome_pathways)

Single sample pathway analysis methods

All ssPA methods now have a fit(), transform() and fit_transform() method for compatibility with SciKitLearn. This allows integration of ssPA transformation with various machine learning functions in SKLearn such as Pipeline and GridSearchCV. Specifically for sspa.sspa_ssClustPA, sspa.sspa_SVD, and sspa.sspa_KPCA methods the model can be fit on the training data and the test data is transformed using the fitted model.

# ssclustPA
ssclustpa_res = sspa.sspa_ssClustPA(reactome_pathways, min_entity=2).fit_transform(processed_data_mapped)

# kPCA 
kpca_scores = sspa.sspa_kpca(reactome_pathways, min_entity=2).fit_transform(processed_data_mapped)

# z-score (Lee et al. 2008)
zscore_res = sspa.sspa_zscore(reactome_pathways, min_entity=2).fit_transform(processed_data_mapped)

# SVD (PLAGE, Tomfohr et al. 2005)
svd_res = sspa.sspa_svd(reactome_pathways, min_entity=2).fit_transform(processed_data_mapped)

# ssGSEA (Barbie et al. 2009)
ssgsea_res = sspa.sspa_ssGSEA(reactome_pathways, min_entity=2).fit_transform(processed_data_mapped)