We then computed weighted UniFrac9 distances to compare metabolomic profiles

These chemical relationships are represented as a chemical tree that can be visualized in the context of sample metadata and molecular annotations obtained from spectral matching and in silico annotation tools. We show that such a chemical tree representation enables the application of various tree-based tools, originally developed for analyzing DNA sequencing data, for exploring mass-spectrometry data. Here, we introduce Qemistree software that constructs a chemical tree based on predicted molecular fingerprints from MS/MS fragmentation spectra. Molecular fingerprints are vectors where each position encodes a substructural property of the molecule, and recent methods allow us to predict molecular fingerprints from tandem mass spectra. In Qemistree, we use SIRIUS and CSI:FingerID to obtain predicted molecular fingerprints. Users can first perform feature detection to generate a list of observed ions with associated peak areas and MS/MS fragmentation spectra, referred to as chemical features henceforth, raspberry container size to be analyzed by Qemistree . Only chemical features with MS/MS data are included; features with only MS1 are not considered.

SIRIUS then determines the molecular formula of each feature using the isotope and fragmentation patterns and estimates the best fragmentation tree explaining the fragmentation spectrum. Subsequently, CSI:FingerID operates on the fragmentation trees using kernel support vector machines to predict molecular properties . We use these molecular fingerprints to calculate pairwise distances between chemical features and hierarchically cluster the fingerprint vectors to generate a tree representing their chemical structural relationships. Although alternative approaches to hierarchically cluster features based on cosine similarity of fragmentation spectra exist , we use molecular fingerprints predicted by CSI:FingerID for this. Previous work has shown that CSI:FingerID outperforms other tools for automatic in silico structural annotation. Therefore, we leverage it to search molecular structural databases to provide complementary insights into structures when no match is obtained against spectral libraries. Subsequently, we use ClassyFire to assign a 5-level chemical taxonomy to all molecules annotated via spectral library matching and in silico prediction . Phylogenetic tools such as iTOL can be used to visualize Qemistree trees interactively in the context of sample information and feature annotations for easy data exploration. The outputs of Qemistree can also be plugged into other workflows in QIIME 2 or in R, Python, etc. for system-wide metabolomic data analyses.

In this study, we apply Qemistree to perform chemically informed comparisons of samples in the presence of technical variation such as chromatographic shifts that commonly affect mass spectrometry data analysis. Additionally, we exemplify the use of a tree-based representation to visualize and explore chemical diversity using a heterogeneous collection of food products. Qemistree can be used iteratively to incorporate multiple datasets without the need for cumbersome reprocessing , allowing for large-scale dataset comparisons. Qemistree is available to the microbiome community as a QIIME 2 plugin and the metabolomics community as a workflow on GNPS2 . Thechemical tree from the GNPS workflow can be explored interactively using the QemistreeGNPS dashboard. To verify that molecular fingerprint-based trees correctly capture the chemical relationships between molecules, we designed an evaluation dataset using four distinct biological specimens: two human fecal samples, a tomato seedling sample, and a human serum sample. Samples were prepared by combining them in binary, tertiary, and quaternary mixtures in various proportions to generate a set of diverse but related metabolite profiles . Untargeted tandem mass spectrometry was used to analyze the chemical composition of these samples and obtain fragmentation spectra.

The mass spectrometry experiments were performed twice using different chromatographic elution gradients, causing a retention time shift between the two runs . Processing the data of these two experiments with traditional LC-MS-based pipelines leads to the same molecules being detected as different chemical features in downstream analysis. Figure 1 shows the analysis of pure samples to demonstrate this. In Extended Data Figure 4, we highlight how these technical variations make the same samples appear chemically disjointed. Using Qemistree, we mapped each of the spectra in the two chromatographic conditions to a molecular fingerprint, and organized these in a tree structure . Because molecular fingerprints are independent of retention time shifts, spectra are clustered based on their chemical similarity. It is noteworthy that the structural information from chemical features with spectral library matches or other forms of annotation could also be used to compare the chemical composition of samples across different mass spectrometry runs. Qemistree improves upon this by enabling the use of all MS/MS spectra with molecular fingerprints for downstream comparative analyses, by not constraining analysis to the chemical features with spectral matches only. This tree structure can be decorated using sample type descriptions, chromatographic conditions, spectral matches obtained from molecular networking in GNPS , and any other chemical annotations23,28. Figure 1 shows that similar chemical features were detected exclusively in one of the two batches. However, based on the molecular fingerprints, these chemical features were arranged as neighboring tips in the tree regardless of the retention time shifts. This result shows how Qemistree can reconcile and facilitate the comparison of datasets acquired on different chromatographic gradients.Having demonstrated Qemistree’s practical utility on biologically inspired synthetic datasets, we now turned to a conceptual example illustrating the general principle. We demonstrated an application of a chemical hierarchy in performing chemically informed comparisons of metabolomics profiles. In standard metabolomic statistical analyses, eachmolecule is assumed unrelated to the other molecules in the dataset. Some of the pitfalls of this assumption are highlighted in Figure 2a. Consider a scenario where we want to compare samples 1–3. An analysis schema that does not account for the chemical relationships among the molecules in these samples , will assume that the sugars in samples 2 and 3 are as chemically related to the lipids in sample 1 as they are to each other. This would lead to the naive conclusion that samples 1 and 2, and samples 2 and 3 are equally distinct, yet from a chemical perspective they are not. On the other hand, if we account for the fact that sugar molecules are more chemically related to one another than they are to lipids, we can obtain a chemically informed sample-to-sample comparison. The chemical structural compositional similarity metric29 was developed to compute pairwise sample-to-sample comparison by considering cosine similarity of MS/MS spectra from molecular networking. Here, we utilize a tree-based approach to account for chemical relationships, which allows us to adopt phylogeny-based tools for metabolomics analyses . Specifically, we first constructed a tree of chemical similarities by hierarchical clustering molecular fingerprints from CSI:FingerID . This tree is analogous to phylogenetic trees used in ecology, such that the tips of the tree are molecules . In Figure 2a, we show that by using a tree of chemical relationships between molecules in samples 1–3, we can visualize that sample 1 is chemically very distinct from samples 2 and 3. Returning to our evaluation dataset, raspberry plant container we can highlight the importance of comparing samples by accounting for their molecular relatedness. Principal coordinates analysis of the evaluation dataset that ignores the tree structure performs far worse than the Qemistree PCoA that uses the tree .

With the structural context provided by Qemistree, the differences between replicates across batches are comparable to the within-batch differences . The retention time shift in this dataset leads to a strong signal due to chromatography conditions that obscures the biological relationships among the samples . We observed and remediated a similar pattern originating from plate-to-plate variation in a recently published study investigating the metabolome and microbiome of captive cheetahs . In this study, placing the molecules in a tree using Qemistree reduced the observed technical variation , and highlighted the dietary effect that was expected . These results show how systematic and spurious molecular differences can be mitigated in an unsupervised manner using chemically informed distance measures based on a tree structure.As a case study demonstrating the utility of Qemistree on a set of biological specimens, we used the platform to explore chemical diversity in food samples collected in the GlobalFoodOmics initiative . Understanding the chemical relationships between different foods is challenging because most molecules within foods are unannotated. We selected a diverse range of food ingredients to represent animal, plant, and fungal groupings. We first performed feature-based molecular networking using MZmine to obtain spectral library matches for a subset of the chemical features . Using Qemistree, we collated GNPS spectral library matches and in silico predictions from CSI:FingerID to annotate ~91% of the chemical fingerprints with molecular structures. We also retrieved chemical taxonomy assignments for structures that were classified by ClassyFire; the remaining are in the queue to be processed on the ClassyFire server for taxonomy assignment . Labeling annotations allowed us to retrieve subtrees of distinct chemical classes such as flavonoids, alkaloids, phospholipids, acyl-carnitines, and Oglycosyl compounds in food products. We propagated ClassyFire annotations of chemical features to each internal node of the tree and labeled the nodes by pie charts depicting the distribution in chemical superclasses and classes of its tips. The molecular fingerprint-based hierarchy of chemical features agreed well with ClassyFire taxonomy assignment, further demonstrating that molecular fingerprints can meaningfully capture structural relationships among molecules in a hierarchical manner. Furthermore, Qemistree coupled the chemical tree to sample metadata, revealing distinct chemical classes expected for each sample type. Branches representing acyl-carnitines were exclusively found in animal products . In contrast, honey, although categorized as an animal product, shared most of its chemical space with plant products, reflective of the plant nectar and pollen-based diet of honey bees. We observed a clade of flavonoids in both plant products and honey , but no other animal-based foods. While it is expected that a complex food such as blueberry kefir contains molecules from both blueberries and dairy, we can now visualize how individual ingredients and food preparation contribute to the chemical composition of complex foods. We noted that metabolite signatures that stem directly from particular ingredients, such as phosphoethanolamine from eggs, are present in egg scramble , but not in the other two foods highlighted . We can also observe the addition of ingredients in foods that were not listed as present in the initial set of ingredients. We were able to retrieve that there is black pepper in the egg scramble with chorizo and orange chicken, but that this signal is absent from the blueberry kefir .We show that our tree-based approach coherently captures chemical ontologies and relationships among molecules and samples in various publicly available datasets. Qemistree depends on representing chemical features as molecular fingerprints, and does share limitations with the underlying fingerprint prediction tool CSI:FingerID. For example, fingerprint prediction depends on the quality and coverage of MS/MS spectral databases available for training the predictive models, and these will improve as databases are enriched with more compound classes. Nevertheless, the use of CSI:FingerID-predicted molecularfingerprints is highly advantageous. While annotations from spectral matches may be more accurate, their coverage is too low to adequately summarize the chemical content of complex samples. Qemistree is also applicable in negative ionization mode; however, fewer molecular fingerprints can be confidently predicted due to fewer publicly available reference spectra, resulting in less-extensive trees. A key contribution of this work is to introduce the concept of building chemical hierarchies that can be used to leverage phylogeny-based tools , for metabolomics data exploration. Hierarchical relationships have provided a powerful framework to understand the relatedness of organisms. These techniques form a cornerstone for the interpretation of genomics data with phylogenetics and phylogenomics, and even taxonomy. The suite of tools and algorithms that have been developed over the past few decades in these fields, which utilize hierarchical structures, potentially have general relevance to the investigation of mass spectrometry data. Using Qemistree we can begin to explore the applicability of other methods, such as Faith’s Phylogenetic Diversity to understand within-sample complexity, or phylogeneticindependent contrasts with a metabolomics-inspired topology as these representations enter regular use. We showed that a hierarchical representation could be used to infer chemically informed relationships between samples . While we used molecular fingerprints predicted by CSI:FingerID to build chemical hierarchies here, this approach can be extended to incorporate other strategies to compare molecules for building chemical trees. For example, chemical relationships based on assigned chemical classes, spectral motifs, shared biosynthetic origin or other structural comparison methods could also be used as a basis for such a tree. These approaches will result in different tree topologies capturing complementary chemical information for subsequent analyses.