GSoC2021: extracting key metabolic processes

About the project

Twenty years after Covert W.Covert, Bernhard O. Palsson and their colleagues published their review on metabolic modelling of microbial strains, the value of this method has been well established. From the beginning, metabolic modeling has been interwoven with constraing-based methods. The value of randomized sampling in the framework of metabolic modeling has been proved itself over the years. High Throughput Sequencing technologies have allowed process in the genetic information (DNA) in a cost-efficient and easy way, especially for microbial species as their genome is only but a few hundreds of genes long. Once the complete genome of a species is obtained, the complete reconstruction of the metabolic network of the species is enabled, called genome-scale metabolic models (GEMs). Such models for all the species present in a microbial community, allows the study of metabolic interactions, thus an insight for the actual microbial intaractions. Aim of this project is to integrate the produced data and knowledge of these twenty (and more) years and make use of the randomized flux sampling method to evaluate the metabolic interactions retrieved. To this end, thousands of publicly available reference microbial genomes will be selected, and their automatic metabolic network reconstructions will be implemented. Based on these models, cross-feeding interactions algorithms will be performed for groups of species to extract key metabolic processes. New functions, implementing the recently developed Multiphase Monte Carlo Sampling (MMCS) approach in the framework of the dingo library, will make use of the randomized flux sampling concept to evaluate the processes retrieved.

Contact me

Bonding period

Studying the literature

↘ Weeks 1-2

This project has great challenges both from the computational and the biological point-of-view. To this end, studying about how someone could address the challenge of extracting inter-species metabolic processes out of genome-scale metabolic reconstuctions of the species present in a microbial community, will be essential.
It was a fortunate coincidence that just before this GSoC project started, a great review by Heinken et al. (2021) was published, describing both the framework of our work and the challenges, to a great extent. Here is some extra related literature.

Setting up our working framework

↘ Week 3

A GitHub repository has already been initiated for this project. The proposal can be found there as well as the source code of this GitHub pages blog. In addition, a fork of the main dingo GitHub repository was built. Pull Requests (PRs) will be implemented from this fork to the main dingo repository.

Here are some key resources strongly related to metabolic modelling of microbial taxa that could be used in a framework as the one of this GSoC project.

Conclusions after the bonding period

↘ overall

My interaction with microbiologists highlighted the challenges of the project. The scientific literature retrieved represents that there are strong ongoing efforts that try to address the question of interspecies metabolic interactions inference.

After the bonding period I have now a specific plan for this project. I will model a microbial community using the same assumptions with micom but instead of Flux Balance Analysis (FBA) implement the Multiphase Monte Carlo Sampling algorith of dingo. This way, I will be able to investigate how the flux distribution of a reaction of a species present, affects the same or another reaction of another species of the community. For starters, we will work with hypothetical communities of 2 species.

Coding period

Get the models

↘ Weeks 1-4

Metabolic network reconstruction is a far from easy task. In case of bacteria, semi-automated ways make things a bit easier, for example carveme is a software tool that takes a genome as input to build its corresponding metabolic model.
However, the GTDB genomes described in the proposal were thoroughly investigated but since they only include a short number of genes and not the complete genome, they were not used for this project. Instead, I used the EMBL genome scale models after I built a script for converting them from .xml to .mat format. A series of 5,587 models are now available and can be used in the darn framework.
In addition, genomes from JGI/IMG were retrieved and metabolic models were built based on them, using the carveme software. As carveme needs the IBM CPLEX Optimizer as a dependency, I used the academic edition of it. carveme returns .xml models, so the script for converting the model format was used again. More and more genomes will be added in the ongoing darn database using genomes from global repositories, such as JGI/IMG. During this period, I built a Google Colab notebook as a tutorial for the dingo library which was also presented in the Bioinformatics Open Source Conference (BOSC). My presentation there will soon be available on BOSC YouTube channel too. You may find the notebook under the dingo tutorial notebook_#17 PR.

How to validate our future outcome?

↘ Weeks 5 - 6

As described in the proposal, the step of finding a way to validate the outcome of this new dingo module I have described, is probably the hardest one. The SMETANA software that was initially meant to be used for this task, did not fit for this task. Several ongoing approaches aim to address this challenge, such as the recently published Metage2Metabo software tool (publication and GitHub repo). I will use this software in the near future to compare its findings with those that the dingo module I am working at will return.

Sample on the flux space of pairs of bacteria

↘ Weeks 7-9

During these 3 weeks, I developed the new dingo module as described in my proposal. To this end, I built 2 new classes following the scheme of dingo; the first one, called CommunityMetabolicNetwork, builds a community metabolic model based on multiple metabolic networks, and the second, called CommunityPolytopeSampler gets the polytope derived from that model and implements flux sampling on that. The new module can be invoked both from a terminal and a Python console through the dingo library. For more, you may check the Sampling on the flux space of multiple metabolic networks_#18 PR. In addition, on the same PR, you will also find the script and the models built during Weeks 1-4 of this GSoC project.

Tests and documentation

↘ Week 10

During the last week, tests were made to check how the new module responses with various example cases. An example directory was added under the ext_data directory with a community of 2 models. Instructions about how to use the new module were added in the Google Colab notebook.

Summary of my GSoC work

dingo database of metabolic models in .mat format init

More than 6,000 models most of them converted by the EMBL GEMs

dingo and how to contribute tutorials

You may find the notebook on this link. My BOSC presentation will soon be available on the community's YouTube channel. The notebook and a how to contribute tutorial can be found under PR_#17

Sampling on the flux space of multiple metabolic networks

The module can be found under PR_#18.

Future work

This GSoC project has been a great opportunity for me to work on the dingo library focusing on the metabolic interactions inference at the microbial community level. This project is now ongoing and it will be further deployed in the next months as part of my PhD. A publication of dingo and applications of it on metabolic interactions inference will also be considered.

Additional Links

Social Links

Github: https://github.com/hariszaf/gsoc2021
Twitter: https://twitter.com/haris_zaf
LinkedIn: https://www.linkedin.com/in/haris-zaf-44555711a/
Website: https://hariszaf.github.io/