This project has great challenges both from the computational and the biological point-of-view. To this end, studying about how someone could address the challenge of extracting inter-species metabolic processes out of genome-scale metabolic reconstuctions of the species present in a microbial community, will be essential.
It was a fortunate coincidence that just before this GSoC project started, a great review by Heinken et al. (2021) was published, describing both the framework of our work and the challenges, to a great extent. Here is some extra related literature.
A GitHub repository has already been initiated for this project. The proposal can be found there as well as the source code of this GitHub pages blog. In addition, a fork of the main dingo GitHub repository was built. Pull Requests (PRs) will be implemented from this fork to the main dingo repository.
Here are some key resources strongly related to metabolic modelling of microbial taxa that could be used in a framework as the one of this GSoC project.
My interaction with microbiologists highlighted the challenges of the project. The scientific literature retrieved represents that there are strong ongoing efforts that try to address the question of interspecies metabolic interactions inference.
After the bonding period I have now a specific plan for this project. I will model a microbial community using the same assumptions with micom but instead of Flux Balance Analysis (FBA) implement the Multiphase Monte Carlo Sampling algorith of dingo. This way, I will be able to investigate how the flux distribution of a reaction of a species present, affects the same or another reaction of another species of the community. For starters, we will work with hypothetical communities of 2 species.
Metabolic network reconstruction is a far from easy task. In case of bacteria, semi-automated ways make things a bit easier, for example carveme is a software tool that takes a genome as input to build its corresponding metabolic model.
However, the GTDB genomes described in the proposal were thoroughly investigated but since they only include a short number of genes and not the complete genome, they were not used for this project. Instead, I used the EMBL genome scale models after I built a script for converting them from .xml to .mat format. A series of 5,587 models are now available and can be used in the darn framework.
In addition, genomes from JGI/IMG were retrieved and metabolic models were built based on them, using the carveme software. As carveme needs the IBM CPLEX Optimizer as a dependency, I used the academic edition of it. carveme returns .xml models, so the script for converting the model format was used again. More and more genomes will be added in the ongoing darn database using genomes from global repositories, such as JGI/IMG. During this period, I built a Google Colab notebook as a tutorial for the dingo library which was also presented in the Bioinformatics Open Source Conference (BOSC). My presentation there will soon be available on BOSC YouTube channel too. You may find the notebook under the dingo tutorial notebook_#17 PR.
As described in the proposal, the step of finding a way to validate the outcome of this new dingo module I have described, is probably the hardest one. The SMETANA software that was initially meant to be used for this task, did not fit for this task. Several ongoing approaches aim to address this challenge, such as the recently published Metage2Metabo software tool (publication and GitHub repo). I will use this software in the near future to compare its findings with those that the dingo module I am working at will return.
During these 3 weeks, I developed the new dingo module as described in my proposal. To this end, I built 2 new classes following the scheme of dingo; the first one, called CommunityMetabolicNetwork, builds a community metabolic model based on multiple metabolic networks, and the second, called CommunityPolytopeSampler gets the polytope derived from that model and implements flux sampling on that. The new module can be invoked both from a terminal and a Python console through the dingo library. For more, you may check the Sampling on the flux space of multiple metabolic networks_#18 PR. In addition, on the same PR, you will also find the script and the models built during Weeks 1-4 of this GSoC project.
During the last week, tests were made to check how the new module responses with various example cases. An example directory was added under the ext_data directory with a community of 2 models. Instructions about how to use the new module were added in the Google Colab notebook.
More than 6,000 models most of them converted by the EMBL GEMs
You may find the notebook on this link. My BOSC presentation will soon be available on the community's YouTube channel. The notebook and a how to contribute tutorial can be found under PR_#17
The module can be found under PR_#18.
This GSoC project has been a great opportunity for me to work on the dingo library focusing on the metabolic interactions inference at the microbial community level. This project is now ongoing and it will be further deployed in the next months as part of my PhD. A publication of dingo and applications of it on metabolic interactions inference will also be considered.