Frequently Asked Questions

Setting the parameters right
I can’t get an annotated network, despite having correct input files and well-tuned parameters
Frozen Sending data to server pop-up
GTDB and microbetagDB versioning
How to read a KEGG map with pathway complementarities ?
I am trying to run microbetag locally, but..
I have a really large 16S-oriented network. What can I do?
Reconstructing GEMs locally
Can’t import files on my macOS

Setting the parameters right

Choose input type

It should be clear by now that you always need to load your abundance table. If you wish microbetag to infer a co-occurrence network using FlashWeave then go ahead and set this parameter to abundance_table. In case you already have a network, and you have loaded it to MGG as described in the relative tutorial then you may set this parameter to network so microbetag will annotate the provided network.

What taxonomy scheme to choose ?

For microbetag to return the best annotations it could come up with, it is essential to map as best as possible the sequences described in your abundance table to corresponding GTDB genomes. There are 4 taxonomy schemes supported:

GTDB: in case you have used GTDB-tk to taxonomically annotate your bins
Silva: in case you have used DADA2 along with the 7-level version of Silva they support
microbetag_prep: in case you have amplicon data and would like to use our implementation for mapping your OTUs/ASVs to GTDB genomes directly by using the idataxa algorithm of the DECIPHER package and the 16S genes of the GTDB genomes as a reference database. For large datasets (>1000 sequences) see also
other: in case you want to use your taxonomies and get the closest NCBI Taxonomy names included in the microbetagDB (using the fuzzywuzzy Python library)

When to enable the `sensitive` and `heterogeneous` arguments?

Both these parameters are arguments of the FlashWeave software. Therefore, they are only available if you have selected abundance_table as your input type.

The core idea for FlashWeave is to distinguish direct and indirect associations between the taxa of an abundance table. More specifically, for each target variable \(T\) (OTU/ASV or a metavariable), FlashWeave tries to infer its directly associated neighborhood, meaning the set of neighbor variables that renders all remaining variables probabilistically independent of \(T\).

To this end, statistical tests for conditional independence iteratively remove indirect edges. In each iteration a pair of taxa is tested to be directly associated or not and a conditioning set of other taxa is used to test whether the association between the two taxa under study disappears or significantly weakens when conditioning on another taxon.

Apparently, the abundance table is all FlashWeave cares about! The figure below comes from the Supplemental Information of the FlashWeave paper.

four_cases

Sensitive modes of FlashWeave use full abundance information (“continuous”), while fast modes (i.e., the sensitive parameter is not selected) work on discretized abundances. As a rule-of-thumb, the sensitive module requires further computing resources and time, thus it may lead the online version of microbetag to time errors. In this case the user can always go for the microbetag_prep steps to infer the network locally. However, in other cases, especially when the number of samples is low, not choosing the sensitive mode can also lead to errors. This is because FlashWeave fails to infer any relationship at all, so there is no network for microbetag to annotate!

sens_hetero

The heterogeneous module (FlashWeaveHE) makes the assumption that zeroes in large, heterogeneous data sets are mostly structural. Thus, it only considers samples in which both OTUs/ASVs have a non-zero abundance as reliable for association prediction. Zero elements are excluded from association computations. However, this restriction only affects the potential association partners being tested: OTUs in the conditioning set keep their absences.

The FlashWeaveHE approach may lose in sensitivity but in large datasets this loss has been found to be rather small, while saving significant computing time. Yet, if applied in small datasets it can also lead to no network inference and therefore microbetag to fail.

What is the Consider children taxa parameter?

This parameter is only valid if the taxonomy database selected is Other. In this case, microbetag tries to use the taxonomies provided to their closest NCBI Taxonomy. There is a chance that your OTU/ASV has been assigned to a species for which a genome is not present in microbetagDB, but genomes of strains of that species are. By enabling this parameter microbetag will consider those genomes and use them for the following annotation steps.

We remind that it is always a good practice in terms of getting as many and as good as possible annotations to run the microbetag_prep step instead of using the Other taxonomy.

I can’t get an annotated network, despite having correct input files and well-tuned parameters

This may be caused because of the time limit of our server for a single run. Thus, there are cases for which even the number of sequence identifiers is less than 1000, combinations of other factors can lead to time-consuming runs, causing an error in the end. For example, you have set taxonomy to other and you have also chosen sensitive as a parameter for the network inference. Such a scenario often leads to time-errors. In this case, you should first run the microbetag-prep step and use its output as your input files. Also, you may have a short number of taxa but a vast amount of samples. In this case, you will also get a time error in case you enable the sensitive parameter of FlashWeave. It is always a good practice to run the preparation step locally, so you also have a better overview of the network you will then ask microbetag to annotate. Remember that microbetag focus is in annotating a network, not building one.

Frozen `Sending data to server` pop-up

In cases where a time error has occurred, we have observed that from time to time the pop-up box with the progress of your query keeps showing that your data are in process. If that happens, you need to kill the process of the Cytoscape instance. For example, in a Linux system, you would have to check on your htop panel which is the PID for Cytoscape and then run kill <PID>.

GTDB and microbetagDB versioning

GTDB releases a new version once a year, most of the time in April increasing its number of genomes to a great extent from version to version. Even the number of high quality genomes is not increasing that fast, the pairs for microbetagDB to store increases exponentially. We intend to develop a new feature to export new reference genomes and run precalculations over them once a year, following GTDB versioning. Yet, this is still work-in-progress.

How to read a KEGG map with pathway complementarities ?

Complementarities need to always be considered as potential. There is no evidence that just because a complementarity could be occurring based on the genomes of a species pair that is actually happening. To argue about such a case, one would have to get experimental data.

Regarding the pathway complementarities, one need to consider that one pathway does not take into account what is happening to the others. Thus, in cases where both \(pathway_A\) and \(pathway_B\) perform processes that come up with the same end-products, and \(species_A\) has a complete alternative for \(pathway_A\) but not for \(pathway_B\) and \(species_B\) could potentially complete \(pathway_B\), then microbetag would return this potential complementarity, even if it is not necessary for \(species_A\) as it gets what it needs on its own, using \(pathway_A\).

The number of the KOs required for a complementarity to happen (number of KO in the Complement column) is also indicative for its likelihood. If only one KO term needs to be provided by the donor species to the beneficiary, complementarity is more likely to occur compared to a situation where several KOs are required. Also, in case there are also seed complementarities available for a species pair, you can combine information from both types of complementarities. If a seed is close to the pathway mentioned from your pathway complementarities, this adds some extra confidence for the latter to occur.

I am trying to run `microbetag` locally, but..

When running microbetag locally (see relative tutorial) one may have a wide range of different input files as starting points. You may start with nothing but your bins; i.e., sequencing files, one for each bin mentioned in your abundance table. Otherwise, you may have already annotated them with KEGG ORTHOLOGY terms. You may as well have reconstructed GEMs on your own already. You can adapt your microbetag run by pointing to these files through the config.yml file you have to provide as input. However, we cannot say for sure that no matter the software you used to annotate for example your bins will suit what microbetag expects. If you have a large number of bins, and you would prefer to avoid annotating them again using microbetag this time, you may check the format of the annotations provided in the example case and see if you can edit your format to that. In any case, we strongly suggest you reach out the microbetag’s community on Matrix with any specific questions of yours.

I have a really large 16S-oriented network. What can I do?

There are three things you could do in this case. First, you can try to build a database with the closest genomes you can find for the strains present in you data. If you do so, then you could run microbetag locally using those genomes as they were your bins.

Second, you could build a local instance of microbetagDB locally. This would require a storage of ~700GB. We are now working on an efficient way to go for that.

Third, you can get the annotations per species pairs using the microbetag API. However, in this case, you will not have a .cx file as an end product, i.e. you will not have a single file you can then load on Cytoscape and view through the MGG features.

Reconstructing GEMs locally

There is a great chance when you are trying to reconstruct Genome Scale Models (GEMs) using your own genomes/bins/MAGs and the modelseedpy library, as shown in the relative tutorial, to keep getting messages like:

Recursive run for model_id: /data/my_faa/bin_101

This can lead to excessive time, especially as the number of your genomes increases. This is because modelseedpy requires RAST annotated genomes and thus it needs to establish a connection to the RAST server. Unfortunately, we have observed that this is not always stable.

Can’t import files on my macOS

Make sure you are not using aliases pointing to the files you need to import. Sometimes you may use aliases even without knowing, for example when you drag and drop a file on Finder, you create a shortcut of the file there, but the file is actually located in its original location. Make sure you use the right path and not the shortcut when you are about to import a file on MGG.