Run microbetag
by making use of your own bins/MAGs or GENREs
v1.0.2
Docker image Tutorial files GitHub release
Table of contents
This tutorial is for advanced users that have some basic experience working on a terminal.
Contrary to previous cases, this scenario is not performed from within the CytoscapeApp.
The user needs to run
microbetag
first on their computing environment (personal computer, HPC etc.) and then load the returned annotated network to Cytoscape.
In the Cytoscape App tutorial, our sequences were already taxonomically assigned before running microbetag
and their taxonomies were mapped to representative GTDB genomes. microbetag
then used these genomes for the annotation steps.
However, in case of shotgun metagenomics one may end up with their own bins while further refinement of the latter can lead to Metagenome-Assembled Genomes (MAGs). In case of high quality MAGs, i.e. high completeness and low contamination, they can be used directly for the annotation steps of microbetag
. Yet, this requires computing resources and time much higher than those that our web-server can support.
Thus, we provide a version of microbetag
as a stand-alone, containerized tool so that users can annotate a co-occurrence network using their own bins/MAGs. To do that, you need first to make sure you have either Docker or Singularity/Apptainer in the computing system to be used for running microbetag
. The last is common in HPC systems and if you are about to use such a system, you should ask your admin for more information.
On top of the abundance table and your genomes/bins/MAGs, to go for this case you need:
- Docker / Singularity (containerization technology)
- the
microbetag
image based on the containerization technology you are using (see below for how to getmicrobetag
as a Docker or a Singularity image) - the
config.yml
file where you set the parameters for how to runmicrobetag
Running microbetag
using your own genomes/bins/MAGs requires significant computing time and/or resources. In this tutorial, we will use a very short number of bins (7) to showcase the various steps microbetag
implements. Yet, it still gets more than a couple of hours to go through all the different steps to get all the possible supported annotations. In our experience, memory (RAM) requirements should not be a challenge; memory would be an issue only with really large networks. microbetag
is more often than not thread-limited, i.e. it needs computing power to go through the annotation steps.
INPUT FILES USED IN THIS TUTORIAL
A complete example of running
microbetag
locally using themodelseedpy
library for GEM reconstruction with all the intermediate files produced can be found in thedev_io_microbetag
folder of theuser-bins
branch on the GitHub repo.In the initial run, there are only 3 input files:
- the
config.yml
file; allows you to set all the relative parameters formicrobetag
to run- an abundance table (following the format of the Cytoscape app tutorial) called
thirty_Samples.tsv
, and- its corresponding edge list (
edgelist.csv
)Remember!
The config and the abundance table files are mandatory. Always keep those (and the edge list if available) in the root of your input/output folder; i.e. in the path you set as your
io_path
in theconfig.yml
file.
Input and config.yml
files
The config.yml
file is rather important as it is the one that allows you to set your microbetag
run. A number of the parameters there correspond to tools that are invoked while others have to do with alternative routes that microbetag
can follow for the annotation of the network. Read carefully the description
of each argument before setting a value. Here, we highlight some of them.
abundance_table_file
: path to your abundance table; the abundance table needs to follow the instructions for any abundance table to be used withmicrobetag
, i.e., sequence identifier in the first column, sample names in the first row and a 7-level taxonomy in the last column; of course, you may provide the output of themicrobetag
preprocessing step as an abundance table.input_type_for_seed_complementarities
: This is a key parameter for runningmicrobetag
locally; based on whether you already have annotated your genomes (either using other software or from previous runs ofmicrobetag
) you can use different input files as the starting point for getting the seed complementarities. Thesequence_files_for_reconstructions
parameter is strongly related to this. For example, if you have already GEMs reconstructed based on your genomes, you may set this toinput_type_for_seed_complementarities
tomodels
and then, provide the folder name with your GEMs in thesequence_files_for_reconstructions
parameter (e.g.my_xmls
). Likewise, if you do not have GEMs, but you already have RAST annotations, you may setinput_type_for_seed_complementarities
toproteins_faa
and give the path to those in thesequence_files_for_reconstructions
parameter.seed_complementarity
: since this is the most time and resource consuming step, the user may choose not to go for it. By setting this toFasle
, none of the steps for GEMs reconstruction or seed complementarity inference will be performed.flashweave_args
: all the arguments under this umbrella term are related to howFlashWeave
will perform, check on the FAQs but also the FlashWeave GitHub repo for more.
Please, go through the parameters of the
config.yml
file carefully and make sure you keep this file in yourio_path
.
Output
In the config.yml
file, you can set the location of your input files using the io_path
parameter. Additionally, you can specify the output_directory
, which is the name of the folder that will be created within the io_path
to store all the output files generated by microbetag
. Here we discuss the folders and the files you will find under the output_directory
. We dot not always follow the order with which the files are generated.
The annotated .cx
network file
The main output file (end proudct) of microbetag
can be found in the output_directory
you set in your config.yml
file; the microbetag
-annotated network called microbetag_annotated_network.cx
. This is the file you need to load in your Cytoscape and then after enabling the MGG visual style and the MGG results panel you can investigate your annotated network! This file is in .cx2
format.
FAPROTAX
A folder called faprotax
is made where there is a subfolder, called sub_tables
and a file whith the sum of the abundances of the taxa found with a specific process in each sample , called functional_otu_table.tsv
. In the sub_tables
folder, a file for each process is available mentioning the genomes/bins found related with the process udner study and their relative abundance per sample.
For example, the aerobic_nitrite_oxidation.txt
looks like:
record | seqId | sample1 | sample2 | sample4 | sample5 | sample6 | sample7 | sample8 | sample9 | sample10 | sample11 | sample12 | sample13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Nitrobacter;s__Nitrobacter sp001896955 | bin_32 | 43 | 10 | 56 | 73 | 9 | 58 | 54 | 46 | 9 | 40 | 42 | 81 |
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Nitrobacter;s__Nitrobacter sp001897285 | bin_223 | 47 | 87 | 69 | 64 | 25 | 95 | 40 | 71 | 16 | 78 | 52 | 40 |
phenDB-like
train.genotype
file: this is the output of thephenotrex
program annotating your genomes/bins with COG families using the latestpredictions
folder: within this folder, a file with the predictions for presence/absence of each trait in each of our genomes along with a confidence score are available. For example, thesymbiont.prediction.tsv
file, in our test case looks like:
# Trait: Symbiont | ||
---|---|---|
Identifier | Trait present | Confidence |
bin_101.fa | NO | 0.643 |
bin_151.fa | NO | 0.6395 |
bin_19.fa | NO | 0.869 |
bin_38.fa | NO | 0.8678 |
bin_41.fa | NO | 0.7842 |
bin_45.fa | YES | 0.7954 |
bin_48.fa | NO | 0.8545 |
ORFs
microbetag
invokes prodigal
to extract Open Reading Frames (ORFs). It creates a folder called ORFs
in the output_directory
and for each genome/bin it returns 3 files:
.gbk
: Genbank-like format (for more information check here).faa
: the reading frames as aminoacid sequences.ffn
: the reading frames as nucleic acid sequences
SKIP THE ORFs PREDICTION (
prodigal
) STEPIf you have already calculated the ORFs of your genomes before start using
microbetag
, you can create a folder within youroutput_directory
calledORFs
and move there all your.faa
files. This way,microbetag
will be using those instead of runningprodigal
.You
.faa
files should look like this:>c_000000001749_1 # 2 # 913 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.479 DDSKIHQLGWDAFQAGTKVAKEEGLYGAGQDLLSDAFSGNVKGLGPAVAELSFEERPSEP FLFFMADKTEPGAYNLPFYLSYADPMYNPGLMLSPKMGKGFVFTVMDVENTENDRIIELT TPEDIYDLACLLRDNGRFVVESIRSAKTGETTAVCSTTRLNKIAGEYVGKDDPVALARVQ
KEGG annotations
microbetag
makes use of the hmmsearch
tool and the kofam_database
profiles to check which KOs are present in each of your genomes. In the output_directory
, microbetag
creates a folder called KEGG_annotations
and there it builds a folder called hmmout
, where it keeps all the 24.728 .hmmout
files for each genome.
Once all the .hmmout
files are there for all the genomes/bins under study, microbetag
builds a file called ko_merged.txt
based on the DiTing implementation, that looks like this:
bin_id | contig_id | ko_term |
---|---|---|
bin_41 | SCN18_26_2_15_R1_F_scaffold_115_57 | K07586 |
bin_48 | SCN18_26_2_15_R4_B_scaffold_93_80 | K08086 |
bin_41 | SCN18_26_2_15_R1_F_scaffold_206_63 | K03503 |
This file is the main component for microbetag
to proceed with the pathway complementarity step.
SKIP THE KEGG ANNOTATION (
HMMSEARCH
) STEPIf you have already hmm profiles either from analysis before using
microbetag
or from previousmicrobetag
runs of your genomes, you can create a folder calledhmmout
within theKEGG_annotations
folder of youroutput_directory
and move all the.hmmout
profiles of your bins there. An.hmmout
file would looks like:root@8649bd465c24:/data/microbetag_local/KEGG_annotations/hmmout# more K00005.bin_101.hmmout # --- full sequence ---- --- best 1 domain ---- --- domain number estimation ---- # target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target #------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- --------------------- c_000000006615_9 - K00005 - 1.2e-98 327.2 0.0 1.4e-98 327.0 0.0 1.0 1 0 0 1 1 1 1 # 14663 # 15772 # 1 # ID=74_9;partial=00;start_type=TTG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.449 # # Program: hmmsearch # Version: 3.4 (Aug 2023) # Pipeline mode: SEARCH # Query file: /microbetag/microbetagDB/ref-dbs/kofam_database/profiles/K00005.hmm # Target file: /data/microbetag_local/ORFs/bin_101.faa # Option settings: hmmsearch -o /dev/null --tblout /data/microbetag_local/KEGG_annotations/hmmout/K00005.bin_101.hmmout -T 324.27 --cpu 1 /microbetag/microbetagDB/ref-dbs/kofam_database/profiles/K00005.hmm /data/microbetag_local/ORFs/bin_101.faa # Current dir: /microbetag # Date: Mon Aug 5 11:06:08 2024 # [ok]
If you already have the
ko_merged.txt
file, you can only add a copy of it in theKEGG_annotations
folder (thehmmout
files are not necessary in this case) andmicrobetag
will use this directly skipping thehmmsearch
step.
COMPUTING TIME, RESOURCES AND STORAGE
Running
microbetag
locally using your own genomes/bins/MAGs can take significant computing time and resources. In this tutorial, we use only a short number of bins that has almost no biological significance. What we want to accomplish here is to make sure that you can runmicrobetag
locally. Indicatively, using a Linux machine and allocating 2 CPUs it took more than 1 hour for the KO annotation of just those 7 bins. Running the complete workflow could get up to 6-7 hours based on the approach you will choose to reconstruct your GEMs.
carveme
is much more robust in running smoothly and faster since it does not require a RAST connection (see following paragraph).
GEM reconstruction step
microbetag
supports 2 ways to reconstruct GEMs based on the user’s genomes/bins:
- using the
modelseedpy
Python library - using the
CarveMe
tool
In the first case, modelseedpy
requires RAST-annotated genomes. microbetag
can do that on its own starting from your genome sequences; alternative, you may provide these to be used for the GEM reconstruction directly if you already have them (either from previous microbetag
runs or from other software).
modelseedpy
needs to establish a connection to the RAST server (RastClient()
) In some cases, based on the status of the RAST server, we have observed that time errors may occur. In this case,microbetag
will exit and force a restart of its running on its own! Yet, it is a good practice to also check its status when themodelseed
reconstruction step is running.
In the following paragraphs, we highlight how to go for different scenarios of GEMs reconstruction using different file types as initial starting points. One need to combine 2 parameters of the config.yml
file to specify those scenarios: the input_type_for_seed_complementarities
where one specifies the file type and the sequence_files_for_reconstructions
that points to the directory where the files to be used are located.
using modelseedpy
and your bins
in this case, you have set
input_type_for_seed_complementarities
asbins_fasta
, andsequence_files_for_reconstructions
is blankgenre_reconstruction_with
asmodelseedpy
Then, microbetag
will use RASTtk
programs to RAST annotate the original genomes/bins. In the output_directory
, a folder called reconstructions
has been built and in this case, 3 files for each genome/bin are now available:
.gto
and.gto_2
: these are genome typed object, i.e. JSON files that are compatible with KBase. The.gto_2
is a second genome typed object with all the RAST annotation data..faa
includes the same information as the.gto_2
file, but we export the protein translations in.fasta
format
For our 7 genomes/bins this step may take about 1 hour depending on your computing system
using modelseedpy
and your already RAST annotated genomes
Assuming you already have the .faa
files coming from the rast-tk
package, you may use them directly by setting
input_type_for_seed_complementarities
asproteins_faa
, andsequence_files_for_reconstructions
as the path to the folder with your.faa
filesgenre_reconstruction_with
asmodelseedpy
In this case, microbetag
will have to establish connections with the RAST client like before.
If your annotated genomes include the DNA sequences instead of the protein ones (.fna
files) you may use them by setting the input_type_for_seed_complementarities
as coding_regions
.
using carveme
input_type_for_seed_complementarities
asbins_fasta
sequence_files_for_reconstructions
is blankgenre_reconstruction_with
ascarveme
In this case, under the reconstructions
file, we have a .tsv
file for each genome/bin with the findings of the diamond
against the internal database of carveme
with the BiGG reactions.
bin_151.peg.3 | iLJ478.TM0057 | 57.9 | 309 | 125 | 3 | 6 | 310 | 2 | 309 | 2.72e-128 | 369 |
bin_151.peg.3 | iLJ478.TM1063 | 55.9 | 311 | 130 | 3 | 7 | 310 | 3 | 313 | 5.36e-124 | 358 |
For a thorough description of each column, you may check this here.
GEMs already available
In this case, you may use your GEMs directly for the seed complementarities inference by setting:
input_type_for_seed_complementarities
asmodels
sequence_files_for_reconstructions
pointing to directory with the.xml
filesgenre_reconstruction_with
can be left blank or any value; it will not be considered
PhyloMInt post process
After microbetag
performs PhyloMInt
, it runs a step to post-process the seed and the non seed sets as initially returned by:
- removing compounds from seed sets that are related to environmental metabolites that can be produced in several ways within the cell.
- removing from non seed sets compounds that cannot be produced in any other way than from entering the cell from the environment.
The numbers of this post process are recorded in the log.tsv
file that can be found under the seed_complementarity
folder that looks like this:
model_id | environmental_initial_seeds | non_environmental_initial_seeds | total_initial_seeds | updated_seeds | initial_non_seeds | updated_non_seeds |
---|---|---|---|---|---|---|
bin_151 | 173 | 57 | 230 | 230 | 879 | 879 |
bin_38 | 177 | 54 | 231 | 231 | 1211 | 1211 |
bin_101 | 152 | 51 | 203 | 203 | 845 | 845 |
This post process step is necessary since we use a complete medium to gapfill the model.
One may come with alternative procedures on how to gap fill in terms of minimising the missing potential cross-feedings, but at the same time not over-predicting such cases.
Using Docker
Once you have installed Docker locally, you may run
docker pull hariszaf/microbetag:v1.0.2
to get microbetag locally.
Version is essential!
Please, make sure you are aware of the version you are using. Latest versions may fix reported bugs or have new features. It is important to always be aware of the version you are using and report it when you are about to submit any issues.
Then, you need to get a copy of the kofam
database to allow the annotation of your sequences with KEGG ORTHOLOGY terms. You may get this by running the following chunk of code:
mkdir kofam_database &&\
cd kofam_database &&\
wget -c ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz &&\
wget -c ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz &&\
gzip -d ko_list.gz &&\
tar zxvf profiles.tar.gz
Now, you need to download the config.yml
file that accompanies microbetag
, to set the values to the required and optional arguments of your choice.
In this file, each argument has a required
field that denotes whether it is mandatory to be set or not.
One may provide just an abundance table and the corresponding bins/MAGs sequence files.
FILENAMES
The filenames of your bins/MAGs need to have the same name, like those in your abundance table. For example, if in the abundance table you have bin101, then the corresponding filename of the bin should be bin101.fa or bin101.fasta etc. This will soon be changed so a mapping file can be used instead. Until then though
microbetag
will fail if that is not the case.
In case you do not already have GENREs for your bins/MAGs, microbetag
supports two ways for the reconstruction of metabolic networks:
- using
modelseedpy
that required RAST annotation of your bins and are based on the ModelSEED resource and identifiers, - using
carveme
that can be performed in both DNA and protein sequences, make use of the BiGG identifiers and required a Gurobi license (see section ) This can be a rather time-consuming step, especially usingmodelseedpy
.
As you may already have gene predictions for your bins/MAGs, or even protein annotations, you may also provide them to microbetag
, so those steps can be skipped. If you have already built metabolic networks, then in case they are based on either ModelSEED or BiGG identifiers, you may provide them so seed scores and seed complementarity can be computed directly on the,
Input folder
To conclude, your input folder to be mounted will look like this:
u0156635@gbw-l-l0074:microbetag$ ls config.yml my_bins/ my_abundance_table.tsv
where in the
my_bins
folder you have:u0156635@gbw-l-l0074:microbetag$ ls bins/ bin_101.fa bin_151.fa bin_19.fa
Once your input folder is ready, you can mount it on your Docker container and run microbetag
:
docker run --rm -it \
--volume=./tests/dev_io_microbetag/:/data \
--volume=./microbetagDB/ref-dbs/kofam_database/:/microbetag/microbetagDB/ref-dbs/kofam_database/ \
--volume=$PWD/gurobi.lic:/opt/gurobi/gurobi.lic:ro \
--entrypoint /bin/bash \
hariszaf/microbetag:v1.0.2
The --volume
flag allows you to mount a local directory to a specific path in the container. It is essential that the right parts of the volumes are kept as above! For example, when using carveme
, a gurobi license is required; microbetag
expects the license unde the /opt/gurobi
path, so you need to make sure all the right parts of the volumes are as above and that the left parts point to your local paths.
Remember! It is strongly suggested all the files and folders you mount to be part of your root path; meaning the directory from which you initiate your Docker container.
For example, if you observe the last chunk of code, you will notice that both
kofam_database
andgurobi.lic
and the input-output folder calleddev_io_microbetag
they are all within my root folder~/github_repos/KU/microbetag
from where I run thedocker run
command.
A Web License Service (WLS) Gurobi license in case you are about to use carveme
. You may find the following link useful on how to do that.
Once you have fired a container, you can now run microbetag
using the following command:
root@20510f8400f1:/microbetag# python3 microbetag.py /data/config.yml
Using Singularity/Apptainer
These technologies are widely used in High Performance Computing (HPC) systems. In case you are about to use microbetag
in such a system, you first need to build a Singularity image (.simg
) based on the Docker one:
sudo singularity build microbetag_v102.simg docker://hariszaf/microbetag:v1.0.2
You will need to have sudo rights to run this command. If you do not have sudo
rights, you can either ask your admin to do so or run the build command in a similar environment, e.g. your own Linux based laptop and move it to the HPC with a single scp
command. Also, you can ask your admin or check your HPC documentation site how they deal with Docker images and follow their lead.
Once a .simg
image is available, you may run microbetag
again by mounting the necessary paths:
singularity exec
-B tests/dev_io_microbetag/:/data
-B microbetagDB/ref-dbs/kofam_database/:/microbetag/microbetagDB/ref-dbs/kofam_database/
-B $PWD/gurobi.lic:/opt/gurobi/gurobi.lic:ro
microbetag_v101.simg
python3 /microbetag/microbetag.py /data/config.yml