Input files
Table of contents
Here is a list with microbetag
input files along with typical examples of how they need to be like:
File | Description | requirement_status |
---|---|---|
abundance table | An abundance table (in .tsv or .csv format) (example) | mandatory |
metadata file | File describing the sequencing data (example) | optional; using FlashWeave |
network file | A 3-column edge list (example) | optional |
Abundance table
Please, make sure in case you provide your abundance table as a .tsv
or .csv
file where:
- in the first column you have always the sequence identifier
- in the first row the samples names
- in the last column you keep a complete 7-level taxonomy
Do not use numeric characters only for labeling your samples and/or the sequences mentioned in your abundance table. For example,
324
as a sample id will lead microbetag to fail.
If microbetag
requires for a 7-level taxonomy scheme; for example:
Bacteria;Firmicutes;Thermoanaerobacteria;Thermoanaerobacterales;Thermoanaerobacteraceae;Caldanaerobius;Caldanaerobius polysaccharolyticus
in case an entry reaches only to a higher taxonomic level, microbetag
fills the entry with NA values
for example
Bacteria;Firmicutes;Thermoanaerobacteria;Thermoanaerobacterales;Thermoanaerobacteraceae
would become
Bacteria;Firmicutes;Thermoanaerobacteria;Thermoanaerobacterales;Thermoanaerobacteraceae;NA;NA;NA
You may notice that the
GTDB_tax_assigned_abundance_table.tsv
abundance table returned bymicrobetag_prep
has an 8-level taxonomy (including aRoot
level) That is why you need to make sure you denotemicrobetag_prep
as the taxonomy database in the parameters settings, otherwisemicrobetag
will fail.
CURATE YOUR TAXONOMIES
If you have a taxonomy scheme that “skips” a level, or another one that has more levels, microbetag will either return fewer annotations or fail. You need to make sure you always have a 7-level scheme for all the entries on your table and that the species/strain level if available is in the 7th field. Again, it is always a good practice to use the
microbetag
preparation step to get the most suited taxonomies formicrobetag
The phyloseq
case
In case you start from a phyloseq
object, you may get a .tsv
file using the tax_table
and the otu_table
functions of the phyloseq
library.
# In an R environment, assuming `physeq` is a `phyloseq` object.
OTU_TAX <- cbind(
data.frame(otu_table(physeq)),
data.frame(tax_table(physeq))
)
write.table(OTU_TAX, "OTU_TAX.txt",
row.names = TRUE, col.names = TRUE, sep = "\t", quote=FALSE)
The .biom
case
In case you start from a biom
file, you may get a .tsv
file using the
biom convert -i otu_table.biom -o otu_table.csv --to-tsv --header-key taxonomy
Make sure you have the biom
tools installed; if not, you may follow the instructions you can find here how to get them.
BEST PRACTICE
To get the optimal annotations in the more robust way, we strongly suggest you first prepare your data using the
microbetag_prep
Docker/Singularity image. That will be almost always the case when you have large datasets with more than a few thousands of sequences and no network for them. Yet, even if you have a network, we still strongly suggest running the taxonomy assignment step, somicrobetag
can map more efficiently the taxa present to their corresponding GTDB genomes.Have a look at the “preparation” section for how to do so!
Running microbetag_prep
In case you are about to use the microbetag_prep
to taxonomically assign your OTUs/ASVs using GTDB, your abundance table file should be exactly as before only this time, in the last column, instead of having a 7-level taxonomy, you need to provide the sequence.
Here is an example file.
Metadata file
FlashWeave, the software microbetag
invokes to build the co-occurrence network, can exploit metadata. If you want to run FlashWeave with a metadata file, you need to remember that FlashWeave considers as variables both the sequence ids (i.e., ASVs/OTUs/bins) and the metavariables (e.g. pH, sex, any variable on your metadata file). Thus, you need to provide them as rows, contrary to what we do in most microbiome analyses.
Here is a toy example of how your files should look like:
abundance_file.txt
seqId | sample_1 | sample_2 | sample_3 |
---|---|---|---|
asv_1 | 10 | 0 | 3 |
asv_2 | 0 | 21 | 43 |
asv_3 | 32 | 31 | 2 |
asv_4 | 0 | 0 | 12 |
metadata_file.tsv
Metadata_1 | 0.2 | 1.7 | 0 |
Metadata_2 | Yes | No | Yes |
As shown, the sample names are omitted from the metadata_file.tsv
. You need to make sure that their corresponding values are in the exact same order as in the abundance_file.txt
. In case the files are not provided like this, microbetag and/or the Docker image of microbetag preprocess, will fail.
Network file
There is a great range of formats for networks. When you are using microbetag
through Cytoscape then, to the best of our knowledge, you can start from any network format of your choice. That is because you first import then network on Cytoscape and only then you load it on the MGG
app that will allow its transferring to the microbetag
server.
Make sure to rename the column microbetag
should treat as the weight of your edges to microbetag::weight
(see relative tutorial).
However, in case you are using microbetag
locally, and you already have a network to annotate, then you will have to provide it as a 3-column file (see example file):
node_a | node_b | microbetag::weight |
---|---|---|
ASV_963239 | ASV_4372091 | 0.3769868016242981 |
ASV_4480529 | ASV_4472202 | 0.4468387961387634 |
ASV_4472202 | ASV_4374302 | 0.4154910147190094 |
ASV_4480529 | ASV_4439469 | 0.39721810817718506 |
Cytoscape asks for a
source
and atarget
column in your network. Since a co-occurrence network does not have directed edges, you can set any node column assource
ortarget
. In our example,node_a
could besource
and then,node_b
would be thetarget
or the other way around.