PEMA investigating metabarcoding

Train the CREST classifier using any Silva version

In this post we present how to train the CREST classifier using any Silva database version of your choice. We will use Silva v.138 as an example.

From Silva’s archive and from the “Exports” tab, you may download the following files:

- SILVA_<version>_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz
- SILVA_<version>_SSURef_NR99_tax_silva_trunc.fasta.gz

and gunzip them.

Let’s break down the name of these files: - SSU: is for the 16S/18S rRNA marker genes. - Ref: RefNR to the non-redundant SILVA Reference datase - NR: stands for the non-redundant SILVA Reference dataset - trunc: Sequences in these files haven been truncated. This means that all nucleotides that have not been aligned were removed from the sequence. - align: Multi FASTA files of the SSU/LSU databases including the SILVA taxonomy for Bacteria, Archaea and Eukaryotes in the header (including the FULL alignment).

As described in the Training the CREST classifier tab, you need to have: 1. a 3-column file (.nds taxonomy file) 2. an alignment file of your sequences

The .nds file

Get the taxonomies

grep ">" SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' > taxonomies

Get the accession numbers

It has been noticed that is better for the .nds file not to have special characters. So we remove the dots . from the accession numbers.

Remove the taxonomy from the headers of the sequences.

sed 's/ .*//g' SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta  > SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy.fasta

and remove the dots . from the accesion ids.

sed '/^>/s/\.//g' SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy.fasta > SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy_no_points.fasta

Now, you can keep just the ids of the sequences (accessions numbers without the dots).

more SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy_no_points.fasta | grep ">"  > ids

Build the .nds file

We use the ids and the taxonomies files we built to get the .nds file:

paste -d "\t" ids taxonomies ids > silva138.nds

The alignment sequence file

This is the reason you downloaded the SILVA_<version>_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz too.

This is the alignment that was made out of the initial sequences of your Silva version . The *SSURef_NR99_tax_silva_trunc.fasta*we used in the previous step contains the sequences of this alignment, meaning that all nucleotides that have not been aligned were removed from the sequence.

So, the only thing you need to do before using it is to make sure that you keep as a header only the new id (the accession number without the dots .).

sed 's/ .*//g' SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta > SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc_no_taxonomy.fasta
sed '/^>/s/\.//g' SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc_no_taxonomy.fasta > SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc_no_taxonomy_no_points.fasta

Train CREST with your Silva version

Now, you may train the CREST classifier by running

~/CREST/LCAClassifier/bin/nds2CREST -o silva138 -i SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc_no_taxonomy_no_points.fasta silva138.nds

where the nds2CREST is available through the CREST GitHub repo or you may just run a pema container and do it there.

REMEMBER! In case you are using a container, everything you are doing there is gone once you exit the container. So you have to mount a local directory onto the container and save your work there. This way you will get the output of anything running on the container in your machine.

You have now a silva138.map and a silva138.tre file.

Last but not least

To actually use CREST though you also need a blast batabase with the sequences you used. To this end, you may use the SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy_no_points.fasta. Attention! This is the non-aligned sequencing file!

/ncbi-blast-2.8.1+/bin/makeblastdb \
	-in SILVA_138.1_SSURef_NR99_tax_silva_trunc_no_taxonomy_no_points.fasta \
	-title silva138 \
	-dbtype nucl \
	-out <a-directory-of-your-choice>

This will return 3 files: silva138.nsq ,silva138.nin, silva138.nhr

You can now add the new Silva version on yor CREST instance by moving all these 3 files along with the *.map and the *.tree file under a new directory at ~/CREST/LCAClassifier/parts/flatdb and add it to the lcaclassifier.conf file that is located at ~/CREST/LCAClassifier/parts/etc by running something like:

echo "silva138 = /home/tools/CREST/LCAClassifier/parts/flatdb/silva138" >> /home/tools/CREST/LCAClassifier/parts/etc/lcaclassifier.conf

You are all set!

PEMA v.2.1.4 is now released!

A new version of PEMA is just released!

This new PEMA release:

  • is based on a new pemabase Docker image (pemabase:v.2.1.1) which is now completely indepedent from any local resources and the user can build it through the Dockerfile.pemabase file
  • a new marker gene, the 12S rRNA, is now also supported, by exploiting the 12S Vertebrate Classifier v2.0.0-ref database
  • for the case of 18S rRNA marker gene, the PR2 database has been integrated so now the user may select between Silva and PR2
  • thanks to the ncbi-taxonomist software, PEMA now provides an extended OTU/ASV table where in last column the NCBI Taxonomy Id for the taxonomic level closer to the species name rank for which there is one, is available

In addition, in this release:

  • a sannity_check folder with datasets and parameters files for multiple combinations are now available, so tests can be performed to check whether pema runs properly after edits.
  • bugs during the analysis of 18S data have been fixed
  • the perSample outputs have been fixed
  • convertion of non ena format raw data for multiple sequencers is supported

You may find pema v.2.1.4 as a Docker image or as a Singularity.

Many thanks to Christina Pavloudi for all her feedback and insight!

This release was to address some of the challenges set by the ARMS-MBON project of ASSEMBPLE Plus.

More features are about to be integrated in the future, for example what about pseudogenes and NUMTS ??

You can always share your thoughts on metabarcoding issues and challenged throught PEMA’s Discussions!

PEMA has now a new architecture

PEMA has been re-architectured completely!

All previous modules are now available, i.e all four marker genes (16S/18S rRNA, ITS and COI) are supported and custom reference databases can be used both for the CREST and the RDPClassifier cases.

Aim of this re-architecturing was to make PEMA much more friendlier to contribute!

Metabarcoding is an extremely active method!

That is why PEMA’s intention is to become an open source project asking for the contribution of anyone who would like to join forces!

For a thorough description on how you could do that, you may have a look here.

💯 🖥️ Happy coding! 🖥️ 💯

PEMA@DNAQUANET2021

PEMAv2 was presented in the framework of the first DNAQUANET conference!

Here you may watch the video recording]!

Many thanks to @DNAquaNet for bringing us all together!

Pema@mdaw Online Workshop

The Microbiome Data Analyses Workshop has opened its registration!

A series of great topics and software will be covered!

PEMA will be there for a full hands-on tutorial!

Even online, hope to e-meet you all there!

Follow the PREGO project tweet for more!

What is a container?

this is an example post

Containers are a form of operating system virtualization. A single container might be used to run anything from a small microservice or software process to a larger application. Inside a container are all the necessary executables, binary code, libraries, and configuration files.