Generation of the Malaspina Gene Database (M-GeneDB)

SA Silvia G. Acinas
PS Pablo Sánchez
GS Guillem Salazar
FC Francisco M. Cornejo-Castillo
MS Marta Sebastián
RL Ramiro Logares
MR Marta Royo-Llonch
LP Lucas Paoli
SS Shinichi Sunagawa
PH Pascal Hingamp
HO Hiroyuki Ogata
GL Gipsi Lima-Mendez
SR Simon Roux
JG José M. González
JA Jesús M. Arrieta
IA Intikhab S. Alam
AK Allan Kamau
CB Chris Bowler
JR Jeroen Raes
SP Stéphane Pesant
PB Peer Bork
SA Susana Agustí
TG Takashi Gojobori
DV Dolors Vaqué
MS Matthew B. Sullivan
CP Carlos Pedrós-Alió
RM Ramon Massana
CD Carlos M. Duarte
JG Josep M. Gasol
ask Ask a question
Favorite

All 3,872,410 predicted coding sequences larger than 100 bp from each assembled metagenome were pooled and clustered at 95% sequence similarity and 90% sequence overlap of the smaller sequence using cd-hit-est97 v.4.6 using the following options: -c 0.95 -T 0 -M 0 -G 0 -aS 0.9 -g 1 -r 1 -d 0 to obtain 1,115,269 non-redundant gene clusters (from now on referred simply as genes). These gene clusters were aligned to UniRef10098 (release 2019-10-16) with diamond blastx99 (v0.9.22; e-value 0.0001). The least common ancestor taxonomic assignation of UniRef100 best matches was obtained from NCBI’s taxonomy database100 (release 2020-01-30).

In order to explore the novelty of the M-GeneDB, we clustered it with the 46,775,154 non-redundant sequences from the Tara Oceans Microbial Reference Gene Catalog version 2 (OM-RGC.v2)37 using cd-hit-est-2d97 v.4.6 with the following options: -c 0.95 -T 48 -M 256000 -G 0 -aS 0.9 -g 1 -r 1 -d 0 to obtain a final catalog of 47,422,971 genes.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A