3.4. Database implementation

IM Ioannis Mouratidis
FB Fotis A. Baltoumas
NC Nikol Chantzi
MP Michail Patsakis
CC Candace S.Y. Chan
AM Austin Montgomery
MK Maxwell A. Konnaris
EA Eleni Aplakidou
GG George C. Georgakopoulos
AD Anshuman Das
DC Dionysios V. Chartoumpekis
JK Jasna Kovac
GP Georgios A. Pavlopoulos
IG Ilias Georgakopoulos-Soares
request Request a Protocol
ask Ask a question
Favorite

Kmers, nullomers, nullpeptides, quasi-primes, and primes are organized in prefix tree (trie) data structures, using the Matching Algorithm with Recursively Implemented StorAge (MARISA) Trie implementation and its Python bindings [59]. This particular data structure was chosen as the most performant. Trie hashes produced by MARISA are alphabet-agnostic and can be used to retrieve all contents of an indexed hash table and to perform searches inside that table, either as exact matches or with prefix-based queries. While several kmer-based indexing methods exist in the literature [2], [12], such as ssHash [43], ntHash [26], Fulgor [16], [26] or Pufferfish [6], they have been implemented as a means to hash existing DNA sequences and produce corresponding dictionaries of k-sized substrings (kmers), which can be subsequently used in several other tasks, such as testing whether an input sequence contains kmers existing in said dictionary. Although such structures are beneficial in sequence feature recognition/prediction (e.g. kmer based taxonomy assignment), they do not serve the purpose of kmerDB, namely, storing kmers in a database-like structure, and retrieving all kmers existing in one or more genomes/proteomes (or, conversely, all nullomers / nullpeptides not appearing in a genome/proteome). At the same time, these structures are geared towards the hashing of DNA kmers, meaning they have been implemented with a 4-letter alphabet (A, T, G, C) hardcoded into their underlying data structure. However, a very large portion of kmerDB concerns protein sequences, which would require the use of a 20-letter alphabet for amino acids.

The current size of the stored kmers and nullomers/nullpeptides is 172 GB and 154 GB, respectively, utilizing the MARISA Trie data structure for storing the sequences of each genome/proteome. By contrast, the initial size of the dataset in uncompressed ASCII format amounts to approximately 2.4 TB. This highlights the efficacy of the MARISA Trie structure as a means of hashing and storing kmer datasets.

The front end of kmerDB is implemented in HTML, CSS, and JavaScript. The back end is supported by the Apache web server and the Slim Framework v. 4.0, with server-side operations handled by PHP and, when required, Python. Genome and proteome metadata are stored in a MySQL relational database. The kmerDB website layout was designed with the Bootstrap v. 5 framework, jQuery, and the DataTables library. kmerDB is publicly available through http://www.kmerdb.com.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A