Avoiding scaling issues with long string data

BM Benjamin Murray
EK Eric Kerfoot
LC Liyuan Chen
JD Jie Deng
MG Mark S. Graham
CS Carole H. Sudre
EM Erika Molteni
LC Liane S. Canas
MA Michela Antonelli
KK Kerstin Klaser
AV Alessia Visconti
AH Alexander Hammers
AC Andrew T. Chan
PF Paul W. Franks
RD Richard Davies
JW Jonathan Wolf
TS Tim D. Spector
CS Claire J. Steves
MM Marc Modat
SO Sebastien Ourselin
ask Ask a question
Favorite

ExeTera has been explicitly designed to avoid some of the problems that Pandas (and by extension, Dask) experience when loading the Covid Symptom Study data. Pandas’ internal representation scales poorly with datasets that contain a number of string columns where one or more of the columns contain very large entries. In the case of the Covid Symptom Study Patient table, the longest string encountered in the free text data is approximately 600 characters in length. Internally, Pandas stores all string columns together in a 2D array-like structure. It allocates this array using fixed string format, where the capacity of every entry is the longest entry encountered in any of the string data. In the case of the Covid Symptom Study Patient table dated 23rd May, 2021, this means approximately 30 columns imported as string with approximately 5 million elements per column, resulting in the need to allocate a 90 GB table, despite the serialized CSV representation being only 3.8 GB. Dask uses Pandas DataFrames internally, and therefore suffers from the same degenerate memory usage. ExeTera always stores columns (fields) as distinct structures in memory, and its ability to present string data to the user through memory-efficient indexed strings means that it does not suffer from degenerate performance when dealing with datasets containing natural language fields. For the Covid Symptom Study dataset, the imported Patient table is 5.9 GB in size. Our approach further scales to enable textual analysis of natural language data on the far-larger Assessment table; 360 million assessments logged by users of the Covid Symptom Study app.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A