Earth Mover’s Distance

AG Allison R. Greenplate
DM Daniel D. McClanahan
BO Brian K. Oberholtzer
DD Deon B. Doxie
CR Caroline E. Roe
KD Kirsten E. Diggins
NL Nalin Leelatian
MR Megan L. Rasmussen
MK Mark C. Kelley
VG Vivian Gama
PS Peter J. Siska
JR Jeffrey C. Rathmell
PF P. Brent Ferrell
DJ Douglas B. Johnson
JI Jonathan M. Irish
request Request a Protocol
ask Ask a question
Favorite

The Earth Mover’s Distance (EMD) was calculated between each pair of populations using the “transport” library for R (28, 33) (https://cran.r-project.org/web/packages/transport/citation.html). The parent population (e.g. live CD45+ events) were gated in Cytobank, followed by the creation of a viSNE map in Cytobank. A viSNE analysis with two output dimensions was performed, equally sampling 5000 events per file, with 1000 iterations, perplexity equal to 30, and theta equal to 0.5. The events with their viSNE axes were then downloaded from Cytobank, and the Earth Mover’s Distance (EMD) was calculated between each pair of files using the “transport” library for R. The “wpp” object was used to represent each set of points in the two viSNE axes, and the “wasserstein” function was called on each pair of point sets to produce a distance matrix. Each point was assigned unit weight.

Because calculating a matrix with the EMD between each set of 5000 events from the viSNE analysis is computationally expensive, four optimizations were performed. (1) Each file was further down-sampled to 1000 out of the original 5000 events per file in the viSNE analysis. Each event was still assigned unit weight, and each point set, therefore, still had an equal total mass of 1000. (2) The “shortsimplex” method was used for the “wasserstein” function in the “transport” library, which accepted no other parameters besides the pair of weighted point sets (34). (3) Each population was automatically assigned a zero EMD compared to itself, and EMD scores already computed across the diagonal were simply copied because EMD is a metric. (4) The “parallel” library was used to parallelize the computation of each row of the matrix, in addition to the above, using the number of cores detected from the “detectCores” function in the “parallel” library. EMD values computed by ‘emdist’ were compiled in a CSV file and used to create a heatmap, in R, for visualization. Statistical comparisons of EMD values between groups were done in Excel using a Student’s t- test. CSV file and heatmap are each produced as an output.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A