All of the tested datasets [27–35] are shown in Table Table1.1. They vary in size from a few hundred to tens of thousands, with varying sparsity rates (proportion of zero entries) and different numbers of inherent cell subpopulations, thus allowing a comprehensive evaluation of the imputation methods. In addition, all of the real datasets comprise certain types of immune cell subsets, such as T cells, B cells, natural killer (NK) cells, monocytes, dendritic cells (DCs) and innate lymphoid cells (ILCs). For example, dataset PBMC is mainly composed 4 distinct cell types (T cells, B cells, NK cells, and monocytes), while dataset CRC contains 20 highly homogenous cell subsets (12 CD4 T cell subsets and 8 CD8 T cell subsets), which poses different challenges for imputation.
To further evaluate the effectiveness and robustness of the different methods, four simulated datasets with varying proportions of dropouts were synthesized using Splatter [36]. Briefly, a baseline dataset without dropouts was first generated using the default parameters in Splatter. This dataset contains 2000 cells, 600 genes, and 5 clusters. Four datasets with different sparsity rates, ranging from 30 to 90%, were then derived from this baseline dataset.
Quality control of the real datasets was performed before imputation. First, bulk RNA samples within the datasets were removed. Low-quality single cells were then filtered out if the number of expressed genes or the library size exceeded the upper threshold or fell below the lower threshold. The upper threshold was defined as the 75th percentile of all cells plus three times the interquartile range (IQR), while the lower threshold was defined as the 25th percentile minus three times the IQR. Genes that were expressed in no more than two cells were removed.
In dataset BCC, which contains more than 50,000 cells, only the top 1000 genes with the highest expressional variance were retained for imputation, to speed up the calculation. DrImpute and scImpute were not applied to this dataset, as the number of cells exceeds the limit of DrImpute, and the run time of scImpute exceeds our time limit (5 days).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.