Virtual Workshop on Missing Data Challenges in Computation, Statistics and Applications
Regularization and spurious correlations in sparse single-cell transcriptomes
Abstract: Recent advances in biotechnology and genomics have generated dizzying amounts of large, noisy, and sparse datasets that require concomitant development of machine learning methods. The analyses of single-cell RNA-seq data have driven the development of data processing methods, such as transcript abundance normalization and imputation, to address the numerous sources of technical variability and missing data. While these regularization methods have been demonstrated to be effective in imputing individual gene expression, the suitability of these methods to the inference of gene-gene interactions and gene networks have not been systematically investigated. We report that the leading published methods all induce widespread and significant inflation of gene expression correlations across the genome, resulting in erroneous inferences of molecular pathways and networks. A model-agnostic correction approach is proposed that can effectively eliminate correlation artifacts whilst still accurately inferring gene expression levels.