Reproducibility in Learning
Reproducibility is vital to ensuring scientific conclusions are reliable, but failures of reproducibility have been a major issue in nearly all scientific areas of study in recent decades. A key issue underlying the reproducibility crisis is the explosion of methods for data generation, screening, testing, and analysis, where, crucially, only the combinations producing the most significant results are reported. Such practices (also known as p-hacking, data dredging, and researcher degrees of freedom) can lead to erroneous findings that appear to be significant, but that don’t hold up when other researchers attempt to replicate them.
In this talk, we introduce a new notion of reproducibility for randomized algorithms. This notion ensures that with high probability, an algorithm returns exactly the same output when run with two samples from the same distribution, simplifying the task of replication. We present new algorithms for fundamental learning problems that achieve this notion of reproducibility, and discuss connections to other well-studied algorithmic properties, such as differential privacy.
Joint work with Russell Impagliazzo (UCSD), Rex Lei (UCSD), and Toniann Pitassi (Columbia University)