Event

Data thinning to avoid double dipping

Wednesday, January 31, 2024 15:30to16:30

Lucy Gao, PhD

Assistant Professor of Statistics
University of British Columbia

WHEN: Wednesday, January 31, 2024, from 3:30 to 4:30 p.m.

WHERE: hybrid | 2001 McGill College Avenue, room 1140; Zoom

NOTE: Dr. Gao will be presenting from British Columbia

Abstract

"Double dipping" is the practice of using the same data to fit and validate a model. Problems typically arise when standard statistical procedures are applied in settings involving double dipping. To avoid the challenges surrounding double dipping, a natural approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in some problems, sample splitting is unattractive or impossible. In this talk, we are motivated by unsupervised problems that arise in the analysis of single cell RNA sequencing data, where sample splitting does not allow us to avoid double dipping. We first propose Poisson thinning, which splits a single observation drawn from a Poisson distribution into two independent pseudo-observations. We show that Poisson count splitting allows us to avoid double dipping in unsupervised settings. We next generalize the Poisson thinning framework to a variety of distributions, and refer to this general framework as "data thinning". Data thinning is applicable far beyond the context of single-cell RNA sequencing data, and is particularly useful for problems where sample splitting is unattractive or impossible.

Speaker bio

Website Link: https://www.lucylgao.com/

Back to top