This research project will develop methods and practical tools for leveraging the information from auxiliary data sources, such as administrative records and databases gathered by private-sector data aggregators, to adjust for nonresponse in surveys. Modern surveys have seen steep declines in response rates. These declines threaten the validity of secondary analyses based on those incomplete data. Government agencies and survey organizations are under increasing budgetary pressures, however, and the result is fewer resources available for extensive nonresponse follow-up activities. In this environment, government agencies and survey organizations need new options for handling missing data. This project will provide such options, enhancing the ability of data producers to create high-quality public use datasets that account for missing data. The project will benefit data users, including scholars who use survey data and those interested in methods for evaluating and correcting for biases due to nonresponse. An open-source package will be developed and made widely available via the Comprehensive R Archive Network. This package will enable agencies and other users to take advantage of the methodological advances. The project will train two Ph.D. students from underrepresented groups, one in statistical science and one in political science. The project also will engage two undergraduate students in a data science summer research experience.
The methodological developments to be addressed in this project will focus on the following question: How can survey organizations take advantage of information about the marginal distributions of survey variables that are available in auxiliary data sources when adjusting for nonresponse? The project will develop methods that enable users to posit distinct specifications of missing data mechanisms for different blocks of values. The project also will develop multiple imputation routines based on machine learning techniques to handle imputation with auxiliary information in databases with large numbers of variables. The multiple imputation framework is leveraged to propagate uncertainty not only from the missing data, but also from population-based auxiliary marginal information with potentially non-trivial uncertainty. The project will fuse features of Bayesian modeling and classical survey-weighted estimation to ensure imputations account for complex survey designs. The methodology will be illustrated on an application examining voter turnout among subgroups of the population in the Current Population Survey (CPS). The application will use population-based auxiliary data from government election statistics available in the United States Elections Project and voter files available from Catalist, a leading national vendor of voter registration data. The information in the auxiliary margins will be used to adjust the CPS data for nonresponse with a more reasonable set of assumptions than previous analyses of voter turnout based on the CPS. The CPS voter turnout application will inform scholars and policy makers about inequalities in electoral participation and provides insights about possible policy alternatives for improving voter turnout.