New software speeds processing of large, complicated data sets

From the National Partnership for Advanced Computational Infrastructure

Researchers at the U-M will soon release software that will allow social scientists to analyze complex data at levels not possible by conventional techniques. The researchers are testing the beta version of the software as part of the Michigan Parallel Imputation (MPI) project, through the U-M’s participation in the National Partnership for Advanced Computational Infrastructure. The software calculates missing information in large data collections at a rate that is eight times faster than current imputation regression methods.

The MPI project was started two years ago by Richard Rockwell, former director of the Interuniversity Consortium for Political and Social Research, which is affiliated with the Institute for Social Research (ISR). It currently is headed by Quentin Stout, professor of electrical engineering and computer science and director of the Center for Parallel Computing, and T. E. Raghunathan, associate professor of biostatistics and senior associate research scientist at ISR’s Survey Research Center.

Also working on the project are Theodore Tabe, a computer science graduate student who was primarily responsible for converting the original serial imputation code into the MPI parallel version that is suitable for a wide range of machines, and Peter Solenberger, senior systems analyst at the Survey Research Center.

“Not only will the MPI software speed up the time it takes to complete imputation regression, but future versions also will parallelize a complimentary set of variance error programs to help estimate accuracy of surveys,” Solenberger says. “This is important because many large surveys, such as the U.S. National Health Interview Survey (NHIS), usually have large amounts of missing data. How this missing data is ascribed can greatly affect its analysis and perhaps the policy recommendations based on this analysis.”

Imputation regressions for complex surveys can take inordinate amounts of time using traditional computing methods. Solenberger and his colleagues recently ran a typical imputation regression using serial software on a Sun Enterprise 450 that took three weeks. They then ran the same imputation using MPI’s parallel software on 32 IBM SP2 processors in less than three days.

The project has utilized the 112 processor IBM SP2 system in the Center for Parallel Computing. The Center has several parallel computers available for use by U-M researchers and students. For more information on the Center, visit the Web at www.engin.umich.edu/center/cpc.

“Parallel software like the MPI application will help current social scientists further analyze large surveys that already exist,” Solenberger explains. “This will help improve data quality and, therefore, the quality of analyses. Parallel computing will benefit future social scientists because it demonstrates the advantages of relatively inexpensive supercomputing in the field. Essentially, it is contributing to the modernization of social science computing.”

As the U-M researchers continue to pave the way for future social scientists and parallel computing, Solenberger anticipates that commercial vendors will soon develop parallelized versions of software, particularly applications that are used for immense data collections.

In addition to the MPI software, Raghunathan and Stout are working on Web-based access to data sets stored on supercomputers. Although both projects are new concepts in the social science field, Stout is hopeful.

“With a strong and persistent interdisciplinary team, we intend to raise the standard for analysis of complex social science problems,” he says.

This article is revised from one that appeared in Online: News about the NPACI and SDSC Community, © 2000, by Kimberly Mann Bruch. The publication is on the Web at www.npaci.edu/online.

Tags:

Leave a comment

Commenting is closed for this article. Please read our comment guidelines for more information.