The Michigan Institute for Data Science is leading a multi-university partnership to develop a framework for a national institute that would enable research using sensitive data, while preventing its misuse and misinterpretation.
With a $2.5 million grant from the National Science Foundation, U-M researchers are collaborating with colleagues at the University of Washington and New York University to establish a Framework for Integrative Data Equity Systems.
“Data science continues to have a transformative impact on science and engineering, and on society at large, by enabling evidence-based decision making, reducing costs and errors and improving objectivity,” said H.V. Jagadish, the Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science, professor of electrical engineering and computer science, and director of MIDAS, a unit based in the Office of the Vice President for Research.
“However, thoughtless use of data science techniques and technologies will disproportionately harm underrepresented groups across race, gender, physical ability, sexual orientation, education and more. FIDES allows an interdisciplinary community of researchers to converge around data equity systems to address these pervasive issues.”
Jagadish and Margaret Levenstein, director of the Inter-university Consortium for Political and Social Research at U-M, along with researchers in New York and Washington, have completed initial work on methods to identify data and software issues that negatively impact equity in critical domains, such as mobility, housing, education, economic indicators and government transparency.
Members of the disability community, for example, face chronic data equity issues in the public transit system, causing their travel needs to go unaccounted for in data-driven policy. Four potential equity issues are present in this context:
- Representation equity: The disability community is underrepresented in the data record since it infrequently uses conventional transportation systems from which data is typically collected.
- Feature equity: Comprehensive information needed to support accessible transportation needs to include subtle and sensitive features about one’s health, income and travel needs that are not typically available in the data record.
- Access equity: The sensitive nature of the data prevents it from being released publicly. Essentially, this locks researchers and companies out from studying the problem or providing solutions.
- Outcome equity: Even if this sensitive data can be shared, integrated and used to power new applications that purport to improve equity, the improvement on the quality of life of the disability community would be difficult to measure directly, and unintended consequences are rampant.
The goal is to provide sufficient information about datasets or analytical models so that researchers can decide whether they are fit to use. Do they account for biases, what are the limitations, and what are the underlying assumptions about these data and models that users should know?
Researchers are developing tools and techniques to produce “nutritional labels” for data and models, formalizing and standardizing how records are kept about data and models and how to assess their quality, especially with regard to equity issues.
The collaboration already has led to the development of methods to identify bias in datasets used to train machine learning models. Continuing on this foundation, they are developing an information system that will accept heterogeneous, potentially sensitive and biased data from both public and private sources as input, and then produce integrated, bias-adjusted, analysis-ready data products as output.
Besides supporting methodological innovation in data science, the institute will promote interaction between data science and domain experts looking to share expertise in data equity systems, develop and share best practices, and consistently support efforts on diversity and equity.
“It is not enough to understand inequity in data systems. Our goal is to address it,” Jagadish said. “As we build upon our work and address the many ways to reduce inequities, we can realize truly inclusive improvement in areas like poverty, precision health care and all levels of education. FIDES is data science, at its most fundamental level, for the public good.”