Promise and challenges of sharing ‘big data' explored at new computing facility

Yale’s annual symposium on big data emphasized the human values at the heart of vast swaths of information.

Whether it’s medical researchers sifting through mountains of genomic data, historians poring over thousands of old photographs, or astrophysicists discerning the universe via statistical algorithms, Yale scholars use big data in ways that make research more open, interconnected, and insightful.

“Data ties in with Yale fulfilling its mission,” said Steve Girvin, deputy provost for research and the Eugene Higgins Professor of Physics, at the start of the Yale Day of Data presentations on Dec. 2. “We take our mission to do good in the world very seriously. The free exchange of ideas is a core principle and that includes data.”

The Yale Center for Research Computing (YCRC) hosted the 2016 Yale Day of Data at its new facility at 160 St. Ronan Street. Established in 2015, the center provides guidance, training, and digital frameworks for Yale researchers across all disciplines.

Such expertise is now essential in managing the complexity of computational needs on campus, according to Kiran Keshav, executive director of the center. “Underpinning all of these disciplines is a need for a robust cyber infrastructure,” Keshav said. “It’s not just the technology itself, but also access to skill sets on how to use this infrastructure, as well as the training to keep up to speed with the latest and ever-changing technology trends.”

That might include pairing a new faculty member with a YCRC support specialist to plan for the transfer of the faculty member’s extensive research data from another institution; it might mean building platforms for storing and evaluating climate change data, or devising new ways to map the structure of the Zika virus or sequence a genome.

“We’re involved in everything computational on campus,” Keshav said.

The YCRC joins a number of centers and projects on campus that conduct cutting-edge data research and formulate policy for best practices in using large, sensitive databases. Those entities include the Yale Open Data Access (YODA) project, the Institution for Social and Policy Studies’ data archive, the Yale Institute for Network Science, the Digital Humanities Lab, the Data Science Initiative, the StatLab at the Center for Science and Social Science Information, various upcoming initiatives around data science, and a growing number of campus hackathon events.

More broadly, officials said, Yale is committed to taking on the larger issues that accompany big data: widening public access to knowledge, securing the privacy of personal information, improving the reproducibility of scientific findings, and building more public trust in research.

The case for open data

A trio of keynote speakers advocated for the need to keep data open whenever possible.

Psychologist Brian Nosek, co-founder and director of the Center for Open Science at the University of Virginia, outlined a number of open data tools he and his colleagues have developed, such as the SHARE platform and the Open Science Framework. He said these services allow researchers to share large datasets more easily, and to incorporate open data practices into each step of the research process.

Nosek also is a proponent of creating new “preprint” archives for research data as an alternative to publication in peer-reviewed journals. Nosek said separating the act of publishing research from the process of evaluating that work would free researchers from the “tyranny” of the academic publication process.

Erin McKiernan, a professor of physics in the biomedical physics program at National Autonomous University of Mexico, also pushed for the use of open data protocols. She said opening up research data makes it easier to reproduce experiments and head off the “crisis” of confidence about results produced in some disciplines.

Yale’s Harlan Krumholz, the Harold H. Hines Jr. Professor of Medicine, director of YODA, and YCRC faculty co-director, gave several examples of how the restrictive use of data in medical studies has been detrimental to the profession. A certain degree of interpretation is built into many studies, Krumholz said, and there are cases in which different researchers could start with the same raw data and reach a different outcome.

“Transparency is the most important treatment of that problem,” Krumholz said.

Spanning the disciplines of research

The humanities and social sciences are seeing waves of data transform their research, as well.

Laura Wexler, professor of American studies, women’s, gender and sexuality studies, and film and media studies, gave a presentation on the Photogrammar project, which is providing interactive digital tools for looking at 170,000 images taken by the United States Farm Security Administration (FSA) and the Office of War Information from 1935 to 1945.

Photogrammar, a digital tool developed at Yale, has enabled Wexler and her colleagues to map and analyze the FSA photo collection in new ways for academic research. The project also gives the public greater access to a trove of historic images.

“These are difficult archives to search through, physically,” Wexler said. “Even in a week of looking, it’s too unwieldy, too difficult to search effectively.”

Anthropologists Melanie Martin of Yale and Bret Beheim of the Max Planck Institute for Evolutionary Anthropology noted that big data presents different challenges and opportunities for their field.

Replication of results is especially tough for anthropologists, they said, because fieldwork is often conducted in rapidly changing environments among fluid populations. In addition, anthropologists are highly protective of the communities they study and the personal data they collect. Yet the wealth of data being collected — by GPS devices, for example — has been a boon to the discipline, Martin and Beheim said.

Yale assistant professor of political science Allan Dafoe urged researchers in his discipline to make more of their survey data available, including text of all surveys conducted for studies. Dafoe suggested that the future of scientific conversation may come in the form of online communities that discuss each others’ work, with access to all pertinent data.

Ethics, challenges, and opportunities

Other Day of Data panelists talked about government, galaxies, and genomes.

Beth Simone Noveck, a Florence Rogatz Visiting Professor at Yale Law School, discussed the important policy decisions that emerge from big data, from rooting out government corruption to better coordinating our response to natural disasters. Large, responsibly curated databases, open to all, can help uncover discrimination, monitor the spread of disease, and bring more attention to social issues, she said.

“As we talk about all of the fabulous things we can do with big data, we must put the policies of privacy, of openness, of ethical and responsible sharing, and of governance, at the center,” she said.

Arif Harmanci, an associate research scientist in molecular biophysics and biochemistry at Yale, echoed the need for sophisticated privacy safeguards when dealing with big data. He noted that without safeguards, “linking attacks” can be used to compare clinical databases and identify people in even a protected or anonymous database. “These are going to be the real challenges we face in the next couple of years,” Harmanci said.

Yet perhaps the most expansive use of big data comes in astrophysics. The field is seeing exponential growth in data — a nexus of sorts between big data and supercomputing. It is so much data that statistical algorithms are necessary to sort through it all to understanding the universe, said Yale associate professor of physics and astronomy and YCRC faculty co-director Daisuke Nagai.

“I think of it as a cosmic genome project,” Nagai said.

Sponsors for the event were the Office of the Provost, the Yale Center for Research Computing, the Center for Science and Social Science Information, the Institution for Social and Policy Studies, the Digital Humanities Lab, the Yale Institute for Network Science, the Center for Teaching and Learning, and the Sigma XI Distinguished Visitor Fund.