Big News in Big Data: NIH Launches Largest and Most Diverse Genetics Database Ever Created

Feb 26, 2014, 7:21 PM, Posted by Nancy Barrand

biobank

Eighteen years ago this month, Big Data had a cultural coming out party when IBM's Deep Blue defeated international chess champion Gary Kasparov in a game. Gary Kasparov was a chess genius. But Deep Blue could mine the records of 700,000 grandmaster chess games and evaluate 200 million positions per second. The famously nimble Kasparov ultimately could not match the brute computing force of Deep Blue. 

This week we mark another historic milestone in Big Data history. This time, there is more at stake than bragging rights from a chess competition. 

On February 26 the National Institutes of Health (NIH) announced it had added comprehensive genetic data for a cohort of 78,000 people to its online genetics database—known as the database for Genotypes and Phenotypes (dbGaP). The transfer of data is a down payment on what is envisioned to be the largest and most diverse repository of high quality genetic data in the world.

This data donation is the product of a collaboration between Kaiser Permanente's Research Program on Genes, Environment and Health and the University of California, San Francisco (UCSF) Institute for Human Genetics. Since 2007 and with support from the Robert Wood Johnson Foundation, researchers at Kaiser Permanente have been collecting saliva samples from volunteering Kaiser Permanente members. In 2009, Kaiser Permanente and UCSF collaborated to genotype DNA from the saliva samples for more than 650,000 genetic markers per person. The genetic data was matched with each member's longitudinal electronic medical records as well as extensive survey data on their health habits and backgrounds. It was also linked to one of the world’s most comprehensive environmental databases.

If you ever wanted a big data set to study what makes a culture of health—including genes, social and environmental factors, and behavior—you would want something a lot like the biobank Kaiser and UCSF built. 

In addition to diseases and conditions traditionally associated with aging, such as cardiovascular disease, cancer and osteoarthritis, researchers worldwide can—because of this deposit—now use the dbGaP to explore the potential genetic underpinnings of other diseases, including depression, insomnia, diabetes, and certain eye diseases. Researchers will also be able to use the database to retroactively confirm or disprove studies that use data from relatively small numbers of people. The database will also serve as a source of controls that researchers can compare to individuals with different conditions that they have studied.

The genetic information for all 78,000 individual patients translates into over 55 billion bits of genetic data. Like Deep Blue calculating moves with mind-blowing speed, researchers who access the database will be able to look at millions of genetic markers at the same time. With this addition, dbGaP will save researchers time and money. In doing so, it will ultimately savelives.

Thanks to this huge data set, researchers won't have to go through the expensive and painstaking process of collecting, storing and genotyping their own bio samples. Instead, they can just extract and study volumes of valuable genetic information from their computer.

And this first transfer is just the beginning. The Kaiser Permanente RPGEH has already collected more than 200,000 genetic samples from Kaiser volunteers and are aiming to reach half-a-million samples, all with the goal of accelerating health research worldwide.

So much of the credit for this breakthrough resource goes to the research team led by Kaiser's Catherine Schaefer, PhD and UCSF's Neil Risch, PhD. Credit also goes to the 78,000 Kaiser members who volunteered their genes and medical data for the good of human health.

It is also worth remembering the role the federal government must play in supporting research that has the potential to improve health on such a massive scale. The Robert Wood Johnson Foundation may have gotten the ball rolling with an initial $8.6 million grant to start this project. It was a later $24.9 million grant from NIH to complete the work that made this historic data transfer possible. 

Just as we saw in Big Blue vs. Kasparov, Big Data has changed the game. This time, with the potential to save lives.