Building Big Data, One Swab at a Time

Mar 14, 2013, 2:00 PM, Posted by Nancy Barrand

Watch PBS NewsHour's feature, "Researchers Aim to Unlock Genetic Data Goldmine for Vital Medical Information," on the Kaiser biobank to learn more about how Catherine Schaefer, Neil Risch and 200,000 Kaiser members are accelerating the pace of medical research and bringing the future potential of genomics into the here and now.

When the Robert Wood Johnson Foundation launched the Pioneer Portfolio, my colleagues and I asked ourselves what fields might produce the greatest potential game-changers for health and health care. Genomics was at the top of the list. The human genome had been mapped and fantastic discoveries had begun to blossom, but a true era of personalized medicine still seemed too far off.

So we set out to do what Pioneer does best. We explored and learned. We networked.  We asked a lot of questions.  And we began to hunt down ideas.

On March 12, PBS NewsHour did a feature story on one of the big ideas that came out of that process: the world’s largest, deepest, and most diverse “biobank.” It presented a good opportunity to share the backstory. 

The Promise of Big Data

I first encountered the idea of biobanks, which are large-scale repositories for biological samples (such as DNA, blood, and tissue) collected for research purposes, in 2007, when the UK began releasing findings from its biobank, one of the first in the world. The UK’s earliest research revealed several new genes linked to depression and diabetes.

Discovering genetic links isn’t just about finding strings of interesting letters and numbers; it has the potential to help scientists and doctors create and prescribe medications more effectively, and even crack the code of diseases like prostate cancer and Alzheimer’s. So, I was impressed and excited by the research that UK’s biobank made possible.

But, the biobank’s initial research paper had an important caveat: The sample had only been taken from people of northern European descent, and therefore could not provide any conclusions about genetic factors in other populations, such as those of African descent. That caveat struck me profoundly: How we set out to gather research data could ultimately contribute to disparities in health and health care.

As major government-run biobanks flourished in other countries, private biobanks began rapidly proliferating in the United States, making it clear that a government-run biobank would, for better or worse, not happen here. The Pioneer Portfolio received proposals for repositories of all types. There were many ideas for collecting and analyzing data, but they didn’t provide a way to link data sets together to create a statistically large and diverse enough sample to be representative of the U.S. population.

A Big Idea for Really Big Data

Then, in 2007, we heard from representatives at Kaiser Permanente in California—and they had an audacious vision. Their biobank would have 500,000 genetic samples, be linked to Kaiser’s electronic health records system—which went back more than 15 years—be updated every 24 hours, and be linked to comprehensive behavioral and environmental data.  The samples would come from the 2 million members in the northern California Kaiser region, one of the most ethnically and racially diverse populations in the U.S.

In short, Kaiser wanted to create the largest, most comprehensive, most diverse biobank in the world. If it succeeded, it could radically transform the speed and scale of genomic research.  We would know more than ever before about how who we are—and where we live, learn, work, and play—impacts our health and the health care we may need. 

And we saw a way to help make it happen.

In 2009, RWJF made an $8.6 million grant to the Kaiser Research Program on Genes, Environment and Health (RPGEH). The grant focused on gathering the first 200,000 genetic samples, and developing the procedures needed to make this data gold mine accessible to researchers outside Kaiser. 

Of course, one can’t talk about accessibility to genomic data without also talking about privacy. So, it was critical that Kaiser had already spent two years educating its members and developing a fully informed and transparent consent process for how the genetic material would be processed, stored, and used. By so thoroughly ensuring the biobank’s privacy, Kaiser could maximize its degree of accessibility, which was vital to RWJF’s funding decision.

Once RPGEH collected its first 100,000 samples and proved it had strong policy procedures and the full commitment of Kaiser behind it, major federal investment followed. The National Institutes of Health provided $24.8 million in stimulus monies. The biobank was well on its way to becoming an unprecedented, powerful, and truly national resource.

Big Data’s Biggest Heroes

Ultimately, the size of the repository is not the only thing that makes Kaiser’s biobank the poster child for medicine’s use of big data. It’s also the outsized contributions of scientists who have literally built this biobank swab by swab over several years—most notably Catherine Schaefer, Ph.D., an epidemiologist who directs Kaiser’s Research Program on Genes, Environment and Health (RPGEH), and Neil J. Risch, Ph.D., the renowned statistical geneticist and director of the University of California San Francisco Institute for Human Genetics. And it’s the 200,000-plus Kaiser members who have voluntarily swabbed their cheeks and given blood samples in the hope that their genetic information could save lives and alleviate suffering.

Tell us what you think: Now that we have this national scientific treasure, how should we use it? What health and medical questions would you like to see the Kaiser biobank answer?