Health Services Research and Data Linkages

Issues, Methods, and Directions for the Future

The National Cancer Institute’s SEER program is a model for creating a powerful data set from patient files and registries in different states; SEER overcomes limitations of separate data sets.

Often scattered among disparate organizations, public and private, are groups of related health data. Linking these complementary data sets is a necessary step in the production of meaningful research.

This article guides health professionals through the data linkage process. The authors provide an overview of files likely to be related but kept by separate organizations. In addition, the authors identify agencies that own and control significant amounts of data. Cancer care research, because of its high incidence and societal burden, illustrates important issues that occur when linking datasets. The authors present five basic steps for linking databases; the article explains the distinction between deterministic and probabilistic matching.

Key Findings:

  • Two datasets can be linked only if they have at least one common identifier (e.g., social security numbers or insurance claim numbers); race and ethnicity are poor identifiers because they are inconsistently reported.
  • Compared with private research teams, the government creates linked datasets at a lower cost.
  • The NCI’s Community Cancer Centers Program, a public-private partnership, has the goal of improving information sharing among cancer centers in more than 20 hospitals.

In the U.S., related health information is often kept by unconnected organizations. Linked data systems can lower research costs and avoid the duplication of data within a given study. This article offers guidance for data linkage and provides examples of successful linked data sets in cancer research.