The National Cancer Institute’s SEER program is a model for creating a powerful data set from patient files and registries in different states; SEER overcomes limitations of separate data sets.
Often scattered among disparate organizations, public and private, are groups of related health data. Linking these complementary data sets is a necessary step in the production of meaningful research.
This article guides health professionals through the data linkage process. The authors provide an overview of files likely to be related but kept by separate organizations. In addition, the authors identify agencies that own and control significant amounts of data. Cancer care research, because of its high incidence and societal burden, illustrates important issues that occur when linking datasets. The authors present five basic steps for linking databases; the article explains the distinction between deterministic and probabilistic matching.
In the U.S., related health information is often kept by unconnected organizations. Linked data systems can lower research costs and avoid the duplication of data within a given study. This article offers guidance for data linkage and provides examples of successful linked data sets in cancer research.