Improving Public Health Materials on the Web Takes Human and Automated Indexing

In 2002–03, the Center for Natural Language Processing at Syracuse University developed a method for extracting key information from public health reports that will make it easier to find information on public health interventions.

Online databases like Medline allow clinicians to search the medical literature for information about best treatments for particular conditions and other medical information and research. No such tool exists for finding reports on public health interventions, however.

Project staff began by assembling a digital collection of public health documents. Sources included county and state public health agencies and the New York Academy of Medicine's grey literature collection. Staff assembled a team of public health professionals to identify the most important elements of reports, then wrote rules that would allow a computer system to recognize and extract these elements.

Key Results

  • Public health professionals reached a consensus on the most important elements that would have to be extracted to produce a searchable database of public health documents. Six key elements were identified:

    1. The problem addressed by the intervention.
    2. A description of the intervention.
    3. The population targeted for the intervention.
    4. The document type.
    5. The geographical location.
    6. The document title.
  • Evaluation of the extracted elements from 39 documents determined that most had been identified with an accuracy ranging from 64 to 90 percent.

    Overall, 46 percent of the documents yielded a complete and accurate summary, 41 percent yielded information but incomplete summaries, and 13 percent did not yield useful summaries. The most difficult element to identify was the document title, which is often difficult to pick out of a PDF file.