When directly obtained sources of race, ethnicity and language data, such as from self-reports, are unavailable or impractical to obtain quickly, health plans and providers can use a variety of indirect methods to estimate their members' likely race, ethnicity and language preferences. The two most commonly used indirect methods are geocoding and surname analysis.1 Either approach can be used alone, but various types of combined approaches are increasingly used to improve accuracy.2
Did You Know? Geocoding and surname analysis are the most commonly used indirect methods of race, ethnicity and language data collection.
Approaches that employ both methods are preferred because of increased accuracy.
The most obvious answer is that most plans still lack race, ethnicity and language data on most or all of their member population, and the process of obtaining the data through self-reported data can be lengthy and expensive. For example, with significant company leadership and a system in place for data capture, Aetna has data on one-quarter of its active enrollees at any point of time. Although a few smaller regional plans that followed Aetna's lead have obtained a similar proportion of self-reported data in less time, collecting data on an increased percentage of members will likely take several more years. Though not a replacement for self-reported data, indirect methods of obtaining race, ethnicity and language data can help plans and providers quickly begin assessing disparities at relatively little cost.
Top Tips: Plans should use indirect methods of data collection to:
- Provide a quick, though less reliable, method of assessing disparities
- Demonstrate the existence of disparities among plan members to health plan leadership and stakeholders
- Supplement direct methods of data collection (e.g., indirect data on total plan membership can supplement direct data on a subset of membership).
The majority of the plans participating in the NHPC began their efforts with indirect estimates of race and ethnicity.3 Indirect data demonstrated to plan leadership and other internal stakeholders that there were disparities in care and illustrated some of the ways race, ethnicity and language data could be used to target resources to member populations with apparent disparities. At the same time, some plans (e.g., HealthPartners and Highmark Inc.) recognized the need and urgency of obtaining self-reported data because of the uncertainty of the precision of indirect estimates for determining the race and ethnicity of a single member (versus a group of members) and directly targeting interventions to the individual member based on this information. Other plans, such as WellPoint, Inc., concluded that continued refinements to indirect methods and improved accuracy made the indirect approach a viable interim strategy for effectively targeting their efforts. They were also reluctant to rapidly scale up collection of self-reported data across their system until health information technology and data coding standards, including race/ethnicity coding, were standardized nationally.
Top Tips: Though indirect estimates of race, ethnicity and language for individual members should be performed at the block group (or census tract) level, it is fine to aggregate the member estimates to a higher geographic level such as ZIP code or county as needed for reporting or mapping patterns of care.
Indirect data collection methods can supplement missing data and be useful to assess disparities. Health plans should devise methods to validate the indirect methods with direct methods. Health plans can cross-check direct race, ethnicity and language data from a sample of members with indirect method estimates of their entire membership. Indirect methods can also be used to increase accuracy of some types of direct data where misclassification occurs. For example, despite significant improvements in overall accuracy of CMS Medicare race and ethnicity data, a substantial proportion of Hispanics are classified as "white" or "other," preventing more targeted analysis of disparities among Hispanics. The indirect method of surname analysis using Hispanic surname dictionaries can be used to reclassify most of those individuals.4
In the sections that follow, you will find brief descriptions of two of the most common approaches health plans have used for indirect estimation of race/ethnicity—geocoding and surname analysis—as well as some newer methodologies that are substantially improving the accuracy and reliability of these approaches. 5 You also will find ways indirect approaches can be used to estimate other member characteristics, such as language or socioeconomic status.
Strictly speaking, "geocoding" refers to the process of assigning a geographic identifier to a person or object located in a given area, such as converting an address into a census code designating a census area (e.g., a specific census tract) or geographic coordinates (e.g., latitude and longitude).6 However, for our purposes, geocoding is a method in which information about the social characteristics of the neighborhood or community a person lives in is used to infer information about them. In these respects, geocoded measures are best viewed as reflecting the characteristics of the community or neighborhood individuals live in rather than being a direct proxy for that person's characteristics. For example, knowing that a person lives in a neighborhood where eight of 10 residents are African-American provides useful information for estimating that individual's race. Similarly, knowing that a member lives in a neighborhood where less than 1 percent of the residents live below the poverty level and housing values are high can be useful in determining the member's probable socioeconomic status. The initial step of geocoding involves converting members' addresses to a geographic identifier such as a census tract code. This step is a straightforward process that can be done easily using commercially available software or vendors. Keep in mind that cost and accuracy vary depending on the software or vendor.
Geocoding: A method in which information about the social characteristics of the neighborhood or community a person lives in is used to infer information about them.
Steps in Geocoding:
- Convert members' addresses to a geographic identifier such as a census tract code
- Need commercially available software or vendors: do some comparison shopping for cost and accuracy
- Can also obtain data on members' area from the U.S. Census Bureau (staff need basic programming skills)
- Determine geographic level of information for the indirect estimates of estimates of race, ethnicity or language
- Census block groups provide best level of detail and homogenization for making inferences (area = small neighborhoods)
- Census tracts are larger areas than census blocks but smaller than ZIP codes (area = 4,000 residents)
- ZIP codes are least preferable due to large area included in ZIP codes; limited ability to make inferences (area = >10,000 residents).
A related step is deciding what geographic level of information the indirect estimates will be based on. For example, a common mistake is to use ZIP code level information (e.g., average income level) as a proxy for an individual's socioeconomic standard. ZIP codes generally include relatively large areas containing tens of thousands of people, often with widely varying racial/ethnic characteristics. Geocoding to the census tract level is a much better approach since these areas average only about 4,000 residents and are designed to demarcate populations with relatively homogeneous social characteristics. However, it is not uncommon for a given census tract to include both pockets of poverty and affluence. Therefore, when possible, indirect estimates of race, ethnicity or language should be based on information obtained at the census block groups level.
These areas roughly correspond to small neighborhoods with 1,000 residents or fewer.6
Surname analysis uses a person's last name to estimate the likelihood that they belong to a particular racial or ethnic group. For example, a person whose last name is Lopez has a reasonably high likelihood of being Hispanic, whereas it is a reasonable bet that a person whose last name is Chang is Asian. Based on this logic, researchers have developed a number of surname dictionaries that include names that have relatively high probability of belonging to a specified racial or ethnic group. The most widely used dictionaries focus on Hispanic or Asian surnames; separate surname lists have been generated for Chinese, Indian, Japanese, Korean, Filipino and Vietnamese Americans. Experimental dictionaries for identifying Arab Americans are under development.7 More recently, the U.S. Census Bureau released a new surname list that includes nearly 90 percent of all surnames in the U.S. Census, including predictive probabilities that individuals with a given surname belonged to each of six racial/ethnic categories (white, black, API, Asian, 2+Race and Hispanic).8 Although use outside of the U.S. Census Bureau is still limited, it offers numerous advantages compared to prior lists—in terms of accuracy and flexibility—and it could become the industry standard.
Did You Know? The U.S. Census Bureau released a new surname list that includes nearly 90 percent of all surnames in the U.S. Census; this list could become the industry standard.
Studies assessing the accuracy of surname analysis using older surname lists confirm the approach is reasonably accurate, at least for identifying persons likely to be Hispanic or Asian, respectively. Most validation studies, for example, show that surname lists can correctly classify about eight of every 10 Hispanic members and seven of 10 Asians. However, the accuracy can vary considerably depending on the concentration or prevalence of a given racial/ethnic group in an area or region. For instance, individuals with "Lee" as a surname are much more likely to truly be Asian if they live in San Francisco where the proportion of Asians is relatively high versus individuals with the same last name living in Atlanta, where there are proportionately fewer Asians and Lee is a surname more commonly used by non-Asians. This sort of variation can be largely overcome by employing Bayesian methods that adjust estimates based on the prevalence of different racial/ethnic groups in the area. Plans and providers considering using surname analysis should remember that this approach, by itself, is generally not very useful for identifying African Americans or Whites since these groups tend to have less distinctive surnames than Hispanic or Asian individuals.
Did You Know? Surname analysis is best used for identifying persons likely to be Hispanic or Asian.
Top Tips: Remember that accuracy can vary considerably in surname analysis depending on the concentration or prevalence of residents belonging to a given racial/ethnic group in an area or region.
Bayesian methods (see section below on Bayesian methods) that adjust estimates based on the prevalence of different racial and ethic groups in the area can improve accuracy.
The advantages and limitations of geocoding and surname analysis complement each other, making combined use an attractive means for inferring race/ethnicity among health plan members. Geocoding is more reliable for inferring race whereas surname analysis is better for inferring Hispanic or Asian ethnicity. Furthermore, geocoding provides estimates of the racial/ethnic composition of the area where surnames are applied. When the two methods are applied to the same geographic area (e.g., census tract, block group, or block), overall accuracy can improve. For example, a combined approach can improve the accuracy of geocoding of non-Hispanic African Americans and Whites. To verify numbers of non-Hispanic, African Americans or Whites:
Incorrect assignment of minorities to the majority White population will have relatively little effect in most instances because of much higher numbers of White, non-Hispanics.9
Top Tips: Use geocoding for inferring race; use surname analysis for inferring Hispanic or Asian ethnicity.
As noted earlier, the accuracy of indirect methods can vary depending on the prevalence (e.g., actual proportion of local population that are Hispanic) of different racial/ethnic groups in a given area. In general, accuracy of indirect estimates drops when prevalence of a group is low and improves when it is high. This problem can be partly overcome by applying an approach similar to those used in medical decision-making based on Bayes Theory. For instance, though a commonly used diagnostic test to detect a blood clot in a member's lung, a V-Q scan, is reasonably accurate, it still may misclassify 20 percent or more of the cases. Based on Bayes Theory, doctors have learned that the likelihood that a positive test result is correct depends, in part, on whether the doctor thought that the likelihood the member had a clot was low or high prior to the test, based on the patient's clinical symptoms. For instance, if the doctor felt the patient was at high risk, even a weakly positive test may warrant treatment. Conversely, if the doctor felt there was little risk of a clot based on the members' symptoms, then it was reasonable to not take into account even a moderately positive test. In a similar way, one can use prior knowledge about the plan member, such as the percentage of Asian Americans living in their neighborhood, to refine the final estimate of the likelihood that one is Asian or not based on the surname. Hence, we would be more confident that someone with a name on an Asian surname list was truly Asian if they lived in a neighborhood that census data indicated was predominantly Asian versus if only about 1 percent of the residents were Asian. Using this approach, RAND researchers have been able to markedly improve estimates obtained with the combined geocoding and surname approach described earlier.10
Did You Know? Accuracy of indirect estimates drops when prevalence of a group is low and improves when it's high.
Health care disparities are throughout the health care system. Each NHPC health plan has used data on the race, ethnicity or language preference of their members as a critical decision-making tool to target quality improvement programs in the effort to reduce disparities. As highlighted in the case study below, plans have used various methods such as geographic information system (GIS) mapping and decision tools to inform their efforts to reduce disparities.
1. Fiscella K and Fremont AM. "Use of Geocoding and Surname Analysis to Estimate Race and Ethnicity." Health Services Research, 41(4 Pt 1):1482-500, 2006. (http://www.rwjf.org/qualityequality/product.jsp?id=15406 ) 2. Elliott M, Fremont AM, Lurie N, et al. "A New Method for Estimating Racial/Ethnic Disparities where Administrative Records Lack Self reported Race/Ethnicity." Health Services Research, 2008. (http://www.rand.org/cgi-bin/health/showab.cgi?key=2008_134&year=2008 ) 3. Lurie N, Fremont AM, et al. "The National Health Plan Collaborative to Reduce Disparities and Improve Quality." The Joint Commission Journal on Quality and Patient Safety, 34 (5): 256-265, 2008. (http://www.rwjf.org/files/research/lurienhpcjointcommissionarticle.pdf ) 4. Morgan RO, Wei II, Virnig BA. "Improving Identification of Hispanic Males in Medicare: Use of Surname Matching." Medical Care, 42(8):810-816, 2004. (http://www.ncbi.nlm.nih.gov/pubmed/15258483 ) 5. Lurie N, Fremont AM, et al. "The National Health Plan Collaborative to Reduce Disparities and Improve Quality." The Joint Commission Journal on Quality and Patient Safety, 34 (5): 256-265, 2008. (http://www.rwjf.org/files/research/lurienhpcjointcommissionarticle.pdf ) 6. Fiscella K and Fremont AM. 7. Elliott M, Fremont AM, Lurie N, et al. 8. U.S. Census Bureau. Demographic Aspects of Surnames. Available at http://www.census.gov/genealogy/www/surnames.pdf, 2008. 9. Fiscella K and Fremont AM. 10. Elliott M, Fremont AM, Lurie N, et al.