The use of surnames to impute missing ethnicity data in the South African National Cancer Registry database

Research Square (Research Square)(2022)

引用 0|浏览0
暂无评分
摘要
Abstract The National Cancer Registry (NCR) of South Africa (SA) calculates cancer incidence rates based on full pathology reports from South African private and public health care laboratories and presents the cancer incidence data by ethnic groups. The sensitivity of collecting ethnicity data in post-apartheid South Africa by reporting sources has resulted in large proportions of cancer cases being reported without population group/ethnicity information. The absence of ethnicity data is a significant challenge to cancer incidence reporting. An imputation method was developed to impute the missing ethnicities by using surnames with known patient-reported ethnicities. A hold-out test done by masking the ethnicities of 50% (n = 332232) of the NCR dataset with known ethnicities, from 1986 to 2014, was used to evaluate this imputation method. The masked ethnicities were imputed and then compared to the patient-reported ethnicities. 94.31% of ethnicities were correctly classified using this imputation method. Sensitivities and specificities were calculated per ethnicity group (Asian, Black, Coloured, White). The imputation method performed well for the Asian, Black and White ethnic groups, but performed poorly for the Coloured ethnic group. The strong relationship between surnames and ethnic groups, as evidenced by the results, mitigates the significant concern of whether surname itself is predictive of ethnicity. Despite the increasing proportion of missing data over the years, the percentage of correctly classified individuals remains high across the test dataset. The strength of this imputation methodology is demonstrated in this study, however, with the large disparities across the private and public healthcare sectors in SA, all cancer cases should be reported with complete information, from all sources, for accurate cancer incidence reporting without the need for having to impute for missing data. There are still challenges around collecting sensitive data such as ethnicities in a SA that warrant further discussions.
更多
查看译文
关键词
missing ethnicity data,surnames,cancer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要