Trustworthy Evaluation of Clinical AI for Analysis of Medical Images in Diverse Populations

More search options...

Fajtl, J; Welikala, RA; Barman, S; Chambers, R; Bolter, L; Anderson, J; Olvera-Barrios, A; Shakespeare, R; Egan, C; Owen, CG; et al. Fajtl, J; Welikala, RA; Barman, S; Chambers, R; Bolter, L; Anderson, J; Olvera-Barrios, A; Shakespeare, R; Egan, C; Owen, CG; Tufail, A; Rudnicka, AR (2024) Trustworthy Evaluation of Clinical AI for Analysis of Medical Images in Diverse Populations. NEJM AI, 1 (9). ISSN 2836-9386 https://doi.org/10.1056/aioa2400353
SGUL Authors: Rudnicka, Alicja Regina Owen, Christopher Grant

	Microsoft Word (.docx) Accepted Version Available under License ["licenses_description_publisher" not defined]. Download (1MB)
	Microsoft Word (.docx) (Supplementary Material) Supplemental Material Available under License ["licenses_description_publisher" not defined]. Download (567kB)

Official URL: http://dx.doi.org/10.1056/aioa2400353

Abstract

Background The deployment of algorithms in health care screening programs has been hindered by a lack of agreed-upon methodology to evaluate trustworthiness and equity. We outline transferable methodology for independent evaluation of algorithms using a routine, high-volume, multiethnic national diabetic eye screening program as an exemplar. Automated retinal image analysis systems (ARIAS), including artificial intelligence (AI), for detection of diabetic retinopathy (DR) could substantially increase image-grading capacity. We report technical and operational considerations relevant to implementation and evaluation in large-scale population screening. Methods Twenty-five vendors with current or pending Conformité Européene Class IIa ARIAS for DR detection from retinal images were invited. Sample data (6268 images) were provided to confirm that ARIAS outputs could be replicated in a trusted research environment. We curated consecutive routine screening encounters between January 1, 2021 and December 31, 2022 at the North East London Diabetic Eye Screening Programme for evaluation. Sample size calculations focused on precision for detection of severe DR by population subgroups, particularly ethnicity. Vendor algorithms did not have access to human grading data or other metadata during image processing. Results Eight of 25 eligible vendors participated. In total, 202,886 encounters were evaluated, representing 1.2 million images from 32% white, 17% Black, and 39% South Asian ethnic groups, including approximately 25,000 cases requiring referral to ophthalmology for review and treatment. Image resolutions varied from 150 × 300 to 6000 × 4000 pixels. Time from study invitation to ARIAS installation and algorithm verification ranged from 96 to 460 days; image processing required between 13.5 hours and 105 days. Conclusions This comparison of ARIAS at scale on a range of images with different characteristics, including a population of different ethnicities, wide age range, levels of deprivation, and spectrum of DR, provides the framework for transparent, equitable, robust, and trustworthy evaluation of clinical AI in screening to inform standards in health care before deployment. (Funded by the NHS Transformation Directorate and The Health Foundation and managed by the National Institute for Health and Social Care Research.)

Item Type:

Article

Additional Information:

From NEJM AI, Trustworthy Evaluation of Clinical AI for Analysis of Medical Images in Diverse Populations, Trustworthy Evaluation of Clinical AI for Analysis of Medical Images in Diverse Populations, 1(9), Copyright © 2024. Massachusetts Medical Society. Reprinted with permission.

SGUL Research Institute / Research Centre:

Academic Structure > Population Health Research Institute (INPH)

Journal or Publication Title:

NEJM AI

ISSN:

2836-9386

Language:

Publisher License:

Publisher's own licence

Projects: