OSNABRÜCK, GERMANY – A comprehensive study led by Dr. Yusuf Brima of Osnabrück University in Germany and Dr. Marcellin Atemkeng of Rhodes University in South Africa revealed that while certain artificial intelligence saliency map methods show promise for medical disease detection, significant limitations persist in their reliability and clinical application. The research team conducted the first empirical investigation using advanced quantitative metrics called Performance Information Curves (PICs) to evaluate these AI explanation tools across brain tumor MRI (Magnetic Resonance Imaging) scans and COVID-19 chest X-ray images.
Saliency maps are visual “heat maps” that artificial intelligence systems generate to show which parts of medical images influenced their diagnostic decisions. Think of them as highlighting tools that reveal where an AI model “looked” when making a diagnosis – similar to how a doctor might point to specific areas on an X-ray when explaining their findings. This study represents a critical advancement because previous medical imaging research primarily relied on visual inspection rather than rigorous statistical analysis to determine whether these AI explanation tools actually work as intended.
The research evaluated nine different deep learning model architectures and eight distinct saliency map generation methods across two medical imaging datasets containing over 2,500 images total. The study employed innovative Accuracy Information Curves (AICs) and Softmax Information Curves (SICs), which measure how effectively saliency methods identify regions that contribute to accurate medical diagnoses by analyzing the correlation between highlighted areas and model performance.
Leading Methods Show Clinical Promise
Visual inspections indicate that methods such as ScoreCAM, XRAI, GradCAM, and GradCAM++ consistently produce focused and clinically interpretable attribution maps. These methods highlighted possible biomarkers, exposed model biases, and offered insights into the links between input features and predictions, demonstrating their ability to elucidate model reasoning on these datasets. ScoreCAM, a gradient-free technique that uses confidence scores rather than mathematical gradients, achieved the highest effectiveness ratings for brain tumor detection with an Area Under the Curve (AUC) value of 0.084 – a measure indicating how well the method retains critical diagnostic information.
“These methods highlighted possible biomarkers, exposed model biases, and offered insights into the links between input features and predictions,” Dr. Brima’s research team concluded. The ScoreCAM method proved particularly effective because it weighs the importance of different brain regions based on the AI model’s confidence levels rather than relying on potentially noisy gradient calculations, making it more stable for medical applications.
For brain tumor classification, the study found that InceptionResNetV2 achieved the highest diagnostic accuracy with an F1 score of 0.95, correctly identifying 69 out of 72 meningioma cases, 133 out of 143 glioma cases, and 91 out of 92 pituitary tumor cases in their test dataset.
Significant Reliability Concerns Emerge
However, SICs highlight variability, with instances of random saliency masks outperforming established methods, emphasizing the need for combining visual and empirical metrics for a comprehensive evaluation. This surprising finding suggests that randomly highlighting areas of medical images sometimes provided better diagnostic correlation than sophisticated AI explanation methods – a concerning result that questions the fundamental reliability of current saliency map approaches.
The research revealed troubling inconsistencies where methods like Vanilla Gradient and SmoothGrad produced coarse and noisy saliency maps that provided little clinical value. Path-integration methods, which trace mathematical pathways through the AI model’s decision process, proved particularly susceptible to highlighting image edges rather than actual disease-related features, reducing their medical interpretability.
Clinical Implementation Challenges Identified
Dr. Brima emphasized that the deployment of attribution maps alone is insufficient for establishing comprehensive model explainability in healthcare settings. The research team noted that unlike brain tumor datasets which include expert-verified tumor boundaries, many medical imaging datasets lack ground-truth annotations (expert-verified correct answers), making it difficult to verify whether AI explanations accurately identify disease-relevant features.
“The results underscore the importance of selecting appropriate saliency methods for specific medical imaging tasks and suggest that combining qualitative and quantitative approaches can enhance the transparency, trustworthiness, and clinical adoption of deep learning models in healthcare,” the researchers stated. This finding has significant implications for hospitals and medical centers considering AI diagnostic tools, as it suggests that visual inspection alone cannot determine the reliability of AI explanations.
Healthcare Industry Implications
The results underscore the importance of selecting appropriate saliency methods for specific medical imaging tasks and suggest that combining qualitative and quantitative approaches can enhance the transparency, trustworthiness, and clinical adoption of deep learning models in healthcare. For healthcare providers, this research indicates that AI diagnostic tools require more rigorous validation before clinical deployment, particularly in high-stakes medical environments where incorrect explanations could influence patient care decisions.
The study’s findings come at a critical time when healthcare systems worldwide are increasingly incorporating AI diagnostic tools. The research demonstrates that while AI can achieve high diagnostic accuracy – with some models performing comparably to experienced radiologists – the explanation systems designed to help doctors understand AI decisions require significant improvement before widespread clinical adoption.
The study analyzed 3,064 T1-weighted contrast-enhanced MRI brain images from 233 patients and 19,820 chest X-ray images across multiple international COVID-19 testing facilities, representing one of the largest comparative evaluations of saliency map reliability in medical imaging. This multi-center approach provides robust evidence about the current limitations and capabilities of AI explanation systems in medical diagnosis.
Key Takeaways
- ScoreCAM and XRAI methods demonstrate superior performance in medical imaging saliency mapping, achieving higher accuracy retention scores than traditional gradient-based approaches.
- Random saliency highlighting sometimes outperformed sophisticated AI explanation methods, revealing fundamental reliability concerns requiring immediate attention from healthcare technology developers.
- Combined visual and quantitative evaluation approaches are essential for validating AI explanation tools before clinical deployment in high-stakes medical environments.
Keep Reading
- Understanding AI in Medical Imaging: A Beginner’s Complete Guide – Learn fundamental concepts about how artificial intelligence analyzes medical scans and supports diagnostic decision-making processes.
- Saliency Maps Explained: What Healthcare Workers Need to Know – Practical guide covering how to interpret AI explanation tools and integrate them into clinical workflows effectively.
- Medical AI Validation Standards: 2025 Requirements and Best Practices – Comprehensive overview of current regulations and quality standards governing AI diagnostic tool approval and implementation.
- Top Medical AI Software Platforms: 2025 Comparison and Reviews – Detailed analysis of leading medical imaging AI platforms, pricing structures, and integration capabilities for healthcare organizations.
References
Brima, Y., Atemkeng, M. Saliency-driven explainable deep learning in medical imaging: bridging visual explainability and statistical quantitative analysis. BioData Mining 17, 18 (2024). https://doi.org/10.1186/s13040-024-00370-4.



