Radiomics Quality Score - RQS 2.0(Under Development)

The radiomics quality score (RQS) 2.0 consists of 36 checkpoints that reward or penalize radiomics studies to encourage best scientific practice. Each checkpoint shows the FUTURE-AI principle it promotes (Fairness, Universality, Traceability, Usability, Robustness, Explainability). Each checkpoint either belongs to Handcrafted Radiomics (HCR), deep learning (DL), or both.

This model is currently Under-Development For suggestions and feedback, please contact any of the following:
philippe dot lambin at maastrichtuniversity dot nl
z dot salahuddin at maastrichtuniversity dot nl
s dot mali at maastrichtuniversity dot nl


 

Please select the type of radiomics study
Handcrafted Radiomics (HCR)
Deep Learning (DL)
Unmet clinical need (UCN) defined
Uni-centre

Multi-centre

International Multi-centre
Classification of the model: diagnostic, theragnostic, predictive, prognostic, follow-up
Defined

Not Clearly Defined
Input from Clinicians for interpretable pipeline development. Discussion regarding choosing appropriate explainability method
Clinical knowledge incorporated in the pipeline or explainability method decided and agreed with the clinician before model development

Clinical knowledge not incorporated in the pipeline or explainability method not decided or agreed with the clinician before model development
Image protocol quality to be documented following the TRIAC level (Transparent Reporting of Medical Image Acquisition for a future proof radiomics). TRIAC guidelines describe five different levels of evidence for reporting imaging protocols. Level 0 indicates that the protocol has not been formally approved with a reference number; Level 1 indicates that the protocol has been approved with a reference number in the archive of the department; Level 2 indicates that the protocol has been approved with formal quality assurance (recommended minimum level for prospective trials); Level 3 indicates that the protocol is established internationally and has been published in guideline documents and peer-reviewed papers; Level 4 indicates that the protocol is Future proof i.e., the protocol follows TRIAC Level 3, FAIR principles and retains raw data.
Protocols are well documented

Public protocol is used
Hardware’s used described, image reconstruction method specified
Description of the hardware used for image acquisition

Information about image reconstruction method e.g. convolutional kernel
Preprocessing of the images
Accounted for variation in slice thickness/ convolution kernel/ contrast

Not accounted for variation in slice thickness/ convolution kernel/ contrast
Imaging at multiple time points - collect individuals’ images at additional time points. Analyze feature robustness to temporal variabilities (e.g., organ movement, organ expansion/shrinkage)
Yes

No
Inclusion and exclusion criteria or boundaries of the model defined (e.g. a CT with 6 mm slice thickness cannot be analyzed)
Inclusion or exclusion criteria specified

Inclusion or exclusion criteria not specified
Phantom study on all scanners - detect inter-scanner differences and vendor-dependent features. Analyze feature robustness.
Yes

No
The diversity and distribution across diverse patient groups in the datasets should be reported at training and testing to identify potential biases and apply appropriate corrective measures
Mitigation strategies are applied to counter the biases

Mitigation strategies are not applied to counter the biases
Use of post-processing harmonization to reduce multi-center acquisition variability e.g. Combat for HCR and CycleGANs for DL
Yes

No
Method and statistical plan pre-registered on a public platform (e.g. www.osf.io)
Yes

No
Training dataset coming from one center, two centers, three centers, or more
One Centre

Two Centres

Three Centres
Multiple segmentations - possible actions are segmentation by different physicians/algorithms/software, perturbing segmentations by (random) noise, segmentation at different breathing cycles. Analyze feature robustness to segmentation variabilities
Yes

No
Feature reduction based on the test-retest dataset, other method or adjustment for multiple testing - decreases the risk of overfitting. Consider feature robustness when selecting features
Either measure is implemented

Neither measure is implemented
Multivariable analysis with non-radiomics features (e.g., age, EGFR mutation) - is expected to provide a more holistic model. Permits correlating/inferencing between radiomics and non-radiomics features
Yes

No
Cut-off analyses - determine risk groups by either median, a previously published cut-off, or report a continuous risk variable or published method. Reduces the risk of reporting overly optimistic results
Yes

No
Random permutations to assess the risk of overfitting. Randomize the input variable to get ideally an AUC not different than 0.5 and therefore assess the risk of overfitting.
Yes

No
Investigate both handcrafted radiomics and deep learning, or a combination thereof, in an ensemble. Radiomics features may also help in the interpretability of deep learning
Comparative analysis or ensemble of HCR and DL approaches

No comparative analysis or ensemble of HCR and DL approaches
Quality Management System
Available online with internal audit

ISO certification or equivalent with external audit
Discrimination statistics - report discrimination statistics (e.g., C-statistic, ROC curve, AUC) and their statistical significance (e.g., p-values, confidence intervals). One can also apply a resampling method (for example, bootstrapping, cross-validation).
A discrimination statistic and its statistical significance are reported

A resampling method technique is also applied
Calibration statistics - report calibration statistics (e.g., Calibration-in-the-large/slope, calibration plots) and their statistical significance (e.g., p-values, confidence intervals). One can also apply a resampling method (for example, bootstrapping, cross-validation).
A calibration statistic and its statistical significance are reported

A resampling method technique is also applied
Comparison with previously published radiomics signatures and models
Yes

No
Validation - the validation is performed without retraining and adaptation of the cut-off value, providing crucial information about credible clinical performance.
Validation is missing

Validation is based on a dataset from the same institute

Validation is based on a dataset from another institute

Validation is based on two datasets from two distinct institutes

The study validates a previously published signature

Validation is based on three or more datasets from distinct institutes
Prospective study registered in a trial database (real-world or In Silico), with sample size calculation - provides the highest level of evidence supporting the clinical validity and usefulness of the radiomics biomarker
Prospective validation

The trial is pre-registered

A resampling method technique is also applied
Algorithm tested in a clinical environment e.g. a department of Radiology or Nuclear medicine
Yes

No
Evaluation should also be carried out concerning the sources of discriminative biases that have been identified
Yes

No
Detect and discuss biological correlates - demonstration of phenotypic differences (possibly associated with underlying gene-protein expression patterns) deepens understanding of radiomics and biology
Yes

No
Details on the intrinsic or post-hoc interpretability method or uncertainty estimation method utilized (e.g. attribution maps, SHAP analysis). Evaluation of the explanations using in-silico trials or clinicians.
Details on interpretability methods or uncertainty estimation are available

Sanity and/or evaluation of explanations are available
Comparison to 'gold standard' - assess the extent to which the model agrees with/is superior to the current ‘gold standard’ method (e.g. Dr. evaluation, TNM-staging for survival prediction, Dr. Assessment). This comparison shows the added value of radiomics
Yes

No
Potential clinical utility - report on the current and potential application of the model in a clinical setting (e.g., decision curve analysis)
Yes

No
Cost-effectiveness analysis - report on the cost-effectiveness of the clinical application (e.g., QALYs generated).
Yes

No
Level of automation for the clinical practice.

At level 0 (No Automation), a clinician performs the clinical task without using the radiomics model.

At level 1 (Clinical Assistance), the clinician uses the radiomics model’s prediction for a part of the clinical task.

At level 2 (Partial Automation), the clinician considers the radiomics model’s prediction for the clinical task before making the final recommendation.

At level 3 (Conditional Automation), the radiomics model provides the predictions for the clinical task under supervision and the clinician can intervene at any time.

At level 4 (High Automation), the radiomics model provides the predictions and the clinician’s intervention is required for special (out-of-distribution) cases.

At level 5 (Full Automation), the radiomics model provides predictions for the clinical task without human intervention.

Level 0 (No Automation)

Level 1 (Clinical Assistance)

Level 2 (Partial Automation)

Level 3 (Conditional Automation)

Level 4 (High Automation)

Level 5 (Full Automation)
The algorithm, source code, and coefficients are made publicly available. Add a table detailing the different versions of software & packages used.
Yes

No
Details on the intrinsic or post-hoc interpretability method or uncertainty estimation method utilized (e.g. attribution maps, SHAP analysis). Evaluation of the explanations using in-silico trials or clinicians.
Scans are open source

The ROI/segmentations are open source

Clinical, non-DICOM data, and outcomes are open source
Define strategy to update models (frequency, approach, access to data etc)
Yes

No

Total score

0
(0%)
PRINT

 

Link to the original article

https://www.nature.com/articles/nrclinonc.2017.141