Performance evaluation

For the most representative score evaluation, a leave-one-out cross-validation must be performed on a patient level, which we call Leave-One-Patient-Out Cross-Validation (LOPOCV). Hence, the algorithm should be evaluated for all patients separately, where the images of all the other patients are used for training. In this fashion, all possible bias due to images of the same patient appearing in both the training and the test set is ruled out.
 
For estimation of algorithm-related parameters, only the training data can be used. Hence, in the case of parameter-estimation algorithms, the parameters must be evaluated for each patient separately, as if the test data is not available. It is not allowed to estimate algorithm parameters based on the complete set of images before cross-validation.
 
Using the above described LOPOCV, the performance is then measured in threefold:
(A) The patient-based detection performance;
(B) the image-based detection performance;
(C) the annotation performance.
 
(A) The patient-based detection performance measures the ability of the system to classify patients based on their endoscopic images. Based on one or more images from a patient, the system should classify the patient as either, "having cancer" or "not having cancer". Using these classifications and the known pathology for each patient the patient-based detection performance can be computed. For this, three metrics are employed: Sensitivity (true positive rate), Specificity (true negative rate) and a modification of the F-score, such that it is harmonic mean of the latter two. We choose sensitivity/specificity over precision/recall, since they are also accepted metrics in the medical community and the latter is not as widely known among medical experts. Hence the patient-based detection performance is computed as:
 
  • Patient-based Sensitivity (PATsen) = (# True positive patients) / (# Positive patients )
  • Patient-based Specificity (PATspe) = (# True negative patients) / (# Negative patients )
  • Patient-based F1-score (PATf1s) = 2 * (PATsen x PATspe) / (PATsen + PATspe)
 
(B) The image-based detection performance measures the ability of the system to detect malignant lesions in endoscopic images. For this we employ the same metrics as (A), but we now only look at the image classifications. For example, if there are 3 images available for a certain patient, the system can decide that the patient has cancer if in only one of the three images a lesion is detected. This would result in a good patient-based detection performance but a bad image-based detection performance. Hence the image-based detection performance is computed as:
 
  • Image-based Sensitivity (IMGsen) = (# True positive images) / (# Positive images)
  • Image-based Specificity (IMGspe) = (# True negative images) / (# Negative images)
  • Image-based F1-score (IMGf1s) = 2 * (IMGsen x IMGspe) / (IMGsen + IMGspe)
 
(C) The annotation performnce measures the quality of the annotation made by the system. Since the variation in annotations among specialists is significant (one annotates the tissue that he would resect, whereas the other annotates only the malignant center of the lesion), it is hard to establish one ground truth annotation. Therefore, we use the annotations of five medical specialists to determine the annotation performance. For this, we define two metrics: (1) the fraction of the sweet spot that has been annotated (from now on referred to as Sweet Spot Coverage (SSC)* and (2) a generlized form of the Jaccard index computed from the annotation made by the algorithm and the five expert annotations (From now on referred to as the Jaccard Index for Golden Standard ground truth (JIGS))*. These metrics are computed as follows:
 
 
 
where A is a binary mask containing a shape and M represents a set of N binary masks, where each mask Mi contains a shape and I = 1, 2, ..., N. Using this metric, annotations of a CAD system can be compared to the set of expert annotations, where A is the system annotation and M is the set of expert annotations.
 
Both metrics are only computed for correctly detected lesions and the mean and standard deviation over all images (of both metrics) are taken as a performance indication for an algorithm.
 

The metrics described above result in a total of 10 values (A:3, B:3, C:4) that define the performance of an annotation system.

 
* Both metrics will be explained in more detail in the following submitted paper:
F. van der Sommen, S. Zinger, E.J. Schoon, P.H.N. de With, "Sweet-spot training for early esophageal cancer detection", submitted to SPIE Medical Imaging 2016, San Diego, CA, USA.