|
Indian Journal of Medical Sciences, Vol. 62, No. 6, June, 2008, pp. 217-221 ORIGINAL CONTRIBUTION The reliability and distinguishability of ultrasound diagnosis of ovarian masses Bagheban AlirezaAkbarzadeh, Zayeri Farid, Anaraki FatemehBaradaran, Elahipanah Zahra Department of Biostatistics, Shahid Beheshti University, MC, Tehran Code Number: ms08039 Abstract Background: For any radiologist, intra-observer agreement in observing and decision making in diagnosis of any disease is of great importance, and so is observing and reading ultrasound pictures of ovarian masses and distinguishing amongst their categories. Keywords: Distinguishability, ovarian mass, reliability, ultrasound Introduction Suppose a radiologist classifies each ultrasound in a sample on an ordinal scale at two different times, so that the first evaluation has no effect on the second one; we could show these two ratings by a contingency table and assess two important issues:
Materials and Methods This is a prospective observational study. The data were gathered from the radiology department of Mirza Koochak Khan Hospital in Tehran, Iran, in January 2005. After obtaining consent from 40 women whose ultrasounds were performed by an expert radiologist and just with a single apparatus (in order to minimize the performer bias), two experienced radiologists and three less experienced radiologists evaluated these ultrasounds separately and independently and scored them 1 through 3 for benign, borderline, and malignant cases respectively. In a single blind study, each one of these ultrasounds was reevaluated by our observers for a second time after a week. This period (a week) seems reasonable, because our observers would not recall the ultrasounds after a week and we would not encounter loss of quality of ultrasounds in this short period. Cross classification of these observers at two different times provided five different 3x3 tables, and the tables were used as the basis of our analysis. In this study, intra-observer agreement of the raters has been evaluated by weighted kappa (as index of reliability), and the distinguishability by the observers in differentiating categories of ovarian tumor has been assessed by utilizing two statistical models ( ′square scores association model′ and ′agreement plus square scores association model′ ). These models are special cases of the ′uniform association model′ [10] and the ′Agreement Plus Uniform Association Model′ [9] respectively. The observers evaluated and reevaluated (after 1 week) 40 different ultrasounds of ovarian masses, separately and independently. Required sample size for these studies (validity and reliability) is usually 15 to 20 cases for quantitative variables and a little more for qualitative variables, so it seemed that 40 cases were enough to achieve our goal and perform our study appropriately. [11] Distinguishability by the observers in differentiating two adjacent categories could show their ability to determine and diagnose the category or the status of the ovarian mass in ultrasonography. [12],[13] The range of this parameter is similar to coefficient in a regression model and its value varies between zero and one, in which with greater distinguishability by the observer, the value will be closer to one and vice versa. SPSS version 10 was used for data entry and obtaining appropriate two-dimensional tables. In addition, SAS version 8 was utilized to measure weighted kappa, fit the models, and estimate the models′ parameters. To calculate distinguishability and make a figure, we used EXCEL 2003 software. Results In this study, we considered three different categories of ovarian mass, and each of the observers classified the ultrasounds at two separate times, so we had five 3x3 tables. The ′square scores association model′ had the best fit for the experienced radiologists, and the ′agreement plus square scores association model′ had the best fit for the less experienced radiologists. The experienced radiologists demonstrated high distinguishability in categorizing different categories (minimum 0.98 for benign and borderline [1 and 2] and minimum 0.99 for borderline and malignant [2 and 3] entities), and there was no significant difference between these two categorization abilities of experienced radiologists. The overall mean of distinguishability for these raters was 0.995, and the mean of weighted kappa for them was 0.81 [Table - 1]. The less experienced radiologists demonstrated lower distinguishability in categorizing different categories (minimum 0.95 for benign and borderline [1 and 2] and minimum 0.97 for borderline and malignant [2 and 3] entities) [Figure - 1]. These raters had an overall distinguishability mean of 0.967, and it was a little lower compared to the experienced radiologists. Mean of weighted kappa for them was 0.65. The mean of distinguishability for benign and borderline categories was 0.990 for the experienced radiologists and 0.955 for the less experienced radiologists. Besides, the experienced radiologists and the less experienced radiologists had a mean of 0.999 and 0.978 respectively for distinguishing the borderline and malignant cases. Discussion To compare distinguishability demonstrated by the observers in categorizing the samples and assessing intra-observer agreement for each one of them, we computed weighted kappa at first. Although there was no complete intra-observer agreement for these observers at two different times, by considering 0.71 for mean of weighted kappa, it can be stated that there was good overall reliability. [14] Besides, minimum and maximum of weighted kappa in our study have been obtained to be 0.61 and 0.86 respectively. Our findings confirm the results reported by Amer et al. [15] They found 69.4% for the mean intra-observer agreement (kappa = 0.54). One reason for a small difference in reliability index is that they used kappa instead of weighted kappa. Although the less experienced radiologists demonstrated a lower distinguishability compared to the experienced radiologists, yet this difference was not remarkable; because all the observers had a minimum 0.90 to distinguish between adjacent categories. But for all observers, distinguishability between categories 1 and 2 was lower than that between categories 2 and 3; and experienced radiologists showed better results than the less experienced radiologists. Generally, for assessing validity and reliability of diagnosing among different categories of ovarian cysts, kappa and weighted kappa coefficients are used. [15] These coefficients show intra-observer agreement generally; and by considering several deficiencies that were reported for them in multiple studies [2],[5],[7],[8] and their inability to show distinguishability by observers, we used statistical models to consider distinguishability demonstrated by them to classify different ordered categories. We could use these results for better future training of raters in big epidemiological studies. References
Copyright 2008 - Indian Journal of Medical Sciences The following images related to this document are available:Photo images[ms08039f1.jpg] [ms08039t1.jpg] |
|