|
Indian Journal of Cancer, Vol. 48, No. 4, October-December, 2011, pp. 488-495 Genitourinary - Original Article Interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists RV Singh1, SR Agashe2, AV Gosavi2, KR Sulhyan2 1 Department of Pathology, Tarini Cancer Hospital, Alwar, Rajasthan, India Code Number: cn11131 PMID: 22293266 Abstract Context: Gleason grade is the most widely used grading system for prostatic carcinoma and is recommended by World Health Organization. It is essential that there should be good interobserver reproducibility of this grading system as it has important implications in patient management. Keywords: Adenocarcinoma, Gleason score, interobserver reproducibility, prostate Introduction Carcinoma of the prostate is the most common form of cancer in men and the second leading cause of cancer-related deaths. [1] The American Cancer Society estimated 218,890 new cases of adenocarcinoma of the prostate in 2007, which well surpassed lung cancer as the most prevalently diagnosed carcinoma in men. [2] Subjectivity involved in histological grading is mainly responsible for limiting the utility of various grading systems. The inconsistency in histological grading may invalidate its use in treatment decision. Thus reproducibility of histological grading has the same significance as the predictive character of prognosis. Gleason grading is the most widely used, and recommended, grading system for prostatic carcinoma in the world today. Even then it is not foolproof method and does not carry 100% reproducibility. To date many authors have examined the interobserver and intraobserver reproducibility of Gleason grading on radical prostatectomy specimens or on prostate biopsy specimens. [3],[4],[5],[6],[7],[8],[9] The aim of current study is to assess interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists. Materials and Methods Twenty Hematoxylin and Eosin-stained glass slides of prostatic adenocarcinoma were randomly retrieved from the histopathological files on the basis of the original diagnosis. Out of these 20 slides, 10 were from needle biopsy, eight were from Transurethral Resection of Prostate (TURP) specimens and two were from prostatectomy specimens. The slides were selected to roughly represent the spectrum of Gleason scores and no effort was made to select particularly difficult cases. The slides were of uniform and adequate quality, and reproducibility with respect to variation in the quality of slides was not studied. All slides were coded to ensure that they could not be identified by the pathologists. Our study welcomed 21 general pathologists from a teaching institute. All the pathologists were given code numbers randomly from P1 to P21 to maintain anonymity. A seminar was carried out on Gleason grading based on current practice of this grading system identified from recent publications before the start of the study. [10],[11],[12] For instruction and /or review in how to use the Gleason grading system, a written description of this grading system along with colored photomicrographs of different Gleason patterns accompanied the slides. Proforma for reporting Gleason grade were circulated and each pathologist was asked to assign Gleason score based on primary and secondary pattern, and tertiary pattern of higher grade than the secondary pattern if present. The interobserver agreement for each pathologist was compared in pairwise using algorithm for a simple κ-coefficient. [13],[14] Simple κ is a measure of interobserver agreement, given by Cohen. When the observed agreement exceeds the chance agreement the κ-coefficient is positive, with its magnitude reflecting the strength of the agreement. Strength of κ agreement is as follows: κ statistics - Strength of agreement Kappa was calculated for interobserver agreement for each of the pathologists compared pairwise for Primary grade (1-5), Secondary grade (1-5), Gleason score (2-10), and Gleason score groups (2-4, 5-6, 7 and 8-10). The groupings of the Gleason scores were chosen to reflect those employed in patient management and were the same as those used elsewhere. [15] To calculate κ coefficient help of a biomedical Statistician was taken. To assess percentage interobserver agreement mathematical consensus score for each slide was calculated. [4],[5] For each slide, the median of the panel′s readings was first calculated separately for the primary and secondary grade. These values were then summed up to calculate the mathematical consensus score for each slide. The total number of Gleason scores recorded by each pathologist, which agreed with the consensus for each slide, was expressed as a percentage of the total number of slides read i.e., exact agreement. Results Gleason scores determined by 21 pathologists were 4-10. No 2 or 3 score was assigned to any slide. Gleason score 7 was assigned maximum number of times (137/420; 32.6%) and Gleason score 4 was reported least (2/420; 0.5%). [Table - 1] shows interobserver agreement for Gleason score groups with the consensus score groups. For Gleason score groups (2-4, 5-6, 7and 8-10) the maximum number of readings were in 8-10 group (229/420; 54.5%) and least in 2-4 groups (2/420; 0.5%). Using the score groups, the overall percentage agreement for the panel of pathologists with the consensus score groups was 68.0% and ranged from 42.9% to 85.7%. [Table - 2] shows percentage agreement for Gleason scores with consensus scores and agreement within ±1, ±2 and ±3. The distribution of the difference between each reading and the consensus score for each slide was calculated. For 43.3% of the readings there were exact agreement with the consensus Gleason scores and for 92.3% there were agreement within ±1 of the consensus score and 99.1% agreement within ± 2 of the consensus Gleason scores. These results varied between individual consensus scores, the percentage being lower for consensus scores 6 (35.7%) and 8 (34.3%) and higher for consensus scores 7 (65.1%) and 9 (71.4%). No slides had consensus score 2, 3, 4 or 5. Undergrading were seen in 23% and overgrading in 33.7% of the readings. [Table - 3] and [Table - 4] shows κ agreement for Primary grade, Secondary grade, Gleason scores and Gleason score groups. Kappa interobserver agreement for Primary grade when each of the pathologists was compared with each other ranged from -0.32 to 0.92 with majority 60% (254/420) of the readings in fair to moderate agreement range [Table - 3] and 11% (46/420) showed poor agreement and 1% (4/420) showing almost perfect agreement. κ-interobserver agreement for Secondary grade ranged from -0.34 to 0.62 with majority (78%; 328/420) of the readings in slight to fair agreement range [Table - 3] and 14% (58/420) showed poor agreement and two readings out of 420 (0.5%) showed substantial agreement. κ-interobserver agreement for Gleason scores [Table - 4] ranged from -0.13 to 0.55. In 4% (18/420) κ were less than 0 (poor agreement) and in 80% (336/420) slight to fair agreement was seen. No reading with substantial or perfect agreement was seen. For Gleason score groups k ranged from -0.11 to 0.82 [Table - 4]. Kappa for 10 out of 420 (2%) readings were less than 0 (poor agreement) and in 68.5% (288/420) there were fair to moderate agreement. Only two (0.5%) readings showed almost perfect agreement. As we see from above, κ agreement for Gleason score groups is marginally better than Gleason scores and that for primary grade is better than secondary grade. Majority of the readings were on slight to fair agreement range. Almost perfect agreements were achieved only for Primary grades (1%) and Gleason score groups (0.5%). Discussion To be clinically useful, a histopathological grading system must provide significant prognostic information, be reasonably easy to use and reproducible. Grading prostate cancer is particularly difficult because of the pronounced morphological heterogeneity of this tumor. Inevitably any grading system is flawed by some degree of interobserver and intraobserver variability. Interobserver and intraobserver reproducibility of the Gleason grading system has been studied by several groups. Comparisons of these studies is hampered by a number of factors, including, [6]
One of the problems in evaluating interobserver agreement in surgical pathology is establishing the "true" diagnosis. A number of methodologies can be used, including calculating percent exact agreement between pairs of observers, between observers and an expert diagnosis, and between observers and a consensus diagnosis. [6] We have compared the readings of pathologists against mathematical consensus score and against each other using algorithm for a simple k-coefficient. The ′κ′ statistics is important for reproducibility study as it corrects for chance agreement, which occurs when concordance only is evaluated. [14] The κ-value generally increases when the number of categories decreases, as dose concordance. The κ-value also provides insight into the disparity of nonconcordant cases. [Table - 5] demonstrates previous studies of interobserver reproducibility of Gleason scores (2-10) compared with the present study. From this table it is evident that reproducibility of Gleason scores for exact agreement in previous studies ranges from 9.9% to 70.8%. [Table - 6] demonstrates previous studies of interobserver reproducibility of Gleason score groups (2, 4, 5-6, 7 and 8-10) compared with the present study. Comparisons of the results between studies of observer variation is affected by the number and experience of the participants, the number and selection of slides, variation in the criteria for Gleason grading, whether the criteria are agreed before or during the course of the study and methods of analysis, in particular the choice of groupings for the Gleason score. [4] In the present study 2-4 Gleason score were assigned to only 0.5% of the readings. The shift away from reporting of Gleason scores 2-4 has been identified in other studies also and seems to be related to changes in interpretation over time. [4],[16] Gleason score 7 is identified as area of difficulty in both this study and elsewhere. [4],[17] Fourteen out of 63 readings of slides(22%) [Table - 2] with a consensus score of 7 were underscored in the present study which is more than the study of Melia et al. [4] in which underscoring of consensus score 7 were seen in 13%. These differences centered on the assessment of small areas of fusion and the distinction between separate and fused small irregular glands arranged in a compact form. In addition, at times, it may be difficult to determine whether the loss of acinar spaces is caused by compression artifact or by real inability to form spaces. This difficulty has the potential to lead to inappropriate investigation and suboptimal patient management. Agreement is needed on the definition of small irregular areas of gland fusion (not confirming to Gleason cribriform pattern 3) which can be uniformly applied. Clarification is needed on: The morphology of fusion (possibly involving identification of a common wall): The minimum number of glands involved: The number of small areas of gland fusion required for assigning Gleason pattern 4. With regards to pattern 4 and 5, the limits and proportions of tiny poorly defined acinar structures versus cords and nests of cells appeared to be a problem in the present study as elsewhere. [7] In the present study sheets of cells with many lumina formations were considered erroneously as grade 5 in many readings instead of grade 4. A tertiary grade was reported infrequently and there was poor agreement on the presence and the individual grades of tertiary pattern. Although its prognostic value has been reported, [18] it could not be reliably used in practice as identified in other study. [4] It has been observed that general pathologists more frequently underscore than overscore. [6],[8],[9],[17],[19] However, in the present study underscoring was seen in 23% and overscoring in 33.7% [Table - 2]. This could be due to the fact that the present study included prostatectomy as well as TURP specimens besides needle biopsy specimens and earlier studies showed that under grading is seen in needle biopsies specimens. [6],[8],[17] Moreover grading 2-4 is discouraged on needle biopsy as often higher grade is found on subsequent prostatectomy specimen. [11],[12],[20] Allsbrook et al. [6] have the opinion that the low-grade tumors must be more clearly defined, as even experienced urological pathologists do not show good interobserver agreement. Our study involved general pathologists and it has been reported earlier that the reproducibility among genitourinary pathologists are better than the general pathologists. [6],[8],[17] In our study poor agreement (κ<0) for Gleason score was seen in 4% of the readings [Table - 4]. This can have major impact on the treatment as two pathologists may grade differently without any agreement between them. In our setup we get 4-5 cases of adenocarcinoma of prostate in a year and the experience of the pathologists and how they learned the Gleason grading do play a role in the reproducibility of Gleason grading system as identified in previous studies. [6],[9] One of the major problems faced in the present study was evaluating the percentages of different pattern present in the slides. In some cases two distinct patterns were seen in approximately equal proportions, complicating the choice of a primary Gleason grade. In some cases, difference of opinion on Gleason grade for the entire slide could be explained by the presence of an approximately equal number of patterns as seen in previous studies also. [21] Specific problem areas in the present study similar to previous studies were: [6],[22]
It has been documented that significant improvements in Gleason grading have been accomplished when Gleason grading tutorials, including Web-based tutorials, are made available. [23] A final comment should be made regarding the improvement of the interobserver reproducibility of Gleason grading. Subjectivity will always be present in any grading system. The study by Mikami et al. [9] indicates that good agreement for Gleason grading can be achieved by understanding the definition of each pattern in the scheme, as well as the pitfalls. In addition, although a lecture component could strengthen the understanding of the attendees, and was expected to be the superior educational method, printed material using a case-oriented approach played a comparable role, which seemed to be superior to self-learning from a standard textbook with a limited number of photomicrographs. Disagreement in grading can be attributed to various factors, including heterogeneity of a given tumor consisting of various patterns and the existence of morphologically borderline tumors. [24] Carlson et al. [25] demonstrated that a standardized protocol can minimize observer variability, and Egevad et al. [19] showed that a set of reference images may significantly improve the reproducibility of grading. In the present study a lecture was taken, before the commencement of the study, on Gleason grading and written material about the reporting of Gleason grading based on current practices of Gleason grading along with the photomicrographs of different Gleason patterns were distributed to the participating pathologists. In spite of this the level of agreement achieved was not satisfactory. So we come to a conclusion that the experience of the pathologists with the Gleason grading play a significant role as reported elsewhere. [6],[8],[9] The International Society of Urological Pathology (ISUP) convened a conference in 2005. [11] This conference led to consensus development of "2005 ISUP Modified Gleason System" which recommended that initial grading of prostate carcinoma should be performed at low magnification, patterns 1-5 were clearly described and differences in the interpretation of biopsies and prostatectomy specimens were indicated. Overall, the recommendations follow a trend towards the use of higher grades than before. Further studies are needed whether "2005 ISUP Modified Gleason System" increases the interobserver reproducibility among general pathologists and its impact on patients outcome over a period of time. De la Taille et al. [3] presented a noble approach to evaluate Gleason grading among pathologists using high-density tissue microarrays (TMA) and they concluded that a Gleason score can be easily assigned to each TMA spot of a 0.6 mm-diameter prostate cancer sample and their data indicated that using TMA spot images may be a good approach for teaching the Gleason grading system due to small areas of tissue. To obtain optimal, although never perfect, results for our educational efforts, these varying opinions must, whenever possible, be reconciled, and a greater consensus must be developed. First, consensus itself will have to be defined. Some of the issues (e.g., how the actual grades should be reported) are more amenable to consensus. Resolution of other issues (e.g., the question of whether poorly defined/incomplete glands represent pattern 4) will ultimately require comparison of large series of cases, possibly including a review of the slides from these series, in itself is a daunting task. [22] Another approach towards improvement of reproducibility of Gleason grading system includes obtaining a second opinion in those cases where the grade could significantly influence management. This has been shown to be effective for grading of prostatic cancer. [25] All of these possible aids to improve accuracy will have important resource and management implications for the patients.[29] Acknowledgment We are thankful to Dr. Rakhi Jagdale and Mrs G.S.Garad for their help. References
Copyright 2011 - Indian Journal of Cancer The following images related to this document are available:Photo images[cn11131t3.jpg] [cn11131t2.jpg] [cn11131t6.jpg] [cn11131t5.jpg] [cn11131t4.jpg] [cn11131t1.jpg] |
|