Systematic reviews of evaluations of diagnostic and screening tests
BMJ 2001; 323 doi: https://doi.org/10.1136/bmj.323.7305.157 (Published 21 July 2001) Cite this as: BMJ 2001;323:157All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
Further to our rapid response to Juni P et al. Systematic reviews in
health care. Assessing the quality of controlled clinical trials. BMJ
2001;323:42-6, we would also like to focus on how dramatic can be the
effect of intra and inter-observer variation on the sensitivity and
specificity of a diagnostic or screening test [1]. We demonstrated, for
example, that for an observer variation with a proportion of agreement of
0.33 and 0.71 for abnormal and normal tests, respectively, and an assumed
sensitivity and false positive rate of 50% and 17%, respectively, the
latter may vary from 0% to 100% and 0% to 33%, respectively [1].
Systematic reviews including large studies and/or many small studies with
many observers will also tend to be biased towards average values whereas
systematic reviews including a large study with a single observer will
also tend to be biased towards the particular, high, average or low
results of that study. Shouldn't the largely random effect arising from
intra-observer variation and the random and systematic effect, arising
from inter-observer variation [2], be also obligatorily discussed in
systematic reviews of diagnostic and screening tests?
References
1- Bernardes J, Costa-Pereira A. How should we interpret RCTs based
on unreproducible methods.
http://www.bmj.com/cgi/content/full/322/7300/1457#EL9
2- Grant A. Principles for clinical evaluation of methods of
perinatal monitoring. J Perinat Med 1984;12:227-31.
Competing interests: Bernardes and Costa-Pereira are involved in the
development and validation of reproducible computerized diagnostic tests
Competing interests: No competing interests
EDITORS - Deeks, in the third of four articles on evaluations of
diagnostic and screening tests,1 promoted the odds ratio as often being
constant regardless of the diagnostic threshold. We agree with Deeks'
statement that the choice of threshold varies according to the prevalence
of the disease. However, the statement that the odds ratio is generally
constant regardless of the diagnostic threshold can be misleading. The
value of an odds ratio, like that of other measures of test performance
(e.g. sensitivity, specificity, likelihood ratios), depends on
prevalence.2 For example, a test with a diagnostic odds ratio of 10.00 is
considered to be a very good test by current standards. It is easy to
verify that this is generally true only in high risk populations. A
diagnostic odds ratio of 10.00 in a low risk may well represent very weak
association between the experimental test and the gold standard test.
This is so because the observable range of values for an odds ratio
increases as the prevalence of the disease decreases (i.e. moves away from
1/2).
Nicole Jill-Marie Blackman
Senior Biostatistician
GlaxoSmithKline,
1250 South Collegeville Road, P.O. Box 5089,
Collegeville PA
USA
email: nicole_blackman-1@gsk.com
Competing interests: None
REFERENCES
1.Deeks JJ. Systematic reviews of evaluations of diagnostic and
screening tests. British Medical Journal 2001; 323: 157-162.
2. Kraemer HC. The robustness common measures of 2 X 2 association
due to misclassifications
Competing interests: No competing interests
There seem to be errors in Figure 2 of BMJ 2001;323:157-62.
In the Sensitivity plot on the left, some of the point estimates
appear to be incorrect. For example, we are told that the second Nasri
study has a specificity of 1.00 (6/6), but the point estimate is at about
0.7 on the graph; there in no point esimate for Tavani; and the point
estimate for Varner is at about 0.45 rather than 0.5.
Is the scale reversed at the bottom of the Specificty plot? For
example, the numbers for the Botsis study (14/114) suggest a low
specificity of 0.12, but the point estimate is at about 0.88 (1 - 0.12).
On the other hand, we are told that the Perti study has a specificity of
96/131, or 0.73, but the point estimate is at about 0.27 (1 - 0.73).
Competing interests: No competing interests
An author's error has been made in the labelling of the right-hand graph in
Figure 2, and publisher's errors have been made in the positioning of some
of the points in both panels of this figure.
The numerators given in the right-hand panel of Figure 2 are the numbers of
false positives (not true negatives as labelled). This column is also
incorrectly labelled in the corresponding book chapter (Systematic Reviews
in Health Care: Meta-analysis in Context, page 266).
The sensitivity point estimates for Nasri(b)and Taviani should be
sensitivity=1.0. The 1-specificity point estimate for Goldstein should be
at 1-(16/27)=0.41 and not at 0.6 as depicted. The positioning of all of
these points in the corresponding book chapter is correct (Systematic
Reviews in Health Care: Meta-analysis in Context, page 266).
I thank the keen eyed readers who have pointed out these errors.
Competing interests: No competing interests
Dear Sir,
We enjoyed Deeks’ excellent presentation of two very important topics in systematic reviews of diagnostic accuracy research 1. We should like to draw attention to three points on which perhaps Deeks might have simplified matters too much.
First, in his ‘Framework for considering study quality and likelihood of bias’ he states that the reference diagnosis should be available for all (our italics) patients. However, when the reference test is dangerous and/or expensive it may be wise (and more ethical) to restrict the performance of the reference diagnosis to a random sample of patients who tested negatively on the experimental test. Since this approach does not affect the relative frequency of disease presence among patients with a negative experimental test result, the two approaches will produce identical diagnostic odds ratios (DOR). However, the sampling approach may become statistically infeasible when very small numbers of false negatives are to be expected. Investigators who use the sampling approach should report the sampling fraction to put their readers/reviewers in the position to recalculate the correct sensitivities, specificities and likelihood ratios (LR), because these, unlike the DOR, will not be the same for the two approaches.
Consider an even more extreme example to illustrate the futility of what might be called the “filling-the-fourfold-table-reflex” in diagnostic accuracy research. Consider a study on an experimental test that claims to give clinicians more certainty in situations where they have only a few indications that disease may be present. However, let’s assume that the indications are not strong enough to justify the performance of truly invasive and unpleasant tests. Without the new experimental test these patients would be sent home. The value of the new test lies in its ability to identify - in a relatively non-invasive and inexpensive fashion - those patients who have the disease and would benefit from treatment. In this scenario, the analysis of only those patients who test positively on the experimental test (two cells filled of the fourfold table) suffices to learn about its clinical usefulness.
Second, Deeks ends his explanation of the application of the likelihood ratio with stating that: “Knowledge of other characteristics of a particular patient that either increase or decrease their prior probability of endometrial cancer can be incorporated into the calculation by adjusting the pretest probability accordingly.” However, this assumes constancy of likelihood ratios (an assumption that seems difficult to eradicate), which should not be assumed because it is very unlikely and usually incorrect. In clinical practice, the knowledge of other patient characteristics (known by performing other ‘tests’) will have an influence on the magnitude of the LRs of following tests. This is so because when a chain of diagnostic tests (history taking, physical exam, lab tests, imaging) is performed on a patient certain results from his clinical history make the likelihood to find certain lab results more (or less) likely, which in turn influence the chances of finding certain imaging results. In other words, the results of the component tests are not mutually independent. For example, on average, women with a positive test on ultrasound (thickened endometrium) are more likely to test positively on hysteroscopy in which the endometrial thickness is also assessed, albeit in a different manner. The theoretical solution to this problem is the calculation of LRs that are conditional on the results of the preceding tests in the diagnostic test chain. In practice, this is usually not feasible due to lack of sufficient data and most investigators use logistic regression models to account for all these dependencies. These models, however, yield DORs, not LRs. It is partly this complexity that hampers the application of simple diagnostic accuracy studies to clinical practice. We support Deeks where he calls for more clinically relevant diagnostic studies.
Third, we agree with Deeks that DORs and summary receiver operating characteristic curves may be difficult to interpret. Deeks gives the example of a DOR of 29 and explains that a DOR of 29 could be interpreted as belonging to a sensitivity of 0.95 and a specificity of 0.60 or vice versa. This ‘vice versa-ness’, according to Deeks, limits the DOR’s clinical application. However, given the ranges of sensitivities and specificities in the material from his case study (0.8-1.0 and 0.27-0.87, respectively, excluding one study reporting a sensitivity of 0.50 based on two patients) the choice between these two combinations should be in favour of the former. More general, in instances where a diagnostic test allows the clinician to (technically) adjust the cutoff at a point along the (summary) receiver operating characteristic curve, such a curve may be quite useful in selecting the most (cost-) effective cutoff.
Finally, for those who like to faithfully reproduce Deeks’ graphs and summaries, in figure 2, the numerators in the second column of the right-hand panel represent the number of false positives, not the true negatives. This error may have large numerical consequences when it goes undetected.
Gerben ter Riet, MD, clinical epidemiologist 1,2
Alphons G.H. Kessels, MD, MSc, medical statistician 2
Lucas M. Bachmann, MD, research fellow 3
1 Dept Epidemiology, Maastricht University, Maastricht, The Netherlands
2 Dept Clinical Epidemiology & Medical Technology Assessment, Maastricht University Hospital, Maastricht, The Netherlands.
3 Horten Centre, University of Zurich, Zurich, Switzerland
Competing interests: none
1. Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323: 157-62.
Competing interests: No competing interests
Dear Sir- The "ROC" curves in fig 3 are hard to understand. The
units of the ordinates and abscissas in the centre and right squares do
not seem to correspond to likelihood ratios or diagnostic odds. Anyway,
the scatter of the supposed false positive values looks much the same in
each section. One would expect this as diagnostic odds are derived from
the LR+/LR- ratios, which in turn come from sensitivity and specificity
estimates. No extra information has been added. The apparent reduction in
variance must come from the forced choice between sensitivity or
specificity as determinants, as the authors hint. Using one "cut off"
point for diagnosis (here it is 5mm) means that only one point on the ROC
curve is addressed and estimated. It cannot give any idea of the whole
curve (as shown in fig 1), its discriminatory capacity, or its variance.
It is a useful reminder of the fact that test information is not
infallible and that things like sensitivity and specificity are variables
with errors attached.
GH Hall MD
Competing interests: No competing interests
Corrected Correction
This is hardly a Rapid Response, being 8 years overdue. The
correction, prompted by readers and the author, leaves the impression that
the sensitivity is given by (false positives)/(true negatives + false
positives), whereas it was correctly stated in the original article as
(true negatives)/(true negatives + false positives). The problem arises
because the points to be plotted have been calculated from the false
positive fraction.
The correction should really be corrected to make it clear that it is
(1-sensitivity) that is being plotted. What is really required is a new
figure 2.
Competing interests:
None declared
Competing interests: No competing interests