Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
a Unit for Evidence-Based Practice and Policy Department of Primary Care and Population Sciences University College London Medical School/Royal Free Hospital School of Medicine Whittington Hospital London N19 5NF p.greenhalgh@ucl.ac.uk
| Ten men in the dock |
|---|
|
|
|---|
If you are new to the concept of validating diagnostic tests, the following example may help you. Ten men are awaiting trial for murder. Only three of them actually committed a murder; the seven others are innocent of any crime. A jury hears each case and finds six of the men guilty of murder. Two of the convicted are true murderers. Four men are wrongly imprisoned. One murderer walks free.
|
This information can be expressed in what is known as a two by two table (table 1). Note that the "truth" (whether or not the men really committed a murder) is expressed along the horizontal title row, whereas the jury's verdict (which may or may not reflect the truth) is expressed down the vertical row.
|
These figures, if they are typical, reflect several features of this particular jury:
These five features constitute, respectively, the sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of this jury's performance. The rest of this article considers these five features applied to diagnostic (or screening) tests when compared with a "true" diagnosis or gold standard. A sixth featurethe likelihood ratiois introduced at the end of the article.
| Validating tests against a gold standard |
|---|
|
|
|---|
Our window cleaner told me that he had been feeling thirsty recently and had asked his general practitioner to be tested for diabetes, which runs in his family. The nurse in his surgery had asked him to produce a urine specimen and dipped a stick in it. The stick stayed green, which meant, apparently, that there was no sugar in his urine. This, the nurse had said, meant that he did not have diabetes.
|
Summary points New tests should be validated by comparison against an established gold standard in an appropriate spectrum of subjects Diagnostic tests are seldom 100% accurate (false positives and false negatives will occur) A test is valid if it detects most people with the target disorder (high sensitivity) and excludes most people without the disorder (high specificity), and if a positive test usually indicates that the disorder is present (high positive predictive value) The best measure of the usefulness of a test is probably the likelihood ratiohow much more likely a positive test is to be found in someone with, as opposed to without, the disorder
|
I had trouble explaining that the result did not necessarily mean this, any more than a guilty verdict necessarily makes someone a murderer. The definition of diabetes, according to the World Health Organisation, is a blood glucose level above 8 mmol/l in the fasting state, or above 11 mmol/l two hours after a 100 g oral glucose load, on one occasion if the patient has symptoms and on two occasions if he or she does not.1 These stringent criteria can be termed the gold standard for diagnosing diabetes (although purists have challenged this notion2 ).
The dipstick test, however, has some distinct practical advantages over the fullblown glucose tolerance test. To assess objectively just how useful the dipstick test for diabetes is, we would need to select a sample of people (say 100) and do two tests on each of them: the urine test (screening test) and a standard glucose tolerance test (gold standard). We could then see, for each person, whether the result of the screening test matched the gold standard (see table 2). Such an exercise is known as a validation study.
|
The validity of urine testing for glucose in diagnosing diabetes has been looked at by Andersson and colleagues,3 whose data I have adapted for use (expressed as a proportion of 1000 subjects tested) in table 3.
|
From the calculations of important features of the urine dipstick test for diabetes (box), you can see why I did not share the window cleaner's assurance that he did not have diabetes. A positive urine glucose test is only 22% sensitive, which means that the test misses nearly four fifths of people who have diabetes. In the presence of classical symptoms and a family history, the window cleaner's baseline chances (pretest likelihood) of having the condition are pretty high and is reduced to only about four fifths of this (the negative likelihood ratio, 0.78; see below) after a single negative urine test. This man clearly needs to undergo a more definitive test.
| |||||||||||||||||||||||||||||||||||||||||||||
| Does the paper validate the test? |
|---|
|
|
|---|
The 10 questions below can be asked about a paper that claims to validate a diagnostic or screening test. In preparing these tips, I have drawn on several sources.4 5 6 7 8
Question 1: Is this test potentially relevant to my practice?
Sackett and colleagues call this the utility of the test.6
Even if this test were 100% valid, accurate, and reliable, would it help me? Would it
identify
a treatable disorder? If so, would I use it in preference to the test I use now? Could I (or my
patients
or the taxpayer) afford it? Would my patients consent to it? Would it change the probabilities for
competing diagnoses sufficiently for me to alter my treatment plan?
Question 2: Has the test been compared with a true gold standard?
You need to ask, firstly, whether the test has been compared with anything at all.
Assuming
that a "gold standard" test has been used, you should verify that it merits the
description, perhaps by using the questions listed in question 1. For many conditions, there is no
gold
standard diagnostic test. Unsurprisingly, these tend to be the conditions for which new tests are
most
actively sought. Hence, the authors of such papers may need to develop and justify a combination
of criteria against which the new test is to be assessed. One specific point to check is that the test
being validated in the paper is not being used to define the gold standard.
Question 3: Did this validation study include an appropriate spectrum of subjects?
Although few investigators would be naive enough to select only, say, healthy male
medical
students for their validation study, only 27% of published studies explicitly define the
spectrum of subjects tested in terms of age, sex, symptoms or disease severity, and specific
eligibility
criteria.7 Importantly, the test should be verified on a
population which includes mild and severe disease, treated and untreated subjects, and those with
different but commonly confused conditions.6
| ||||||||||||||||||||||||||||||||||||||||||||||
Although the sensitivity and specificity of a test are virtually constant whatever the prevalence of the condition, the positive and negative predictive values depend crucially on prevalence. This is why general practitioners are sceptical of the utility of tests developed exclusively in a secondary care population, and why a good diagnostic test is not necessarily a good screening test.
Question 4: Has workup bias been avoided?
This is easy to check. It simply means, "Did everyone who got the new diagnostic
test
also get the gold standard, and vice versa?" There is clearly a potential bias in studies
where
the gold standard test is performed only on people who have already tested positive for the test
being
validated.7
Question 5: Has expectation bias been avoided?
Expectation bias occurs when pathologists and others who interpret diagnostic specimens
are
subconsciously influenced by the knowledge of the particular features of the casefor
example, the presence of chest pain when interpreting an electrocardiogram. In the context of
validating diagnostic tests against a gold standard, all such assessments should be
"blind."
Question 6: Was the test shown to be reproducible?
If the same observer performs the same test on two occasions on a subject whose
characteristics have not changed, they will get different results in a proportion of cases. Similarly,
it is important to confirm that reproducibility between different observers is at an acceptable
level.9
Question 7: What are the features of the test as derived from this validation study?
All the above standards could have been met, but the test might still be worthless because
the
sensitivity, specificity, and other crucial features of the test are too lowthat is, the test
is not
valid. What counts as acceptable depends on the condition being screened for. Few of us would
quibble about a test for colour blindness that was 95% sensitive and 80% specific,
but
nobody ever died of colour blindness. The Guthrie heel-prick screening test for congenital
hypothyroidism, performed on all babies in Britain soon after birth, is over 99% sensitive
but
has a positive predictive value of only 6% (it picks up almost all babies with the condition
at the expense of a high false positive rate),10 and rightly
so.
It is more important to pick up every baby with this treatable condition who would otherwise
develop severe mental handicap than to save hundreds the minor stress of a repeat blood
test.
Question 8: Were confidence intervals given?
A confidence interval, which can be calculated for virtually every numerical aspect of a
set
of results, expresses the possible range of results within which the true value will probably lie.
If the
jury in the first example had found just one more murderer not guilty, the sensitivity of its verdict
would have gone down from 67% to 33%, and the positive predictive value of the
verdict from 33% to 20%. This enormous (and quite unacceptable) sensitivity to
a
single case decision is, of course, because we validated the jury's performance on only
10
cases. The larger the sample, the narrower the confidence interval, so it is particularly important
to
look for confidence intervals if the paper you are reading reports a study on a relatively small
sample.11
Question 9: Has a sensible "normal range" been derived?
If the test gives non-dichotomous (continuous) resultsthat is, if it gives a
numerical value rather than a yes/no resultsomeone will have to say what values
count
as abnormal. Defining relative and absolute danger zones for a continuous variable (such as
blood
pressure) is a complex science, which should take into account the actual likelihood of the
adverse
outcome which the proposed treatment aims to prevent. This process is made considerably more
objective by the use of likelihood ratios (see below).
Question 10: Has this test been placed in the context of other potential tests in the
diagnostic
sequence?
In general, we treat high blood pressure simply on the basis of a series of resting blood
pressure readings. Compare this with the sequence we use to diagnose coronary artery stenosis.
Firstly, we select patients with a typical history of effort angina. Next, we usually do a resting
electrocardiogram, an exercise electrocardiogram, and, in some cases, a radionuclide scan of the
heart. Most patients come to a coronary angiogram only after they have produced an abnormal
result
on these preliminary tests.
If you sent 100 ordinary people for a coronary angiogram, the test might show very different positive and negative predictive values (and even different sensitivity and specificity) than it did in the ill population on which it was originally validated. This means that the various aspects of validity of the coronary angiogram as a diagnostic test are virtually meaningless unless these figures are expressed in terms of what they contribute to the overall diagnostic work up.
| A note on likelihood ratios |
|---|
|
|
|---|
Question 9 above described the problem of defining a normal range for a continuous variable. In such circumstances, it can be preferable to express the test result not as "normal" or "abnormal" but in terms of the actual chances of a patient having the target disorder if the test result reaches a particular level. Take, for example, the use of the prostate specific antigen (PSA) test to screen for prostate cancer. Most men will have some detectable antigen in their blood (say, 0.5 ng/ml), and most of those with advanced prostate cancer will have high concentrations (above about 20 ng/ml). But a concentration of, say, 7.4 ng/ml may be found either in a perfectly normal man or in someone with early cancer. There simply is not a clean cutoff between normal and abnormal.12
We can, however, use the results of a validation study of this test against a gold standard for prostate cancer (say a biopsy of the prostate gland) to draw up a whole series of two by two tables. Each table would use a different definition of an abnormal test result to classify patients as "normal" or "abnormal." From these tables, we could generate different likelihood ratios associated with an antigen concentration above each different cutoff point. When faced with a test result in the "grey zone" we would at least be able to say, "This test has not proved that the patient has prostate cancer, but it has increased [or decreased] the odds of that diagnosis by a factor of x."
The likelihood ratio thus has enormous practical value, and it is becoming the preferred way of expressing and comparing the usefulness of different tests.6 For example, if a person enters my consulting room with no symptoms at all, I know that they have a 5% chance of having iron deficiency anaemia, since I know that one person in 20 in the population has this condition (in the language of diagnostic tests, the pretest probability of anaemia is 0.05).13
|
Now, if I do a diagnostic test for anaemia, the serum ferritin concentration, the result will usually make the diagnosis of anaemia either more or less likely. A moderately reduced serum ferritin concentration (between 18 and 45 µg/l) has a likelihood ratio of 3, so the chances of a patient with this result having iron deficiency anaemia is 0.05x3or 0.15 (15%). This value is known as the post-test probability of the serum ferritin test. The likelihood ratio of a very low serum ferritin concentration (below 18 µg/l) is 41, making the chances of iron deficiency anaemia in a patient with this result greater than unity. On the other hand, a very high concentration (above 100 µg/l; likelihood ratio 0.13) would reduce the chances of the patient being anaemic from 5% to less than 1%.13
Figure 1 shows a nomogram, adapted by Sackett and colleagues from an original paper by Fagan,14 for working out post-test probabilities when the pretest probability (prevalence) and likelihood ratio for the test are known. The lines A, B, and C, drawn from a pretest probability of 25% (the prevalence of smoking among British adults), are the trajectories through likelihood ratios of 15, 100, and 0.015, respectivelythree different tests for detecting whether someone is a smoker.15 Actually, test C detects whether the person is a non-smoker, since a positive result in this test leads to a post-test probability of only 0.5%.
|
The articles in this series are excerpts from How to read a paper:
the basics of evidence based medicine. The book includes chapters on searching the
literature and implementing evidence based findings. It can be ordered from the BMJ Publishing
Group: tel 0171 383 6185/6245; fax 0171 383 6662. Price £13.95 UK members,
£14.95 non-members.
|
| Acknowledgements |
|---|
Thanks to Dr Sarah Walters and Dr Jonathan Elford for advice, and in particular to Dr Walters for the jury example.
| References |
|---|
|
|
|---|
Read all Rapid Responses
What can you learn from this BMJ paper? Read Leanne Tite's Paper+