To review how heterogeneity has been examined in systematic reviews of diagnostic test accuracy studies.
Centre for Reviews and Dissemination's Database of Abstracts of Reviews of Effects (DARE).
Systematic reviews that evaluated a diagnostic or screening test by including studies that compared a test with a reference test were identified from DARE. Reviews for which structured abstracts had been written up to December 2002 were screened for inclusion. Data extraction was undertaken using standardised data extraction forms.
A total of 189 systematic reviews met the inclusion criteria. The median number of studies included was 18. Meta-analyses have a higher number with a median of 22 studies compared with 11 for narrative reviews. Graphical plots to demonstrate the spread in study results were provided in 56% of meta-analyses; in 79% these were plots of sensitivity and specificity in the receiver operating characteristic (ROC) space. Statistical tests to identify heterogeneity were used in 32% of reviews: 41% of meta-analyses and 9% of reviews using narrative syntheses. The chi-squared test and Fisher's exact test to assess heterogeneity in individual aspects of test performance were the most common. In contrast, only 16% of meta-analyses used correlation coefficients to test for a threshold effect. A narrative synthesis was used in 30% of reviews. Of the meta-analyses, 52% carried out statistical pooling alone, 18% conducted only summary receiver operator characteristic (SROC) analyses and 30% used both methods of statistical synthesis. For those undertaking SROC analyses, the main differences between the models used were the weights chosen for the regression models, although in 42% of cases the use of, or choice of, weight was not provided. The proportion of reviews using statistical pooling alone has declined from 67% in 1995 to 42% in 2001, with a corresponding increase in the use of SROC methods, from 33% to 58%. However, two-thirds of those using SROC methods also carried out statistical pooling rather than presenting only SROC models. Reviews using SROC analyses also tended to present their results as some combination of sensitivity and specificity rather than using alternative, perhaps less clinically meaningful, means of data presentation such as diagnostic odds ratios. Three-quarters of meta-analyses attempted to investigate statistically possible sources of variation, using subgroup analysis or regression analysis. The impact of clinical or socio-demographic variables was investigated in 74% of these reviews and test- or threshold-related variables in 79%. At least one quality-related variable was investigated in 63% of reviews. Within this subset, the most commonly considered variables were the use of blinding, sample size, the reference test used and the avoidance of verification bias.
The emphasis on pooling individual aspects of diagnostic test performance and the under-use of statistical tests and graphical approaches to identify heterogeneity perhaps reflect the uncertainty in the most appropriate methods to use and also greater familiarity with more traditional indices of test accuracy. This indicates the difficulty and complexity of carrying out such reviews. In these cases it is strongly suggested that meta-analyses are carried out with the involvement of a statistician familiar with the field. Further methodological work on the statistical methods available for combining diagnostic test accuracy studies is needed, as are sufficiently large, prospectively designed primary studies of diagnostic test accuracy comparing two or more tests for the same target disorder. Use of individual patient data meta-analysis in diagnostic test accuracy reviews should be explored to allow heterogeneity to be considered in more detail.