In biostatistics, spectrum bias refers to the phenomenon that the performance of a diagnostic test may vary in different clinical settings because each setting has a different mix of patients.[1] Because the performance may be dependent on the mix of patients, performance at one clinic may not be predictive of performance at another clinic.[2] These differences are interpreted as a kind of bias. Mathematically, the spectrum bias is a sampling bias and not a traditional statistical bias; this has led some authors to refer to the phenomenon as spectrum effects,[3] whilst others maintain it is a bias if the true performance of the test differs from that which is 'expected'.[2] Usually the performance of a diagnostic test is measured in terms of its sensitivity and specificity and it is changes in these that are considered when referring to spectrum bias. However, other performance measures such as the likelihood ratios may also be affected by spectrum bias.[2]
Generally spectrum bias is considered to have three causes.[2] The first is due to a change in the case-mix of those patients with the target disorder (disease) and this affects the sensitivity. The second is due to a change in the case-mix of those without the target disorder (disease-free) and this affects the specificity. The third is due to a change in the prevalence, and this affects both the sensitivity and specificity.[4] This final cause is not widely appreciated, but there is mounting empirical evidence[5] as well as theoretical arguments[6] which suggest that it does indeed affect a test's performance.
Examples where the sensitivity and specificity change between different sub-groups of patients may be found with the carcinoembryonic antigen test[7] and urinary dipstick tests.[8]
Diagnostic test performances reported by some studies may be artificially overestimated if it is a case-control design where a healthy population ('fittest of the fit') is compared with a population with advanced disease ('sickest of the sick'); that is two extreme populations are compared, rather than typical healthy and diseased populations.[9]
If properly analyzed, recognition of heterogeneity of subgroups can lead to insights about the test's performance in varying populations.