Censoring (statistics) explained

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known. For example, suppose a study is conducted to measure the impact of a drug on mortality rate. In such a study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.

Censoring also occurs when a value occurs outside the range of a measuring instrument. For example, a bathroom scale might only measure up to 140 kg. If a 160 kg individual is weighed using the scale, the observer would only know that the individual's weight is at least 140 kg.

The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of missing data, where the observed value of some variable is unknown.

Censoring should not be confused with the related idea truncation. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within an interval. With truncation, observations never result in values outside a given range: values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as rounding.

Types

Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.

Estimation methods for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.[1]

A common misconception with time interval data is to class as left censored intervals where the start time is unknown. In these cases we have a lower bound on the time interval, thus the data is right censored (despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).

Analysis

Special techniques may be used to handle censored data. Tests with specific failure times are coded as actual failures; censored data are coded for the type of censoring and the known interval or limit. Special software programs (often reliability oriented) can conduct a maximum likelihood estimation for summary statistics, confidence intervals, etc.

Epidemiology

One of the earliest attempts to analyse a statistical problem involving censored data was Daniel Bernoulli's 1766 analysis of smallpox morbidity and mortality data to demonstrate the efficacy of vaccination.[2] An early paper to use the Kaplan–Meier estimator for estimating censored costs was Quesenberry et al. (1989),[3] however this approach was found to be invalid by Lin et al.[4] unless all patients accumulated costs with a common deterministic rate function over time, they proposed an alternative estimation technique known as the Lin estimator.[5]

Operating life testing

Reliability testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur.

An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.

Censored regression

An earlier model for censored regression, the tobit model, was proposed by James Tobin in 1958.[6]

Likelihood

The likelihood is the probability or probability density of what was observed, viewed as a function of parameters in an assumed model. To incorporate censored data points in the likelihood the censored data points are represented by the probability of the censored data points as a function of the model parameters given a model, i.e. a function of CDF(s) instead of the density or probability mass.

The most general censoring case is interval censoring:

Pr(a<x\leqslantb)=F(b)-F(a)

, where

F(x)

is the CDF of the probability distribution, and the two special cases are:

Pr(-infty<x\leqslantb)=F(b)-F(-infty)=F(b)-0=F(b)=Pr(x\leqslantb)

Pr(a<x\leqslantinfty)=F(infty)-F(a)=1-F(a)=1-Pr(x\leqslanta)=Pr(x>a)

For continuous probability distributions:

Pr(a<x\leqslantb)=Pr(a<x<b)

Example

Suppose we are interested in survival times,

T1,T2,...,Tn

, but we don't observe

Ti

for all

i

. Instead, we observe

(Ui,\deltai)

, with

Ui=Ti

and

\deltai=1

if

Ti

is actually observed, and

(Ui,\deltai)

, with

Ui<Ti

and

\deltai=0

if all we know is that

Ti

is longer than

Ui

.

When

Ti>Ui,Ui

is called the censoring time.[7]

If the censoring times are all known constants, then the likelihood is

L=

\prod
i,\deltai=1

f(ui)

\prod
i,\deltai=0

S(ui)

where

f(ui)

= the probability density function evaluated at

ui

,

and

S(ui)

= the probability that

Ti

is greater than

ui

, called the survival function.

This can be simplified by defining the hazard function, the instantaneous force of mortality, as

λ(u)=f(u)/S(u)

so

f(u)=λ(u)S(u)

.

Then

L=\prodi

\deltai
λ(u
i)

S(ui)

.

For the exponential distribution, this becomes even simpler, because the hazard rate,

λ

, is constant, and

S(u)=\exp(u)

. Then:

L(λ)=λk\exp(\sum{ui})

,

where

k=\sum{\deltai}

.

From this we easily compute

\hat{λ}

, the maximum likelihood estimate (MLE) of

λ

, as follows:

l(λ)=log(L(λ))=klog(λ)-λ\sum{ui}

.

Then

dl/dλ=k/λ-\sum{ui}

.

We set this to 0 and solve for

λ

to get:

\hatλ=k/\sumui

.

Equivalently, the mean time to failure is:

1/\hatλ=\sumui/k

.

This differs from the standard MLE for the exponential distribution in that the any censored observations are considered only in the numerator.

See also

Further reading

External links

Notes and References

  1. Helsel . D. . Much Ado About Next to Nothing: Incorporating Nondetects in Science . Annals of Occupational Hygiene . 54 . 3 . 257–262 . 2010 . 10.1093/annhyg/mep092 . 20032004 . free .
  2. Bernoulli . D. . 1766 . Essai d'une nouvelle analyse de la mortalité causée par la petite vérole . Mem. Math. Phy. Acad. Roy. Sci. Paris .,. reprinted in Bradley (1971) 21 and Blower (2004)
  3. C. P. Jr. . Quesenberry . B. . Fireman . R. A. . Hiatt . J. V. . Selby . 1 . A survival analysis of hospitalization among patients with acquired immunodeficiency syndrome . . 79 . 12 . 1989 . 1643–1647 . 1349769 . 2817192 . 10.2105/AJPH.79.12.1643 .
  4. Lin . D. Y. . Feuer . E. J. . Etzioni . R. . Wax . Y. . 1 . Estimating medical costs from incomplete follow-up data . . 1997 . 53 . 2 . 419–434 . 9192444 . 10.2307/2533947 . 2533947 .
  5. Wijeysundera . H. C. . Wang . X. . Tomlinson . G. . Ko . D. T. . Krahn . M. D. . 1 . Techniques for estimating health care costs with censored data: an overview for the health services researcher . . 2012 . 4 . 145–155 . 3377439 . 22719214 . 10.2147/CEOR.S31552 . free .
  6. Tobin . James . 1958 . Estimation of relationships for limited dependent variables . Econometrica . 26 . 1 . 24–36 . 1907382 . 10.2307/1907382 .
  7. .