Estimate Mean, Standard Deviation, and Standard Error Nonparametrically Based on Censored Data

Estimate the mean, standard deviation, and standard error of the mean nonparametrically given a sample of data from a positive-valued distribution that has been subjected to left- or right-censoring, and optionally construct a confidence interval for the mean.

enparCensored(x, censored, censoring.side = "left", correct.se = TRUE, 
    restricted = FALSE, left.censored.min = "Censoring Level", 
    right.censored.max = "Censoring Level", ci = FALSE, 
    ci.method = "normal.approx", ci.type = "two-sided", conf.level = 0.95, 
    pivot.statistic = "t", ci.sample.size = "Total", n.bootstraps = 1000, 
    seed = NULL, warn = FALSE)

Arguments

x: numeric vector of positive-valued observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.
censored: numeric or logical vector indicating which values of x are censored. This must be the same length as x. If the mode of censored is "logical", TRUE values correspond to elements of x that are censored, and FALSE values correspond to elements of x that are not censored. If the mode of censored is "numeric", it must contain only 1's and 0's; 1 corresponds to TRUE and 0 corresponds to FALSE. Missing (NA) values are allowed but will be removed.
censoring.side: character string indicating on which side the censoring occurs. The possible values are "left" (the default) and "right".
correct.se: logical scalar indicating whether to multiply the estimated standard error by a factor to correct for bias. The default value is correct.se=TRUE. See the DETAILS section below.
restricted: logical scalar indicating whether to compute the restricted mean in the case when the smallest censored value is less than or equal to the smallest uncensored value (left-censored data) or the largest censored value is greater than or equal to the largest uncensored value (right-censored data). The default value is restricted=FALSE. See the DETAILS section for more information.
left.censored.min: Only relevant for the case when censoring.side="left", the smallest censored value is less than or equal to the smallest uncensored value, and
restricted=TRUE. In this case, left.censored.min must be the character string "Censoring Level", or else a numeric scalar between 0 and the smallest censored value. The default value is left.censored.min="Censoring Level". See the DETAILS section for more information.
right.censored.max: Only relevant for the case when censoring.side="right", the largest censored value is greater than or equal to the largest uncensored value, and
restricted=TRUE. In this case, right.censored.max must be the character string "Censoring Level", or else a numeric scalar greater than or equal to the largest censored value. The default value is right.censored.max="Censoring Level". See the DETAILS section for more information.
ci: logical scalar indicating whether to compute a confidence interval for the mean or variance. The default value is ci=FALSE.
ci.method: character string indicating what method to use to construct the confidence interval for the mean. The possible values are "normal.approx" (normal approximation; the default), and "bootstrap" (based on bootstrapping). See the DETAILS section for more information. This argument is ignored if ci=FALSE.
ci.type: character string indicating what kind of confidence interval to compute. The possible values are "two-sided" (the default), "lower", and "upper". This argument is ignored if ci=FALSE.
conf.level: a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default value is conf.level=0.95. This argument is ignored if ci=FALSE.
pivot.statistic: character string indicating which statistic to use for the confidence interval for the mean when ci.method="normal.approx". Possible values are "t" (confidence interval based on the t-statistic; the default), and "z" (confidence interval based on the z-statistic). When pivot.statistic="t" you may supply the argument ci.sample size (see below). This argument is ignored if ci=FALSE.
ci.sample.size: character string indicating what sample size to assume when computing the confidence interval for the mean when ci.method="normal.approx"
and pivot.statistic="t". Possible values are ci.sample.size="Total" (the total number of observations; the default), and
ci.sample.size="Uncensored" (the number of uncensored observations). This argument is ignored if ci=FALSE, ci.method="bootstrap", or pivot.statistic="z".
n.bootstraps: numeric scalar indicating how many bootstraps to use to construct the confidence interval for the mean when ci.type="bootstrap". This argument is ignored if ci=FALSE or ci.method="normal.approx".
seed: integer supplied to the function set.seed and used when ci=TRUE and
ci.method="bootstrap". The default value is seed=NULL, in which case the current value of .Random.seed is used. This argument is ignored if ci=FALSE or ci.method="normal.approx". The seed argument is necessary in order to create reproducible results for the bootstrapped confidence intervals (see the EXAMPLES section).
warn: logical scalar indicating whether to issue a notification in the case when a restricted mean will be estimated, but setting the smallest censored value(s) to an uncensored value (left-censored data) or setting the largest censored value(s) to an uncensored value (right-censored data) results in no censored values in the data. In this case, the function enpar is called.

Details

Let $\underline{x} = (x_1, x_2, \ldots, x_N)$ denote a vector of $N$ observations from some positive-valued distribution with mean $\mu$ and standard deviation $\sigma$. Assume $n$ ($0 < n < N$) of these observations are known and $c$ ($c=N-n$) of these observations are all censored below (left-censored) or all censored above (right-censored) at $k$ censoring levels $$T_1, T_2, \ldots, T_k; \; k \ge 1 \;\;\;\;\;\; (1)$$ Let $y_1, y_2, \ldots, y_n$ denote the $n$ ordered uncensored observations, and let $r_1, r_2, \ldots, r_n$ denote the order of these uncensored observations within the context of all the observations (censored and uncensored). For example, if the left-censored data are {<10, 14, 14, <15, 20}, then $y_1 = 14, y_2 = 14, y_3 = 20$, and $r_1 = 2, r_2 = 3, r_3 = 5$.

Let $y_1', y_2', \ldots, y_p'$ denote the $p$ ordered distinct uncensored observations, let $m_j$ denote the number of detects at $y_j'$ ($j = 1, 2, \ldots, p$), and let $r_j'$ denote the number of $x_i \le y_j'$, i.e., the number of observations (censored and uncensored) less than or equal to $y_j'$ ($j = 1, 2, \ldots, p$). For example, if the left-censored data are {<10, 14, 14, <15, 20}, then $y_1' = 14, y_2' = 20$, $m_1 = 2, m_2 = 1$, and $r_1' = 3, r_2' = 5$.

Estimation
This section explains how the mean $\mu$, standard deviation $\sigma$, and standard error of the mean $\hat{\sigma}_{\hat{\mu}}$ are estimated, as well as the restricted mean.

Estimating the Mean
It can be shown that the mean of a positive-valued distribution is equal to the area under the survival curve (Klein and Moeschberger, 2003, p.33): $$\mu = \int_0^\infty [1 - F(t)] dt = \int_0^\infty S(t) dt \;\;\;\;\;\; (2)$$ where $F(t)$ denotes the cumulative distribution function evaluated at $t$ and $S(t) = 1 - F(t)$ denotes the survival function evaluated at $t$. When the Kaplan-Meier estimator is used to construct the survival function, you can use the area under this curve to estimate the mean of the distribution, and the estimator can be as efficient or more efficient than parametric estimators of the mean (Meier, 2004; Helsel, 2012; Lee and Wang, 2003). Let $\hat{F}(t)$ denote the Kaplan-Meier estimator of the empirical cumulative distribution function (ecdf) evaluated at $t$, and let $\hat{S}(t) = 1 - \hat{F}(t)$ denote the estimated survival function evaluated at $t$. (See the help files for ecdfPlotCensored and qqPlotCensored for an explanation of how the Kaplan-Meier estimator of the ecdf is computed.)

The formula for the estimated mean is given by (Lee and Wang, 2003, p. 74): $$\hat{\mu} = \sum_{i=1}^{n} \hat{S}(y_{i-1}) (y_{i} - y_{i-1}) \;\;\;\;\;\; (3)$$ where $y_{0} = 0$ and $\hat{S}(y_{0}) = 1$ by definition. It can be shown that this formula is eqivalent to: $$\hat{\mu} = \sum_{i=1}^n y_{i} [\hat{F}(y_{i}) - \hat{F}(y_{i-1})] \;\;\;\;\;\; (4)$$ where $\hat{F}(y_{0}) = \hat{F}(0) = 0$ by definition, and this is equivalent to: $$\hat{\mu} = \sum_{i=1}^p y_i' [\hat{F}(y_i') - \hat{F}(y_{i-1}')] \;\;\;\;\;\; (5)$$ (USEPA, 2009, pp. 15–7 to 15–12; Beal, 2010; USEPA, 2022, pp. 128–129).

Estimating the Standard Deviation
The formula for the estimated standard deviation is: $$\hat{\sigma} = \{\sum_{i=1}^n (y_{i} - \hat{\mu})^2 [\hat{F}(y_{i}) - \hat{F}(y_{i-1})]\}^{1/2} \;\;\;\;\; (6)$$ which is equivalent to: $$\hat{\sigma} = \{\sum_{i=1}^p (y_i' - \hat{\mu})^2 [\hat{F}(y_i') - \hat{F}(y_{i-1}')]\}^{1/2} \;\;\;\;\; (7)$$ (USEPA, 2009, p. 15-10; Beal, 2010).

Estimating the Standard Error of the Mean
For left-censored data, the formula for the estimated standard error of the mean is: $$\hat{\sigma}_{\hat{\mu}} = [\sum_{i=j}^{p-1} A_j^2 \frac{m_{j+1}}{r_{j+1}'(r_{j+1}' - m_{j+1})}]^{1/2} \;\;\;\;\;\; (8)$$ where $$A_j = \sum_{i=1}^{j} (y_{i+1}' - y_i') \hat{F}(y_i') \;\;\;\;\;\; (9)$$ (Beal, 2010; USEPA, 2022, pp. 128–129).

For rigth-censored data, the formula for the estimated standard error of the mean is: $$\hat{\sigma}_{\hat{\mu}} = [\sum_{r=1}^{n-1} \frac{A_r^2}{(N-r)(N-r+1)}]^{1/2} \;\;\;\;\;\; (10)$$ where $$A_r = \sum_{i=r}^{n-1} (y_{i+1} - y_{i}) \hat{S}(y_{i}) \;\;\;\;\;\; (11)$$ (Lee and Wang, 2003, p. 74).

Kaplan and Meier suggest using a bias correction of $n/(n-1)$ for the estimated variance of the mean (Lee and Wang, 2003, p.75): $$\hat{\sigma}_{\hat{\mu}, BC} = \sqrt{\frac{n}{n-1}} \;\; \hat{\sigma}_{\hat{\mu}} \;\;\;\;\;\; (12)$$ When correct.se=TRUE (the default), Equation (12) is used. Beal (2010), ProUCL 5.2.0 (USEPA, 2022), and the kmms function in the STAND package (Frome and Frome, 2015) all compute the bias-corrected estimate of the standard error of the mean as well.

Estimating the Restricted Mean
If the smallest value for left-censored data is censored and less than or equal to the smallest uncensored value, then the estimated mean will be biased high, and if the largest value for right-censored data is censored and greater than or equal to the largest uncensored value, then the estimated mean will be biased low. One solution to this problem is to instead estimate what is called the restricted mean (Miller, 1981; Lee and Wang, 2003, p. 74; Meier, 2004; Barker, 2009).

To compute the restricted mean (restricted=TRUE), for left-censored data, the smallest censored observation(s) are treated as observed, and set to the smallest censoring level
(left.censored.min="Censoring Level") or some other value less than the smallest censoring level and greater than 0, and then applying the above formulas. To compute the restricted mean for right-censored data, the largest censored observation(s) are treated as observed and set to the censoring level (right.censored.max="Censoring Level") or some value greater than the largest censoring level.

ProUCL 5.2.0 (USEPA, 2022, pp. 128–129) and Beal (2010) do not compute the restricted mean in cases where it could be applied, whereas USEPA (2009, pp. 15–7 to 15–12) and the kmms function in Version 2.0 of the R package STAND (Frome and Frome, 2015) do compute the restricted mean and set the smallest censored observation(s) equal to the censoring level (i.e., what enparCensored does when restricted=TRUE and left.censored.min="Censoring Level").

To be consistent with ProUCL 5.2.0, by default the function enparCensored does not compute the restricted mean (i.e., restricted=FALSE). Note that there are instances where ProUCL 5.2.0 does not agree with enparCensored, and it is not clear why. It should be noted that when the restricted mean is computed, the number of uncensored observations increases because the smallest (left-censored) or largest (right-censored) censored observation(s) is/are set to a specified value and treated as uncensored. The kmms function in Version 2.0 of the STAND package (Frome and Frome, 2015) is inconsistent in how it treats the number of uncensored observations when computing estimates associated with the restricted mean. Although kmms sets the smallest censored observations to the observed censoring level and treats them as not censored, when it computes the bias correction factor for the standard error of the mean, it assumes those observations are still censored (see the EXAMPLES section below).

In the unusual case when a restricted mean will be estimated and setting the smallest censored value(s) to an uncensored value (left-censored data), or setting the largest censored value(s) to an uncensored value (right-censored data), results in no censored values in the data, the Kaplan-Meier estimate of the mean reduces to the sample mean, so the function enpar is called and, if warn=TRUE, a warning is returned.

Confidence Intervals
This section explains how confidence intervals for the mean $\mu$ are computed.

Normal Approximation (ci.method="normal.approx")
This method constructs approximate $(1-\alpha)100\%$ confidence intervals for $\mu$ based on the assumption that the estimator of $\mu$ is approximately normally distributed. That is, a two-sided $(1-\alpha)100\%$ confidence interval for $\mu$ is constructed as: $$[\hat{\mu} - t_{1-\alpha/2, v-1}\hat{\sigma}_{\hat{\mu}}, \; \hat{\mu} + t_{1-\alpha/2, v-1}\hat{\sigma}_{\hat{\mu}}] \;\;\;\; (13)$$ where $\hat{\mu}$ denotes the estimate of $\mu$, $\hat{\sigma}_{\hat{\mu}}$ denotes the estimated asymptotic standard deviation of the estimator of $\mu$, $v$ denotes the assumed sample size for the confidence interval, and $t_{p,\nu}$ denotes the $p$'th quantile of Student's t-distribuiton with $\nu$ degrees of freedom. One-sided confidence intervals are computed in a similar fashion.

The argument ci.sample.size determines the value of $v$. The possible values are the total number of observations, $N$ (ci.sample.size="Total"), or the number of uncensored observations, $n$ (ci.sample.size="Uncensored"). To be consistent with ProUCL 5.2.0, in enparCensored the default value is the total number of observations. The kmms function in the STAND package, on the other hand, uses the number of uncensored observations.

When pivot.statistic="z", the $p$'th quantile from the standard normal distribution is used in place of the $p$'th quantile from Student's t-distribution.

Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap")
The bootstrap is a nonparametric method of estimating the distribution (and associated distribution parameters and quantiles) of a sample statistic, regardless of the distribution of the population from which the sample was drawn. The bootstrap was introduced by Efron (1979) and a general reference is Efron and Tibshirani (1993).

In the context of deriving an approximate $(1-\alpha)100\%$ confidence interval for the population mean $\mu$, the bootstrap can be broken down into the following steps:

Create a bootstrap sample by taking a random sample of size $N$ from the observations in $\underline{x}$, where sampling is done with replacement. Note that because sampling is done with replacement, the same element of $\underline{x}$ can appear more than once in the bootstrap sample. Thus, the bootstrap sample will usually not look exactly like the original sample (e.g., the number of censored observations in the bootstrap sample will often differ from the number of censored observations in the original sample).
Estimate $\mu$ based on the bootstrap sample created in Step 1, using the same method that was used to estimate $\mu$ using the original observations in $\underline{x}$. Because the bootstrap sample usually does not match the original sample, the estimate of $\mu$ based on the bootstrap sample will usually differ from the original estimate based on $\underline{x}$. For the bootstrap-t method (see below), this step also involves estimating the standard error of the estimate of the mean and computing the statistic $T = (\hat{\mu}_B - \hat{\mu}) / \hat{\sigma}_{\hat{\mu}_B}$ where $\hat{\mu}$ denotes the estimate of the mean based on the original sample, and $\hat{\mu}_B$ and $\hat{\sigma}_{\hat{\mu}_B}$ denote the estimate of the mean and estimate of the standard error of the estimate of the mean based on the bootstrap sample.
Repeat Steps 1 and 2 $B$ times, where $B$ is some large number. For the function enparCensored, the number of bootstraps $B$ is determined by the argument n.bootstraps (see the section ARGUMENTS above). The default value of n.bootstraps is 1000.
Use the $B$ estimated values of $\mu$ to compute the empirical cumulative distribution function of the estimator of $\mu$ or to compute the empirical cumulative distribution function of the statistic $T$ (see ecdfPlot), and then create a confidence interval for $\mu$ based on this estimated cdf.

The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as: $$[\hat{G}^{-1}(\frac{\alpha}{2}), \; \hat{G}^{-1}(1-\frac{\alpha}{2})] \;\;\;\;\;\; (14)$$ where $\hat{G}(t)$ denotes the empirical cdf of $\hat{\mu}_B$ evaluated at $t$ and thus $\hat{G}^{-1}(p)$ denotes the $p$'th empirical quantile of the distribution of $\hat{\mu}_B$, that is, the $p$'th quantile associated with the empirical cdf. Similarly, a one-sided lower confidence interval is computed as: $$[\hat{G}^{-1}(\alpha), \; \infty] \;\;\;\;\;\; (15)$$ and a one-sided upper confidence interval is computed as: $$[-\infty, \; \hat{G}^{-1}(1-\alpha)] \;\;\;\;\;\; (16)$$ The function enparCensored calls the R function quantile to compute the empirical quantiles used in Equations (14)-(16).

The percentile method bootstrap confidence interval is only first-order accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability that the confidence interval will contain the true value of $\mu$ can be off by $k/\sqrt{N}$, where $k$ is some constant. Efron and Tibshirani (1993, pp.184–188) proposed a bias-corrected and accelerated interval that is second-order accurate, meaning that the probability that the confidence interval will contain the true value of $\mu$ may be off by $k/N$ instead of $k/\sqrt{N}$. The two-sided bias-corrected and accelerated confidence interval is computed as: $$[\hat{G}^{-1}(\alpha_1), \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (17)$$ where $$\alpha_1 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(z_0 + z_{\alpha/2})}] \;\;\;\;\;\; (18)$$ $$\alpha_2 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(z_0 + z_{1-\alpha/2})}] \;\;\;\;\;\; (19)$$ $$\hat{z}_0 = \Phi^{-1}[\hat{G}(\hat{\mu})] \;\;\;\;\;\; (20)$$ $$\hat{a} = \frac{\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^3}{6[\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^2]^{3/2}} \;\;\;\;\;\; (21)$$ where the quantity $\hat{\mu}_{(i)}$ denotes the estimate of $\mu$ using all the values in $\underline{x}$ except the $i$'th one, and $$\hat{\mu}{(\cdot)} = \frac{1}{N} \sum_{i=1}^N \hat{\mu_{(i)}} \;\;\;\;\;\; (22)$$ A one-sided lower confidence interval is given by: $$[\hat{G}^{-1}(\alpha_1), \; \infty] \;\;\;\;\;\; (23)$$ and a one-sided upper confidence interval is given by: $$[-\infty, \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (24)$$ where $\alpha_1$ and $\alpha_2$ are computed as for a two-sided confidence interval, except $\alpha/2$ is replaced with $\alpha$ in Equations (18) and (19).

The constant $\hat{z}_0$ incorporates the bias correction, and the constant $\hat{a}$ is the acceleration constant. The term “acceleration” refers to the rate of change of the standard error of the estimate of $\mu$ with respect to the true value of $\mu$ (Efron and Tibshirani, 1993, p.186). For a normal (Gaussian) distribution, the standard error of the estimate of $\mu$ does not depend on the value of $\mu$, hence the acceleration constant is not really necessary.

For the bootstrap-t method, the two-sided confidence interval (Efron and Tibshirani, 1993, p.160) is computed as: $$[\hat{\mu} - t_{1-\alpha/2}\hat{\sigma}_{\hat{\mu}}, \; \hat{\mu} - t_{\alpha/2}\hat{\sigma}_{\hat{\mu}}] \;\;\;\;\;\; (25)$$ where $\hat{\mu}$ and $\hat{\sigma}_{\hat{\mu}}$ denote the estimate of the mean and standard error of the estimate of the mean based on the original sample, and $t_p$ denotes the $p$'th empirical quantile of the bootstrap distribution of the statistic $T$. Similarly, a one-sided lower confidence interval is computed as: $$[\hat{\mu} - t_{1-\alpha}\hat{\sigma}_{\hat{\mu}}, \; \infty] \;\;\;\;\;\; (26)$$ and a one-sided upper confidence interval is computed as: $$[-\infty, \; \hat{\mu} - t_{\alpha}\hat{\sigma}_{\hat{\mu}}] \;\;\;\;\;\; (27)$$

When ci.method="bootstrap", the function enparCensored computes the percentile method, bias-corrected and accelerated method, and bootstrap-t bootstrap confidence intervals. The percentile method is transformation respecting, but not second-order accurate. The bootstrap-t method is second-order accurate, but not transformation respecting. The bias-corrected and accelerated method is both transformation respecting and second-order accurate (Efron and Tibshirani, 1993, p.188).

Value

a list of class "estimateCensored" containing the estimated parameters and other information. See estimateCensored.object for details.

References

Barker, C. (2009). The Mean, Median, and Confidence Intervals of the Kaplan-Meier Survival Estimate – Computations and Applications. The American Statistician 63(1), 78–80.

Beal, D. (2010). A Macro for Calculating Summary Statistics on Left Censored Environmental Data Using the Kaplan-Meier Method. Paper SDA-09, presented at Southeast SAS Users Group 2010, September 26-28, Savannah, GA. https://analytics.ncsu.edu/sesug/2010/SDA09.Beal.pdf.

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.

Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.

El-Shaarawi, A.H., and D.M. Dolan. (1989). Maximum Likelihood Estimation of Water Quality Concentrations from Censored Data. Canadian Journal of Fisheries and Aquatic Sciences 46, 1033–1039.

Frome E.L., and D.P. Frome (2015). STAND: Statistical Analysis of Non-Detects. R package version 2.0, https://CRAN.R-project.org/package=STAND.

Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.

Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.

Irwin, J.O. (1949). The Standard Error of an Estimate of Expectation of Life, with Special Reference to Expectation of Tumourless Life in Experiments with Mice. Journal of Hygiene 47, 188–189.

Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.

Klein, J.P., and M.L. Moeschberger. (2003). Survival Analysis: Techniques for Censored and Truncated Data, Second Edition. Springer, New York, 537pp.

Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.

Meier, P., T. Karrison, R. Chappell, and H. Xie. (2004). The Price of Kaplan-Meier. Journal of the American Statistical Association 99(467), 890–896.

Miller, R.G. (1981). Survival Analysis. John Wiley and Sons, New York.

Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.

Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

Singh, A., N. Armbya, and A. Singh. (2010). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2022). ProUCL Version 5.2.0 Technical Guide: Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations. Prepared by: Neptune and Company, Inc., 1435 Garrison Street, Suite 201, Lakewood, CO 80215. pp. 128–129, 143. https://www.epa.gov/land-research/proucl-software.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.

Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.

Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.

In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation, rather than rely on a single point-estimate of the mean. Since confidence intervals and regions depend on the properties of the estimators for both the mean and standard deviation, the results of studies that simply evaluated the performance of the mean and standard deviation separately cannot be readily extrapolated to predict the performance of various methods of constructing confidence intervals and regions. Furthermore, for several of the methods that have been proposed to estimate the mean based on type I left-censored data, standard errors of the estimates are not available, hence it is not possible to construct confidence intervals (El-Shaarawi and Dolan, 1989).

Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).

Examples

  # Using the lead concentration data from soil samples shown in 
  # Beal (2010), compute the Kaplan-Meier estimators of the mean, 
  # standard deviation, and standard error of the mean, as well as 
  # a 95% upper confidence limit for the mean.  Compare these 
  # results to those given in Beal (2010), and also to the results 
  # produced by ProUCL 5.2.0.

  # First look at the data:
  #-----------------------

  head(Beal.2010.Pb.df)
#>   Pb.char  Pb Censored
#> 1      <1 1.0     TRUE
#> 2      <1 1.0     TRUE
#> 3       2 2.0    FALSE
#> 4     2.5 2.5    FALSE
#> 5     2.8 2.8    FALSE
#> 6      <3 3.0     TRUE
  #  Pb.char  Pb Censored
  #1      <1 1.0     TRUE
  #2      <1 1.0     TRUE
  #3       2 2.0    FALSE
  #4     2.5 2.5    FALSE
  #5     2.8 2.8    FALSE
  #6      <3 3.0     TRUE

  tail(Beal.2010.Pb.df)
#>    Pb.char   Pb Censored
#> 24     <10   10     TRUE
#> 25      10   10    FALSE
#> 26      15   15    FALSE
#> 27      49   49    FALSE
#> 28     200  200    FALSE
#> 29    9060 9060    FALSE
  #   Pb.char   Pb Censored
  #24     <10   10     TRUE
  #25      10   10    FALSE
  #26      15   15    FALSE
  #27      49   49    FALSE
  #28     200  200    FALSE
  #29    9060 9060    FALSE

  # enparCensored Results:
  #-----------------------
  Beal.unrestricted <- with(Beal.2010.Pb.df, 
    enparCensored(x = Pb, censored = Censored, ci = TRUE, 
      ci.type = "upper"))

  Beal.unrestricted
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 29
#> 
#> $censoring.side
#> [1] "left"
#> 
#> $censoring.levels
#> [1]  1  3  4  6  9 10
#> 
#> $percent.censored
#> [1] 34.48276
#> 
#> $parameters
#>      mean        sd   se.mean 
#>  325.3396 1651.0950  315.0023 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "Kaplan-Meier\n                                 (Bias-corrected se.mean)"
#> 
#> $data.name
#> [1] "Pb"
#> 
#> $censoring.name
#> [1] "Censored"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>      LCL      UCL 
#>   0.0000 861.1996 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "Normal Approximation\n                                 (t Distribution)"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 29
#> 
#> $dof
#> [1] 28
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimateCensored"

  #Results of Distribution Parameter Estimation
  #Based on Type I Censored Data
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):               1  3  4  6  9 10 
  #
  #Estimated Parameter(s):          mean    =  325.3396
  #                                 sd      = 1651.0950
  #                                 se.mean =  315.0023
  #
  #Estimation Method:               Kaplan-Meier
  #                                 (Bias-corrected se.mean)
  #
  #Data:                            Pb
  #
  #Censoring Variable:              Censored
  #
  #Sample Size:                     29
  #
  #Percent Censored:                34.48276%
  #
  #Confidence Interval for:         mean
  #
  #Assumed Sample Size:             29
  #
  #Confidence Interval Method:      Normal Approximation
  #                                 (t Distribution)
  #
  #Confidence Interval Type:        upper
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL =   0.0000
  #                                 UCL = 861.1996

  c(Beal.unrestricted$parameters, Beal.unrestricted$interval$limits)
#>      mean        sd   se.mean       LCL       UCL 
#>  325.3396 1651.0950  315.0023    0.0000  861.1996 
  #     mean        sd   se.mean       LCL       UCL 
  # 325.3396 1651.0950  315.0023    0.0000  861.1996

  # Beal (2010) published results:
  #-------------------------------
  #   Mean   Std. Dev.  SE of Mean
  # 325.34     1651.09      315.00

  # ProUCL 5.2.0 results:
  #----------------------
  #   Mean   Std. Dev.  SE of Mean  95% UCL
  # 325.2      1651         315       861.1

  #----------

  # Now compute the restricted mean and associated quantities, 
  # and compare these results with those produced by the 
  # kmms() function in the STAND package.
  #----------------------------------------------------------- 

  Beal.restricted <- with(Beal.2010.Pb.df, 
    enparCensored(x = Pb, censored = Censored, restricted = TRUE, 
      ci = TRUE, ci.type = "upper"))

  Beal.restricted 
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 29
#> 
#> $censoring.side
#> [1] "left"
#> 
#> $censoring.levels
#> [1]  1  3  4  6  9 10
#> 
#> $percent.censored
#> [1] 34.48276
#> 
#> $parameters
#>      mean        sd   se.mean 
#>  325.2011 1651.1221  314.1774 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "Kaplan-Meier (Restricted Mean)\n                                 Smallest censored value(s)\n                                   set to Censoring Level\n                                 (Bias-corrected se.mean)"
#> 
#> $data.name
#> [1] "Pb"
#> 
#> $censoring.name
#> [1] "Censored"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>     LCL     UCL 
#>   0.000 859.658 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "Normal Approximation\n                                 (t Distribution)"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 29
#> 
#> $dof
#> [1] 28
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimateCensored"

  #Results of Distribution Parameter Estimation
  #Based on Type I Censored Data
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):               1  3  4  6  9 10 
  #
  #Estimated Parameter(s):          mean    =  325.2011
  #                                 sd      = 1651.1221
  #                                 se.mean =  314.1774
  #
  #Estimation Method:               Kaplan-Meier (Restricted Mean)
  #                                 Smallest censored value(s)
  #                                   set to Censoring Level
  #                                 (Bias-corrected se.mean)
  #
  #Data:                            Pb
  #
  #Censoring Variable:              Censored
  #
  #Sample Size:                     29
  #
  #Percent Censored:                34.48276%
  #
  #Confidence Interval for:         mean
  #
  #Assumed Sample Size:             29
  #
  #Confidence Interval Method:      Normal Approximation
  #                                 (t Distribution)
  #
  #Confidence Interval Type:        upper
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL =   0.000
  #                                 UCL = 859.658

  c(Beal.restricted$parameters, Beal.restricted$interval$limits)
#>      mean        sd   se.mean       LCL       UCL 
#>  325.2011 1651.1221  314.1774    0.0000  859.6580 
  #     mean        sd   se.mean       LCL       UCL 
  # 325.2011 1651.1221  314.1774    0.0000  859.6580

  # kmms() results:
  #----------------
  #  KM.mean    KM.LCL    KM.UCL     KM.se     gamma 
  # 325.2011 -221.0419  871.4440  315.0075    0.9500 
  
  # NOTE: as pointed out above, the kmms() function treats the 
  #       smallest censored observations (<1 and <1) as NOT 
  #       censored when computing the mean and uncorrected 
  #       standard error of the mean, but assumes these 
  #       observations ARE censored when computing the corrected 
  #       standard error of the mean.
  #--------------------------------------------------------------

  Beal.restricted$parameters["se.mean"] * sqrt((20/21)) * sqrt((19/18))
#>  se.mean 
#> 315.0075 
  #  se.mean 
  # 315.0075

  #==========

  # Repeat the above example, estimating the unrestricted mean and 
  # computing an upper confidence limit based on the bootstrap 
  # instead of on the normal approximation with a t pivot statistic.
  # Compare results to those from ProUCL 5.2.0.
  # Note:  Setting the seed argument lets you reproduce this example.
  #------------------------------------------------------------------

  Beal.unrestricted.boot <- with(Beal.2010.Pb.df, 
    enparCensored(x = Pb, censored = Censored, ci = TRUE, 
      ci.type = "upper", ci.method = "bootstrap", seed = 923))

  Beal.unrestricted.boot
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 29
#> 
#> $censoring.side
#> [1] "left"
#> 
#> $censoring.levels
#> [1]  1  3  4  6  9 10
#> 
#> $percent.censored
#> [1] 34.48276
#> 
#> $parameters
#>      mean        sd   se.mean 
#>  325.3396 1651.0950  315.0023 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "Kaplan-Meier\n                                 (Bias-corrected se.mean)"
#> 
#> $data.name
#> [1] "Pb"
#> 
#> $censoring.name
#> [1] "Censored"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>    Pct.LCL    Pct.UCL    BCa.LCL    BCa.UCL      t.LCL      t.UCL 
#>     0.0000   948.7342     0.0000   942.6596     0.0000 62121.8909 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "Bootstrap"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 29
#> 
#> $n.bootstraps
#> [1] 1000
#> 
#> $too.few.obs.count
#> [1] 0
#> 
#> $no.cen.obs.count
#> [1] 0
#> 
#> attr(,"class")
#> [1] "intervalEstimateCensored"
#> 
#> attr(,"class")
#> [1] "estimateCensored"

  #Results of Distribution Parameter Estimation
  #Based on Type I Censored Data
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):               1  3  4  6  9 10 
  #
  #Estimated Parameter(s):          mean    =  325.3396
  #                                 sd      = 1651.0950
  #                                 se.mean =  315.0023
  #
  #Estimation Method:               Kaplan-Meier
  #                                 (Bias-corrected se.mean)
  #
  #Data:                            Pb
  #
  #Censoring Variable:              Censored
  #
  #Sample Size:                     29
  #
  #Percent Censored:                34.48276%
  #
  #Confidence Interval for:         mean
  #
  #Assumed Sample Size:             29
  #
  #Confidence Interval Method:      Bootstrap
  #
  #Number of Bootstraps:            1000
  #
  #Number of Bootstrap Samples
  #With No Censored Values:         0
  #
  #Number of Times Bootstrap
  #Repeated Because Too Few
  #Uncensored Observations:         0
  #
  #Confidence Interval Type:        upper
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             Pct.LCL =     0.0000
  #                                 Pct.UCL =   948.7342
  #                                 BCa.LCL =     0.0000
  #                                 BCa.UCL =   942.6596
  #                                 t.LCL   =     0.0000
  #                                 t.UCL   = 62121.8909

  c(Beal.unrestricted.boot$interval$limits)
#>    Pct.LCL    Pct.UCL    BCa.LCL    BCa.UCL      t.LCL      t.UCL 
#>     0.0000   948.7342     0.0000   942.6596     0.0000 62121.8909 
  #   Pct.LCL    Pct.UCL    BCa.LCL    BCa.UCL      t.LCL      t.UCL 
  #    0.0000   948.7342     0.0000   942.6596     0.0000 62121.8909

  # ProUCL 5.2.0 results:
  #----------------------
  #   Pct.LCL    Pct.UCL    BCa.LCL    BCa.UCL      t.LCL      t.UCL 
  #    0.0000   944.3        0.0000   947.8        0.0000 62169

  #==========

  # Clean up
  #---------
  rm(Beal.unrestricted, Beal.restricted, Beal.unrestricted.boot)