Returns a list of “ordered” observations and associated plotting positions based on Type I left-censored or right-censored data. These plotting positions may be used to construct empirical cumulative distribution plots or quantile-quantile plots, or to estimate distribution parameters.

ppointsCensored(x, censored, censoring.side = "left",
    prob.method = "michael-schucany", plot.pos.con = 0.375)

Arguments

x

numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.

censored

numeric or logical vector indicating which values of x are censored. This must be the same length as x. If the mode of censored is "logical", TRUE values correspond to elements of x that are censored, and FALSE values correspond to elements of x that are not censored. If the mode of censored is "numeric", it must contain only 1's and 0's; 1 corresponds to TRUE and 0 corresponds to FALSE. Missing (NA) values are allowed but will be removed.

censoring.side

character string indicating on which side the censoring occurs. The possible values are "left" (the default) and "right".

prob.method

character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are:
"kaplan-meier" (product-limit method of Kaplan and Meier (1958)),
"modified kaplan-meier" (modification of Kaplan-Meier method),
"nelson" (hazard plotting method of Nelson (1972)),
"michael-schucany" (generalization of the product-limit method due to Michael and Schucany (1986)), and
"hirsch-stedinger" (generalization of the product-limit method due to Hirsch and Stedinger (1987)).

The default value is prob.method="michael-schucany".

The "nelson" method is only available for censoring.side="right", and the "modified kaplan-meier" method is only available for
censoring.side="left". See the DETAILS section for more explanation.

plot.pos.con

numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is plot.pos.con=0.375. See the DETAILS section for more information. This argument is used only if prob.method is equal to "michael-schucany" or "hirsch-stedinger".

Details

Methods for computing plotting positions for complete data sets (no censored observations) are discussed in D'Agostino, R.B. (1986a) and Cleveland (1993). For data sets with censored observations, these methods must be modified. The function ppointsCensored allows you to compute plotting positions based on any of the following methods:

  • Product-limit method of Kaplan and Meier (1958) (prob.method="kaplan-meier").

  • Hazard plotting method of Nelson (1972) (prob.method="nelson").

  • Generalization of the product-limit method due to Michael and Schucany (1986)
    (prob.method="michael-schucany") (the default).

  • Generalization of the product-limit method due to Hirsch and Stedinger (1987)
    (prob.method="hirsch-stedinger").

Let \(\underline{x}\) denote a random sample of \(N\) observations from some distribution. Assume \(n\) (\(0 < n < N\)) of these observations are known and \(c\) (\(c=N-n\)) of these observations are all censored below (left-censored) or all censored above (right-censored) at \(k\) fixed censoring levels $$T_1, T_2, \ldots, T_K; \; K \ge 1 \;\;\;\;\;\; (1)$$ For the case when \(K \ge 2\), the data are said to be Type I multiply censored. For the case when \(K=1\), set \(T = T_1\). If the data are left-censored and all \(n\) known observations are greater than or equal to \(T\), or if the data are right-censored and all \(n\) known observations are less than or equal to \(T\), then the data are said to be Type I singly censored (Nelson, 1982, p.7), otherwise they are considered to be Type I multiply censored.

Let \(c_j\) denote the number of observations censored below or above censoring level \(T_j\) for \(j = 1, 2, \ldots, K\), so that $$\sum_{i=1}^K c_j = c \;\;\;\;\;\; (2)$$ Let \(x_{(1)}, x_{(2)}, \ldots, x_{(N)}\) denote the “ordered” observations, where now “observation” means either the actual observation (for uncensored observations) or the censoring level (for censored observations). For right-censored data, if a censored observation has the same value as an uncensored one, the uncensored observation should be placed first. For left-censored data, if a censored observation has the same value as an uncensored one, the censored observation should be placed first.

Note that in this case the quantity \(x_{(i)}\) does not necessarily represent the \(i\)'th “largest” observation from the (unknown) complete sample.

Finally, let \(\Omega\) (omega) denote the set of \(n\) subscripts in the “ordered” sample that correspond to uncensored observations.

Product-Limit Method of Kaplan and Meier (prob.method="kaplan-meier")
For complete data sets (no censored observations), the empirical probabilities estimator of the cumulative distribution function evaluated at the \(i\)'th ordered observation is given by (D'Agostino, 1986a, p.8): $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (3)$$ where \(\#[x_j \le x_{(i)}]\) denotes the number of observations less than or equal to \(x_{(i)}\) (see the help file for ecdfPlot). Kaplan and Meier (1958) extended this method of computing the empirical cdf to the case of right-censored data.

Right-Censored Data (censoring.side="right")
Let \(S(t)\) denote the survival function evaluated at \(t\), that is: $$S(t) = 1 - F(t) = Pr(X > t) \;\;\;\;\;\; (4)$$ Kaplan and Meier (1958) show that a nonparametric estimate of the survival function at the \(i\)'th ordered observation that is not censored (i.e., \(i \in \Omega\)), is given by:

\(\hat{S}[x_{(i)}]\)\(=\)\(\widehat{Pr}[X > x_{(i)}]\)
\(=\)\(\widehat{Pr}[X > x_{(1)}]\)
\(\;\; \widehat{Pr}[X > x_{(2)} | X > x_{(1)}] \;\; \cdots\)
\(\;\; \widehat{Pr}[X > x_{(i)} | X > x_{(i-1)}]\)
\(=\)\(\prod_{j \in \Omega, j \le i} \frac{n_j - d_j}{n_j}, \;\; i \in \Omega \;\;\;\;\; (5)\)

where \(n_j\) is the number of observations (uncensored or censored) with values greater than or equal to \(x_{(j)}\), and \(d_j\) denotes the number of uncensored observations exactly equal to \(x_{(j)}\) (if there are no tied uncensored observations then \(d_j\) will equal 1 for all values of \(j\)). (See also Lee and Wang, 2003, pp. 64–69; Michael and Schucany, 1986). By convention, the estimate of the survival function at a censored observation is set equal to the estimated value of the survival function at the largest uncensored observation less than or equal to that censoring level. If there are no uncensored observations less than or equal to a particular censoring level, the estimate of the survival function is set to 1 for that censoring level.

Thus the Kaplan-Meier plotting position at the \(i\)'th ordered observation that is not censored (i.e., \(i \in \Omega\)), is given by: $$\hat{p}_i = \hat{F}[x_{(i)}] = 1 - \prod_{j \in \Omega, j \le i} \frac{n_j - d_j}{n_j} \;\;\;\;\;\; (6)$$

The plotting position for a censored observation is set equal to the plotting position associated with the largest uncensored observation less than or equal to that censoring level. If there are no uncensored observations less than or equal to a particular censoring level, the plotting position is set to 0 for that censoring level.

As an example, consider the following right-censored data set: $$3, \ge4, \ge4, 5, 5, 6$$ The table below shows how the plotting positions are computed.

\(i\)\(x_{(i)}\)\(n_i\)\(d_i\)\(\frac{n_i-d_i}{n_i}\)Plotting Position
1\(3\)\(6\)\(1\)\(5/6\)\(1 - (5/6) = 0.167\)
2\(\ge4\)
3\(\ge4\)
4\(5\)\(3\)\(2\)\(1/3\)\(1 - (5/6)(1/3) = 0.722\)
5\(5\)\(0.722\)
6\(6\)\(1\)\(1\)\(0/1\)\(1 - (5/6)(1/3)(0/1) = 1\)

Note that for complete data sets, Equation (6) reduces to Equation (3).

Left-Censored Data (censoring.side="left")
Gillespie et al. (2010) give formulas for the Kaplan-Meier estimator for the case of left-cesoring (censoring.side="left"). In this case, the plotting position for the \(i\)'th ordered observation, assuming it is not censored, is computed as: $$\hat{p}_i = \hat{F}[x_{(i)}] = \prod_{j \in \Omega, j > i} \frac{n_j - d_j}{n_j} \;\;\;\;\;\; (7)$$ where \(n_j\) is the number of observations (uncensored or censored) with values less than or equal to \(x_{(j)}\), and \(d_j\) denotes the number of uncensored observations exactly equal to \(x_{(j)}\) (if there are no tied uncensored observations then \(d_j\) will equal 1 for all values of \(j\)). The plotting position is equal to 1 for the largest uncensored order statistic.

As an example, consider the following left-censored data set: $$3, <4, <4, 5, 5, 6$$ The table below shows how the plotting positions are computed.

\(i\)\(x_{(i)}\)\(n_i\)\(d_i\)\(\frac{n_i-d_i}{n_i}\)Plotting Position
1\(3\)\(1\)\(1\)\(0/1\)\(1(5/6)(3/5) = 0.5\)
2\(<4\)
3\(<4\)
4\(5\)\(5\)\(2\)\(3/5\)\(0.833\)
5\(5\)\(1(5/6) = 0.833\)
6\(6\)\(6\)\(1\)\(5/6\)\(1\)

Note that for complete data sets, Equation (7) reduces to Equation (3).

Modified Kaplan-Meier Method (prob.method="modified kaplan-meier")
(Left-Censored Data Only.) For left-censored data, the modified Kaplan-Meier method is the same as the Kaplan-Meier method, except that for the largest uncensored order statistic, the plotting position is not set to 1 but rather is set equal to the Blom plotting position: \((N - 0.375)/(N + 0.25)\). This method is useful, for example, when creating Quantile-Quantile plots.

Hazard Plotting Method of Nelson (prob.method="nelson")
(Right-Censored Data Only.) For right-censored data, Equation (5) can be re-written as: $$\hat{S}[x_{(i)}] = \prod_{j \in \Omega, j \le i} \frac{N-j}{N-j+1}, \;\; i \in \Omega \;\;\;\;\;\; (8)$$ Nelson (1972) proposed the following formula for plotting positions for the uncensored observations in the context of estimating the hazard function (see Michael and Schucany,1986, p.469): $$\hat{p}_i = \hat{F}[x_{(i)}] = 1 - \prod_{j \in \Omega, j \le i} exp(\frac{-1}{N-j+1}) \;\;\;\;\;\; (9)$$ See Lee and Wang (2003) for more information about the hazard function.

As for the Kaplan and Meier (1958) method, the plotting position for a censored observation is set equal to the plotting position associated with the largest uncensored observation less than or equal to that censoring level. If there are no uncensored observations less than or equal to a particular censoring level, the plotting position is set to 0 for that censoring level.

Generalization of Product-Limit Method, Michael and Schucany
(prob.method="michael-schucany")
For complete data sets, the disadvantage of using Equation (3) above to define plotting positions is that it implies the largest observed value is the maximum possible value of the distribution (the \(100\)'th percentile). This may be satisfactory if the underlying distribution is known to be discrete, but it is usually not satisfactory if the underlying distribution is known to be continuous.

A more frequently used formula for plotting positions for complete data sets is given by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{N - 2a + 1} \;\;\;\;\;\; (10)$$ where \(0 \le a \le 1\) (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25). The value of \(a\) is usually chosen so that the plotting positions are approximately unbiased (i.e., approximate the mean of their distribution) or else approximate the median value of their distribution (see the help file for ecdfPlot). Michael and Schucany (1986) extended this method for both left- and right-censored data sets.

Right-Censored Data (censoring.side="right")
For right-censored data sets, the plotting positions for the uncensored observations are computed as: $$\hat{p}_i = 1 - \frac{N-a+1}{N-2a+1} \prod_{j \in \Omega, j \le i} \frac{N-j-a+1}{N-j-a+2} \;\; i \in \Omega \;\;\;\;\;\; (11)$$ Note that the plotting positions proposed by Herd (1960) and Johnson (1964) are a special case of Equation (11) with \(a=0\). Equation (11) reduces to Equation (10) in the case of complete data sets. Note that unlike the Kaplan-Meier method, plotting positions associated with tied uncensored observations are not the same (just as in the case for complete data using Equation (10)).

As for the Kaplan and Meier (1958) method, for right-censored data the plotting position for a censored observation is set equal to the plotting position associated with the largest uncensored observation less than or equal to that censoring level. If there are no uncensored observations less than or equal to a particular censoring level, the plotting position is set to 0 for that censoring level.

Left-Censored Data (censoring.side="left")
For left-censored data sets the plotting positions are computed as: $$\hat{p}_i = \frac{N-a+1}{N-2a+1} \prod_{j \in \Omega, j \ge i} \frac{j-a}{j-a+1} \;\; i \in \Omega \;\;\;\;\;\; (12)$$ Equation (12) reduces to Equation (10) in the case of complete data sets. Note that unlike the Kaplan-Meier method, plotting positions associated with tied uncensored observations are not the same (just as in the case for complete data using Equation (10)).

For left-censored data, the plotting position for a censored observation is set equal to the plotting position associated with the smallest uncensored observation greater than or equal to that censoring level. If there are no uncensored observations greater than or equal to a particular censoring level, the plotting position is set to 1 for that censoring level.

Generalization of Product-Limit Method, Hirsch and Stedinger
(prob.method="hirsch-stedinger")
Hirsch and Stedinger (1987) use a slightly different approach than Kaplan and Meier (1958) and Michael and Schucany (1986) to derive a nonparametric estimate of the survival function (probability of exceedance) in the context of left-censored data. First they estimate the value of the survival function at each of the censoring levels. The value of the survival function for an uncensored observation between two adjacent censoring levels is then computed by linear interpolation (in the form of a plotting position). See also Helsel and Cohn (1988).

The discussion below presents an extension of the method of Hirsch and Stedinger (1987) to the case of right-censored data, and then presents the original derivation due to Hirsch and Stedinger (1987) for left-censored data.

Right-Censored Data (censoring.side="right")
For right-censored data, the survival function is estimated as follows. For the \(j\)'th censoring level (\(j = 0, 1, \ldots, K\)), write the value of the survival function as:

\(S(T_j)\)\(=\)\(Pr[X > T_j]\)
\(=\)\(Pr[X > T_{j+1}] + Pr[T_j < X \le T_{j+1}]\)
\(=\)\(S(T_{j+1}) + Pr[T_j < X \le T_{j+1} | X > T_j] Pr[X > T_j]\)
\(=\)\(S(T_{j+1}) + Pr[T_j < X \le T_{j+1} | X > T_j] S(T_j) \;\;\;\;\;\; (13)\)

where $$T_0 = -\infty, \;\;\;\;\;\; (14)$$ $$T_{K+1} = \infty \;\;\;\;\;\; (15)$$ Now set

\(A_j\)\(=\)# uncensored observations in \((T_j, T_{j+1}] \;\;\;\;\;\; (16)\)
\(B_j\)\(=\)# observations in \((T_{j+1}, \infty) \;\;\;\;\;\; (17)\)

for \(j = 0, 1, \ldots, K\). Then the method of moments estimator of the conditional probability in Equation (13) is given by: $$\widehat{Pr}[T_j < X \le T_{j+1} | X > T_j] = \frac{A_j}{A_j + B_j} \;\;\;\;\;\; (18)$$ Hence, by equations (13) and (18) we have $$\hat{S}(T_j) = \hat{S}(T_{j+1}) + (\frac{A_j}{A_j + B_j}) \hat{S}(T_{j}) \;\;\;\;\;\; (19)$$ which can be rewritten as: $$\hat{S}(T_{j+1}) = \hat{S}(T_j) [1 - (\frac{A_j}{A_j + B_j})] \;\;\;\;\;\; (20)$$ Equation (20) can be solved interatively for \(j = 1, 2, \ldots, K\). Note that $$\hat{S}(T_0) = \hat{S}(-\infty) = S(-\infty) = 1 \;\;\;\;\;\; (21)$$ $$\hat{S}(T_{K+1}) = \hat{S}(\infty) = S(\infty) = 0 \;\;\;\;\;\; (22)$$

Once the values of the survival function at the censoring levels are computed, the plotting positions for the \(A_j\) uncensored observations in the interval \((T_J, T_{j+1}]\) (\(j = 0, 1, \ldots, K\)) are computed as $$\hat{p}_i = [1 - \hat{S}(T_j)] + [\hat{S}(T_j) - \hat{S}(T_{j+1})] \frac{r-a}{A_j - 2a + 1} \;\;\;\;\;\; (23)$$ where \(a\) denotes the plotting position constant, \(0 \le a \le 1\), and \(r\) denotes the rank of the \(i\)'th observation among the \(A_j\) uncensored observations in the interval \((T_J, T_{j+1}]\). (Tied observations are given distinct ranks.)

For the \(c_j\) observations censored at censoring level \(T_j\) (\(j = 1, 2, \ldots, K\)), the plotting positions are computed as: $$\hat{p}_i = 1 - [\hat{S}(T_j) \frac{r-a}{c_j - 2a + 1}] \;\;\;\;\;\; (24)$$ where \(r\) denotes the rank of the \(i\)'th observation among the \(c_j\) observations censored at censoring level \(T_j\). Note that all the observations censored at the same censoring level are given distinct ranks, even though there is no way to distinguish between them.

Left-Censored Data (censoring.side="left")
For left-censored data, Hirsch and Stedinger (1987) modify the definition of the survival function as follows: $$S^*(t) = Pr[X \ge t] \;\;\;\;\;\; (25)$$ For continuous distributions, the functions in Equations (4) and (25) are identical.

Hirsch and Stedinger (1987) show that for the \(j\)'th censoring level (\(j = 0, 1, \ldots, K\)), the value of the survival function can be written as:

\(S(T_j)\)\(=\)\(Pr[X \ge T_j]\)
\(=\)\(Pr[X \ge T_{j+1}] + Pr[T_j \le X < T_{j+1}]\)
\(=\)\(S^*(T_{j+1}) + Pr[T_j \le X < T_{j+1} | X < T_{j+1}] Pr[X < T_{j+1}]\)
\(=\)\(S^*(T_{j+1}) + Pr[T_j \le X < T_{j+1} | X < T_j] [1 - S^*(T_j)] \;\;\;\;\;\; (26)\)

where \(T_0\) and \(T_{K+1}\) are defined in Equations (14) and (15).

Now set

\(A_j\)\(=\)# uncensored observations in \([T_j, T_{j+1}) \;\;\;\;\;\; (27)\)
\(B_j\)\(=\)# observations in \((-\infty, T_j) \;\;\;\;\;\; (28)\)

for \(j = 0, 1, \ldots, K\). Then the method of moments estimator of the conditional probability in Equation (26) is given by: $$Pr[T_j \le X < T_{j+1} | X < T_{j+1}] = \frac{A_j}{A_j + B_j} \;\;\;\;\;\; (29)$$ Hence, by Equations (26) and (29) we have $$\hat{S}(T_j) = \hat{S}(T_{j+1}) + (\frac{A_j}{A_j + B_j}) \hat{S}(T_{j}) \;\;\;\;\;\; (30)$$ which can be solved interatively for \(j = 1, 2, \ldots, K\). Note that $$\widehat{S^*}(T_{K+1}) = \widehat{S^*}(\infty) = S^*(\infty) = 0 \;\;\;\;\;\; (31)$$ $$\widehat{S^*}(T_0) = \widehat{S^*}(-\infty) = S^*(-\infty) = 1 \;\;\;\;\;\; (32)$$

Once the values of the survival function at the censoring levels are computed, the plotting positions for the \(A_j\) uncensored observations in the interval \([T_J, T_{j+1})\) (\(j = 0, 1, \ldots, K\)) are computed as $$\hat{p}_i = [1 - \widehat{S^*}(T_j)] + [\widehat{S^*}(T_j) - \widehat{S^*}(T_{j+1})] \frac{r-a}{A_j - 2a + 1} \;\;\;\;\;\; (33)$$ where \(a\) denotes the plotting position constant, \(0 \le a \le 0.5\), and \(r\) denotes the rank of the \(i\)'th observation among the \(A_j\) uncensored observations in the interval \([T_J, T_{j+1})\). (Tied observations are given distinct ranks.)

For the \(c_j\) observations censored at censoring level \(T_j\) (\(j = 1, 2, \ldots, K\)), the plotting positions are computed as: $$\hat{p}_i = [1 - \widehat{S^*}(T_j)] \frac{r-a}{c_j - 2a + 1} \;\;\;\;\;\; (34)$$ where \(r\) denotes the rank of the \(i\)'th observation among the \(c_j\) observations censored at censoring level \(T_j\). Note that all the observations censored at the same censoring level are given distinct ranks, even though there is no way to distinguish between them.

Value

ppointsCensored returns a list with the following components:

Order.Statistics

numeric vector of the “ordered” observations.

Cumulative.Probabilities

numeric vector of the associated plotting positions.

Censored

logical vector indicating which of the ordered observations are censored.

Censoring.Side

character string indicating whether the data are left- or right-censored. This is same value as the argument censoring.side.

Prob.Method

character string indicating what method was used to compute the plotting positions. This is the same value as the argument prob.method.

Optional Component (only present when prob.method="michael-schucany" or
prob.method="hirsch-stedinger"):

Plot.Pos.Con

numeric scalar containing the value of the plotting position constant that was used. This is the same as the argument plot.pos.con.

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.

Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.

Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.

Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.

Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.

Lee, E.T., and J. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley and Sons, New York.

Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.

Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

For censored data sets, plotting positions may be used to construct empirical cumulative distribution plots (see ecdfPlotCensored), construct quantile-quantile plots (see qqPlotCensored), or to estimate distribution parameters (see FcnsByCatCensoredData).

The function survfit in the built-in R library survival computes the survival function for right-censored, left-censored, or interval-censored data. Calling survfit with type="kaplan-meier" will produce similar results to calling ppointsCensored with prob.method="kaplan-meier". Also, calling survfit with type="fh2" will produce similar results to calling ppointsCensored with prob.method="nelson".

Helsel and Cohn (1988, p.2001) found very little effect of changing the value of the plotting position constant when using the method of Hirsch and Stedinger (1987) to compute plotting positions for multiply left-censored data. In general, there will be very little difference between plotting positions computed by the different methods except in the case of very small samples and a large amount of censoring.

Examples

  # Generate 20 observations from a normal distribution with mean=20 and sd=5,
  # censor all observations less than 18, then compute plotting positions for
  # this data set.  Compare the plotting positions to the plotting positions
  # for the uncensored data set.  Note that the plotting positions for the
  # censored data set start at the first ordered uncensored observation and
  # that for values of x > 18 the plotting positions for the two data sets are
  # exactly the same. This is because there is only one censoring level and
  # no uncensored observations fall below the censored observations.
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(333)
  x <- rnorm(20, mean=20, sd=5)
  censored <- x < 18
  censored
#>  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#> [13] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
  # [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
  #[13] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

  sum(censored)
#> [1] 7
  #[1] 7

  new.x <- x
  new.x[censored] <- 18
  round(sort(new.x),1)
#>  [1] 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.1 18.7 19.6 20.2 20.3 20.6 21.4 21.8
#> [16] 21.8 23.2 26.2 26.8 29.7
  # [1] 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.1 18.7 19.6 20.2 20.3 20.6 21.4
  #[15] 21.8 21.8 23.2 26.2 26.8 29.7

  p.list <- ppointsCensored(new.x, censored)
  p.list
#> $Order.Statistics
#>  [1] 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.09771
#>  [9] 18.65418 19.58594 20.21931 20.26851 20.55296 21.38869 21.76359 21.82364
#> [17] 23.16804 26.16527 26.84336 29.67340
#> 
#> $Cumulative.Probabilities
#>  [1] 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432
#>  [8] 0.3765432 0.4259259 0.4753086 0.5246914 0.5740741 0.6234568 0.6728395
#> [15] 0.7222222 0.7716049 0.8209877 0.8703704 0.9197531 0.9691358
#> 
#> $Censored
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 
#> $Censoring.Side
#> [1] "left"
#> 
#> $Prob.Method
#> [1] "michael-schucany"
#> 
#> $Plot.Pos.Con
#> [1] 0.375
#> 
  #$Order.Statistics
  # [1] 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.09771
  # [9] 18.65418 19.58594 20.21931 20.26851 20.55296 21.38869 21.76359 21.82364
  #[17] 23.16804 26.16527 26.84336 29.67340
  #
  #$Cumulative.Probabilities
  # [1] 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432
  # [8] 0.3765432 0.4259259 0.4753086 0.5246914 0.5740741 0.6234568 0.6728395
  #[15] 0.7222222 0.7716049 0.8209877 0.8703704 0.9197531 0.9691358
  #
  #$Censored
  # [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
  #[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  #
  #$Censoring.Side
  #[1] "left"
  #
  #$Prob.Method
  #[1] "michael-schucany"
  #
  #$Plot.Pos.Con
  #[1] 0.375

  #----------

  # Round off plotting positions to two decimal places
  # and compare to plotting positions that ignore censoring
  #--------------------------------------------------------

  round(p.list$Cum, 2)
#>  [1] 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.43 0.48 0.52 0.57 0.62 0.67 0.72
#> [16] 0.77 0.82 0.87 0.92 0.97
  # [1] 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.43 0.48 0.52 0.57 0.62 0.67
  #[15] 0.72 0.77 0.82 0.87 0.92 0.97

  round(ppoints(x, a=0.375), 2)
#>  [1] 0.03 0.08 0.13 0.18 0.23 0.28 0.33 0.38 0.43 0.48 0.52 0.57 0.62 0.67 0.72
#> [16] 0.77 0.82 0.87 0.92 0.97
  # [1] 0.03 0.08 0.13 0.18 0.23 0.28 0.33 0.38 0.43 0.48 0.52 0.57 0.62 0.67
  #[15] 0.72 0.77 0.82 0.87 0.92 0.97

  #----------

  # Clean up
  #---------
  rm(x, censored, new.x, p.list)

  #----------------------------------------------------------------------------

  # Reproduce the example in Appendix B of Helsel and Cohn (1988).  The data
  # are stored in Helsel.Cohn.88.appb.df.  This data frame contains 18
  # observations, of which 9 are censored below one of 2 distinct censoring
  # levels.

  Helsel.Cohn.88.app.b.df
#>    Conc.orig Conc Censored
#> 1         <1    1     TRUE
#> 2         <1    1     TRUE
#> 3         <1    1     TRUE
#> 4         <1    1     TRUE
#> 5         <1    1     TRUE
#> 6         <1    1     TRUE
#> 7          3    3    FALSE
#> 8          7    7    FALSE
#> 9          9    9    FALSE
#> 10       <10   10     TRUE
#> 11       <10   10     TRUE
#> 12       <10   10     TRUE
#> 13        12   12    FALSE
#> 14        15   15    FALSE
#> 15        20   20    FALSE
#> 16        27   27    FALSE
#> 17        33   33    FALSE
#> 18        50   50    FALSE
  #   Conc.orig Conc Censored
  #1         <1    1     TRUE
  #2         <1    1     TRUE
  #...
  #17        33   33    FALSE
  #18        50   50    FALSE

  p.list <- with(Helsel.Cohn.88.app.b.df,
    ppointsCensored(Conc, Censored, prob.method="hirsch-stedinger", plot.pos.con=0))
  lapply(p.list[1:2], round, 3)
#> $Order.Statistics
#>  [1]  1  1  1  1  1  1  3  7  9 10 10 10 12 15 20 27 33 50
#> 
#> $Cumulative.Probabilities
#>  [1] 0.063 0.127 0.190 0.254 0.317 0.381 0.500 0.556 0.611 0.167 0.333 0.500
#> [13] 0.714 0.762 0.810 0.857 0.905 0.952
#> 
  #$Order.Statistics
  # [1]  1  1  1  1  1  1  3  7  9 10 10 10 12 15 20 27 33 50
  #
  #$Cumulative.Probabilities
  # [1] 0.063 0.127 0.190 0.254 0.317 0.381 0.500 0.556 0.611 0.167 0.333 0.500
  #[13] 0.714 0.762 0.810 0.857 0.905 0.952

  # Clean up
  #---------
  rm(p.list)

  #----------------------------------------------------------------------------

  # Example 15-1 of USEPA (2009, page 15-10) gives an example of
  # computing plotting positions based on censored manganese
  # concentrations (ppb) in groundwater collected at 5 monitoring
  # wells.  The data for this example are stored in
  # EPA.09.Ex.15.1.manganese.df.

  EPA.09.Ex.15.1.manganese.df
#>    Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
#> 1       1 Well.1                 <5           5.0     TRUE
#> 2       2 Well.1               12.1          12.1    FALSE
#> 3       3 Well.1               16.9          16.9    FALSE
#> 4       4 Well.1               21.6          21.6    FALSE
#> 5       5 Well.1                 <2           2.0     TRUE
#> 6       1 Well.2                 <5           5.0     TRUE
#> 7       2 Well.2                7.7           7.7    FALSE
#> 8       3 Well.2               53.6          53.6    FALSE
#> 9       4 Well.2                9.5           9.5    FALSE
#> 10      5 Well.2               45.9          45.9    FALSE
#> 11      1 Well.3                 <5           5.0     TRUE
#> 12      2 Well.3                5.3           5.3    FALSE
#> 13      3 Well.3               12.6          12.6    FALSE
#> 14      4 Well.3              106.3         106.3    FALSE
#> 15      5 Well.3               34.5          34.5    FALSE
#> 16      1 Well.4                6.3           6.3    FALSE
#> 17      2 Well.4               11.9          11.9    FALSE
#> 18      3 Well.4                 10          10.0    FALSE
#> 19      4 Well.4                 <2           2.0     TRUE
#> 20      5 Well.4               77.2          77.2    FALSE
#> 21      1 Well.5               17.9          17.9    FALSE
#> 22      2 Well.5               22.7          22.7    FALSE
#> 23      3 Well.5                3.3           3.3    FALSE
#> 24      4 Well.5                8.4           8.4    FALSE
#> 25      5 Well.5                 <2           2.0     TRUE
  #   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
  #1       1 Well.1                 <5           5.0     TRUE
  #2       2 Well.1               12.1          12.1    FALSE
  #3       3 Well.1               16.9          16.9    FALSE
  #4       4 Well.1               21.6          21.6    FALSE
  #5       5 Well.1                 <2           2.0     TRUE
  #...
  #21      1 Well.5               17.9          17.9    FALSE
  #22      2 Well.5               22.7          22.7    FALSE
  #23      3 Well.5                3.3           3.3    FALSE
  #24      4 Well.5                8.4           8.4    FALSE
  #25      5 Well.5                 <2           2.0     TRUE

  p.list.EPA <- with(EPA.09.Ex.15.1.manganese.df,
    ppointsCensored(Manganese.ppb, Censored,
      prob.method = "kaplan-meier"))
  data.frame(Mn = p.list.EPA$Order.Statistics, Censored = p.list.EPA$Censored,
    CDF = p.list.EPA$Cumulative.Probabilities)
#>       Mn Censored  CDF
#> 1    2.0     TRUE 0.21
#> 2    2.0     TRUE 0.21
#> 3    2.0     TRUE 0.21
#> 4    3.3    FALSE 0.28
#> 5    5.0     TRUE 0.28
#> 6    5.0     TRUE 0.28
#> 7    5.0     TRUE 0.28
#> 8    5.3    FALSE 0.32
#> 9    6.3    FALSE 0.36
#> 10   7.7    FALSE 0.40
#> 11   8.4    FALSE 0.44
#> 12   9.5    FALSE 0.48
#> 13  10.0    FALSE 0.52
#> 14  11.9    FALSE 0.56
#> 15  12.1    FALSE 0.60
#> 16  12.6    FALSE 0.64
#> 17  16.9    FALSE 0.68
#> 18  17.9    FALSE 0.72
#> 19  21.6    FALSE 0.76
#> 20  22.7    FALSE 0.80
#> 21  34.5    FALSE 0.84
#> 22  45.9    FALSE 0.88
#> 23  53.6    FALSE 0.92
#> 24  77.2    FALSE 0.96
#> 25 106.3    FALSE 1.00
  #      Mn Censored  CDF
  #1    2.0     TRUE 0.21
  #2    2.0     TRUE 0.21
  #3    2.0     TRUE 0.21
  #4    3.3    FALSE 0.28
  #5    5.0     TRUE 0.28
  #6    5.0     TRUE 0.28
  #7    5.0     TRUE 0.28
  #8    5.3    FALSE 0.32
  #9    6.3    FALSE 0.36
  #10   7.7    FALSE 0.40
  #11   8.4    FALSE 0.44
  #12   9.5    FALSE 0.48
  #13  10.0    FALSE 0.52
  #14  11.9    FALSE 0.56
  #15  12.1    FALSE 0.60
  #16  12.6    FALSE 0.64
  #17  16.9    FALSE 0.68
  #18  17.9    FALSE 0.72
  #19  21.6    FALSE 0.76
  #20  22.7    FALSE 0.80
  #21  34.5    FALSE 0.84
  #22  45.9    FALSE 0.88
  #23  53.6    FALSE 0.92
  #24  77.2    FALSE 0.96
  #25 106.3    FALSE 1.00

  #----------

  # Clean up
  #---------
  rm(p.list.EPA)