Nonparametric Simultaneous Prediction Interval for a Continuous Distribution

Construct a nonparametric simultaneous prediction interval for the next $r$ sampling “occasions” based on one of three possible rules: k-of-m, California, or Modified California. The simultaneous prediction interval assumes the observations from from a continuous distribution.

predIntNparSimultaneous(x, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", 
    lpl.rank = ifelse(pi.type == "upper", 0, 1), 
    n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), 
    lb = -Inf, ub = Inf, pi.type = "upper", integrate.args.list = NULL)

Arguments

x: a numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.
n.median: positive odd integer specifying the sample size associated with the future medians. The default value is n.median=1 (i.e., individual observations). Note that all future medians must be based on the same sample size.
k: for the $k$-of-$m$ rule (rule="k.of.m"), a positive integer specifying the minimum number of observations (or medians) out of $m$ observations (or medians) (all obtained on one future sampling “occassion”) the prediction interval should contain. The default value is k=1. This argument is ignored when the argument rule is not equal to "k.of.m".
m: positive integer specifying the maximum number of future observations (or medians) on one future sampling “occasion”. The default value is m=2, except when rule="Modified.CA", in which case this argument is ignored and m is automatically set equal to 4.
r: positive integer specifying the number of future sampling “occasions”. The default value is r=1.
rule: character string specifying which rule to use. The possible values are "k.of.m" ($k$-of-$m$ rule; the default), "CA" (California rule), and "Modified.CA" (modified California rule). See the DETAILS section below for more information.
lpl.rank: positive integer indicating the rank of the order statistic to use for the lower bound of the prediction interval. When pi.type="lower", the default value is lpl.rank=1 (implying the minimum value of x is used as the lower bound of the prediction interval). When pi.type="upper", the argument lpl.rank is set equal to 0 and the value of lb is used as the lower bound of the tolerance interval.
n.plus.one.minus.upl.rank: positive integer related to the rank of the order statistic to use for the upper bound of the prediction interval. A value of n.plus.one.minus.upl.rank=1 (the default) means use the first largest value, and in general a value of
n.plus.one.minus.upl.rank=$i$ means use the $i$'th largest value. When
pi.type="lower", the argument n.plus.one.minus.upl.rank is set equal to 0 and the value of ub is used as the upper bound of the prediction interval.
lb, ub: scalars indicating lower and upper bounds on the distribution. By default, lb=-Inf and ub=Inf. If you are constructing a prediction interval for a distribution that you know has a lower bound other than -Inf (e.g., 0), set lb to this value. Similarly, if you know the distribution has an upper bound other than Inf, set ub to this value. The argument lb is ignored if pi.type="two-sided" or pi.type="lower". The argument ub is ignored if pi.type="two-sided" or pi.type="upper".
pi.type: character string indicating what kind of prediction interval to compute. The possible values are "upper" (the default) and "lower".
integrate.args.list: a list of arguments to supply to the integrate function. The default value is integrate.args.list=NULL which means that the default values of integrate are used.

Details

What is a Nonparametric Simultaneous Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line constructed so that it will contain at least $k$ of $m$ future observations from that population with some specified probability $(1-\alpha)100\%$, where $0 < \alpha < 1$ and $k$ and $m$ are some pre-specified positive integers and $k \le m$. The quantity $(1-\alpha)100\%$ is called the confidence coefficient or confidence level associated with the prediction interval. The function predIntNpar computes a standard nonparametric prediction interval.

The function predIntNparSimultaneous computes a nonparametric simultaneous prediction interval that will contain a certain number of future observations with probability $(1-\alpha)100\%$ for each of $r$ future sampling “occasions”, where $r$ is some pre-specified positive integer. The quantity $r$ may refer to $r$ distinct future sampling occasions in time, or it may for example refer to sampling at $r$ distinct locations on one future sampling occasion, assuming that the population standard deviation is the same at all of the $r$ distinct locations.

The function predIntNparSimultaneous computes a nonparametric simultaneous prediction interval based on one of three possible rules:

For the $k$-of-$m$ rule (rule="k.of.m"), at least $k$ of the next $m$ future observations will fall in the prediction interval with probability $(1-\alpha)100\%$ on each of the $r$ future sampling occasions. If obserations are being taken sequentially, for a particular sampling occasion, up to $m$ observations may be taken, but once $k$ of the observations fall within the prediction interval, sampling can stop. Note: For this rule, when $r=1$, the results of predIntNparSimultaneous are equivalent to the results of predIntNpar.
For the California rule (rule="CA"), with probability $(1-\alpha)100\%$, for each of the $r$ future sampling occasions, either the first observation will fall in the prediction interval, or else all of the next $m-1$ observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, $m-1$ more observations must be taken.
For the Modified California rule (rule="Modified.CA"), with probability $(1-\alpha)100\%$, for each of the $r$ future sampling occasions, either the first observation will fall in the prediction interval, or else at least 2 out of the next 3 observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, up to 3 more observations must be taken.

Nonparametric simultaneous prediction intervals can be extended to using medians in place of single observations (USEPA, 2009, Chapter 19). That is, you can create a nonparametric simultaneous prediction interval that will contain a specified number of medians (based on which rule you choose) on each of $r$ future sampling occassions, where each each median is based on $b$ individual observations. For the function predIntNparSimultaneous, the argument n.median corresponds to $b$.

The Form of a Nonparametric Prediction Interval
Let $\underline{x} = x_1, x_2, \ldots, x_n$ denote a vector of $n$ independent observations from some continuous distribution, and let $x_{(i)}$ denote the the $i$'th order statistics in $\underline{x}$. A two-sided nonparametric prediction interval is constructed as: $$[x_{(u)}, x_{(v)}] \;\;\;\;\;\; (1)$$ where $u$ and $v$ are positive integers between 1 and $n$, and $u < v$. That is, $u$ denotes the rank of the lower prediction limit, and $v$ denotes the rank of the upper prediction limit. To make it easier to write some equations later on, we can also write the prediction interval (1) in a slightly different way as: $$[x_{(u)}, x_{(n + 1 - w)}] \;\;\;\;\;\; (2)$$ where $$w = n + 1 - v \;\;\;\;\;\; (3)$$ so that $w$ is a positive integer between 1 and $n-1$, and $u < n+1-w$. In terms of the arguments to the function predIntNparSimultaneous, the argument lpl.rank corresponds to $u$, and the argument n.plus.one.minus.upl.rank corresponds to $w$.

If we allow $u=0$ and $w=0$ and define lower and upper bounds as: $$x_{(0)} = lb \;\;\;\;\;\; (4)$$ $$x_{(n+1)} = ub \;\;\;\;\;\; (5)$$ then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as: $$[x_{(u)}, x_{(n + 1)}] = [x_{(u)}, ub] \;\;\;\;\;\; (6)$$ and a one-sided upper nonparametric prediction interval is constructed as: $$[x_{(0)}, x_{(n + 1 - w)}] = [lb, x_{(n + 1 - w)}] \;\;\;\;\;\; (7)$$ Usually, $lb = -\infty$ or $lb = 0$ and $ub = \infty$.

Note: For nonparametric simultaneous prediction intervals, only lower (pi.type="lower") and upper (pi.type="upper") prediction intervals are available.

Constructing Nonparametric Simultaneous Prediction Intervals for Future Observations
First we will show how to construct a nonparametric simultaneous prediction interval based on future observations (i.e., $b=1$, n.median=1), and then extend the formulas to future medians.

Simultaneous Prediction Intervals for the k-of-m Rule (rule="k.of.m")
For the $k$-of-$m$ rule (rule="k.of.m") with $w=1$ (i.e., n.median=1), at least $k$ of the next $m$ future observations will fall in the prediction interval with probability $(1-\alpha)100\%$ on each of the $r$ future sampling occasions. If observations are being taken sequentially, for a particular sampling occasion, up to $m$ observations may be taken, but once $k$ of the observations fall within the prediction interval, sampling can stop. Note: When $r=1$, this kind of simultaneous prediction interval becomes the same as a standard nonparametric prediction interval (see predIntNpar).

Chou and Owen (1986) developed the theory for nonparametric simultaneous prediction limits for various rules, including the 1-of-$m$ rule. Their theory, however, does not cover the California or Modified California rules, and uses an $r$-fold summation involving a minimum of $2^r$ terms. Davis and McNichols (1994b; 1999) extended the results of Chou and Owen (1986) to include the California and Modified California rule, and developed algorithms that involve summing far fewer terms.

Davis and McNichols (1999) give formulas for the probabilities associated with the one-sided upper simultaneous prediction interval shown in Equation (7). For the $k$-of-$m$ rule, the probability that at least $k$ of the next $m$ future observations will be contained in the interval given in Equation (7) for each of $r$ future sampling occasions is given by:

$1 - \alpha$	$=$	$E[\sum_{i=0}^{m-k} {{k-1+i} \choose {k-1}} Y^k (1-Y)^i]^r$
	$=$	$\int_0^1 [\sum_{i=0}^{m-k} {{k-1+i} \choose {k-1}} y^k (1-y)^i]^r f(y) dy \;\;\;\;\;\; (8)$

where $Y$ denotes a random variable with a beta distribution with parameters $v$ and $n+1-v$, and $f()$ denotes the pdf of this distribution. Note that $v$ denotes the rank of the order statistic used as the upper prediction limit (i.e., n.plus.one.minus.upl.rank=$n+1-v$), and that $v$ is usually equal to $n$.

Also note that the summation term in Equation (8) corresponds to the cumulative distribution function of a Negative Binomial distribution with parameters size=$k$ and prob=$y$ evaluated at q=$m-k$.

When pi.type="lower", $Y$ denotes a random variable with a beta distribution with parameters $n+1-u$ and $u$. Note that $u$ denotes the rank of the order statistic used as the lower prediction limit (i.e., lpl.rank=$u$), and that $u$ is usually equal to $1$.

Simultaneous Prediction Intervals for the California Rule (rule="CA")
For the California rule (rule="CA"), with probability $(1-\alpha)100\%$, for each of the $r$ future sampling occasions, either the first observation will fall in the prediction interval, or else all of the next $m-1$ observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, $m-1$ more observations must be taken.

In this case, the probability is given by:

$1 - \alpha$	$=$	$E[\sum_{i=0}^r {r \choose i} Y^{r - i + (m-1)i} (1-Y)^i]$
	$=$	$\int_0^1 [\sum_{i=0}^r {r \choose i} y^{r - i + (m-1)i} (1-y)^i] f(y) dy \;\;\;\;\;\; (9)$

Simultaneous Prediction Intervals for the Modified California Rule (rule="Modified.CA")
For the Modified California rule (rule="Modified.CA"), with probability $(1-\alpha)100\%$, for each of the $r$ future sampling occasions, either the first observation will fall in the prediction interval, or else at least 2 out of the next 3 observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, up to 3 more observations must be taken.

In this case, the probability is given by:

$1 - \alpha$	$=$	$E[Y^r (1 + Q + Q^2 - 2Q^3)^r]$
	$=$	$\int_0^1 [y^r (1 + q + q^2 - 2q^3)^r] f(y) dy \;\;\;\;\;\; (10)$

where $Q = 1 - Y$ and $q = 1 - y$.

Davis and McNichols (1999) provide algorithms for computing the probabilities based on expanding polynomials and the formula for the expected value of a beta random variable. In the discussion section of Davis and McNichols (1999), however, Vangel points out that numerical integration is adequate, and this is how these probabilities are computed in the function predIntNparSimultaneous.

Constructing Nonparametric Simultaneous Prediction Intervals for Future Medians
USEPA (2009, Chapter 19; Cameron, 2011) extends nonparametric simultaneous prediction intervals to testing future medians for the case of the 1-of-1 and 1-of-2 plans for medians of order 3. In general, each of the rules ($k$-of-$m$, California, and Modified California) can be easily extended to the case of using medians as long as the medians are based on an odd (as opposed to even) sample size.

For each of the above rules, if we are interested in using medians instead of single observations (i.e., $b \ge 1$; n.median$\ge 1$), and we force $b$ to be odd, then a median will be less than a prediction limit once $(b+1)/2$ observations are less than the prediction limit. Thus, Equations (8) - (10) are modified by replacing $y$ with the term: $$\sum_{i=0}^{b - b'} {{b' - 1 + i} \choose {b' - 1}} y^{b'} (1 - y)^i \;\;\;\;\;\; (11)$$ where $$b' = \frac{b+1}{2} \;\;\;\;\;\; (12)$$

Value

a list of class "estimate" containing the simultaneous prediction interval and other information. See the help file for estimate.object for details.

References

Cameron, Kirk. (2011). Personal communication, February 16, 2011. MacStat Consulting, Ltd., Colorado Springs, Colorado.

Chew, V. (1968). Simultaneous Prediction Intervals. Technometrics 10(2), 323–331.

Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.

Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.

Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.

Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.

Davis, C.B., and R.J. McNichols. (1994b). Ground Water Monitoring Statistics Update: Part II: Nonparametric Prediction Limits. Ground Water Monitoring and Remediation 14(4), 159–175.

Davis, C.B., and R.J. McNichols. (1999). Simultaneous Nonparametric Prediction Limits (with Discusson). Technometrics 41(2), 89–112.

Gibbons, R.D. (1987a). Statistical Prediction Intervals for the Evaluation of Ground-Water Quality. Ground Water 25, 455–465.

Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.

Gibbons, R.D., and J. Baker. (1991). The Properties of Various Statistical Prediction Intervals for Ground-Water Detection Monitoring. Journal of Environmental Science and Health A26(4), 535–553.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.

Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.

Hall, I.J., R.R. Prairie, and C.K. Motlagh. (1975). Non-Parametric Prediction Intervals. Journal of Quality Technology 7(3), 109–114.

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).

Examples

  # Generate 20 observations from a lognormal mixture distribution with 
  # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1.  Use 
  # predIntNparSimultaneous to construct an upper one-sided prediction interval 
  # using the maximum observed value using the 1-of-3 rule.  
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, 
    mean2 = 5, cv2 = 1, p.mix = 0.1) 

  predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0) 
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 20
#> 
#> $data.name
#> [1] "dat"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Prediction"
#> 
#> $rule
#> [1] "k.of.m"
#> 
#> $limit.ranks
#> [1] 20
#> 
#> $limits
#>      LPL      UPL 
#> 0.000000 1.817311 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "exact "
#> 
#> $conf.level
#> [1] 0.9994353
#> 
#> $sample.size
#> [1] 20
#> 
#> $k
#> [1] 1
#> 
#> $m
#> [1] 3
#> 
#> $r
#> [1] 1
#> 
#> $n.median
#> [1] 1
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Data:                            dat
  #
  #Sample Size:                     20
  #
  #Prediction Interval Method:      exact 
  #
  #Prediction Interval Type:        upper
  #
  #Confidence Level:                99.94353%
  #
  #Prediction Limit Rank(s):        20 
  #
  #Minimum Number of
  #Future Observations
  #Interval Should Contain:         1
  #
  #Total Number of
  #Future Observations:             3
  #
  #Prediction Interval:             LPL = 0.000000
  #                                 UPL = 1.817311

  #----------

  # Compare the confidence levels for the 1-of-3 rule, California Rule, and 
  # Modified California Rule.

  predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0)$interval$conf.level
#> [1] 0.9994353
  #[1] 0.9994353

  predIntNparSimultaneous(dat, m = 3, rule = "CA", lb = 0)$interval$conf.level
#> [1] 0.9919066
  #[1] 0.9919066

  predIntNparSimultaneous(dat, rule = "Modified.CA", lb = 0)$interval$conf.level
#> [1] 0.9984943
  #[1] 0.9984943

  #=========

  # Repeat the above example, but create the baseline data using just 
  # n=8 observations and set r to 4 future sampling occasions

  set.seed(598) 
  dat <- rlnormMixAlt(n = 8, mean1 = 1, cv1 = 0.5, 
    mean2 = 5, cv2 = 1, p.mix = 0.1) 

  predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0) 
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 8
#> 
#> $data.name
#> [1] "dat"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Prediction"
#> 
#> $rule
#> [1] "k.of.m"
#> 
#> $limit.ranks
#> [1] 8
#> 
#> $limits
#>      LPL      UPL 
#> 0.000000 5.683453 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "exact "
#> 
#> $conf.level
#> [1] 0.977599
#> 
#> $sample.size
#> [1] 8
#> 
#> $k
#> [1] 1
#> 
#> $m
#> [1] 3
#> 
#> $r
#> [1] 4
#> 
#> $n.median
#> [1] 1
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Data:                            dat
  #
  #Sample Size:                     8
  #
  #Prediction Interval Method:      exact 
  #
  #Prediction Interval Type:        upper
  #
  #Confidence Level:                97.7599%
  #
  #Prediction Limit Rank(s):        8 
  #
  #Minimum Number of
  #Future Observations
  #Interval Should Contain
  #(per Sampling Occasion):         1
  #
  #Total Number of
  #Future Observations
  #(per Sampling Occasion):         3
  #
  #Number of Future
  #Sampling Occasions:              4
  #
  #Prediction Interval:             LPL = 0.000000
  #                                 UPL = 5.683453

  #----------

  # Compare the confidence levels for the 1-of-3 rule, California Rule, and 
  # Modified California Rule.

  predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0)$interval$conf.level
#> [1] 0.977599
  #[1] 0.977599

  predIntNparSimultaneous(dat, m = 3, r = 4, rule = "CA", lb = 0)$interval$conf.level
#> [1] 0.8737798
  #[1] 0.8737798

  predIntNparSimultaneous(dat, r = 4, rule = "Modified.CA", lb = 0)$interval$conf.level
#> [1] 0.9510178
  #[1] 0.9510178

  #==========

  # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper 
  # simultaneous prediction limits for various rules based on trace mercury data (ppb) 
  # collected in the past year from a site with four background wells and 10 compliance 
  # wells (data for two of the compliance wells  are shown in the guidance document).  
  # The facility must monitor the 10 compliance wells for five constituents 
  # (including mercury) annually.
  
  # Here we will compute the confidence level associated with two different sampling plans: 
  # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and 
  # 2) the 1-of-4 plan on individual observations using the 3rd highest background value.
  # The data for this example are stored in EPA.09.Ex.19.5.mercury.df.

  # We will pool data from 4 background wells that were sampled on 
  # a number of different occasions, giving us a sample size of 
  # n = 20 to use to construct the prediction limit.

  # There are 10 compliance wells and we will monitor 5 different 
  # constituents at each well annually.  For this example, USEPA (2009) 
  # recommends setting r to the product of the number of compliance wells and 
  # the number of evaluations per year.  

  # To determine the minimum confidence level we require for 
  # the simultaneous prediction interval, USEPA (2009) recommends 
  # setting the maximum allowed individual Type I Error level per constituent to:
 
  # 1 - (1 - SWFPR)^(1 / Number of Constituents)
  
  # which translates to setting the confidence limit to 

  # (1 - SWFPR)^(1 / Number of Constituents)

  # where SWFPR = site-wide false positive rate.  For this example, we 
  # will set SWFPR = 0.1.  Thus, the required individual Type I Error level 
  # and confidence level per constituent are given as follows:

  # n  = 20 based on 4 Background Wells
  # nw = 10 Compliance Wells
  # nc =  5 Constituents
  # ne =  1 Evaluation per year

  n  <- 20
  nw <- 10
  nc <-  5
  ne <-  1

  # Set number of future sampling occasions r to 
  # Number Compliance Wells x Number Evaluations per Year
  r  <-  nw * ne

  conf.level <- (1 - 0.1)^(1 / nc)
  conf.level
#> [1] 0.9791484
  #[1] 0.9791484

  alpha <- 1 - conf.level
  alpha
#> [1] 0.02085164
  #[1] 0.02085164

  #----------

  # Look at the data:

  head(EPA.09.Ex.19.5.mercury.df)
#>   Event Well  Well.type Mercury.ppb.orig Mercury.ppb Censored
#> 1     1 BG-1 Background             0.21        0.21    FALSE
#> 2     2 BG-1 Background              <.2        0.20     TRUE
#> 3     3 BG-1 Background              <.2        0.20     TRUE
#> 4     4 BG-1 Background              <.2        0.20     TRUE
#> 5     5 BG-1 Background              <.2        0.20     TRUE
#> 6     6 BG-1 Background                           NA    FALSE
  #  Event Well  Well.type Mercury.ppb.orig Mercury.ppb Censored
  #1     1 BG-1 Background             0.21        0.21    FALSE
  #2     2 BG-1 Background              <.2        0.20     TRUE
  #3     3 BG-1 Background              <.2        0.20     TRUE
  #4     4 BG-1 Background              <.2        0.20     TRUE
  #5     5 BG-1 Background              <.2        0.20     TRUE
  #6     6 BG-1 Background                           NA    FALSE

  longToWide(EPA.09.Ex.19.5.mercury.df, "Mercury.ppb.orig", 
    "Event", "Well", paste.row.name = TRUE)
#>         BG-1 BG-2 BG-3 BG-4 CW-1 CW-2
#> Event.1 0.21  <.2  <.2  <.2 0.22 0.36
#> Event.2  <.2  <.2 0.23 0.25  0.2 0.41
#> Event.3  <.2  <.2  <.2 0.28  <.2 0.28
#> Event.4  <.2 0.21 0.23  <.2 0.25 0.45
#> Event.5  <.2  <.2 0.24  <.2 0.24 0.43
#> Event.6                      <.2 0.54
  #        BG-1 BG-2 BG-3 BG-4 CW-1 CW-2
  #Event.1 0.21  <.2  <.2  <.2 0.22 0.36
  #Event.2  <.2  <.2 0.23 0.25  0.2 0.41
  #Event.3  <.2  <.2  <.2 0.28  <.2 0.28
  #Event.4  <.2 0.21 0.23  <.2 0.25 0.45
  #Event.5  <.2  <.2 0.24  <.2 0.24 0.43
  #Event.6                      <.2 0.54


  # Construct the upper simultaneous prediction limit using the 1-of-2  
  # retesting plan for a median of order 3 based on the background maximum 

  Hg.Back <- with(EPA.09.Ex.19.5.mercury.df, 
    Mercury.ppb[Well.type == "Background"])

  pred.int.1.of.2.med.3 <- predIntNparSimultaneous(Hg.Back, n.median = 3, 
    k = 1, m = 2, r = r, lb = 0)
#> Warning: There were 4 nonfinite values in x : 4 NA's
#> Warning: 4 observations with NA/NaN/Inf in 'x' removed.

  pred.int.1.of.2.med.3
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 20
#> 
#> $data.name
#> [1] "Hg.Back"
#> 
#> $bad.obs
#> [1] 4
#> 
#> $interval
#> $name
#> [1] "Prediction"
#> 
#> $rule
#> [1] "k.of.m"
#> 
#> $limit.ranks
#> [1] 20
#> 
#> $limits
#>  LPL  UPL 
#> 0.00 0.28 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "exact "
#> 
#> $conf.level
#> [1] 0.9940354
#> 
#> $sample.size
#> [1] 20
#> 
#> $k
#> [1] 1
#> 
#> $m
#> [1] 2
#> 
#> $r
#> [1] 10
#> 
#> $n.median
#> [1] 3
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Data:                            Hg.Back
  #
  #Sample Size:                     20
  #
  #Number NA/NaN/Inf's:             4
  #
  #Prediction Interval Method:      exact 
  #
  #Prediction Interval Type:        upper
  #
  #Confidence Level:                99.40354%
  #
  #Prediction Limit Rank(s):        20 
  #
  #Minimum Number of
  #Future Medians
  #Interval Should Contain
  #(per Sampling Occasion):         1
  #
  #Total Number of
  #Future Medians
  #(per Sampling Occasion):         2
  #
  #Number of Future
  #Sampling Occasions:              10
  #
  #Sample Size for Medians:         3
  #
  #Prediction Interval:             LPL = 0.00
  #                                 UPL = 0.28

  # Note that the achieved confidence level of 99.4% is greater than the 
  # required confidence level of 97.9%.

  # Now determine whether either compliance well indicates evidence of 
  # Mercury contamination.

  # Compliance Well 1
  #------------------
  Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"])

  Hg.CW.1
#> [1] "0.22" "0.2"  "<.2"  "0.25" "0.24" "<.2" 
  #[1] "0.22" "0.2"  "<.2"  "0.25" "0.24" "<.2"

  # The median of the first 3 observations is 0.2, which is less than 
  # the UPL of 0.28, so there is no evidence of contamination.

  # Compliance Well 2
  #------------------
  Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"])

  Hg.CW.2
#> [1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54"
  #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54"

  # The median of the first 3 observations is 0.36, so 3 more observations have to 
  # be looked at.  The median of the second 3 observations is 0.45, which is 
  # larger than the UPL of 0.28, so there is evidence of contamination.  

  #----------

  # Now create the upper simultaneous prediction limit using the 1-of-4 plan 
  # on individual observations using the 3rd highest background value.

  pred.int.1.of.4.3rd <- predIntNparSimultaneous(Hg.Back, k = 1, m = 4, 
    r = r, lb = 0, n.plus.one.minus.upl.rank = 3)
#> Warning: There were 4 nonfinite values in x : 4 NA's
#> Warning: 4 observations with NA/NaN/Inf in 'x' removed.

  pred.int.1.of.4.3rd  
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 20
#> 
#> $data.name
#> [1] "Hg.Back"
#> 
#> $bad.obs
#> [1] 4
#> 
#> $interval
#> $name
#> [1] "Prediction"
#> 
#> $rule
#> [1] "k.of.m"
#> 
#> $limit.ranks
#> [1] 18
#> 
#> $limits
#>  LPL  UPL 
#> 0.00 0.24 
#> 
#> $type
#> [1] "upper"
#> 
#> $method
#> [1] "exact "
#> 
#> $conf.level
#> [1] 0.9864909
#> 
#> $sample.size
#> [1] 20
#> 
#> $k
#> [1] 1
#> 
#> $m
#> [1] 4
#> 
#> $r
#> [1] 10
#> 
#> $n.median
#> [1] 1
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Data:                            Hg.Back
  #
  #Sample Size:                     20
  #
  #Number NA/NaN/Inf's:             4
  #
  #Prediction Interval Method:      exact 
  #
  #Prediction Interval Type:        upper
  #
  #Confidence Level:                98.64909%
  #
  #Prediction Limit Rank(s):        18 
  #
  #Minimum Number of
  #Future Observations
  #Interval Should Contain
  #(per Sampling Occasion):         1
  #
  #Total Number of
  #Future Observations
  #(per Sampling Occasion):         4
  #
  #Number of Future
  #Sampling Occasions:              10
  #
  #Prediction Interval:             LPL = 0.00
  #                                 UPL = 0.24

  # Note that the achieved confidence level of 98.6% is greater than the 
  # required confidence level of 97.9%.


  # Now determine whether either compliance well indicates evidence of 
  # Mercury contamination.

  # Compliance Well 1
  #------------------
  Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"])

  Hg.CW.1
#> [1] "0.22" "0.2"  "<.2"  "0.25" "0.24" "<.2" 
  #[1] "0.22" "0.2"  "<.2"  "0.25" "0.24" "<.2"

  # The first observation is less than the UPL of 0.24, which is less than 
  # the UPL of 0.28, so there is no evidence of contamination.


  # Compliance Well 2
  #------------------
  Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"])

  Hg.CW.2
#> [1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54"
  #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54"

  # All of the first 4 observations are greater than the UPL of 0.24, so there 
  # is evidence of contamination.  

   #==========

  # Cleanup
  #--------
  rm(dat, n, nw, nc, ne, r, conf.level, alpha, Hg.Back, pred.int.1.of.2.med.3, 
    pred.int.1.of.4.3rd, Hg.CW.1, Hg.CW.2)

\(1 - \alpha\)	\(=\)	\(E[\sum_{i=0}^{m-k} {{k-1+i} \choose {k-1}} Y^k (1-Y)^i]^r\)
	\(=\)	\(\int_0^1 [\sum_{i=0}^{m-k} {{k-1+i} \choose {k-1}} y^k (1-y)^i]^r f(y) dy \;\;\;\;\;\; (8)\)

\(1 - \alpha\)	\(=\)	\(E[\sum_{i=0}^r {r \choose i} Y^{r - i + (m-1)i} (1-Y)^i]\)
	\(=\)	\(\int_0^1 [\sum_{i=0}^r {r \choose i} y^{r - i + (m-1)i} (1-y)^i] f(y) dy \;\;\;\;\;\; (9)\)

\(1 - \alpha\)	\(=\)	\(E[Y^r (1 + Q + Q^2 - 2Q^3)^r]\)
	\(=\)	\(\int_0^1 [y^r (1 + q + q^2 - 2q^3)^r] f(y) dy \;\;\;\;\;\; (10)\)