Plots for Sampling Design Based on Confidence Interval for Binomial Proportion or Difference Between Two Proportions

Create plots for a sampling design based on a confidence interval for a binomial proportion or the difference between two proportions.

plotCiBinomDesign(x.var = "n", y.var = "half.width", 
    range.x.var = NULL, n.or.n1 = 25, p.hat.or.p1.hat = 0.5, 
    n2 = n.or.n1, p2.hat = 0.4, ratio = 1, half.width = 0.05, 
    conf.level = 0.95, sample.type = "one.sample", ci.method = "score", 
    correct = TRUE, warn = TRUE, n.or.n1.min = 2, 
    n.or.n1.max = 10000, tol.half.width = 0.005, tol.p.hat = 0.005, 
    maxiter = 10000, plot.it = TRUE, add = FALSE, n.points = 100, 
    plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, 
    digits = .Options$digits, 
    main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)

Arguments

x.var: character string indicating what variable to use for the x-axis. Possible values are "n" (sample size; the default), "half.width" (the half-width of the confidence interval), "p.hat" (the estimated probability of “success”), and "conf.level" (the confidence level).
y.var: character string indicating what variable to use for the y-axis. Possible values are "half.width" (the half-width of the confidence interval; the default), and "n" (sample size).
range.x.var: numeric vector of length 2 indicating the range of the x-variable to use for the plot. The default value depends on the value of x.var.
When x.var="n" the default value is c(10,50).
When x.var="half.width", the default value is c(0.03, 0.1).
When x.var="p.hat", the default value is c(0.5, 0.9).
When x.var="conf.level", the default value is c(0.8, 0.99).
n.or.n1: numeric scalar indicating the sample size. The default value is n.or.n1=25. When sample.type="one.sample", n.or.n1 denotes the number of observations in the single sample. When sample.type="two.sample", n.or.n1 denotes the number of observations from group 1. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed. This argument is ignored if either x.var="n" or y.var="n".
p.hat.or.p1.hat: numeric scalar indicating an estimated proportion.
When sample.type="one.sample", p.hat.or.p1.hat denotes the estimated value of \(p\), the probability of “success”.
When sample.type="two.sample", p.hat.or.p1.hat denotes the estimated value of \(p_1\), the probability of “success” in group 1.
Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed. This argument is ignored if x.var="p.hat".
n2: numeric scalar indicating the sample size for group 2. The default value is the value of n.or.n1. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed. This argument is ignored when
sample.type="one.sample".
p2.hat: numeric scalar indicating the estimated proportion for group 2. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed. This argument is ignored if sample.type="one.sample".
ratio: numeric vector indicating the ratio of sample size in group 2 to sample size in group 1 (\(n_2/n_1\)). The default value is ratio=1. All values of ratio must be greater than or equal to 1. This argument is only used when
sample.type="two.sample" and either x.var="n" or y.var="n".
half.width: positive numeric scalar indicating the half-width of the confidence interval. The default value is half.width=0.05. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed. This argument is ignored if either x.var="half.width" or y.var="half.width".
conf.level: a numeric scalar between 0 and 1 indicating the confidence level associated with the confidence intervals. The default value is conf.level=0.95. This argument is ignored when x.var="conf.level".
sample.type: character string indicating whether this is a one-sample or two-sample confidence interval. When sample.type="one.sample", the computations for the plot are based on a confidence interval for a single proportion. When
sample.type="two.sample", the computations for the plot are based on a confidence interval for the difference between two proportions. The default value is sample.type="one.sample" unless the arguments n2, p2.hat, and/or ratio are supplied.
ci.method: character string indicating which method to use to construct the confidence interval. Possible values are "score" (the default), "exact", "adjusted Wald", and "Wald" (the "Wald" method is never recommended but is included for historical purposes). The exact method is only available for the one-sample case, i.e., when sample.type="one.sample".
correct: logical scalar indicating whether to use the continuity correction when
ci.method="score" or ci.method="Wald". The default value is
correct=TRUE. This argument is ignored if ci.method="exact" or
ci.method="adjusted Wald".
warn: logical scalar indicating whether to issue a warning when ci.method="Wald" for cases when the normal approximation to the binomial distribution probably is not accurate. The default value is warn=TRUE.
n.or.n1.min: for the case when y.var="n", integer indicating the minimum allowed value for \(n\) (sample.type="one.sample") or
\(n_1\) (sample.type="two.sample").
The default value is n.or.n1.min=2.
n.or.n1.max: for the case when y.var="n", integer indicating the maximum allowed value for \(n\) (sample.type="one.sample") or
\(n_1\) (sample.type="two.sample").
The default value is n.or.n1.max=10000.
tol.half.width: for the case when y.var="n", numeric scalar indicating the tolerance to use for the half width for the search algorithm. The sample sizes are computed so that the actual half width is less than or equal to
half.width + tol.half.width. The default value is
tol.half.width=0.005.
tol.p.hat: for the case when y.var="n", numeric scalar indicating the tolerance to use for the estimated proportion(s) for the search algorithm. For the one-sample case, the sample sizes are computed so that the absolute value of the difference between the user supplied value of p.hat.or.p1.hat and the actual estimated proportion is less than or equal to tol.p.hat. For the two-sample case, the sample sizes are computed so that the absolute value of the difference between the user supplied value of p.hat.or.p1.hat and the actual estimated proportion for group 1 is less than or equal to tol.p.hat, and the absolute value of the difference between the user supplied value of p2.hat and the actual estimated proportion for group 2 is less than or equal to tol.p.hat. The default value is tol.p.hat=0.005.
maxiter: for the case when y.var="n", integer indicating the maximum number of iterations to use for the search algorithm. The default value is maxiter=1000.
plot.it: a logical scalar indicating whether to create a plot or add to the existing plot (see description of the argument add below) on the current graphics device. If plot.it=FALSE, no plot is produced, but a list of (x,y) values is returned (see the VALUE section below). The default value is plot.it=TRUE.
add: a logical scalar indicating whether to add the design plot to the existing plot (add=TRUE), or to create a plot from scratch (add=FALSE). The default value is add=FALSE. This argument is ignored if plot.it=FALSE.
n.points: a numeric scalar specifying how many (x,y) pairs to use to produce the plot. There are n.points x-values evenly spaced between range.x.var[1] and
range.x.var[2]. The default value is n.points=100. This argument is ignored when x.var="n", in which case the x-values are all the integers between range.x.var[1] and range.x.var[2].
plot.col: a numeric scalar or character string determining the color of the plotted line or points. The default value is plot.col=1. See the entry for col in the help file for par for more information.
plot.lwd: a numeric scalar determining the width of the plotted line. The default value is 3*par("cex"). See the entry for lwd in the help file for par for more information.
plot.lty: a numeric scalar determining the line type of the plotted line. The default value is plot.lty=1. See the entry for lty in the help file for par for more information.
digits: a scalar indicating how many significant digits to print out on the plot. The default value is the current setting of options("digits").
main, xlab, ylab, type,...: additional graphical parameters (see par).

Details

See the help files for ciBinomHalfWidth and ciBinomN for information on how to compute a one-sample confidence interval for a single binomial proportion or a two-sample confidence interval for the difference between two proportions, how the half-width is computed when other quantities are fixed, and how the sample size is computed when other quantities are fixed.

Value

plotCiBinomDesign invisibly returns a list with components:

x.var: x-coordinates of the points that have been or would have been plotted
y.var: y-coordinates of the points that have been or would have been plotted

References

Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.

Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.

Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.

Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.

Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.

Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.

Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.

Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.

Newcombe, R.G. (1998b). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890.

Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.

USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

The binomial distribution is used to model processes with binary (Yes-No, Success-Failure, Heads-Tails, etc.) outcomes. It is assumed that the outcome of any one trial is independent of any other trial, and that the probability of “success”, \(p\), is the same on each trial. A binomial discrete random variable \(X\) is the number of “successes” in \(n\) independent trials. A special case of the binomial distribution occurs when \(n=1\), in which case \(X\) is also called a Bernoulli random variable.

In the context of environmental statistics, the binomial distribution is sometimes used to model the proportion of times a chemical concentration exceeds a set standard in a given period of time (e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a background well (e.g., USEPA, 1989b, Chapter 8, p.3-7). (However, USEPA 2009, p.8-27 recommends using the Wilcoxon rank sum test (wilcox.test) instead of comparing proportions.)

In the course of designing a sampling program, an environmental scientist may wish to determine the relationship between sample size, confidence level, and half-width if one of the objectives of the sampling program is to produce confidence intervals. The functions ciBinomHalfWidth, ciBinomN, and plotCiBinomDesign can be used to investigate these relationships for the case of binomial proportions.

Examples

  # Look at the relationship between half-width and sample size 
  # for a one-sample confidence interval for a binomial proportion, 
  # assuming an estimated proportion of 0.5 and a confidence level of 
  # 95%.  The jigsaw appearance of the plot is the result of using the 
  # score method:

  dev.new()
  plotCiBinomDesign()

  #----------

  # Redo the example above, but use the traditional (and inaccurate)
  # Wald method.

  dev.new()
  plotCiBinomDesign(ci.method = "Wald")

  #--------------------------------------------------------------------

  # Plot sample size vs. the estimated proportion for various half-widths, 
  # using a 95% confidence level and the adjusted Wald method:

  # NOTE:  This example takes several seconds to run so it has been 
  #        commented out.  Simply remove the pound signs (#) from in front 
  #        of the R commands to run it.

  #dev.new()
  #plotCiBinomDesign(x.var = "p.hat", y.var = "n", 
  #    half.width = 0.04, ylim = c(0, 600), main = "",
  #    xlab = expression(hat(p))) 
  #
  #plotCiBinomDesign(x.var = "p.hat", y.var = "n", 
  #    half.width = 0.05, add = TRUE, plot.col = 2) 
  #
  #plotCiBinomDesign(x.var = "p.hat", y.var = "n", 
  #    half.width = 0.06, add = TRUE, plot.col = 3) 
  #
  #legend(0.5, 150, paste("Half-Width =", c(0.04, 0.05, 0.06)), 
  #    lty = rep(1, 3), lwd = rep(2, 3), col=1:3, bty = "n") 
  #
  #mtext(expression(paste("Sample Size vs. ", hat(p), 
  #  " for Confidence Interval for p")), line = 2.5, cex = 1.25)
  #mtext("with Confidence=95%  and Various Values of Half-Width", 
  #  line = 1.5, cex = 1.25)
  #mtext(paste("CI Method = Score Normal Approximation", 
  #  "with Continuity Correction"), line = 0.5)

  #--------------------------------------------------------------------

  # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), 
  # look at the relationship between half-width and sample size 
  # for a 95% confidence interval for the difference between the 
  # proportion of detects at the background and compliance wells. 
  # Use the estimated proportion of detects from the original data. 
  # (The data are stored in EPA.89b.cadmium.df.)  
  # Assume equal sample sizes at each well.

  EPA.89b.cadmium.df
#>    Cadmium.orig Cadmium Censored  Well.type
#> 1           0.1   0.100    FALSE Background
#> 2          0.12   0.120    FALSE Background
#> 3           BDL   0.000     TRUE Background
#> 4          0.26   0.260    FALSE Background
#> 5           BDL   0.000     TRUE Background
#> 6           0.1   0.100    FALSE Background
#> 7           BDL   0.000     TRUE Background
#> 8         0.014   0.014    FALSE Background
#> 9           BDL   0.000     TRUE Background
#> 10          BDL   0.000     TRUE Background
#> 11          BDL   0.000     TRUE Background
#> 12          BDL   0.000     TRUE Background
#> 13          BDL   0.000     TRUE Background
#> 14         0.12   0.120    FALSE Background
#> 15          BDL   0.000     TRUE Background
#> 16         0.21   0.210    FALSE Background
#> 17          BDL   0.000     TRUE Background
#> 18         0.12   0.120    FALSE Background
#> 19          BDL   0.000     TRUE Background
#> 20          BDL   0.000     TRUE Background
#> 21          BDL   0.000     TRUE Background
#> 22          BDL   0.000     TRUE Background
#> 23          BDL   0.000     TRUE Background
#> 24          BDL   0.000     TRUE Background
#> 25         0.12   0.120    FALSE Compliance
#> 26         0.08   0.080    FALSE Compliance
#> 27          BDL   0.000     TRUE Compliance
#> 28          0.2   0.200    FALSE Compliance
#> 29          BDL   0.000     TRUE Compliance
#> 30          0.1   0.100    FALSE Compliance
#> 31          BDL   0.000     TRUE Compliance
#> 32        0.012   0.012    FALSE Compliance
#> 33          BDL   0.000     TRUE Compliance
#> 34          BDL   0.000     TRUE Compliance
#> 35          BDL   0.000     TRUE Compliance
#> 36          BDL   0.000     TRUE Compliance
#> 37          BDL   0.000     TRUE Compliance
#> 38         0.12   0.120    FALSE Compliance
#> 39         0.07   0.070    FALSE Compliance
#> 40          BDL   0.000     TRUE Compliance
#> 41         0.19   0.190    FALSE Compliance
#> 42          BDL   0.000     TRUE Compliance
#> 43          0.1   0.100    FALSE Compliance
#> 44          BDL   0.000     TRUE Compliance
#> 45         0.01   0.010    FALSE Compliance
#> 46          BDL   0.000     TRUE Compliance
#> 47          BDL   0.000     TRUE Compliance
#> 48          BDL   0.000     TRUE Compliance
#> 49          BDL   0.000     TRUE Compliance
#> 50          BDL   0.000     TRUE Compliance
#> 51         0.11   0.110    FALSE Compliance
#> 52         0.06   0.060    FALSE Compliance
#> 53          BDL   0.000     TRUE Compliance
#> 54         0.23   0.230    FALSE Compliance
#> 55          BDL   0.000     TRUE Compliance
#> 56         0.11   0.110    FALSE Compliance
#> 57          BDL   0.000     TRUE Compliance
#> 58        0.031   0.031    FALSE Compliance
#> 59          BDL   0.000     TRUE Compliance
#> 60          BDL   0.000     TRUE Compliance
#> 61          BDL   0.000     TRUE Compliance
#> 62          BDL   0.000     TRUE Compliance
#> 63          BDL   0.000     TRUE Compliance
#> 64         0.12   0.120    FALSE Compliance
#> 65         0.08   0.080    FALSE Compliance
#> 66          BDL   0.000     TRUE Compliance
#> 67         0.26   0.260    FALSE Compliance
#> 68          BDL   0.000     TRUE Compliance
#> 69         0.02   0.020    FALSE Compliance
#> 70          BDL   0.000     TRUE Compliance
#> 71        0.024   0.024    FALSE Compliance
#> 72          BDL   0.000     TRUE Compliance
#> 73          BDL   0.000     TRUE Compliance
#> 74          BDL   0.000     TRUE Compliance
#> 75          BDL   0.000     TRUE Compliance
#> 76          BDL   0.000     TRUE Compliance
#> 77          0.1   0.100    FALSE Compliance
#> 78         0.04   0.040    FALSE Compliance
#> 79          BDL   0.000     TRUE Compliance
#> 80          BDL   0.000     TRUE Compliance
#> 81          0.1   0.100    FALSE Compliance
#> 82          BDL   0.000     TRUE Compliance
#> 83         0.01   0.010    FALSE Compliance
#> 84          BDL   0.000     TRUE Compliance
#> 85          BDL   0.000     TRUE Compliance
#> 86          BDL   0.000     TRUE Compliance
#> 87          BDL   0.000     TRUE Compliance
#> 88          BDL   0.000     TRUE Compliance
  #   Cadmium.orig Cadmium Censored  Well.type
  #1           0.1   0.100    FALSE Background
  #2          0.12   0.120    FALSE Background
  #3           BDL   0.000     TRUE Background
  # ..........................................
  #86          BDL   0.000     TRUE Compliance
  #87          BDL   0.000     TRUE Compliance
  #88          BDL   0.000     TRUE Compliance

  p.hat.back <- with(EPA.89b.cadmium.df, 
    mean(!Censored[Well.type=="Background"]))

  p.hat.back 
#> [1] 0.3333333
  #[1] 0.3333333 

  p.hat.comp <- with(EPA.89b.cadmium.df,  
    mean(!Censored[Well.type=="Compliance"]))

  p.hat.comp 
#> [1] 0.375
  #[1] 0.375 

  dev.new()
  plotCiBinomDesign(p.hat.or.p1.hat = p.hat.back, 
      p2.hat = p.hat.comp, digits=3) 

  #==========

  # Clean up
  #---------
  rm(p.hat.back, p.hat.comp)
  graphics.off()