Table of Confidence Intervals for Mean or Difference Between Two Means

Create a table of confidence intervals for the mean of a normal distribution or the difference between two means following Bacchetti (2010), by varying the estimated standard deviation and the estimated mean or differene between the two estimated means given the sample size(s).

ciTableMean(n1 = 10, n2 = n1, diff.or.mean = 2:0, SD = 1:3, 
    sample.type = "two.sample", ci.type = "two.sided", conf.level = 0.95, 
    digits = 1)

Arguments

n1: positive integer greater than 1 specifying the sample size when
sample.type="one.sample" or the sample size for group 1 when
sample.type="two.sample". The default value is n1=10.
n2: positive integer greater than 1 specifying the sample size for group 2 when sample.type="two.sample". The default value is n2=n1, i.e., equal sample sizes. This argument is ignored when sample.type="one.sample".
diff.or.mean: numeric vector indicating either the assumed difference between the two sample means when sample.type="two.sample" or the value of the sample mean when sample.type="one.sample". The default value is diff.or.mean=2:0. Missing (NA), undefined (NaN), an infinite (-Inf, Inf) values are not allowed.
SD: numeric vector of positive values specifying the assumed estimated standard deviation. The default value is SD=1:3. Missing (NA), undefined (NaN), an infinite (-Inf, Inf) values are not allowed.
sample.type: character string specifying whether to create confidence intervals for the difference between two means (sample.type="two.sample"; the default) or confidence intervals for a single mean (sample.type="one.sample").
ci.type: character string indicating what kind of confidence interval to compute. The possible values are "two-sided" (the default), "lower", and "upper".
conf.level: a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default value is conf.level=0.95.
digits: positive integer indicating how many decimal places to display in the table. The default value is digits=1.

Details

Following Bacchetti (2010) (see NOTE below), the function ciTableMean allows you to perform sensitivity analyses while planning future studies by producing a table of confidence intervals for the mean or the difference between two means by varying the estimated standard deviation and the estimated mean or differene between the two estimated means given the sample size(s).

One Sample Case (sample.type="one.sample")
Let $\underline{x} = (x_1, x_2, \ldots, x_n)$ be a vector of $n$ observations from an normal (Gaussian) distribution with parameters mean=$\mu$ and sd=$\sigma$.

The usual confidence interval for $\mu$ is constructed as follows. If ci.type="two-sided", the $(1-\alpha)$100% confidence interval for $\mu$ is given by: $$[\hat{\mu} - t(n-1, 1-\alpha/2) \frac{\hat{\sigma}}{\sqrt{n}}, \, \hat{\mu} + t(n-1, 1-\alpha/2) \frac{\hat{\sigma}}{\sqrt{n}}] \;\;\;\;\;\; (1)$$ where $$\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \;\;\;\;\;\; (2)$$ $$\hat{\sigma}^2 = s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \;\;\;\;\;\; (3)$$ and $t(\nu, p)$ is the $p$'th quantile of Student's t-distribution with $\nu$ degrees of freedom (Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992).

If ci.type="lower", the $(1-\alpha)$100% confidence interval for $\mu$ is given by: $$[\hat{\mu} - t(n-1, 1-\alpha) \frac{\hat{\sigma}}{\sqrt{n}}, \, \infty] \;\;\;\; (4)$$ and if ci.type="upper", the confidence interval is given by: $$[-\infty, \, \hat{\mu} + t(n-1, 1-\alpha/2) \frac{\hat{\sigma}}{\sqrt{n}}] \;\;\;\; (5)$$

For the one-sample case, the argument n1 corresponds to $n$ in Equation (1), the argument
diff.or.mean corresponds to $\hat{\mu} = \bar{x}$ in Equation (2), and the argument SD corresponds to $\hat{\sigma} = s$ in Equation (3).

Two Sample Case (sample.type="two.sample")
Let $\underline{x}_1 = (x_{11}, x_{21}, \ldots, x_{n_11})$ be a vector of $n_1$ observations from an normal (Gaussian) distribution with parameters mean=$\mu_1$ and sd=$\sigma$, and let $\underline{x}_2 = (x_{12}, x_{22}, \ldots, x_{n_22})$ be a vector of $n_2$ observations from an normal (Gaussian) distribution with parameters mean=$\mu_2$ and sd=$\sigma$.

The usual confidence interval for the difference between the two population means $\mu_1 - \mu_2$ is constructed as follows. If ci.type="two-sided", the $(1-\alpha)$100% confidence interval for $\mu_1 - \mu_2$ is given by: $$[(\hat{\mu}_1 - \hat{\mu}_2) - t(n_1 + n_2 -2, 1-\alpha/2) \hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \; (\hat{\mu}_1 - \hat{\mu}_2) + t(n_1 + n_2 -2, 1-\alpha/2) \hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}] \;\;\;\;\;\; (6)$$ where $$\hat{\mu}_1 = \bar{x}_1 = \frac{1}{n_1} \sum_{i=1}^{n_1} x_{i1} \;\;\;\;\;\; (7)$$ $$\hat{\mu}_2 = \bar{x}_2 = \frac{1}{n_2} \sum_{i=1}^{n_2} x_{i2} \;\;\;\;\;\; (8)$$ $$\hat{\sigma}^2 = s_p^2 = \frac{(n_1-1) s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2} \;\;\;\;\;\; (9)$$ $$s_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (x_{i1} - \bar{x}_1)^2 \;\;\;\;\;\; (10)$$ $$s_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (x_{i2} - \bar{x}_2)^2 \;\;\;\;\;\; (11)$$ and $t(\nu, p)$ is the $p$'th quantile of Student's t-distribution with $\nu$ degrees of freedom (Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992).

If ci.type="lower", the $(1-\alpha)$100% confidence interval for $\mu_1 - \mu_2$ is given by: $$[(\hat{\mu}_1 - \hat{\mu}_2) - t(n_1 + n_2 -2, 1-\alpha) \hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \; \infty] \;\;\;\;\;\; (12)$$ and if ci.type="upper", the confidence interval is given by: $$[-\infty, \; (\hat{\mu}_1 - \hat{\mu}_2) - t(n_1 + n_2 -2, 1-\alpha) \hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}] \;\;\;\;\;\; (13)$$

For the two-sample case, the arguments n1 and n2 correspond to $n_1$ and $n_2$ in Equation (6), the argument diff.or.mean corresponds to $\hat{\mu_1} - \hat{\mu_2} = \bar{x}_1 - \bar{x}_2$ in Equations (7) and (8), and the argument SD corresponds to $\hat{\sigma} = s_p$ in Equation (9).

Value

a data frame with the rows varying the standard deviation and the columns varying the estimated mean or difference between the means. Elements of the data frame are character strings indicating the confidence intervals.

References

Bacchetti, P. (2010). Current sample size conventions: Flaws, Harms, and Alternatives. BMC Medicine 8, 17–23.

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.

Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.

Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.

Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

Bacchetti (2010) presents strong arguments against the current convention in scientific research for computing sample size that is based on formulas that use a fixed Type I error (usually 5%) and a fixed minimal power (often 80%) without regard to costs. He notes that a key input to these formulas is a measure of variability (usually a standard deviation) that is difficult to measure accurately "unless there is so much preliminary data that the study isn't really needed." Also, study designers often avoid defining what a scientifically meaningful difference is by presenting sample size results in terms of the effect size (i.e., the difference of interest divided by the elusive standard deviation). Bacchetti (2010) encourages study designers to use simple tables in a sensitivity analysis to see what results of a study may look like for low, moderate, and high rates of variability and large, intermediate, and no underlying differences in the populations or processes being studied.

Examples

  # Show how potential confidence intervals for the difference between two means 
  # will look assuming standard deviations of 1, 2, or 3, differences between 
  # the two means of 2, 1, or 0, and a sample size of 10 in each group.

  ciTableMean()
#>           Diff=2      Diff=1      Diff=0
#> SD=1 [ 1.1, 2.9] [ 0.1, 1.9] [-0.9, 0.9]
#> SD=2 [ 0.1, 3.9] [-0.9, 2.9] [-1.9, 1.9]
#> SD=3 [-0.8, 4.8] [-1.8, 3.8] [-2.8, 2.8]
  #          Diff=2      Diff=1      Diff=0
  #SD=1 [ 1.1, 2.9] [ 0.1, 1.9] [-0.9, 0.9]
  #SD=2 [ 0.1, 3.9] [-0.9, 2.9] [-1.9, 1.9]
  #SD=3 [-0.8, 4.8] [-1.8, 3.8] [-2.8, 2.8]

  #==========

  # Show how a potential confidence interval for a mean will look assuming 
  # standard deviations of 1, 2, or 5, a sample mean of 5, 3, or 1, and 
  # a sample size of 15.

  ciTableMean(n1 = 15, diff.or.mean = c(5, 3, 1), SD = c(1, 2, 5), sample.type = "one")
#>           Mean=5      Mean=3      Mean=1
#> SD=1 [ 4.4, 5.6] [ 2.4, 3.6] [ 0.4, 1.6]
#> SD=2 [ 3.9, 6.1] [ 1.9, 4.1] [-0.1, 2.1]
#> SD=5 [ 2.2, 7.8] [ 0.2, 5.8] [-1.8, 3.8]
  #          Mean=5      Mean=3      Mean=1
  #SD=1 [ 4.4, 5.6] [ 2.4, 3.6] [ 0.4, 1.6]
  #SD=2 [ 3.9, 6.1] [ 1.9, 4.1] [-0.1, 2.1]
  #SD=5 [ 2.2, 7.8] [ 0.2, 5.8] [-1.8, 3.8]

  #==========

  # The data frame EPA.09.Ex.16.1.sulfate.df contains sulfate concentrations
  # (ppm) at one background and one downgradient well. The estimated
  # mean and standard deviation for the background well are 536 and 27 ppm,
  # respectively, based on a sample size of n = 8 quarterly samples taken over 
  # 2 years.  A two-sided 95% confidence interval for this mean is [514, 559], 
  # which has a  half-width of 23 ppm.
  #
  # The estimated mean and standard deviation for the downgradient well are 
  # 608 and 18 ppm, respectively, based on a sample size of n = 6 quarterly 
  # samples.  A two-sided 95% confidence interval for the difference between 
  # this mean and the background mean is [44, 100] ppm.
  #
  # Suppose we want to design a future sampling program and are interested in 
  # the size of the confidence interval for the difference between the two means.  
  # We will use ciTableMean to generate a table of possible confidence intervals 
  # by varying the assumed standard deviation and assumed differences between 
  # the means.


  # Look at the data
  #-----------------

  EPA.09.Ex.16.1.sulfate.df
#>    Month Year    Well.type Sulfate.ppm
#> 1    Jan 1995   Background         560
#> 2    Apr 1995   Background         530
#> 3    Jul 1995   Background         570
#> 4    Oct 1995   Background         490
#> 5    Jan 1996   Background         510
#> 6    Apr 1996   Background         550
#> 7    Jul 1996   Background         550
#> 8    Oct 1996   Background         530
#> 9    Jan 1995 Downgradient          NA
#> 10   Apr 1995 Downgradient          NA
#> 11   Jul 1995 Downgradient         600
#> 12   Oct 1995 Downgradient         590
#> 13   Jan 1996 Downgradient         590
#> 14   Apr 1996 Downgradient         630
#> 15   Jul 1996 Downgradient         610
#> 16   Oct 1996 Downgradient         630
  #   Month Year    Well.type Sulfate.ppm
  #1    Jan 1995   Background         560
  #2    Apr 1995   Background         530
  #3    Jul 1995   Background         570
  #4    Oct 1995   Background         490
  #5    Jan 1996   Background         510
  #6    Apr 1996   Background         550
  #7    Jul 1996   Background         550
  #8    Oct 1996   Background         530
  #9    Jan 1995 Downgradient          NA
  #10   Apr 1995 Downgradient          NA
  #11   Jul 1995 Downgradient         600
  #12   Oct 1995 Downgradient         590
  #13   Jan 1996 Downgradient         590
  #14   Apr 1996 Downgradient         630
  #15   Jul 1996 Downgradient         610
  #16   Oct 1996 Downgradient         630


  # Compute the estimated mean and standard deviation for the 
  # background well.
  #-----------------------------------------------------------

  Sulfate.back <- with(EPA.09.Ex.16.1.sulfate.df,
    Sulfate.ppm[Well.type == "Background"])

  enorm(Sulfate.back, ci = TRUE)
#> $distribution
#> [1] "Normal"
#> 
#> $sample.size
#> [1] 8
#> 
#> $parameters
#>     mean       sd 
#> 536.2500  26.6927 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "mvue"
#> 
#> $data.name
#> [1] "Sulfate.back"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>      LCL      UCL 
#> 513.9343 558.5657 
#> 
#> $type
#> [1] "two-sided"
#> 
#> $method
#> [1] "Exact"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 8
#> 
#> $dof
#> [1] 7
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            Normal
  #
  #Estimated Parameter(s):          mean = 536.2500
  #                                 sd   =  26.6927
  #
  #Estimation Method:               mvue
  #
  #Data:                            Sulfate.back
  #
  #Sample Size:                     8
  #
  #Confidence Interval for:         mean
  #
  #Confidence Interval Method:      Exact
  #
  #Confidence Interval Type:        two-sided
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL = 513.9343
  #                                 UCL = 558.5657


  # Compute the estimated mean and standard deviation for the 
  # downgradient well.
  #----------------------------------------------------------

  Sulfate.down <- with(EPA.09.Ex.16.1.sulfate.df,
    Sulfate.ppm[Well.type == "Downgradient"])

  enorm(Sulfate.down, ci = TRUE)
#> Warning: There were 2 nonfinite values in x : 2 NA's
#> Warning: 2 observations with NA/NaN/Inf in 'x' removed.
#> $distribution
#> [1] "Normal"
#> 
#> $sample.size
#> [1] 6
#> 
#> $parameters
#>      mean        sd 
#> 608.33333  18.34848 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "mvue"
#> 
#> $data.name
#> [1] "Sulfate.down"
#> 
#> $bad.obs
#> [1] 2
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>      LCL      UCL 
#> 589.0778 627.5889 
#> 
#> $type
#> [1] "two-sided"
#> 
#> $method
#> [1] "Exact"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 6
#> 
#> $dof
#> [1] 5
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            Normal
  #
  #Estimated Parameter(s):          mean = 608.33333
  #                                 sd   =  18.34848
  #
  #Estimation Method:               mvue
  #
  #Data:                            Sulfate.down
  #
  #Sample Size:                     6
  #
  #Number NA/NaN/Inf's:             2
  #
  #Confidence Interval for:         mean
  #
  #Confidence Interval Method:      Exact
  #
  #Confidence Interval Type:        two-sided
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL = 589.0778
  #                                 UCL = 627.5889


  # Compute the estimated difference between the means and the confidence 
  # interval for the difference:
  #----------------------------------------------------------------------

  print(t.test(Sulfate.down, Sulfate.back, var.equal = TRUE))
#> 
#> Results of Hypothesis Test
#> --------------------------
#> 
#> Null Hypothesis:                 difference in means = 0
#> 
#> Alternative Hypothesis:          True difference in means is not equal to 0
#> 
#> Test Name:                        Two Sample t-test
#> 
#> Estimated Parameter(s):          mean of x = 608.3333
#>                                  mean of y = 536.2500
#> 
#> Data:                            Sulfate.down and Sulfate.back
#> 
#> Test Statistic:                  t = 5.660985
#> 
#> Test Statistic Parameter:        df = 12
#> 
#> P-value:                         0.0001054306
#> 
#> 95% Confidence Interval:         LCL = 44.33974
#>                                  UCL = 99.82693
#> 

  #Results of Hypothesis Test
  #--------------------------
  #
  #Null Hypothesis:                 difference in means = 0
  #
  #Alternative Hypothesis:          True difference in means is not equal to 0
  #
  #Test Name:                        Two Sample t-test
  #
  #Estimated Parameter(s):          mean of x = 608.3333
  #                                 mean of y = 536.2500
  #
  #Data:                            Sulfate.down and Sulfate.back
  #
  #Test Statistic:                  t = 5.660985
  #
  #Test Statistic Parameter:        df = 12
  #
  #P-value:                         0.0001054306
  #
  #95% Confidence Interval:         LCL = 44.33974
  #                                 UCL = 99.82693


  # Use ciTableMean to look how the confidence interval for the difference
  # between the background and downgradient means in a future study using eight
  # quarterly samples at each well varies with assumed value of the pooled standard
  # deviation and the observed difference between the sample means. 
  #--------------------------------------------------------------------------------

  # Our current estimate of the pooled standard deviation is 24 ppm:

  summary(lm(Sulfate.ppm ~ Well.type, data = EPA.09.Ex.16.1.sulfate.df))$sigma
#> [1] 23.57759
  #[1] 23.57759

  # We can see that if this is overly optimistic and in our next study the 
  # pooled standard deviation is around 50 ppm, then if the observed difference 
  # between the means is 50 ppm, the lower end of the confidence interval for 
  # the difference between the two means will include 0, so we may want to 
  # increase our sample size.

  ciTableMean(n1 = 8, n2 = 8, diff = c(100, 50, 0), SD = c(15, 25, 50), digits = 0)
#>         Diff=100    Diff=50     Diff=0
#> SD=15 [ 84, 116] [ 34,  66] [-16,  16]
#> SD=25 [ 73, 127] [ 23,  77] [-27,  27]
#> SD=50 [ 46, 154] [ -4, 104] [-54,  54]

  #        Diff=100    Diff=50     Diff=0
  #SD=15 [ 84, 116] [ 34,  66] [-16,  16]
  #SD=25 [ 73, 127] [ 23,  77] [-27,  27]
  #SD=50 [ 46, 154] [ -4, 104] [-54,  54]
  
  #==========

  # Clean up
  #---------
  rm(Sulfate.back, Sulfate.down)