Estimate Mean, Standard Deviation, and Standard Error Nonparametrically

Estimate the mean, standard deviation, and standard error of the mean nonparametrically given a sample of data, and optionally construct a confidence interval for the mean.

enpar(x, ci = FALSE, ci.method = "bootstrap", ci.type = "two-sided", 
      conf.level = 0.95, pivot.statistic = "z", n.bootstraps = 1000, seed = NULL)

Arguments

x: numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.
ci: logical scalar indicating whether to compute a confidence interval for the mean. The default value is ci=FALSE.
ci.method: character string indicating what method to use to construct the confidence interval for the mean. The possible values are "bootstrap" (based on bootstrapping; the default), and "normal.approx" (normal approximation). See the DETAILS section for more information. This argument is ignored if ci=FALSE.
ci.type: character string indicating what kind of confidence interval to compute. The possible values are "two-sided" (the default), "lower", and "upper". This argument is ignored if ci=FALSE.
conf.level: a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default value is conf.level=0.95. This argument is ignored if ci=FALSE.
pivot.statistic: character string indicating which statistic to use for the confidence interval for the mean when ci.method="normal.approx". Possible values are "z" (confidence interval based on the z-statistic; the default), and "t" (confidence interval based on the t-statistic). This argument is ignored if ci=FALSE or
ci.method="bootstrap".
n.bootstraps: numeric scalar indicating how many bootstraps to use to construct the confidence interval for the mean. This argument is ignored if ci=FALSE or
ci.method="normal.approx".
seed: integer supplied to the function set.seed and used when
ci=TRUE and ci.method="bootstrap". The default value is seed=NULL, in which case the current value of .Random.seed is used. This argument is ignored if ci=FALSE or ci.method="normal.approx". This argument is necessary to create reproducible results for the bootstrapped confidence intervals (see the EXAMPLES section).

Details

Let $\underline{x} = (x_1, x_2, \ldots, x_N)$ denote a vector of $N$ observations from some distribution with mean $\mu$ and standard deviation $\sigma$.

Estimation
Unbiased and consistent estimators of the mean and variance are: $$\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \;\;\;\; (1)$$ $$\hat{\sigma}^2 = s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \;\;\;\; (2)$$ A consistent (but not unbiased) estimate of the standard deviation is given by the square root of the estimated variance above: $$\hat{\sigma} = s \;\;\;\; (3)$$ It can be shown that the variance of the sample mean is given by: $$\sigma^2_{\hat{\mu}} = \sigma^2_{\bar{x}} = \frac{\sigma^2}{n} \;\;\;\; (4)$$ so the standard deviation of the sample mean (usually called the standard error) can be estimated by: $$\hat{\sigma}_{\hat{\mu}} = \hat{\sigma}_{\bar{x}} = \frac{s}{\sqrt{n}} \;\;\;\; (5)$$

Confidence Intervals
This section explains how confidence intervals for the mean $\mu$ are computed.

Normal Approximation (ci.method="normal.approx")
This method constructs approximate $(1-\alpha)100\%$ confidence intervals for $\mu$ based on the assumption that the estimator of $\mu$, i.e., the sample mean, is approximately normally distributed. That is, a two-sided $(1-\alpha)100\%$ confidence interval for $\mu$ is constructed as: $$[\hat{\mu} - t_{1-\alpha/2, m-1}\hat{\sigma}_{\hat{\mu}}, \; \hat{\mu} + t_{1-\alpha/2, m-1}\hat{\sigma}_{\hat{\mu}}] \;\;\;\; (6)$$ where $\hat{\mu}$ denotes the estimate of $\mu$, $\hat{\sigma}_{\hat{\mu}}$ denotes the estimated asymptotic standard deviation of the estimator of $\mu$, $m$ denotes the assumed sample size for the confidence interval, and $t_{p,\nu}$ denotes the $p$'th quantile of Student's t-distribuiton with $\nu$ degrees of freedom. One-sided confidence intervals are computed in a similar fashion.

When pivot.statistic="z", the $p$'th quantile from the standard normal distribution is used in place of the $p$'th quantile from Student's t-distribution.

Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap")
The bootstrap is a nonparametric method of estimating the distribution (and associated distribution parameters and quantiles) of a sample statistic, regardless of the distribution of the population from which the sample was drawn. The bootstrap was introduced by Efron (1979) and a general reference is Efron and Tibshirani (1993).

In the context of deriving an approximate $(1-\alpha)100\%$ confidence interval for the population mean $\mu$, the bootstrap can be broken down into the following steps:

Create a bootstrap sample by taking a random sample of size $N$ from the observations in $\underline{x}$, where sampling is done with replacement. Note that because sampling is done with replacement, the same element of $\underline{x}$ can appear more than once in the bootstrap sample. Thus, the bootstrap sample will usually not look exactly like the original sample.
Estimate $\mu$ based on the bootstrap sample created in Step 1, using the same method that was used to estimate $\mu$ using the original observations in $\underline{x}$. Because the bootstrap sample usually does not match the original sample, the estimate of $\mu$ based on the bootstrap sample will usually differ from the original estimate based on $\underline{x}$. For the bootstrap-t method (see below), this step also involves estimating the standard error of the estimate of the mean and computing the statistic $T = (\hat{\mu}_B - \hat{\mu}) / \hat{\sigma}_{\hat{\mu}_B}$ where $\hat{\mu}$ denotes the estimate of the mean based on the original sample, and $\hat{\mu}_B$ and $\hat{\sigma}_{\hat{\mu}_B}$ denote the estimate of the mean and estimate of the standard error of the estimate of the mean based on the bootstrap sample.
Repeat Steps 1 and 2 $B$ times, where $B$ is some large number. For the function enpar, the number of bootstraps $B$ is determined by the argument n.bootstraps (see the section ARGUMENTS above). The default value of n.bootstraps is 1000.
Use the $B$ estimated values of $\mu$ to compute the empirical cumulative distribution function of the estimator of $\mu$ or to compute the empirical cumulative distribution function of the statistic $T$ (see ecdfPlot), and then create a confidence interval for $\mu$ based on this estimated cdf.

The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as: $$[\hat{G}^{-1}(\frac{\alpha}{2}), \; \hat{G}^{-1}(1-\frac{\alpha}{2})] \;\;\;\;\;\; (7)$$ where $\hat{G}(t)$ denotes the empirical cdf of $\hat{\mu}_B$ evaluated at $t$ and thus $\hat{G}^{-1}(p)$ denotes the $p$'th empirical quantile of the distribution of $\hat{\mu}_B$, that is, the $p$'th quantile associated with the empirical cdf. Similarly, a one-sided lower confidence interval is computed as: $$[\hat{G}^{-1}(\alpha), \; \infty] \;\;\;\;\;\; (8)$$ and a one-sided upper confidence interval is computed as: $$[-\infty, \; \hat{G}^{-1}(1-\alpha)] \;\;\;\;\;\; (9)$$ The function enpar calls the R function quantile to compute the empirical quantiles used in Equations (7)-(9).

The percentile method bootstrap confidence interval is only first-order accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability that the confidence interval will contain the true value of $\mu$ can be off by $k/\sqrt{N}$, where $k$ is some constant. Efron and Tibshirani (1993, pp.184–188) proposed a bias-corrected and accelerated interval that is second-order accurate, meaning that the probability that the confidence interval will contain the true value of $\mu$ may be off by $k/N$ instead of $k/\sqrt{N}$. The two-sided bias-corrected and accelerated confidence interval is computed as: $$[\hat{G}^{-1}(\alpha_1), \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (10)$$ where $$\alpha_1 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(z_0 + z_{\alpha/2})}] \;\;\;\;\;\; (11)$$ $$\alpha_2 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(z_0 + z_{1-\alpha/2})}] \;\;\;\;\;\; (12)$$ $$\hat{z}_0 = \Phi^{-1}[\hat{G}(\hat{\mu})] \;\;\;\;\;\; (13)$$ $$\hat{a} = \frac{\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^3}{6[\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^2]^{3/2}} \;\;\;\;\;\; (14)$$ where the quantity $\hat{\mu}_{(i)}$ denotes the estimate of $\mu$ using all the values in $\underline{x}$ except the $i$'th one, and $$\hat{\mu}{(\cdot)} = \frac{1}{N} \sum_{i=1}^N \hat{\mu_{(i)}} \;\;\;\;\;\; (15)$$ A one-sided lower confidence interval is given by: $$[\hat{G}^{-1}(\alpha_1), \; \infty] \;\;\;\;\;\; (16)$$ and a one-sided upper confidence interval is given by: $$[-\infty, \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (17)$$ where $\alpha_1$ and $\alpha_2$ are computed as for a two-sided confidence interval, except $\alpha/2$ is replaced with $\alpha$ in Equations (11) and (12).

The constant $\hat{z}_0$ incorporates the bias correction, and the constant $\hat{a}$ is the acceleration constant. The term “acceleration” refers to the rate of change of the standard error of the estimate of $\mu$ with respect to the true value of $\mu$ (Efron and Tibshirani, 1993, p.186). For a normal (Gaussian) distribution, the standard error of the estimate of $\mu$ does not depend on the value of $\mu$, hence the acceleration constant is not really necessary.

For the bootstrap-t method, the two-sided confidence interval (Efron and Tibshirani, 1993, p.160) is computed as: $$[\hat{\mu} - t_{1-\alpha/2}\hat{\sigma}_{\hat{\mu}}, \; \hat{\mu} - t_{\alpha/2}\hat{\sigma}_{\hat{\mu}}] \;\;\;\;\;\; (18)$$ where $\hat{\mu}$ and $\hat{\sigma}_{\hat{\mu}}$ denote the estimate of the mean and standard error of the estimate of the mean based on the original sample, and $t_p$ denotes the $p$'th empirical quantile of the bootstrap distribution of the statistic $T$. Similarly, a one-sided lower confidence interval is computed as: $$[\hat{\mu} - t_{1-\alpha}\hat{\sigma}_{\hat{\mu}}, \; \infty] \;\;\;\;\;\; (19)$$ and a one-sided upper confidence interval is computed as: $$[-\infty, \; \hat{\mu} - t_{\alpha}\hat{\sigma}_{\hat{\mu}}] \;\;\;\;\;\; (20)$$

When ci.method="bootstrap", the function enpar computes the percentile method, bias-corrected and accelerated method, and bootstrap-t bootstrap confidence intervals. The percentile method is transformation respecting, but not second-order accurate. The bootstrap-t method is second-order accurate, but not transformation respecting. The bias-corrected and accelerated method is both transformation respecting and second-order accurate (Efron and Tibshirani, 1993, p.188).

Value

a list of class "estimate" containing the estimated parameters and other information. See estimate.object for details.

References

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.

Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

The function enpar is related to the companion function enparCensored for censored data. To estimate the median and compute a confidence interval, use eqnpar.

The result of the call to enpar with ci.method="normal.approx" and pivot.statistic="t" produces the same result as the call to enorm with ci.param="mean".

Examples

  # The data frame ACE.13.TCE.df contains observations on
  # Trichloroethylene (TCE) concentrations (mg/L) at
  # 10 groundwater monitoring wells before and after remediation.
  #
  # Compute the mean concentration for each period along with
  # a 95% bootstrap BCa confidence interval for the mean.
  #
  # NOTE: Use of the argument "seed" is necessary to reproduce this example.
  #
  # Before remediation: 21.6 [14.2, 30.1]
  # After remediation:   3.6 [ 1.6,  5.7]

  with(ACE.13.TCE.df,
    enpar(TCE.mg.per.L[Period=="Before"], ci = TRUE, seed = 476))
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 10
#> 
#> $parameters
#>     mean       sd  se.mean 
#> 21.62400 13.51134  4.27266 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "Sample Mean"
#> 
#> $data.name
#> [1] "TCE.mg.per.L[Period == \"Before\"]"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>  Pct.LCL  Pct.UCL  BCa.LCL  BCa.UCL    t.LCL    t.UCL 
#> 13.95560 29.79510 14.16080 30.06848 12.41945 32.47306 
#> 
#> $type
#> [1] "two-sided"
#> 
#> $method
#> [1] "Bootstrap"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 10
#> 
#> $n.bootstraps
#> [1] 1000
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Estimated Parameter(s):          mean    = 21.62400
  #                                 sd      = 13.51134
  #                                 se.mean =  4.27266
  #
  #Estimation Method:               Sample Mean
  #
  #Data:                            TCE.mg.per.L[Period == "Before"]
  #
  #Sample Size:                     10
  #
  #Confidence Interval for:         mean
  #
  #Confidence Interval Method:      Bootstrap
  #
  #Number of Bootstraps:            1000
  #
  #Confidence Interval Type:        two-sided
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             Pct.LCL = 13.95560
  #                                 Pct.UCL = 29.79510
  #                                 BCa.LCL = 14.16080
  #                                 BCa.UCL = 30.06848
  #                                 t.LCL   = 12.41945
  #                                 t.UCL   = 32.47306
  
  #----------

  with(ACE.13.TCE.df, 
    enpar(TCE.mg.per.L[Period=="After"], ci = TRUE, seed = 543))
#> $distribution
#> [1] "None"
#> 
#> $sample.size
#> [1] 10
#> 
#> $parameters
#>     mean       sd  se.mean 
#> 3.632900 3.554419 1.124006 
#> 
#> $n.param.est
#> [1] 2
#> 
#> $method
#> [1] "Sample Mean"
#> 
#> $data.name
#> [1] "TCE.mg.per.L[Period == \"After\"]"
#> 
#> $bad.obs
#> [1] 0
#> 
#> $interval
#> $name
#> [1] "Confidence"
#> 
#> $parameter
#> [1] "mean"
#> 
#> $limits
#>  Pct.LCL  Pct.UCL  BCa.LCL  BCa.UCL    t.LCL    t.UCL 
#> 1.833843 5.830230 1.631655 5.677514 1.683791 8.101829 
#> 
#> $type
#> [1] "two-sided"
#> 
#> $method
#> [1] "Bootstrap"
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $sample.size
#> [1] 10
#> 
#> $n.bootstraps
#> [1] 1000
#> 
#> attr(,"class")
#> [1] "intervalEstimate"
#> 
#> attr(,"class")
#> [1] "estimate"

  #Results of Distribution Parameter Estimation
  #--------------------------------------------
  #
  #Assumed Distribution:            None
  #
  #Estimated Parameter(s):          mean    = 3.632900
  #                                 sd      = 3.554419
  #                                 se.mean = 1.124006
  #
  #Estimation Method:               Sample Mean
  #
  #Data:                            TCE.mg.per.L[Period == "After"]
  #
  #Sample Size:                     10
  #
  #Confidence Interval for:         mean
  #
  #Confidence Interval Method:      Bootstrap
  #
  #Number of Bootstraps:            1000
  #
  #Confidence Interval Type:        two-sided
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             Pct.LCL = 1.833843
  #                                 Pct.UCL = 5.830230
  #                                 BCa.LCL = 1.631655
  #                                 BCa.UCL = 5.677514
  #                                 t.LCL   = 1.683791
  #                                 t.UCL   = 8.101829