The Empirical Distribution Based on a Set of Observations

Density, distribution function, quantile function, and random generation for the empirical distribution based on a set of observations

demp(x, obs, discrete = FALSE, density.arg.list = NULL)
  pemp(q, obs, discrete = FALSE, 
    prob.method = ifelse(discrete, "emp.probs", "plot.pos"), 
    plot.pos.con = 0.375) 
  qemp(p, obs, discrete = FALSE, 
    prob.method = ifelse(discrete, "emp.probs", "plot.pos"), 
    plot.pos.con = 0.375)
  remp(n, obs)

Arguments

x: vector of quantiles.
q: vector of quantiles.
p: vector of probabilities between 0 and 1.
n: sample size. If length(n) is larger than 1, then length(n) random values are returned.
obs: numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.
discrete: logical scalar indicating whether the assumed parent distribution of x is discrete (discrete=TRUE) or continuous (discrete=FALSE). The default value is FALSE.
density.arg.list: list with arguments to the R density function. The default value is NULL. (See the help file for density for more information on the arguments to density.) The argument density.arg.list is ignored if discrete=TRUE.
prob.method: character string indicating what method to use to compute the empirical probabilities. Possible values are "emp.probs" (empirical probabilities, default if discrete=TRUE) and "plot.pos" (plotting positions, default if discrete=FALSE). See the DETAILS section for more explanation.
plot.pos.con: numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is plot.pos.con=0.375. See the DETAILS section for more information. This argument is ignored if prob.method="emp.probs".

Details

Let $x_1, x_2, \ldots, x_n$ denote a random sample of n observations from some unknown probability distribution (i.e., the elements of the argument obs), and let $x_{(i)}$ denote the $i^{th}$ order statistic, that is, the $i^{th}$ largest observation, for $i = 1, 2, \ldots, n$.

Estimating Density
The function demp computes the empirical probability density function. If the observations are assumed to come from a discrete distribution, the probability density (mass) function is estimated by: $$\hat{f}(x) = \widehat{Pr}(X = x) = \frac{\sum^n_{i=1} I_{[x]}(x_i)}{n}$$ where $I$ is the indicator function:

$I_{[x]}(y) =$	$1$	if $y = x$,
	$0$	if $y \ne x$

That is, the estimated probability of observing the value $x$ is simply the observed proportion of observations equal to $x$.

If the observations are assumed to come from a continuous distribution, the function demp calls the R function density to compute the estimated density based on the values specified in the argument obs, and then uses linear interpolation to estimate the density at the values specified in the argument x. See the R help file for density for more information on how the empirical density is computed in the continuous case.

Estimating Probabilities
The function pemp computes the estimated cumulative distribution function (cdf), also called the empirical cdf (ecdf). If the observations are assumed to come from a discrete distribution, the value of the cdf evaluated at the $i^{th}$ order statistic is usually estimated by: $$\hat{F}[x_{(i)}] = \widehat{Pr}(X \le x_{(i)}) = \hat{p}_i = \frac{\sum^n_{j=1} I_{(-\infty, x_{(i)}]}(x_j)}{n}$$ where:

$I_{(-\infty, x]}(y) =$	$1$	if $y \le x$,
	$0$	if $y > x$

(D'Agostino, 1986a). That is, the estimated value of the cdf at the $i^{th}$ order statistic is simply the observed proportion of observations less than or equal to the $i^{th}$ order statistic. This estimator is sometimes called the “empirical probabilities” estimator and is intuitively appealing. The function pemp uses the above equations to compute the empirical cdf when prob.method="emp.probs".

For any general value of $x$, when the observations are assumed to come from a discrete distribution, the value of the cdf is estimated by:

$\hat{F}(x) =$	$0$	if $x < x_{(1)}$,
	$\hat{p}_i$	if $x_{(i)} \le x < x_{(i+1)}$,
	$1$	if $x \ge x_{(n)}$

The function pemp uses the above equation when discrete=TRUE.

If the observations are assumed to come from a continuous distribution, the value of the cdf evaluated at the $i^{th}$ order statistic is usually estimated by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1}$$ where $a$ denotes the plotting position constant and $0 \le a \le 1$ (Cleveland, 1993, p.18; D'Agostino, 1986a, pp.8,25). The estimators defined by the above equation are called plotting positions and are used to construct probability plots. The function pemp uses the above equation when
prob.method="plot.pos".

For any general value of $x$, the value of the cdf is estimated by linear interpolation:

$\hat{F}(x) =$	$\hat{p}_1$	if $x < x_{(1)}$,
	$(1 - r)\hat{p}_i + r\hat{p}_{i+1}$	if $x_{(i)} \le x < x_{(i+1)}$,
	$\hat{p}_n$	if $x \ge x_{(n)}$

where $$r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}}$$ (Chambers et al., 1983). The function pemp uses the above two equations when discrete=FALSE.

Estimating Quantiles
The function qemp computes the estimated quantiles based on the observed data. If the observations are assumed to come from a discrete distribution, the $p^{th}$ quantile is usually estimated by:

$\hat{x}_p =$	$x_{(1)}$	if $p \le \hat{p}_1$,
	$x_{(i)}$	if $\hat{p}_{i-1} < p \le \hat{p}_i$,
	$x_n$	if $p > \hat{p}_n$

The function qemp uses the above equation when discrete=TRUE.

If the observations are assumed to come from a continuous distribution, the $p^{th}$ quantile is usually estimated by linear interpolation:

$\hat{x}_p =$	$x_{(1)}$	if $p \le \hat{p}_1$,
	$(1 - r)x_{(i-1)} + rx_{(i)}$	if $\hat{p}_{i-1} < p \le \hat{p}_i$,
	$x_n$	if $p > \hat{p}_n$

where $$r = \frac{p - \hat{p}_{i-1}}{\hat{p}_i - \hat{p}_{i-1}}$$ The function qemp uses the above two equations when discrete=FALSE.

Generating Random Numbers From the Empirical Distribution
The function remp simply calls the R function sample to sample the elements of obs with replacement.

Value

density (demp), probability (pemp), quantile (qemp), or random sample (remp) for the empirical distribution based on the data contained in the vector obs.

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11–16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7–62.

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. John Wiley and Sons, New York.

Sheather, S. J. and Jones M. C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statististical Society B, 683–690.

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

Wegman, E.J. (1972). Nonparametric Probability Density Estimation. Technometrics 14, 533-546.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Note

The function demp let's you perform nonparametric density estimation. The function pemp computes the value of the empirical cumulative distribution function (ecdf) for user-specified quantiles. The ecdf is a nonparametric estimate of the true cdf (see ecdfPlot). The function qemp computes nonparametric estimates of quantiles (see the help files for eqnpar and quantile). The function remp let's you sample a set of observations with replacement, which is often done while bootstrapping or performing some other kind of Monte Carlo simulation.

Examples

  # Create a set of 100 observations from a gamma distribution with 
  # parameters shape=4 and scale=5. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(3) 
  obs <- rgamma(100, shape=4, scale=5)

  # Now plot the empirical distribution (with a histogram) and the true distribution:

  dev.new()
  hist(obs, col = "cyan", xlim = c(0, 65), freq = FALSE, 
    ylab = "Relative Frequency") 

  pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) 

  box()

  # Now plot the empirical distribution (based on demp) with the 
  # true distribution:

  x <- qemp(p = seq(0, 1, len = 100), obs = obs) 
  y <- demp(x, obs) 

  dev.new()
  plot(x, y, xlim = c(0, 65), type = "n", 
    xlab = "Value of Random Variable", 
    ylab = "Relative Frequency") 
  lines(x, y, lwd = 2, col = "cyan") 

  pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE)

  # Alternatively, you can create the above plot with the function 
  # epdfPlot:

  dev.new()
  epdfPlot(obs, xlim = c(0, 65), epdf.col = "cyan", 
    xlab = "Value of Random Variable", 
    main = "Empirical and Theoretical PDFs")

  pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE)

  


  # Clean Up
  #---------
  rm(obs, x, y)

\(I_{[x]}(y) =\)	\(1\)	if \(y = x\),
	\(0\)	if \(y \ne x\)

\(I_{(-\infty, x]}(y) =\)	\(1\)	if \(y \le x\),
	\(0\)	if \(y > x\)

\(\hat{F}(x) =\)	\(0\)	if \(x < x_{(1)}\),
	\(\hat{p}_i\)	if \(x_{(i)} \le x < x_{(i+1)}\),
	\(1\)	if \(x \ge x_{(n)}\)

\(\hat{F}(x) =\)	\(\hat{p}_1\)	if \(x < x_{(1)}\),
	\((1 - r)\hat{p}_i + r\hat{p}_{i+1}\)	if \(x_{(i)} \le x < x_{(i+1)}\),
	\(\hat{p}_n\)	if \(x \ge x_{(n)}\)

\(\hat{x}_p =\)	\(x_{(1)}\)	if \(p \le \hat{p}_1\),
	\(x_{(i)}\)	if \(\hat{p}_{i-1} < p \le \hat{p}_i\),
	\(x_n\)	if \(p > \hat{p}_n\)