ecdfPlot.Rd
Produce an empirical cumulative distribution function plot.
ecdfPlot(x, discrete = FALSE,
prob.method = ifelse(discrete, "emp.probs", "plot.pos"),
plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black",
ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE,
curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"),
main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
numeric vector of observations. Missing (NA
), undefined (NaN
), and
infinite (Inf
, -Inf
) values are allowed but will be removed.
logical scalar indicating whether the assumed parent distribution of x
is discrete
(discrete=TRUE
) or continuous (discrete=FALSE
; the default).
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are plot.pos
(plotting positions, the default if discrete=FALSE
) and
emp.probs
(empirical probabilities, the default if discrete=TRUE
).
See the DETAILS section for more explanation.
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is plot.pos.con=0.375
. See the DETAILS section for more information.
This argument is ignored if prob.method="emp.probs"
.
logical scalar indicating whether to produce a plot or add to the current plot (see add
)
on the current graphics device. The default value is plot.it=TRUE
.
logical scalar indicating whether to add the empirical cdf to the current plot (add=TRUE
)
or generate a new plot (add=FALSE
; the default). This argument is ignored if
plot.it=FALSE
.
a numeric scalar or character string determining the color of the empirical cdf line or points.
The default value is ecdf.col=1
. See the entry for col
in the help file for
par
for more information.
a numeric scalar determining the width of the empirical cdf line. The default value is
ecdf.lwd=3*par("cex")
. See the entry for lwd
in the help file for par
for more information.
a numeric scalar determining the line type of the empirical cdf line. The default value is
ecdf.lty=1
. See the entry for lty
in the help file for par
for more information.
a logical scalar indicating whether to fill in the area below the empirical cdf curve with the
color specified by curve.fill.col
. The default value is curve.fill=FALSE
.
a numeric scalar or character string indicating what color to use to fill in the area below the
empirical cdf curve. The default value is curve.fill.col=5
. This argument is ignored
if curve.fill=FALSE
.
additional graphical parameters (see lines
and par
). In particular,
the argument type
specifies the kind of line type. By default, the function
ecdfPlot
plots a step function (type="s"
) when discrete=TRUE
, and
plots a straight line between points (type="l"
) when discrete=FALSE
.
The user may override these defaults by supplying the graphics parameter type
(type="s"
for a step function, type="l"
for linear interpolation,
type="p"
for points only, etc.).
The cumulative distribution function (cdf) of a random variable \(X\) is the function \(F\) such that $$F(x) = Pr(X \le x) \;\;\;\;\;\; (1)$$ for all values of \(x\). That is, if \(p = F(x)\), then \(p\) is the proportion of the population that is less than or equal to \(x\), and \(x\) is called the \(p\)'th quantile, or the 100\(p\)'th percentile. A plot of quantiles on the \(x\)-axis (i.e., the possible value for the random variable \(X\)) vs. the fraction of the population less than or equal to that number on the \(y\)-axis is called the cumulative distribution function plot, and the \(y\)-axis is usually labeled as the “cumulative probability” or “cumulative frequency”.
When we have a sample of data from some population, we usually do not know what percentiles our observations correspond to because we do not know the form of the cumulative distribution function \(F\), so we have to use the sample data to estimate the cdf \(F\). An emprical cumulative distribution function (ecdf) plot, also called a quantile plot, is a plot of the observed quantiles (i.e., the ordered observations) on the \(x\)-axis vs. the estimated cumulative probabilities on the \(y\)-axis (Chambers et al., 1983, pp. 11-19; Cleveland, 1993, pp. 17-20; Cleveland, 1994, pp. 136-139; Helsel and Hirsch, 1992, pp. 21-24).
(Note: Some authors (e.g., Chambers et al., 1983, pp.11-16; Cleveland, 1993, pp.17-20) reverse the axes on a quantile plot, i.e., the observed order statistics from the random sample are on the \(y\)-axis and the estimated cumulative probabilities are on the \(x\)-axis.)
The empirical cumulative distribution function (ecdf) is an estimate of the cdf based on a random sample of \(n\) observations from the distribution. Let \(x_1, x_2, \ldots, x_n\) denote the \(n\) observations, and let \(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\) denote the ordered observations (i.e., the order statistics). The cdf is usually estimated by either the empirical probabilities estimator or the plotting-position estimator. The empirical probabilities estimator is given by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (2)$$ where \(\#[x_j \le x_{(i)}]\) denotes the number of observations less than or equal to \(x_{(i)}\). The plotting-position estimator is given by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1} \;\;\;\;\;\; (3)$$ where \(0 \le a \le 1\) (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).
For any value \(x\) such that \(x_{(1)} < x < x_{(n)}\), the ecdf is usually defined as either a step function:
$$\hat{F}(x) = \hat{F}[x_{(i)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (4)$$
(e.g., D'Agostino, 1986a), or linear interpolation between order statistics is used:
$$\hat{F}(x) = (1-r)\hat{F}[x_{(i)}] + r\hat{F}[x_{(i+1)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (5)$$
where
$$r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}} \;\;\;\;\;\; (6)$$
(e.g., Chambers et al., 1983). For the step function version, the ecdf stays flat until it hits a
value on the \(x\)-axis corresponding to one of the order statistics, then it makes a jump.
For the linear interpolation version, the ecdf plot looks like lines connecting the points.
By default, the function ecdfPlot
uses the step function version when discrete=TRUE
, and
the linear interpolation version when discrete=FALSE
. The user may override these defaults by
supplying the graphics parameter type
(type="s"
for a step function, type="l"
for linear interpolation, type="p"
for points only, etc.).
The empirical probabilities estimator is intuitively appealing. This is the estimator used when
prob.method="emp.probs"
. The disadvantage of this estimator is that it implies the largest
observed value is the maximum possible value of the distribution (i.e., the 100'th percentile). This
may be satisfactory if the underlying distribution is known to be discrete, but it is usually not
satisfactory if the underlying distribution is known to be continuous.
The plotting-position estimator with various values of \(a\) is often used when the goal is
to produce a probability plot (see qqPlot
) rather than an empirical cdf plot. It is used
to compute the estimated expected values or medians of the order statistics for a probability plot.
This is the estimator used when prob.method="plot.pos"
. The argument plot.pos.con
refers
to the variable \(a\). Based on certain principles from statistical theory, certain
values of the constant \(a\) make sense for specific underlying distributions (see
the help file for qqPlot
for more information).
Because \(x\) is a random sample, the emprical cdf changes from sample to sample and the variability in these estimates can be dramatic for small sample sizes.
ecdfPlot
invisibly returns a list with the following components:
numeric vector of the ordered observations.
numeric vector of the associated plotting positions.
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.11-16) plot the observed order statistics on the \(y\)-axis vs. the ecdf on the \(x\)-axis and call this a quantile plot.
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots (see cdfPlot
and cdfCompare
) to
graphically assess whether a sample of observations comes from a particular
distribution. The Kolmogorov-Smirnov goodness-of-fit test
(see gofTest
) is the statistical companion of this kind of
comparison; it is based on the maximum vertical distance between the empirical
cdf plot and the theoretical cdf plot. More often, however,
quantile-quantile (Q-Q) plots are used instead of ecdf plots to graphically assess
departures from an assumed distribution (see qqPlot
).
# Generate 20 observations from a normal distribution with
# mean=0 and sd=1 and create an ecdf plot.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(250)
x <- rnorm(20)
dev.new()
ecdfPlot(x)
#----------
# Repeat the above example, but fill in the area under the
# empirical cdf curve.
dev.new()
ecdfPlot(x, curve.fill = TRUE)
#----------
# Repeat the above example, but plot only the points.
dev.new()
ecdfPlot(x, type = "p")
#----------
# Repeat the above example, but force a step function.
dev.new()
ecdfPlot(x, type = "s")
#----------
# Clean up
rm(x)
#-------------------------------------------------------------------------------------
# The guidance document USEPA (1994b, pp. 6.22--6.25)
# contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB)
# concentrations (in parts per billion) from soil samples
# at a Reference area and a Cleanup area. These data are strored
# in the data frame EPA.94b.tccb.df.
#
# Create an empirical CDF plot for the reference area data.
dev.new()
with(EPA.94b.tccb.df,
ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)"))
#==========
# Clean up
#---------
graphics.off()