is a generic function used to produce summary statistics, confidence intervals,
and results of hypothesis tests. The function invokes particular methods
depend on the class
of the first argument.
The summary statistics include: sample size, number of missing values, mean, standard deviation, median, min, and max. Optional additional summary statistics include 1st quartile, 3rd quartile, and stadard error.
summaryStats(object, ...)
# S3 method for class 'formula'
summaryStats(object, data = NULL, subset,
na.action = na.pass, ...)
# Default S3 method
summaryStats(object, group = NULL,
drop.unused.levels = TRUE, se = FALSE, quartiles = FALSE,
digits = max(3, getOption("digits") - 3),
digit.type = "round", drop0trailing = TRUE, = TRUE, = FALSE, p.value = FALSE,
p.value.digits = 2, p.value.digit.type = "signif",
test = "parametric", paired = FALSE, test.arg.list = NULL,
combine.groups = p.value, = TRUE,
group.p.value.type = NULL, alternative = "two.sided",
ci = NULL, ci.between = NULL, conf.level = 0.95, = FALSE, = deparse(substitute(object)), ...)
# S3 method for class 'factor'
summaryStats(object, group = NULL,
drop.unused.levels = TRUE,
digits = max(3, getOption("digits") - 3),
digit.type = "round", drop0trailing = TRUE, = TRUE, = FALSE, p.value = FALSE,
p.value.digits = 2, p.value.digit.type = "signif",
test = "chisq", test.arg.list = NULL, combine.levels = TRUE,
combine.groups = FALSE, = TRUE,
ci = p.value & test != "chisq", conf.level = 0.95, = FALSE, ...)
# S3 method for class 'character'
summaryStats(object, ...)
# S3 method for class 'logical'
summaryStats(object, ...)
# S3 method for class 'data.frame'
summaryStats(object, ...)
# S3 method for class 'matrix'
summaryStats(object, ...)
# S3 method for class 'list'
summaryStats(object, ...)
an object for which summary statistics are desired. In the default method,
the argument object
can be a numeric vector, factor, character vector,
logical vector, data frame, matrix, or list.
When object
is a character or logical vector, it is coerced to
be a factor.
When object
is a data frame, all columns must be numeric or all columns must be factors.
When object
is a matrix, it must be a numeric or character matrix.
When object
is a list, all components must be numeric vectors
or all components must be factors.
In the formula method, a symbolic specification of the form y ~ g
can be given, indicating the observations in the vector y
are to be grouped
according to the levels of the factor g
(the form y ~ 1
indicates no grouping).
s are allowed in the data.
when object
is a formula, data
specifies an optional data frame, list or
environment (or object coercible by
to a data frame) containing the
variables in the model. If not found in data
, the variables are taken from
, typically the environment from which summaryStats
is called.
when object
is a formula, subset
specifies an optional vector specifying
a subset of observations to be used.
when object
is a formula, na.action
specifies a function which indicates
what should happen when the data contain NA
s. The default is na.pass
when object
is a numeric vector or factor, group
is a factor or character vector
indicating which group each observation belongs to. When object
is a matrix or data frame
this argument is ignored and the columns define the groups. When object
is a formula,
this argument is ignored and the right-hand side of the formula specifies the grouping variable.
when drop.unused.levels=TRUE
, groups with no observations are dropped.
for numeric data, logical scalar indicating whether to include
the standard error of the mean in the summary statistics.
The default value is se=FALSE
for numeric data, logical scalar indicating whether to include
the estimated 25th and 75th percentiles in the summary statistics.
The default value is quartiles=FALSE
integer indicating the number of digits to use for the summary statistics.
When digit.type="signif"
, digits
indicates the number of significant
digits. When digit.type="round"
, digits
indicates the number of
decimal places to round to. The default value is max(3, getOption("digits") - 3)
that is, the maximum of 3 versus the current setting of the "digits"
of .Options
minus 3.
character string indicating whether the digits
argument refers to significant digits
), or how many decimal places to round to
, the default).
logical scalar indicating whether to drop trailing 0's when printing the summary statistics.
The value of this argument is added as an attribute to the returned list and is used by the
function. The default value is TRUE
logical scalar indicating whether to return the number of missing values.
The default value is
logical scalar indicating whether to diplay the number of missing values in the case when
there are no missing values. The default value is
logical scalar indicating whether to return the p-value associated with a test of hypothesis.
The default value is p.value=FALSE
Numeric data: if there are no groups the p-value is associated with the t-test to test
whether the mean is different from 0; if there are groups see the explanation for the argument
below. Factors: the p-value is associated with the test
specified by the argument test
(see below).
integer indicating the number of digits to use for the p-value. When p.value.digit.type="signif"
, p.value.digits
indicates the
number of significant digits. When p.value.digit.type="round"
, p.value.digits
indicates the number of decimal places to round to. The default value is p.value.digits=2
character string indicating whether the p.value.digits
argument refers to
significant digits (p.value.digit.type="signif"
, the default), or how many
decimal places to round to (p.value.digit.type="round"
Numeric data: character string indicating whether to compute p-values and confidence
intervals based on parametric (test="parametric"
; the default) or nonparametric
) tests when p.value=TRUE
and/or ci=TRUE
When test="parametric"
, confidence intervals are based on the t-test (see t.test
and p-values are based on the t-test or F-test (see anova.lm
When test="nonparametric"
, confidence intervals are based on the Wilcoxon rank sum test
(see wilcox.test
) and p-values are based on the Wilcoxon rank sum test or
the Kruskal-Wallis rank sum test
(see kruskal.test
Factors: character string indicating which test to perform when p.value=TRUE
. Possible values are
for the chi-squared test as performed by the function chisq.test
(the default),
for the chi-squared test as performed by the function prop.test
for Fisher's exact test as performed by the function fisher.test
, and test="binom"
for the one-sample exact binomial test as performed by binom.test
The chi-squared test as performed by prop.test
is only available when the number of levels
in object
is 2 and either group
is not supplied or the number of levels in group
is 2.
Fisher's exact test is only available when the number of levels in group
is \(\ge 2\).
The exact binomial test is only available when group
is not supplied and the number of levels
in object
is 2.
applicable only to the case when there are two groups:
logical scalar indicating whether the observations in the first group are paired with those
in the second group. The default is paired=FALSE
NOTE: If the argument test.arg.list
(see below) contains a component named paired
the value of that component is set to the value of the argument paired
a list with additional arguments to pass to the test used to compute p-values and confidence
intervals. For numeric data, when test="parametric"
, p.value=TRUE
and there are two groups, if this argument is
or does not contain a component named var.equal
, it will be modified to contain the component var.equal=TRUE
Note that this overrides the default behavior of t.test
when there are two groups.
NOTE: If test.arg.list
contains a component named paired
, the value of
that component is set to the value of the argument paired
(see above).
logical scalar indicating whether to show summary statistics for all groups combined.
Numeric data: the default value is TRUE
if p.value=TRUE
, otherwise FALSE
Factors: the default value is FALSE
logical scalar indicating whether to remove missing values from the group
and group
contains missing values then an error is returned.
and group
contains missing values then a warning is issued.
By default
for numeric data, character string indicating which p-value(s) to compute when
there is more than one group. When group.p.value.type="between"
(the default when
), the p-value is associated with the two-sample t-test (or the
Wilcoxon rank sum test) in the case of two groups, and the one-way analysis of variance F-test
(or Krukal-Wallis test) in the case of three or more groups to test whether the group
means (locations) are different from each other. When group.p.value.type="within"
(the default when combine.groups=FALSE
), the computed p-values for each group are
associated with the one-sample t-test (or Wilcox signed rank test) to test whether the group
mean (location) is different from 0.
for numeric data, character string indicating which alternative to assume
for p-values and confidence intervals. Possible values are "two.sided"
(the default),
, and "greater"
. This argument is
ignored for p-values in the case of three or more groups when group.p.value.type="between"
and is ignored for confidence intervals in the case of three or more groups when ci.between=TRUE
Numeric data: logical scalar indicating whether to compute a confidence interval
for the mean or each group mean. The default value is FALSE
and there are no groups, or when
and there are groups and group.p.value.type="within"
Factors: logical scalar indicating whether to compute a confidence interval. A confidence
interval is computed only if the number of levels in object
is 2. When group
is not
supplied, if ci=TRUE
and test="prop"
or test="binom"
, a confidence interval for
the percent (not probability) of the first level of object
is computed.
When group
is supplied and the number of levels in group
is 2, if ci=TRUE
, a confidence interval for the difference between percents (not proportions)
is computed, and if test="fisher"
a confidence interval for the odds ratio is computed.
for numeric data, logical scalar indicating whether to compute a confidence interval
for the difference between group means when there are two groups.
The default value is ci.between=TRUE
when p.value=TRUE
and group.p.value.type="between"
, otherwise this argument is ignored.
numeric scalar between 0 and 1 indicating the confidence level associated with the confidence intervals.
The default value is conf.level=0.95
logical scalar indicating whether to show the summary statistics in the rows or columns of the
output. The default is
character string indicating the name of the data used for the summary statistics.
for factors, a logical scalar indicating whether to compute summary statistics based on combining all levels of a factor.
additional arguments affecting the summary statistics produced.
an object of class "summaryStats"
(see summaryStats.object
Objects of class "summaryStats"
are numeric matrices that contain the
summary statisics produced by a call to summaryStats
or summaryFull
These objects have a special printing method that by default removes
trailing zeros for sample size entries and prints blanks for statistics that are
normally displayed as NA
(see print.summaryStats
Summary statistics for numeric data include sample size, mean, standard deviation, median,
min, and max. Options include the standard error of the mean (when se=TRUE
the estimated quartiles (when quartiles=TRUE
), p-values (when p.value=TRUE
and/or confidence intervals (when ci=TRUE
and/or ci.between=TRUE
Summary statistics for factors include the sample size for each level of the factor and the
percent of the total for that level. Options include a p-value (when p.value=TRUE
Note that unlike the R function summary
and the EnvStats function
, by default the digits
argument for the EnvStats function
refers to how many decimal places to round to, not how many
significant digits to use (see the explanation of the argument digit.type
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
# The guidance document USEPA (1994b, pp. 6.22--6.25)
# contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB)
# concentrations (in parts per billion) from soil samples
# at a Reference area and a Cleanup area. These data are strored
# in the data frame EPA.94b.tccb.df.
# First, create summary statistics by area based on the log-transformed data.
summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df)
#> N Mean SD Median Min Max
#> Cleanup 77 -0.2377 0.5908 -0.3665 -1.0458 2.2270
#> Reference 47 -0.2691 0.2032 -0.2676 -0.6576 0.1239
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Mean SD Median Min Max
#Cleanup 77 -0.2377 0.5908 -0.3665 -1.0458 2.2270
#Reference 47 -0.2691 0.2032 -0.2676 -0.6576 0.1239
# Now create summary statistics by area based on the log-transformed data
# and use the t-test to compare the areas.
summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, p.value = TRUE)
#> N Mean SD Median Min Max Diff p.value.between
#> Cleanup 77 -0.2377 0.5908 -0.3665 -1.0458 2.2270 NA NA
#> Reference 47 -0.2691 0.2032 -0.2676 -0.6576 0.1239 NA NA
#> Combined 124 -0.2496 0.4810 -0.3143 -1.0458 2.2270 -0.0313 0.73
#> 95%.LCL.between 95%.UCL.between
#> Cleanup NA NA
#> Reference NA NA
#> Combined -0.2082 0.1456
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df,
p.value = TRUE, = TRUE)
#> Cleanup Reference Combined
#> N 77.0000 47.0000 124.0000
#> Mean -0.2377 -0.2691 -0.2496
#> SD 0.5908 0.2032 0.4810
#> Median -0.3665 -0.2676 -0.3143
#> Min -1.0458 -0.6576 -1.0458
#> Max 2.2270 0.1239 2.2270
#> Diff NA NA -0.0313
#> p.value.between NA NA 0.7300
#> 95%.LCL.between NA NA -0.2082
#> 95%.UCL.between NA NA 0.1456
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Cleanup Reference Combined
#N 77 47 124
#Mean -0.2377 -0.2691 -0.2496
#SD 0.5908 0.2032 0.481
#Median -0.3665 -0.2676 -0.3143
#Min -1.0458 -0.6576 -1.0458
#Max 2.227 0.1239 2.227
#Diff -0.0313
#p.value.between 0.73
#95%.LCL.between -0.2082
#95%.UCL.between 0.1456
# Page 9-3 of USEPA (2009) lists trichloroethene
# concentrations (TCE; mg/L) collected from groundwater at two wells.
# Here, the seven non-detects have been set to their detection limit.
# First, compute summary statistics for all TCE observations.
summaryStats( ~ 1, data = EPA.09.Table.9.1.TCE.df,
digits = 3, = "TCE")
#> N Mean SD Median Min Max NA's N.Total
#> TCE 27 0.09 0.064 0.1 0.004 0.25 3 30
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Mean SD Median Min Max NA's N.Total
#TCE 27 0.09 0.064 0.1 0.004 0.25 3 30
summaryStats( ~ 1, data = EPA.09.Table.9.1.TCE.df,
se = TRUE, quartiles = TRUE, digits = 3, = "TCE")
#> N Mean SD SE Median Min Max 1st Qu. 3rd Qu. NA's N.Total
#> TCE 27 0.09 0.064 0.012 0.1 0.004 0.25 0.031 0.12 3 30
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Mean SD SE Median Min Max 1st Qu. 3rd Qu. NA's N.Total
#TCE 27 0.09 0.064 0.012 0.1 0.004 0.25 0.031 0.12 3 30
# Now compute summary statistics by well.
summaryStats( ~ Well, data = EPA.09.Table.9.1.TCE.df,
digits = 3)
#> N Mean SD Median Min Max NA's N.Total
#> Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15
#> Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Mean SD Median Min Max NA's N.Total
#Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15
#Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15
summaryStats( ~ Well, data = EPA.09.Table.9.1.TCE.df,
digits = 3, = TRUE)
#> Well.1 Well.2
#> N 14.000 13.000
#> Mean 0.063 0.118
#> SD 0.079 0.020
#> Median 0.031 0.110
#> Min 0.004 0.099
#> Max 0.250 0.170
#> NA's 1.000 2.000
#> N.Total 15.000 15.000
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Well.1 Well.2
#N 14 13
#Mean 0.063 0.118
#SD 0.079 0.02
#Median 0.031 0.11
#Min 0.004 0.099
#Max 0.25 0.17
#NA's 1 2
#N.Total 15 15
# If you want to keep trailing 0's, use the drop0trailing argument:
summaryStats( ~ Well, data = EPA.09.Table.9.1.TCE.df,
digits = 3, = TRUE, drop0trailing = FALSE)
#> Well.1 Well.2
#> N 14.000 13.000
#> Mean 0.063 0.118
#> SD 0.079 0.020
#> Median 0.031 0.110
#> Min 0.004 0.099
#> Max 0.250 0.170
#> NA's 1.000 2.000
#> N.Total 15.000 15.000
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] FALSE
# Well.1 Well.2
#N 14.000 13.000
#Mean 0.063 0.118
#SD 0.079 0.020
#Median 0.031 0.110
#Min 0.004 0.099
#Max 0.250 0.170
#NA's 1.000 2.000
#N.Total 15.000 15.000
# Page 13-3 of USEPA (2009) lists iron concentrations (ppm) in
# groundwater collected from 6 wells.
# First, compute summary statistics for each well.
summaryStats(Iron.ppm ~ Well, data = EPA.09.Ex.13.1.iron.df,
combine.groups = FALSE, digits = 2, = TRUE)
#> Well.1 Well.2 Well.3 Well.4 Well.5 Well.6
#> N 4.00 4.00 4.00 4.00 4.00 4.00
#> Mean 47.01 55.74 90.86 70.43 145.24 156.32
#> SD 12.40 20.34 59.35 25.95 92.16 51.20
#> Median 50.06 57.05 76.73 76.96 137.66 171.93
#> Min 29.96 32.14 39.25 34.12 60.95 83.10
#> Max 57.97 76.71 170.72 93.69 244.69 198.34
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Well.1 Well.2 Well.3 Well.4 Well.5 Well.6
#N 4 4 4 4 4 4
#Mean 47.01 55.73 90.86 70.43 145.24 156.32
#SD 12.4 20.34 59.35 25.95 92.16 51.2
#Median 50.05 57.05 76.73 76.95 137.66 171.93
#Min 29.96 32.14 39.25 34.12 60.95 83.1
#Max 57.97 76.71 170.72 93.69 244.69 198.34
# Note the large differences in standard deviations between wells.
# Compute summary statistics for log(Iron), by Well.
summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df,
combine.groups = FALSE, digits = 2, = TRUE)
#> Well.1 Well.2 Well.3 Well.4 Well.5 Well.6
#> N 4.00 4.00 4.00 4.00 4.00 4.00
#> Mean 3.82 3.97 4.35 4.19 4.80 5.00
#> SD 0.30 0.40 0.66 0.45 0.70 0.40
#> Median 3.91 4.02 4.29 4.34 4.80 5.14
#> Min 3.40 3.47 3.67 3.53 4.11 4.42
#> Max 4.06 4.34 5.14 4.54 5.50 5.29
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Well.1 Well.2 Well.3 Well.4 Well.5 Well.6
#N 4 4 4 4 4 4
#Mean 3.82 3.97 4.35 4.19 4.8 5
#SD 0.3 0.4 0.66 0.45 0.7 0.4
#Median 3.91 4.02 4.29 4.34 4.8 5.14
#Min 3.4 3.47 3.67 3.53 4.11 4.42
#Max 4.06 4.34 5.14 4.54 5.5 5.29
# Include confidence intervals for the mean log(Fe) concentration
# at each well, and also the p-value from the one-way
# analysis of variance to test for a difference in well means.
summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df,
digits = 1, ci = TRUE, p.value = TRUE, = TRUE)
#> Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 Combined
#> N 4.0 4.0 4.0 4.0 4.0 4.0 24.000
#> Mean 3.8 4.0 4.3 4.2 4.8 5.0 4.400
#> SD 0.3 0.4 0.7 0.5 0.7 0.4 0.600
#> Median 3.9 4.0 4.3 4.3 4.8 5.1 4.300
#> Min 3.4 3.5 3.7 3.5 4.1 4.4 3.400
#> Max 4.1 4.3 5.1 4.5 5.5 5.3 5.500
#> 95%.LCL 3.3 3.3 3.3 3.5 3.7 4.4 4.100
#> 95%.UCL 4.3 4.6 5.4 4.9 5.9 5.6 4.600
#> p.value.between NA NA NA NA NA NA 0.025
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 Combined
#N 4 4 4 4 4 4 24
#Mean 3.8 4 4.3 4.2 4.8 5 4.4
#SD 0.3 0.4 0.7 0.5 0.7 0.4 0.6
#Median 3.9 4 4.3 4.3 4.8 5.1 4.3
#Min 3.4 3.5 3.7 3.5 4.1 4.4 3.4
#Max 4.1 4.3 5.1 4.5 5.5 5.3 5.5
#95%.LCL 3.3 3.3 3.3 3.5 3.7 4.4 4.1
#95%.UCL 4.3 4.6 5.4 4.9 5.9 5.6 4.6
#p.value.between 0.025
# Using the built-in dataset HairEyeColor, summarize the frequencies
# of hair color and test whether there is a difference in proportions.
# NOTE: The data that was originally factor data has already been
# collapsed into frequency counts by catetory in the object
# HairEyeColor. In the examples in this section, we recreate
# the factor objects in order to show how summaryStats works
# for factor objects.
Hair <- apply(HairEyeColor, 1, sum)
#> Black Brown Red Blond
#> 108 286 71 127
#Black Brown Red Blond
# 108 286 71 127
Hair.color <- names(Hair)
Hair.fac <- factor(rep(Hair.color, times = Hair),
levels = Hair.color)
# Compute summary statistics and perform the chi-square test
# for equal proportions of hair color
summaryStats(Hair.fac, digits = 1, p.value = TRUE)
#> N Pct ChiSq_p
#> Black 108 18.2 NA
#> Brown 286 48.3 NA
#> Red 71 12.0 NA
#> Blond 127 21.5 NA
#> Combined 592 100.0 2.5e-39
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Pct ChiSq_p
#Black 108 18.2
#Brown 286 48.3
#Red 71 12.0
#Blond 127 21.5
#Combined 592 100.0 2.5e-39
# Now test the hypothesis that 10% of the population from which
# this sample was drawn has Red hair, and compute a 95% confidence
# interval for the percent of subjects with red hair.
Red.Hair.fac <- factor(Hair.fac == "Red", levels = c(TRUE, FALSE),
labels = c("Red", "Not Red"))
summaryStats(Red.Hair.fac, digits = 1, p.value = TRUE,
ci = TRUE, test = "binom", test.arg.list = list(p = 0.1))
#> N Pct Exact_p 95%.LCL 95%.UCL
#> Red 71 12 NA 9.5 14.9
#> Not Red 521 88 NA NA NA
#> Combined 592 100 0.11 NA NA
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
# N Pct Exact_p 95%.LCL 95%.UCL
#Red 71 12 9.5 14.9
#Not Red 521 88
#Combined 592 100 0.11
# Now test whether the percent of people with Green eyes is the
# same for people with and without Red hair.
HairEye <- apply(HairEyeColor, 1:2, sum)
Hair.color <- rownames(HairEye)
Eye.color <- colnames(HairEye)
n11 <- HairEye[Hair.color == "Red", Eye.color == "Green"]
n12 <- sum(HairEye[Hair.color == "Red", Eye.color != "Green"])
n21 <- sum(HairEye[Hair.color != "Red", Eye.color == "Green"])
n22 <- sum(HairEye[Hair.color != "Red", Eye.color != "Green"])
Hair.fac <- factor(rep(c("Red", "Not Red"), c(n11+n12, n21+n22)),
levels = c("Red", "Not Red"))
Eye.fac <- factor(c(rep("Green", n11), rep("Not Green", n12),
rep("Green", n21), rep("Not Green", n22)),
levels = c("Green", "Not Green"))
# Here are the results using the chi-square test and computing
# confidence limits for the difference between the two percentages
summaryStats(Eye.fac, group = Hair.fac, digits = 1,
p.value = TRUE, ci = TRUE, test = "prop", = TRUE, test.arg.list = list(correct = FALSE))
#> Green Not Green Combined
#> Red(N) 14.0 57.0 71.00
#> Red(Pct) 19.7 80.3 100.00
#> Not Red(N) 50.0 471.0 521.00
#> Not Red(Pct) 9.6 90.4 100.00
#> ChiSq_p NA NA 0.01
#> 95%.LCL.between NA NA 0.50
#> 95%.UCL.between NA NA 19.70
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Green Not Green Combined
#Red(N) 14 57 71
#Red(Pct) 19.7 80.3 100
#Not Red(N) 50 471 521
#Not Red(Pct) 9.6 90.4 100
#ChiSq_p 0.01
#95%.LCL.between 0.5
#95%.UCL.between 19.7
# Here are the results using Fisher's exact test and computing
# confidence limits for the odds ratio
summaryStats(Eye.fac, group = Hair.fac, digits = 1,
p.value = TRUE, ci = TRUE, test = "fisher", = TRUE)
#> Green Not Green Combined
#> Red(N) 14.0 57.0 71.000
#> Red(Pct) 19.7 80.3 100.000
#> Not Red(N) 50.0 471.0 521.000
#> Not Red(Pct) 9.6 90.4 100.000
#> Fisher_p NA NA 0.015
#> 95%.LCL.OR NA NA 1.100
#> 95%.UCL.OR NA NA 4.600
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Green Not Green Combined
#Red(N) 14 57 71
#Red(Pct) 19.7 80.3 100
#Not Red(N) 50 471 521
#Not Red(Pct) 9.6 90.4 100
#Fisher_p 0.015
#95%.LCL.OR 1.1
#95%.UCL.OR 4.6
rm(Hair, Hair.color, Hair.fac, Red.Hair.fac, HairEye, Eye.color,
n11, n12, n21, n22, Eye.fac)
# The data set EPA.89b.cadmium.df contains information on
# cadmium concentrations in groundwater collected from a
# background and compliance well. Compare detection frequencies
# between the well types and test for a difference using
# Fisher's exact test.
summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df,
digits = 1, p.value = TRUE, test = "fisher")
#> Background(N) Background(Pct) Compliance(N) Compliance(Pct) Fisher_p
#> FALSE 8 33.3 24 37.5 NA
#> TRUE 16 66.7 40 62.5 NA
#> Combined 24 100.0 64 100.0 0.81
#> 95%.LCL.OR 95%.UCL.OR
#> Combined 0.3 2.5
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df,
digits = 1, p.value = TRUE, test = "fisher", = TRUE)
#> FALSE TRUE Combined
#> Background(N) 8.0 16.0 24.00
#> Background(Pct) 33.3 66.7 100.00
#> Compliance(N) 24.0 40.0 64.00
#> Compliance(Pct) 37.5 62.5 100.00
#> Fisher_p NA NA 0.81
#> 95%.LCL.OR NA NA 0.30
#> 95%.UCL.OR NA NA 2.50
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# FALSE TRUE Combined
#Background(N) 8 16 24
#Background(Pct) 33.3 66.7 100
#Compliance(N) 24 40 64
#Compliance(Pct) 37.5 62.5 100
#Fisher_p 0.81
#95%.LCL.OR 0.3
#95%.UCL.OR 2.5
# Paired Observations
# The data frame ACE.13.TCE.df contians paired observations of
# trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells
# before and after remediation.
# Compare TCE concentrations before and after remediation and
# use a paired t-test to test for a difference between periods.
summaryStats( ~ Period, data = ACE.13.TCE.df,
p.value = TRUE, paired = TRUE)
#> N Mean SD Median Min Max Diff paired.p.value.between
#> Before 10 21.6240 13.5113 20.300 5.960 41.5 NA NA
#> After 10 3.6329 3.5544 2.480 0.272 10.7 NA NA
#> Combined 20 12.6284 13.3281 8.475 0.272 41.5 -17.9911 0.0027
#> 95%.LCL.between 95%.UCL.between
#> Before NA NA
#> After NA NA
#> Combined -27.9097 -8.0725
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
summaryStats( ~ Period, data = ACE.13.TCE.df,
p.value = TRUE, paired = TRUE, = TRUE)
#> Before After Combined
#> N 10.0000 10.0000 20.0000
#> Mean 21.6240 3.6329 12.6284
#> SD 13.5113 3.5544 13.3281
#> Median 20.3000 2.4800 8.4750
#> Min 5.9600 0.2720 0.2720
#> Max 41.5000 10.7000 41.5000
#> Diff NA NA -17.9911
#> paired.p.value.between NA NA 0.0027
#> 95%.LCL.between NA NA -27.9097
#> 95%.UCL.between NA NA -8.0725
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Before After Combined
#N 10 10 20
#Mean 21.624 3.6329 12.6284
#SD 13.5113 3.5544 13.3281
#Median 20.3 2.48 8.475
#Min 5.96 0.272 0.272
#Max 41.5 10.7 41.5
#Diff -17.9911
#paired.p.value.between 0.0027
#95%.LCL.between -27.9097
#95%.UCL.between -8.0725
# Repeat the last example, but use a one-sided alternative since
# remediation should decrease TCE concentration.
summaryStats( ~ Period, data = ACE.13.TCE.df,
p.value = TRUE, paired = TRUE, alternative = "less")
#> N Mean SD Median Min Max Diff
#> Before 10 21.6240 13.5113 20.300 5.960 41.5 NA
#> After 10 3.6329 3.5544 2.480 0.272 10.7 NA
#> Combined 20 12.6284 13.3281 8.475 0.272 41.5 -17.9911
#> paired.p.value.between.less 95%.LCL.between 95%.UCL.between
#> Before NA NA NA
#> After NA NA NA
#> Combined 0.0013 -Inf -9.9537
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] FALSE
#> attr(,"drop0trailing")
#> [1] TRUE
summaryStats( ~ Period, data = ACE.13.TCE.df,
p.value = TRUE, paired = TRUE, alternative = "less", = TRUE)
#> Before After Combined
#> N 10.0000 10.0000 20.0000
#> Mean 21.6240 3.6329 12.6284
#> SD 13.5113 3.5544 13.3281
#> Median 20.3000 2.4800 8.4750
#> Min 5.9600 0.2720 0.2720
#> Max 41.5000 10.7000 41.5000
#> Diff NA NA -17.9911
#> paired.p.value.between.less NA NA 0.0013
#> 95%.LCL.between NA NA -Inf
#> 95%.UCL.between NA NA -9.9537
#> attr(,"class")
#> [1] "summaryStats"
#> attr(,"")
#> [1] TRUE
#> attr(,"drop0trailing")
#> [1] TRUE
# Before After Combined
#N 10 10 20
#Mean 21.624 3.6329 12.6284
#SD 13.5113 3.5544 13.3281
#Median 20.3 2.48 8.475
#Min 5.96 0.272 0.272
#Max 41.5 10.7 41.5
#Diff -17.9911
#paired.p.value.between.less 0.0013
#95%.LCL.between -Inf
#95%.UCL.between -9.9537