Title: | Self Calibrating Quantile-Quantile Plots for Visual Testing |
---|---|
Description: | Provides the function qqtest which incorporates uncertainty in its qqplot display(s) so that the user might have a better sense of the evidence against the specified distributional hypothesis. qqtest draws a quantile quantile plot for visually assessing whether the data come from a test distribution that has been defined in one of many ways. The vertical axis plots the data quantiles, the horizontal those of a test distribution. The default behaviour generates 1000 samples from the test distribution and overlays the plot with shaded pointwise interval estimates for the ordered quantiles from the test distribution. A small number of independently generated exemplar quantile plots can also be overlaid. Both the interval estimates and the exemplars provide different comparative information to assess the evidence provided by the qqplot for or against the hypothesis that the data come from the test distribution (default is normal or gaussian). Finally, a visual test of significance (a lineup plot) can also be displayed to test the null hypothesis that the data come from the test distribution. |
Authors: | Wayne Oldford [aut, cre] |
Maintainer: | Wayne Oldford <[email protected]> |
License: | GPL-3 |
Version: | 1.2.0 |
Built: | 2025-01-12 03:03:36 UTC |
Source: | https://github.com/rwoldford/qqtest |
The number of bacteria per cubic centimetre was measured daily for river water entering the Torresdale filter of the Philadelphia water supply through the whole of 1913.
bacteria
bacteria
A data frame with 22 rows and 2 variates:
Number of bacteria per cc.
Percent of days (out of 365) having bacteria count less than or equal to the measured count.
Rather than the individual daily results, recorded here are 22 values together with the percentage of days whose value was less than or equal to the recorded value. These quantiles are therefore based on 365 daily measurements.
Values were taken from a logarithmic-normal probability plot dated January 21, 1915 as it appeared in Figure 22 of George C. Whipple's 1916 (Part 2) paper on the “Element of Chance in Sanitation”. This paper introduces logarithimic-normal probability paper (or a log-normal qqplot).
with(bacteria, plot(qnorm(percentTime/100), log(count,10), type="o"))
reproduces Whipple's 1915 plot.
with(bacteria, qqtest(data=count, p=percentTime/100, np=365, dist="log-normal", type="o"))
will effect a qqtest plot for this data. More detail can be had from:
with(bacteria, qqtest(data=log(count, 10), p=percentTime/100, np=365, dist="normal", type="o"))
"The Element of Chance in Sanitation", George C. Whipple, Journal of the Franklin Institue, Volume 182, July and August (1916), pp. 37-59 and 205-227. Data taken directly from Figure 22, page 209.
dkay
The density function of the K distributionThe K density function on df
degrees of freedom and non-centrality parameter ncp
.
A K distribution is the square root of a chi-square divided by its degrees of freedom. That is, if x is chi-squared on m degrees of freedom, then y = sqrt(x/m) is K on m degrees of freedom. Under standard normal theory, K is the distribution of the pivotal quantity s/sigma where s is the sample standard deviation and sigma is the standard deviation parameter of the normal density. K is the natural distribution for tests and confidence intervals about sigma. K densities are more nearly symmetric than are chi-squared and concentrate near 1. As the degrees of freedom increase, they become more symmetric, more concentrated, and more nearly normally distributed.
dkay(x, df, ncp = 0, log.p = FALSE)
dkay(x, df, ncp = 0, log.p = FALSE)
x |
A vector of values at which to calculate the density. |
df |
Degrees of freedom (non-negative, but can be non-integer). |
ncp |
Non-centrality parameter (non-negative). |
log.p |
logical; if |
dkay
gives the density evaluated at the values of x
.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is the maximum of the lengths of the numerical arguments for the other functions.
The numerical arguments are recycled to the length of the result. Only the first elements of the logical arguments are used.
All calls depend on analogous calls to chi-squared functions. See dchisq
for details on non-centrality parameter calculations.
dkay(1, 20) # # See also the vignette on the "K-distribution" #
dkay(1, 20) # # See also the vignette on the "K-distribution" #
hideLocation
Obfuscating the true location of the real data.Hides the true location of the data as a non-obvious calculation string.
hideLocation(trueLoc, nSubjects)
hideLocation(trueLoc, nSubjects)
trueLoc |
A vector of one or more numbers in the set 1, 2, ..., nSubjects whose locations are to be hidden. |
nSubjects |
An integer larger than 1 used to define the set 1, 2, ..., nSubjects. |
Returns a character vector, each element being a string
containing an obscure calculation which, if parsed and evaluated, would return
the value of the corresponding number in trueLoc
.
trueLoc <- hideLocation(3,100) trueLoc revealLocation(trueLoc) n <- 200 trueLoc <- sample(1:n, 3) trueLoc ans <- hideLocation(trueLoc , n) ans revealLocation(ans)
trueLoc <- hideLocation(3,100) trueLoc revealLocation(trueLoc) n <- 200 trueLoc <- sample(1:n, 3) trueLoc ans <- hideLocation(trueLoc , n) ans revealLocation(ans)
Values are arranged in decreasing order of absolute magnitude. Name of the contrast effect is given as the row name of each value. Daniel(1959) uses the data to illustrate the use of half-normal plots. In his words: "We need, of course, some rule of inference that will help us to be objective in judging whether or not the largest effects are real."
penicillin
penicillin
A data frame with 31 rows and 1 variate:
value of contrast for that row
qqtest(penicillin[1],dist="half-normal")
will effect Daniel's plot.
"Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments", Cuthbert Daniel, Technometrics, Vol. 1, No. 4 (Nov., 1959), pp. 311-341. Daniel cites: Davies, O. L., Editor: Design and Analysis of Industrial Experiments, Second Edition, Oliver and Boyd, London, and Hafner, New York, 1956 as the original.
pkay
The cumulative distribution function K distributionThe cumulative distribution function for the K distribution on df
degrees of freedom having non-centrality parameter ncp
.
A K distribution is the square root of a chi-square divided by its degrees of freedom. That is, if x is chi-squared on m degrees of freedom, then y = sqrt(x/m) is K on m degrees of freedom. Under standard normal theory, K is the distribution of the pivotal quantity s/sigma where s is the sample standard deviation and sigma is the standard deviation parameter of the normal density. K is the natural distribution for tests and confidence intervals about sigma. K densities are more nearly symmetric than are chi-squared and concentrate near 1. As the degrees of freedom increase, they become more symmetric, more concentrated, and more nearly normally distributed.
pkay(q, df, ncp = 0, upper.tail = FALSE, log.p = FALSE)
pkay(q, df, ncp = 0, upper.tail = FALSE, log.p = FALSE)
q |
A vector of quantiles at which to calculate the cumulative distribution. |
df |
Degrees of freedom (non-negative, but can be non-integer). |
ncp |
Non-centrality parameter (non-negative). |
upper.tail |
logical; if |
log.p |
logical; if |
pkay
returns the value of the K cumulative distribution function, F(q), evaluated at q
for the given df
and ncp
. If upper.trail = TRUE
then the upper tail probabilities 1-F(q) = Pr(Q>q) are returned instead of F(q).
Invalid arguments will result in return value NaN, with a warning.
The length of the result is the maximum of the lengths of the numerical arguments.
The numerical arguments are recycled to the length of the result. Only the first elements of the logical arguments are used.
All calls depend on analogous calls to chi-squared functions. See pchisq
for details on non-centrality parameter calculations.
pkay(1, 20) q <- seq(0.01, 1.8, 0.01) # # Plot the cdf for K(5) u <- pkay(q,5) plot(q, u, type="l", xlab="q", ylab="cumulative probability", xlim=range(q), ylim=c(0,1), main="K cdf") # # Add some other K cdfs lines(q, pkay(q,10), lty=2) lines(q, pkay(q,20), lty=3) lines(q, pkay(q,30), lty=4) legend("topleft", legend=c("df = 5", "df = 10", "df = 20", "df = 30"), lty=c(1,2,3,4), title="degrees of freedom", cex=0.75, bty="n") # # See the vignette for more on the "K-distribution" #
pkay(1, 20) q <- seq(0.01, 1.8, 0.01) # # Plot the cdf for K(5) u <- pkay(q,5) plot(q, u, type="l", xlab="q", ylab="cumulative probability", xlim=range(q), ylim=c(0,1), main="K cdf") # # Add some other K cdfs lines(q, pkay(q,10), lty=2) lines(q, pkay(q,20), lty=3) lines(q, pkay(q,30), lty=4) legend("topleft", legend=c("df = 5", "df = 10", "df = 20", "df = 30"), lty=c(1,2,3,4), title="degrees of freedom", cex=0.75, bty="n") # # See the vignette for more on the "K-distribution" #
Contains process control measurements of thickness of primer applied to automotive body parts in an auto factory. Twice daily, a set of 10 consecutive parts were selected and the thickness in mils (thousandths of an inch) were measured. For each set of 10 parts, the average (xbar) and the sample standard deviation (s) were also calculated and recorded. These summaries would be plotted in xbar or s control charts with suitably determined upper and lower control limits. Alternatively, for checking outliers a qqplot (via qqtest) could be used for either xbar or s.
primer
primer
A data frame with 20 rows and 14 variates:
Day on which the parts were taken and measured.
Either the first or second set of 10 consecutive parts taken.
Thickness of primer in mils on the first part sampled in the specified batch of that day.
Thickness of primer in mils on the second part sampled in the specified batch of that day.
Thickness of primer in mils on the third part sampled in the specified batch of that day.
Thickness of primer in mils on the fourth part sampled in the specified batch of that day.
Thickness of primer in mils on the fifth part sampled in the specified batch of that day.
Thickness of primer in mils on the sixth part sampled in the specified batch of that day.
Thickness of primer in mils on the seventh part sampled in the specified batch of that day.
Thickness of primer in mils on the eighth part sampled in the specified batch of that day.
Thickness of primer in mils on the ninth part sampled in the specified batch of that day.
Thickness of primer in mils on the tenth part sampled in the specified batch of that day.
Arithmetic average of the measurements of primer thickness of the 10 parts selected in the specified batch of that day.
Sample standard deviation of the measurements of primer thickness of the 10 parts selected in the specified batch of that day.
with(primer,qqtest(xbar, main="Averages"))
will effect this plot for xbar.
with(primer,qqtest(s,dist="kay", df=9, main ="Standard deviations"))
will effect this plot for s.
"Statistical Process Control - SPC", Automotive Industry Action Group(AIAG), Southfield MI, (1995), page 64.
From measurements made by Francis Galton at the International Health Exhibition in 1884.
pullstrength
pullstrength
A data frame with 7 rows and 4 variates:
Pull strength lower bound in pounds.
Number of cases observed with pull strength between this bound and the next.
Percent of cases observed with pull strength between this bound and the next.
Cumulative percent of cases observed with pull strength up to this bound.
Adjust Galton's cumulative percent to include only half the cases between this bound and the next.
qqtest(pullstrength$strength, p=pullstrength$percentCumulative/100, np=519, dist="uniform", main="Galton's ogive of pull strength for 519 males aged 23-26", xlab="Cumulative Proportions (Adjusted)",yAxisAsProbs=FALSE, ylab="Strength in lbs.", type="o")
will effect Galton's Ogive.
qqtest(pullstrength$strength, p=pullstrength$percentAdjustedCumulative/100, np=519, dist="normal", main="Gaussian qqplot of pull strength for 519 males aged 23-26", xlab="Cumulative Proportions (Adjusted)",yAxisAsProbs=FALSE, ylab="Strength in lbs.", type="o")
will effect a normal qqplot for this data.
"Natural Inheritance", Francis Galton, (1889), Table 1, page 199.
qkay
The K distribution quantile functionQuantile function for the K distribution on df
degrees of freedom having non-centrality parameter ncp
.
A K distribution is the square root of a chi-square divided by its degrees of freedom. That is, if x is chi-squared on m degrees of freedom, then y = sqrt(x/m) is K on m degrees of freedom. Under standard normal theory, K is the distribution of the pivotal quantity s/sigma where s is the sample standard deviation and sigma is the standard deviation parameter of the normal density. K is the natural distribution for tests and confidence intervals about sigma. K densities are more nearly symmetric than are chi-squared and concentrate near 1. As the degrees of freedom increase, they become more symmetric, more concentrated, and more nearly normally distributed.
qkay(p, df, ncp = 0, upper.tail = FALSE, log.p = FALSE)
qkay(p, df, ncp = 0, upper.tail = FALSE, log.p = FALSE)
p |
A vector of probabilities at which to calculate the quantiles. |
df |
Degrees of freedom (non-negative, but can be non-integer). |
ncp |
Non-centrality parameter (non-negative). |
upper.tail |
logical; if |
log.p |
logical; if |
qkay
returns the quantiles at probabilities p
for a K on df
degrees of freedom and non-centrality parameter ncp
.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is the maximum of the lengths of the numerical arguments.
The numerical arguments are recycled to the length of the result. Only the first elements of the logical arguments are used.
All calls depend on analogous calls to chi-squared functions. See qchisq
for details on non-centrality parameter calculations.
p <- ppoints(30) # Get the quantiles for these points q5 <- qkay(p, 5) plot(p, q5, main="Quantile plot of K(20)", ylim=c(0,max(q5))) # Add quantiles from another K points(p, qkay(p, 20), pch=19) # # Do these EXACT quantiles from a K(5) look like they might # have been generated from K(20)? qqtest(q5, dist="kay",df=20) # How about compared to normal? qqnorm(q5) qqtest(q5) # for this many degrees of freedom it looks a lot like # a gaussian (normal) distribution # And should look really good compared to the true distribution qqtest(q5, dist="kay", df=5) # # # But not so much like it came from a K on 1 degree of freedom qqtest(q5, dist="kay",df=1) # # See the vignette for more on the "K-distribution" #
p <- ppoints(30) # Get the quantiles for these points q5 <- qkay(p, 5) plot(p, q5, main="Quantile plot of K(20)", ylim=c(0,max(q5))) # Add quantiles from another K points(p, qkay(p, 20), pch=19) # # Do these EXACT quantiles from a K(5) look like they might # have been generated from K(20)? qqtest(q5, dist="kay",df=20) # How about compared to normal? qqnorm(q5) qqtest(q5) # for this many degrees of freedom it looks a lot like # a gaussian (normal) distribution # And should look really good compared to the true distribution qqtest(q5, dist="kay", df=5) # # # But not so much like it came from a K on 1 degree of freedom qqtest(q5, dist="kay",df=1) # # See the vignette for more on the "K-distribution" #
qqtest
A self-calibrated quantile-quantile plot for assessing distributional shape.Draws a quantile-quantile plot for visually assessing whether the data come from a test distribution that has been defined in one of many ways.
The vertical axis plots the data quantiles, the horizontal those of a test distribution.
Interval estimates and exemplars provide different comparative information to assess the evidence provided by the qqplot against the hypothesis that the data come from the test distribution (default is normal or gaussian). Interval estimates provide test information related to individual quantiles, exemplars provide test information related to the shape of the quantile quantile curve.
Optionally, a visual test of significance (a lineup plot) can be displayed to provide a coarse level of significance for testing the null hypothesis that the data come from the test distribution.
The default behaviour generates 1000 samples from the test distribution and overlays the plot with pointwise interval estimates for the ordered quantiles from the test distribution.
Various option choices are available to effect different visualizations of the uncertainty surrounding the quantile quantile plot. These include overlaying independently generated exemplar test distribution sample quantile traces so as to assess the joint (as opposed to pointwise) distribution of quantiles.
See argument descriptions and examples for more details.
qqtest( data, dist = c("gaussian", "normal", "log-normal", "half-normal", "uniform", "exponential", "chi-squared", "kay", "student", "t"), df = 1, qfunction = NULL, rfunction = NULL, dataTest = NULL, p = NULL, a = NULL, np = NULL, matchMethod = c("hinges", "quartiles", "middlehalf", "bottomhalf", "tophalf"), xAxisAsProbs = FALSE, yAxisAsProbs = FALSE, xAxisProbs = c(0.05, 0.25, 0.5, 0.75, 0.95), yAxisProbs = c(0.05, 0.25, 0.5, 0.75, 0.95), nreps = 1000, centralPercents = c(0.9, 0.95, 0.99), envelope = TRUE, drawPercentiles = FALSE, drawQuartiles = FALSE, legend = NULL, legend.xy = "topleft", legend.cex = 0.8, nexemplars = 0, typex = NULL, plainTrails = FALSE, colTrails = NULL, alphaTrails = 0.25, lwdTrails = 1, lineup = FALSE, nsuspects = 20, col = NULL, h = 260, c = 90, l = 60, alpha = 1, cex = NULL, pch = 19, type = NULL, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, axes = NULL, bty = "o", ... )
qqtest( data, dist = c("gaussian", "normal", "log-normal", "half-normal", "uniform", "exponential", "chi-squared", "kay", "student", "t"), df = 1, qfunction = NULL, rfunction = NULL, dataTest = NULL, p = NULL, a = NULL, np = NULL, matchMethod = c("hinges", "quartiles", "middlehalf", "bottomhalf", "tophalf"), xAxisAsProbs = FALSE, yAxisAsProbs = FALSE, xAxisProbs = c(0.05, 0.25, 0.5, 0.75, 0.95), yAxisProbs = c(0.05, 0.25, 0.5, 0.75, 0.95), nreps = 1000, centralPercents = c(0.9, 0.95, 0.99), envelope = TRUE, drawPercentiles = FALSE, drawQuartiles = FALSE, legend = NULL, legend.xy = "topleft", legend.cex = 0.8, nexemplars = 0, typex = NULL, plainTrails = FALSE, colTrails = NULL, alphaTrails = 0.25, lwdTrails = 1, lineup = FALSE, nsuspects = 20, col = NULL, h = 260, c = 90, l = 60, alpha = 1, cex = NULL, pch = 19, type = NULL, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, axes = NULL, bty = "o", ... )
data |
A univariate dataset to be tested. If data has more than one column, the first is used. |
dist |
The name of the distribution against which the comparison is made, the test distribution for a few built-in distributions.
One of If dist is |
df |
Degrees of freedom of |
qfunction |
If non- |
rfunction |
If non- The value of the |
dataTest |
If non- |
p |
If non- If |
a |
This is the second parameter given to the |
np |
This is required if the vector |
matchMethod |
It is necessary to match locations and scales of the two distributions. The method used to do this matching
is given by If The remaining two methods, Finally, note that the user may supply their own matching method by providing a function having vector arguments |
xAxisAsProbs |
If |
yAxisAsProbs |
If |
xAxisProbs |
A vector of probabilities to be used to label the x axis ticks when |
yAxisProbs |
A vector of probabilities to be used to label the y axis ticks when |
nreps |
The number of replicate samples to be taken from the test distribution to construct the pointwise intervals
for each quantile. Default is 1000.
From these samples, an empirical distribution is generated from the test distribution for the ordered quantiles
corresponding to the values of |
centralPercents |
The vector of proportions determining the central intervals of the empirical distribution of
each ordered quantile from the test distribution.
Default is |
envelope |
If |
drawPercentiles |
If |
drawQuartiles |
If |
legend |
If |
legend.xy |
Either a string being one of |
legend.cex |
Character expansion size for legend (default 0.8). |
nexemplars |
(default is 0) The number of replicate samples to be taken from the test distribution and
plotted as a coloured trail on the qqplot.
Each such trail is a sample of the same size as |
typex |
(default is "o", or match |
plainTrails |
If |
colTrails |
Colours to be used for the trails (default will be multi-colours). |
alphaTrails |
The alpha transparency to be used in plotting all exemplar trails. The default is 0.25.
Because the trails will overplot, a smaller |
lwdTrails |
The graphical line width ( |
lineup |
If Each plot is given a suspect number from 1 to |
nsuspects |
The total number of plots (default is 20) to be viewed in the lineup display when |
col |
If non- |
h |
The hue of the colour of the points. Specified as an angle in degrees from 0 to 360 around a colour wheel. E.g. 0 is red, 120 green, 240 blue, Default is 260 (a bluish). |
c |
The chroma of the colour of the points. Takes values from 0 to an upper bound that is a function of hue, |
l |
The luminance of the colour of the points. Takes values from 0 to 100. For any given combination of hue, |
alpha |
The alpha transparency of the colour of the points. Takes values from 0 to 1. Values near 0 are more transparent, values near 1 (the default) are more opaque. Alpha values sum when points over plot, giving some indication of density. |
cex |
The graphical parameter |
pch |
The graphical parameter |
type |
The graphical parameter |
main |
The graphical parameter |
xlab |
The graphical parameter |
ylab |
The graphical parameter |
xlim |
The graphical parameter |
ylim |
The graphical parameter |
axes |
The graphical parameter |
bty |
The graphical parameter |
... |
Any further graphical parameters to be passed to the |
Displays the qqplot.
Invisibly returns a list with named components x
, y
, and order
giving the horizontal and
vertical locations of the points in sorted order, as well as the order vector from the input data. The result of qqtest
must be assigned to get these.
The values could then, for example, be used to identify or label points in the display.
When lineup
is TRUE
, it returns a string encoding
the true location of the data as a calculation to be evaluated. This provides some simple obfuscation of the true
location so that the visual assessment can be honest. The true location is revealed by calling revealLocation()
on the returned value.
"Self calibrating quantile-quantile plots", R. Wayne Oldford, The American Statistician, 70, (2016) https://doi.org/10.1080/00031305.2015.1090338
"Unbiased Plotting Positions – A Review", C. Cunnane, Journal of Hydrology, Vol. 37 (1978), pp. 205-222.
# # default qqtest plot qqtest(precip, main = "Precipitation (inches/year) in 70 US cities") # # qqtest to compare to qqnorm op <- par(mfrow=c(1,2)) qqnorm(precip, main="qqnorm") qqtest(precip, main="qqtest", xAxisAsProbs=FALSE, yAxisAsProbs=FALSE) par(op) # # Use lines instead of envelope qqtest(precip, envelope=FALSE, drawPercentiles=TRUE, main = "Precipitation (inches/year) in 70 US cities") # # Use quartiles instead of envelope qqtest(precip, envelope=FALSE, drawQuartiles=TRUE, main = "Precipitation (inches/year) in 70 US cities") # # Use coloured exemplars (qqplot of data simulated from the test distribution) # and suppress the envelope. Where the envelope, percentiles, and quartiles are # simulated pointwise bands, exemplars give some sense of what the (joint) shape of the # quantile-quantile plot should look like (for data from the test distribution). # Each simulated sample is a different colour. qqtest(precip, nexemplars=10, typex="o", envelope=FALSE, type="p", main = "Precipitation (inches/year) in 70 US cities") # # Alternatively, the trail of each exemplar could be plain (the identical grey). # Making each trail wide and assigning it some transparency (alpha near 0) # allows the trails to give a sense of the density through the darkness of the grey. # qqtest(precip, nexemplars=20, envelope=FALSE, lwdTrails=3, plainTrails=TRUE, alphaTrail=0.4, typex="o", type="o", main = "Precipitation (inches/year) in 70 US cities") # # Wide coloured exemplars with some transparency provide an indication of # density and allow some trails to be followed by colour. # qqtest(precip, nexemplars=20, envelope=FALSE, lwdTrails=3, alphaTrail=0.4, typex="o", type="o", col="black", main = "Precipitation (inches/year) in 70 US cities") # Envelope and exemplars with coloured trails to be followed. # qqtest(precip, nexemplars=5, lwdTrails=2, alphaTrail=0.6, alpha=0.8, main = "Precipitation (inches/year) in 70 US cities") # # # gaussian - qqplot, but now showing in the line up trueLoc <- qqtest(precip, lineup=TRUE, main="Suspect", legend=FALSE, cex=0.75, col="grey20", ylab="", pch=21) # the location of the real data in the line up can be found by evaluating # the contents of the string trueLoc # # Cut and paste the string contents into the R console, or simply revealLocation(trueLoc) # # # log-normal ... using the bacteria data from Whipple(1916) data(bacteria, package="qqtest") # Note that these are selected percentiles from a sample of 365 days in a year with(bacteria, qqtest(count, dist = "log-normal", p=percentTime/100, np=365, type="o", yAxisAsProbs=FALSE, ylab="bacteria per cc", xAxisProbs = c(0.01, 0.50,0.75, 0.90, 0.95, 0.99, 0.995), xlab="Percentage of days in 1913", main = "Number of bacteria from the Delaware river in 1913") ) ptics <- c(0.01, 0.10, 0.25, 0.50, 0.75, 0.90, 0.99 ) axis(1,at=qnorm(ptics), labels=floor(ptics*100)) yvals <- c(100, 1000, 10000, 100000) axis(2, at=log(yvals,10), labels=c("100", "1,000", "10,000", "100,000")) # # compare this to the log-scaled normal qqplot # # with(bacteria, qqtest(log(count, 10), dist = "normal", p=percentTime/100, np=365, type="o", axes=FALSE, ylab="bacteria per cc", xlab="Proportion of days in 1913", main = "Number of bacteria from the Delaware river in 1913") ) # # # Half normal ... using the penicillin data from Daniel(1959) data(penicillin) qqtest(penicillin, dist = "half-normal") # Or the same again but with significant contrast labelled with (penicillin, {qqtest(value, yAxisProbs=c(0.1, 0.75, 0.90, 0.95), dist="half-normal", ylab="Sample cumulative probability", xlab="Half-normal cumulative probability") ppAdj <- (1+ppoints(31))/2 # to get half-normals from normal x <- qnorm(ppAdj) valOrder <- order(value) # need data and rownames in increasing order y <- value[valOrder] tags <- rownames(penicillin)[valOrder] selPoints <- 28:31 # going to label only the largest effects xoffset <- c(0.01, 0.02, 0.03, 0.075) # text function is a bit off text(x[selPoints]-xoffset, y[selPoints], tags[selPoints], pos=2, cex=0.75) } ) # Alternatively, use the returned results for a lot less work results <- qqtest(penicillin$value, yAxisProbs=c(0.1, 0.75, 0.90, 0.95), dist="half-normal", ylab="Sample cumulative probability", xlab="Half-normal cumulative probability") tags <- row.names(penicillin)[results$order] selPoints <- 28:31 # going to label only the largest effects xoffset <- c(0.01, 0.02, 0.03, 0.075) # text function is a bit off text(results$x[selPoints]-xoffset, results$y[selPoints], tags[selPoints], pos=2, cex=0.75) # or the same points could have been identified interactively vusing # identify(results$x, results$y, labels = row.names(penicillin)[results$order]) # # K on 9 df ... see help(dkay) # Use data on primer paint thickness (standard deviations on n=10) data(primer, package="qqtest") with (primer, qqtest(s, dist="kay", df=9, yAxisAsProbs=FALSE, ylab="Standard deviation of primer thickness (in mils)") ) # # chi-squared on 3 df # Use robust covariance matrix in calculation Mahalanobis distances # for the classical Brownlee stackloss data. data(stacklossDistances, package="qqtest") with(stacklossDistances, qqtest(robust, dist="chi", df=3, ylab="Robust Mahalanobis distances")) # # # user supplied qfunction and rfunction -- compare to beta distribution qqtest(precip, qfunction=function(p){qbeta(p, 2, 2)}, rfunction=function(n){rbeta(n, 2, 2)}, main = "Precipitation (inches/year) in 70 US cities") # # # user supplied qfunction only -- compare to beta distribution qqtest(precip, qfunction=function(p){qbeta(p, 2, 2)}, main = "Precipitation (inches/year) in 70 US cities") # # comparing data samples # # Does the sample of beaver2's temperatures look like they # could have come from a distribution shaped like beaver1's? # qqtest(beaver2[,"temp"], dataTest=beaver1[,"temp"], ylab="Beaver 2", xlab="Beaver 1", main="Beaver body temperatures")
# # default qqtest plot qqtest(precip, main = "Precipitation (inches/year) in 70 US cities") # # qqtest to compare to qqnorm op <- par(mfrow=c(1,2)) qqnorm(precip, main="qqnorm") qqtest(precip, main="qqtest", xAxisAsProbs=FALSE, yAxisAsProbs=FALSE) par(op) # # Use lines instead of envelope qqtest(precip, envelope=FALSE, drawPercentiles=TRUE, main = "Precipitation (inches/year) in 70 US cities") # # Use quartiles instead of envelope qqtest(precip, envelope=FALSE, drawQuartiles=TRUE, main = "Precipitation (inches/year) in 70 US cities") # # Use coloured exemplars (qqplot of data simulated from the test distribution) # and suppress the envelope. Where the envelope, percentiles, and quartiles are # simulated pointwise bands, exemplars give some sense of what the (joint) shape of the # quantile-quantile plot should look like (for data from the test distribution). # Each simulated sample is a different colour. qqtest(precip, nexemplars=10, typex="o", envelope=FALSE, type="p", main = "Precipitation (inches/year) in 70 US cities") # # Alternatively, the trail of each exemplar could be plain (the identical grey). # Making each trail wide and assigning it some transparency (alpha near 0) # allows the trails to give a sense of the density through the darkness of the grey. # qqtest(precip, nexemplars=20, envelope=FALSE, lwdTrails=3, plainTrails=TRUE, alphaTrail=0.4, typex="o", type="o", main = "Precipitation (inches/year) in 70 US cities") # # Wide coloured exemplars with some transparency provide an indication of # density and allow some trails to be followed by colour. # qqtest(precip, nexemplars=20, envelope=FALSE, lwdTrails=3, alphaTrail=0.4, typex="o", type="o", col="black", main = "Precipitation (inches/year) in 70 US cities") # Envelope and exemplars with coloured trails to be followed. # qqtest(precip, nexemplars=5, lwdTrails=2, alphaTrail=0.6, alpha=0.8, main = "Precipitation (inches/year) in 70 US cities") # # # gaussian - qqplot, but now showing in the line up trueLoc <- qqtest(precip, lineup=TRUE, main="Suspect", legend=FALSE, cex=0.75, col="grey20", ylab="", pch=21) # the location of the real data in the line up can be found by evaluating # the contents of the string trueLoc # # Cut and paste the string contents into the R console, or simply revealLocation(trueLoc) # # # log-normal ... using the bacteria data from Whipple(1916) data(bacteria, package="qqtest") # Note that these are selected percentiles from a sample of 365 days in a year with(bacteria, qqtest(count, dist = "log-normal", p=percentTime/100, np=365, type="o", yAxisAsProbs=FALSE, ylab="bacteria per cc", xAxisProbs = c(0.01, 0.50,0.75, 0.90, 0.95, 0.99, 0.995), xlab="Percentage of days in 1913", main = "Number of bacteria from the Delaware river in 1913") ) ptics <- c(0.01, 0.10, 0.25, 0.50, 0.75, 0.90, 0.99 ) axis(1,at=qnorm(ptics), labels=floor(ptics*100)) yvals <- c(100, 1000, 10000, 100000) axis(2, at=log(yvals,10), labels=c("100", "1,000", "10,000", "100,000")) # # compare this to the log-scaled normal qqplot # # with(bacteria, qqtest(log(count, 10), dist = "normal", p=percentTime/100, np=365, type="o", axes=FALSE, ylab="bacteria per cc", xlab="Proportion of days in 1913", main = "Number of bacteria from the Delaware river in 1913") ) # # # Half normal ... using the penicillin data from Daniel(1959) data(penicillin) qqtest(penicillin, dist = "half-normal") # Or the same again but with significant contrast labelled with (penicillin, {qqtest(value, yAxisProbs=c(0.1, 0.75, 0.90, 0.95), dist="half-normal", ylab="Sample cumulative probability", xlab="Half-normal cumulative probability") ppAdj <- (1+ppoints(31))/2 # to get half-normals from normal x <- qnorm(ppAdj) valOrder <- order(value) # need data and rownames in increasing order y <- value[valOrder] tags <- rownames(penicillin)[valOrder] selPoints <- 28:31 # going to label only the largest effects xoffset <- c(0.01, 0.02, 0.03, 0.075) # text function is a bit off text(x[selPoints]-xoffset, y[selPoints], tags[selPoints], pos=2, cex=0.75) } ) # Alternatively, use the returned results for a lot less work results <- qqtest(penicillin$value, yAxisProbs=c(0.1, 0.75, 0.90, 0.95), dist="half-normal", ylab="Sample cumulative probability", xlab="Half-normal cumulative probability") tags <- row.names(penicillin)[results$order] selPoints <- 28:31 # going to label only the largest effects xoffset <- c(0.01, 0.02, 0.03, 0.075) # text function is a bit off text(results$x[selPoints]-xoffset, results$y[selPoints], tags[selPoints], pos=2, cex=0.75) # or the same points could have been identified interactively vusing # identify(results$x, results$y, labels = row.names(penicillin)[results$order]) # # K on 9 df ... see help(dkay) # Use data on primer paint thickness (standard deviations on n=10) data(primer, package="qqtest") with (primer, qqtest(s, dist="kay", df=9, yAxisAsProbs=FALSE, ylab="Standard deviation of primer thickness (in mils)") ) # # chi-squared on 3 df # Use robust covariance matrix in calculation Mahalanobis distances # for the classical Brownlee stackloss data. data(stacklossDistances, package="qqtest") with(stacklossDistances, qqtest(robust, dist="chi", df=3, ylab="Robust Mahalanobis distances")) # # # user supplied qfunction and rfunction -- compare to beta distribution qqtest(precip, qfunction=function(p){qbeta(p, 2, 2)}, rfunction=function(n){rbeta(n, 2, 2)}, main = "Precipitation (inches/year) in 70 US cities") # # # user supplied qfunction only -- compare to beta distribution qqtest(precip, qfunction=function(p){qbeta(p, 2, 2)}, main = "Precipitation (inches/year) in 70 US cities") # # comparing data samples # # Does the sample of beaver2's temperatures look like they # could have come from a distribution shaped like beaver1's? # qqtest(beaver2[,"temp"], dataTest=beaver1[,"temp"], ylab="Beaver 2", xlab="Beaver 1", main="Beaver body temperatures")
revealLocation
Revealing locations encoded as a string calculation.Reveals the location by parsing and evaluating the string.
revealLocation(hiddenLocation)
revealLocation(hiddenLocation)
A character vector of calculation strings. |
Returns the value of the each calculation.
trueLoc <- hideLocation(3,100) trueLoc revealLocation(trueLoc) n <- 200 trueLoc <- sample(1:n, 3) trueLoc ans <- hideLocation(trueLoc , n) ans revealLocation(ans)
trueLoc <- hideLocation(3,100) trueLoc revealLocation(trueLoc) n <- 200 trueLoc <- sample(1:n, 3) trueLoc ans <- hideLocation(trueLoc , n) ans revealLocation(ans)
rkay
The K distribution - generating pseudo-random valuesRandom generation for the K distribution on df
degrees of freedom having non-centrality parameter ncp
.
A K distribution is the square root of a chi-square divided by its degrees of freedom. That is, if x is chi-squared on m degrees of freedom, then y = sqrt(x/m) is K on m degrees of freedom. Under standard normal theory, K is the distribution of the pivotal quantity s/sigma where s is the sample standard deviation and sigma is the standard deviation parameter of the normal density. K is the natural distribution for tests and confidence intervals about sigma. K densities are more nearly symmetric than are chi-squared and concentrate near 1. As the degrees of freedom increase, they become more symmetric, more concentrated, and more nearly normally distributed.
rkay(n, df, ncp = 0)
rkay(n, df, ncp = 0)
n |
Number of observations. If |
df |
Degrees of freedom (non-negative, but can be non-integer). |
ncp |
Non-centrality parameter (non-negative). |
rkay
returns pseudo-randomly generated values.
Invalid arguments will result in return value NaN, with a warning.
Depends on call to analogous chi-squared functions. See rchisq
for details on non-centrality parameter calculations.
x <- rkay(100, 20) hist(x, main="100 observations from a K(20)") # Certainly looks like it comes from a K on 20 qqtest(x, dist="kay",df=20) # for this many degrees of freedom it looks # a lot like a gaussian (normal) distribution qqtest(x, dist="gau",df=1) # But not like it came from a K on 1 degree of freedom qqtest(x, dist="kay",df=1) # # See the vignette for more on the "K-distribution" #
x <- rkay(100, 20) hist(x, main="100 observations from a K(20)") # Certainly looks like it comes from a K on 20 qqtest(x, dist="kay",df=20) # for this many degrees of freedom it looks # a lot like a gaussian (normal) distribution qqtest(x, dist="gau",df=1) # But not like it came from a K on 1 degree of freedom qqtest(x, dist="kay",df=1) # # See the vignette for more on the "K-distribution" #
From measurements made by Francis Galton at London's International Health Exhibition in 1884, published in 1885.
sittingHeights
sittingHeights
A data frame with 9 rows and 8 variates, the first 5 of which are as recorded by Galton:
Sitting height greater than or equal to this lower bound in inches.
Sitting height strictly less than this upper bound in inches.
Number of cases observed with sitting height between the two bounds.
Number of cases observed with sitting height up to but not including the upper bound.
Cumulative number of cases expressed as a percent.
Average of the lower and upper bounds.
Number of cases assigned to the binCentre; half of cases observed with sitting height between the two bounds is assigned to be below the binCentre, half above (avoids producing 100 percent for last entry).
Cumulative proportions using nCasesCentred.
with(sittingHeights, plot(percentCumulative,upperBound, type="o", lwd=2, xlim=c(0,100), ylim=c(20,40),xlab="Percentage of women", ylab="Sitting heights in inches"))
will effect Galton's Ogive.
with(sittingHeights, qqtest(binCentre, dist="normal", p = proportionCumulativeAdjusted, np=775, main="Sitting heights of women in inches"))
will effect a normal qqplot for this data.
"The Application of a Graphic Method to Fallible Measures", Francis Galton, (1885), Journal of the Statistical Society of London, Jubilee Volume (June 22-24, 1885), pp. 262-265.
Mahalanobis distances were calculated using the mahalanobis R function. Under standard normal theory, these are approximately Chi-squared on 3 degrees of freedom.
stacklossDistances
stacklossDistances
A data frame with 21 rows and 2 variates:
Mahalanobis squared distances from the arithemetic mean using the eliptical contours of the sample covariance matrix.
As with distances, these are Mahalobis squared distances butnow based on robust measures of location and covariance matrix (as determined from the default covRob of the robust package).
with(stacklossDistances, qqtest(robust,dist="chi", df=3))
will show "outliers".
with(stacklossDistances, qqtest(ordinary,dist="chi", df=3))
will show "inliers".
"Statistical Theory and Methodology in Science and Engineering", K.A. Brownlee, (1960, 2nd ed. 1965), Wiley, New York pp. 491-500.
First extensive published use of normal qqplots. Hazen uses a=1/2 to make the p values for the plots. Hazen doesn'tplot zeros but has them contribute to the sample size. The context of use is in a study of the relation between the water storage provided in a reservoir on any stream and the quantity of water that can be continuously supplied by it. To quote the paper: ... treat all the remaining variations on the basis of probabilities, using all data from a number of streams; and to study them in comparison with the normal law of error."
WachusettReservoir
WachusettReservoir
A data frame with 15 rows and 6 variates:
Computed storage, in millions of gallons per square mile of land area, given a draft of 100,000 gallons per square mile daily.
Computed storage, in millions of gallons per square mile of land area, given a draft of 200,000 gallons per square mile daily.
Computed storage, in millions of gallons per square mile of land area, given a draft of 400,000 gallons per square mile daily.
Computed storage, in millions of gallons per square mile of land area, given a draft of 600,000 gallons per square mile daily.
Computed storage, in millions of gallons per square mile of land area, given a draft of 800,000 gallons per square mile daily.
Computed storage, in millions of gallons per square mile of land area, given a draft of 1,000,000 gallons per square mile daily.
qqtest(WachusettReservoir$draft800,dist="uniform", a=1/2,type="o")
will effect Hazen's original plot for a draft of 800,000 gallons per square mile daily.
qqtest(WachusettReservoir$draft800,dist="normal", a=1/2, type="o")
will effect Hazen's normal qq plot for a draft of 800,000 gallons per square mile daily.
"Storage to be provided in impounding reservoirs for municipal water supply (with discussion)", Allen Hazen, Transactions of the American Society of Civil Engineers, Vol. 77, (1914), pp. 1539-1669.