--- title: "Data Analysis" author: "R. W. Oldford" date: "August 21, 2018" bibliography: eikosograms.bib output: html_document: number_sections: no toc: yes html_notebook: default html_vignette: number_sections: no toc: yes pdf_document: keep_tex: yes latex_engine: xelatex number_sections: no toc: yes word_document: default vignette: > %\VignetteIndexEntry{Data analysis with eikosograms - different data structure examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{gridExtra} header-includes: - \usepackage{graphicx} - \usepackage{epic} - \usepackage{color} - \usepackage{hyperref} - \usepackage{multimedia} - \PassOptionsToPackage{pdfmark}{hyperref}\RequirePackage{hyperref} - \pgfdeclareimage[height=0.12\textheight]{university-logo}{../../UWlogo.png} - \logo{\pgfuseimage{university-logo}} - \newcommand{\code}[1]{\texttt{#1}} - \newcommand{\ve}[1]{\mathbf{#1}} - \newcommand{\pop}[1]{\mathcal{#1}} - \newcommand{\samp}[1]{\mathcal{#1}} - \newcommand{\subspace}[1]{\mathcal{#1}} - \newcommand{\sv}[1]{\boldsymbol{#1}} - \newcommand{\sm}[1]{\boldsymbol{#1}} - \newcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}} - \newcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert} - \newcommand{\size}[1]{\left\lvert {#1} \right\rvert} - \newcommand{\norm}[1]{\left|\left|{#1}\right|\right|} - \newcommand{\field}[1]{\mathbb{#1}} - \newcommand{\Reals}{\field{R}} - \newcommand{\Integers}{\field{Z}} - \newcommand{\Naturals}{\field{N}} - \newcommand{\Complex}{\field{C}} - \newcommand{\Rationals}{\field{Q}} - \newcommand{\widebar}[1]{\overline{#1}} - \newcommand{\wig}[1]{\tilde{#1}} - \newcommand{\bigwig}[1]{\widetilde{#1}} - \newcommand{\leftgiven}{~\left\lvert~} - \newcommand{\given}{~\vert~} - \newcommand{\indep}{\bot\hspace{-.6em}\bot} - \newcommand{\notindep}{\bot\hspace{-.6em}\bot\hspace{-0.75em}/\hspace{.4em}} - \newcommand{\depend}{\Join} - \newcommand{\notdepend}{\Join\hspace{-0.9 em}/\hspace{.4em}} - \newcommand{\imply}{\Longrightarrow} - \newcommand{\notimply}{\Longrightarrow \hspace{-1.5em}/ \hspace{0.8em}} - \newcommand*{\intersect}{\cap} - \newcommand*{\union}{\cup} - \DeclareMathOperator*{\argmin}{arg\,min} - \DeclareMathOperator*{\argmax}{arg\,max} - \DeclareMathOperator*{\Ave}{Ave\,} - \newcommand{\suchthat}{~:~} - \newcommand{\st}{~:~} - \newcommand{\permpause}{\pause} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(knitr) set.seed(12314159) imageDirectory <- "./img" dataDirectory <- "./data" path_concat <- function(path1, path2, sep="/") {paste(path1, path2, sep = sep)} ``` --- $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ The other vignettes emphasize the use of eikosograms in understanding probability and relations between categorical variates. Here, we just give a few example of the use of eikosograms in data analysis. Because categorical data can present itself in a variety of forms (e.g. `tables`, `factors` in a data frame, etc.), the illustrations centre around different types of data structure # tables Contingency tables (multi-way tables or "data cubes") are possibly the most common arrangement of categorical data. They are used, for example, in the introductory vignette. Here, we present a simple analysis of the University of California (Berkeley) graduate school admissions. Note that eikosograms are `grid` objects (`eikos()` returns a `grob`) and so may be manipulated as any other. For example, several eikosograms could appear in the same display using `grid.arrange()` (from the `gridExtra` package). ```{r libraries} library(eikosograms) library(gridExtra) ``` University administrators were apparently alarmed (as they often are) when they looked at the numbers and saw the relationship between gender and admission rates. ```{r UCBAdmissions 2 way, eval = FALSE, echo = TRUE, fig.width=8, fig.height=4, fig.align="center", out.width="80%"} e1 <- eikos(Admit ~ Gender, data = UCBAdmissions, yaxs = FALSE, xaxs = FALSE, draw = FALSE) e2 <- eikos(Admit ~ Dept, data = UCBAdmissions, yaxs = FALSE, xaxs = FALSE, draw = FALSE) # Using the gridExtra package, draw these in a single plot grid.arrange(e1, e2, nrow = 1) ``` ```{r png UCBAdmissions 2 way, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="80%"} include_graphics("img/DataAnalysis/UCB2way.png") ``` At left we see that male applicants are admitted more frequently than female applicants, suggesting a possible gender bias to university administrators. At right, we see that some departments admit more applicants than others, suggesting some may be more difficult to get into than others. When all three variates are considered together, the true story emerges ```{r UCBAdmissions 3 way, echo = TRUE, eval = FALSE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(Admit ~ Gender + Dept, data = UCBAdmissions, yaxs = FALSE, xaxs = FALSE, lock_aspect = FALSE, xlab_rot = 90, xvals_size = 8, ispace = list(bottom = 15)) ``` ```{r png UCBAdmissions 3 way, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="80%"} include_graphics("img/DataAnalysis/UCB3way.png") ``` In all departments but C and E (which are near equal), a greater proportion of female applicants were accepted than males. This is a famous example of Simpson's paradox. As can be seen perhaps more clearly from the following eikosogram, the reason for the apparent favouritism towards male applicants occured only because a great many more males than females applied to those departments with the highest admission rates. And in those departments (A and B) the admission rates for females was much higher than that for males! ```{r UCBAdmissions, eval = FALSE, echo = TRUE, fig.width=8, fig.height=4, fig.align="center", out.width="80%"} eikos(Gender ~ Dept, data = UCBAdmissions, yprobs = seq(0,1,0.25), xaxs = FALSE) ``` ```{r png UCBAdmissions, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="50%"} include_graphics("img/DataAnalysis/UCBAdmissions.png") ``` Female applicants outnumber male applicants only in departments C and E. # cross tabulation Sometimes the counts appear as a column in a spreadsheet (or `data.frame`). Here we consider the counts for habitat type of adult male **anolis** lizards from Bimini. These are taken frpm @Schoener_1968 and are saved here as a text file. ```{r read lizard data} lizards <- read.table("data/AnolisLizards.txt", header=TRUE) lizards ``` As can be seen, there are three categorical variates, the `species` of the *anolis* lizard, the height (in feet) of where it was observed, and the diameter (in inches) of the branch on which it was perched. The number of such observations is given as the fourth variable, `count`. For example, this can be a very convenient form of the data for selecting subsets, particularly for large data sets. Here we might separate the two species: ```{r separate species} sagrei <- lizards[lizards$species == "sagrei", -1] augusticeps <- lizards[lizards$species == "augusticeps", -1] ``` Following @fienberg80, we might first consider the relation between perch height and perch diameter for the `sagrei` species. To look at the eikosigram, these need to be turned into a table using the cross-tabulation `xtabs()` function: ```{r sagreiTable, eval = TRUE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} sagreiTable <- xtabs(count ~ perch_height_ft + perch_diameter_inches, data = sagrei) ``` and draw it ```{r sagrei perch, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(perch_height_ft ~ perch_diameter_inches, data = sagreiTable, main = "Habitat of adult male anolis sagrei lizards") ``` ```{r png sagrei perch, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="50%"} include_graphics("img/DataAnalysis/sagreiPerch.png") ``` The two heights of the bars are very nearly the same, suggesting that the height and the diameter of the perch might be independent variates for the *anolis sagrei* lizard of Bimini. Indeed, a formal test of independence gives no evidence against this hypothesis. ```{r sagrei independence} chisq.test(sagreiTable) ``` Now, were we to consider the same for *anolis augusticeps*, first the table ```{r augusticepsTable, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} augusticepsTable <- xtabs(count ~ perch_height_ft + perch_diameter_inches, data = augusticeps) ``` and then its probability picture ```{r augusticeps perch, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(perch_height_ft ~ perch_diameter_inches, data = augusticepsTable, main = "Habitat of adult male anolis augusticeps lizards") ``` ```{r png augusticeps perch, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="50%"} include_graphics("img/DataAnalysis/augusticepsPerch.png") ``` Clearly, the bars do not appear to be of equal height. That the eikosogram is not even nearly flat is **suggestive** that perch height and diameter are not independent for the species *anolis augusticeps*. However, there needs to be some **caution** exercised here. As with *anolis sagrei*, one needs to formally test the hypothesis of independence for a sample of counts. There are so few observations (1,2, and 3 in three cells and 21 in the largest) that any formal test will also show **no evidence against the hypothesis of independence** in this case. (This includes a line up test using eikosograms as the display for each generated data set!) All three variates could be examined at once by producing the three way cross-classified table and the corresponding eikosograms. Following @fienberg80, we first consider the relation between perch height and perch diameter for the `sagrei` species of the anolis lizard. The table ```{r lizards three way table, eval = TRUE, echo = TRUE, fig.width=7, fig.height=4, fig.align="center", out.width="80%"} lizardsTable <- xtabs(count ~ species + perch_height_ft + perch_diameter_inches, data = lizards) ``` and its picture ```{r lizards three way, eval = FALSE, echo = TRUE, fig.width=7, fig.height=4, fig.align="center", out.width="80%"} eikos(species ~ . , data = lizardsTable, lock_aspect = FALSE, yprobs = seq(0,1, 0.1)) ``` ```{r png lizards three way, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="50%"} include_graphics("img/DataAnalysis/lizards3way.png") ``` # listings (data frame rows) The final way in which categorical data is commonly presented is in a listing (e.g. `data.frame`) where each row corresponds to a single occurrence of that combination of values for the variates. For example, consider the `mtcars` dataset in `R`, the first few rows of which are ```{r, echo = FALSE} knitr::kable(head(mtcars)) ``` Although all of these variables are numeric, it is clear that many of them could be treated as categorical. For example, `vs` is a simple indicator variable indicating the shape of the engine block -- `0` if it the cylinders are arranged in a "V" pattern and 1 if they are aligned in a straight line. This variable might instead have been represented as a `factor`. The same could be said of the variable `am` which is also a binary variable indicating `0` for an automatic transmission and 1 for a manual. To make the point, we replace these two numeric variables by factors. ```{r mtcars factor creation} mtcars$vs <- factor(mtcars$vs, labels = c("V-shaped", "straight")) mtcars$am <- factor(mtcars$am, labels = c("automatic", "manual")) ``` so that the first few rows of `mtcars` now look like ```{r, echo = FALSE} knitr::kable(head(mtcars)) ``` The eikosogram of these two factors can be compared without contructing a table; `eikos()` does the counting of matching rows to produce the picture. ```{r am vs, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(am ~ vs, data = mtcars) ``` ```{r png r am vs, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="50%"} include_graphics("img/DataAnalysis/amvs.png") ``` This matching **does not depend on the variables being factors**. Any variable in a `data.frame` would do and would be treated as if it were categorical. In the `mtcars` data, there are several variables that are effectively *ordinal categorical* variates, namely `cyl`, `gear`, and `carb`. These too could be explored using eikosograms. Transmission type versus number of cylinders ```{r am and ordinal 1, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(am ~ cyl, data = mtcars) ``` ```{r png am and ordinal 1, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="40%"} include_graphics("img/DataAnalysis/amordinal1.png") ``` and number of forward gears versus number of cylinders ```{r ordinal 2, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos(gear ~ cyl, data = mtcars) ``` ```{r png ordinal 2, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="37%"} include_graphics("img/DataAnalysis/ordinal2.png") ``` # fitted models There are numerous types of models that can be fitted to categorical data. One of the more popular are generalized linear models. For example, we might fit a log-linear model to the `lizards` habitat data. ```{r poisson model, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} fittedModel <- glm(count ~ species + perch_height_ft, family="poisson", data = lizards) ``` This is a "main effects" only model and contains no interaction term. To see what this model asserts about the relationship between `species` and `perch_height`, we need to use the `fitted.values` from the model as the **expected counts**. A new `data.frame` is constructed from the fit as follows: ```{r poisson model fitted data, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} # Can simply append the fitted values to the lizards to get a new data frame lizardsFit <- data.frame(lizards, fit = fittedModel$fitted.values) # and create the table fitTable <- xtabs(fit ~ species + perch_height_ft, data=lizardsFit) ``` The eikosogram corresponding to the model fitted to these two variables is ```{r poisson model asserts, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} eikos("species", "perch_height_ft", data = fitTable) ``` ```{r png poisson model asserts, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="60%"} include_graphics("img/DataAnalysis/poisson1.png") ``` from which we can see that the model is asserting that `species` and `perch_height_ft` are independently distributed. In log-linear modelling, independence and conditional independence are asserted by the interaction terms which appear in these (hierarchical) models. For example, the following model forces independence between `perch_height_ft` and `perch_diameter_inches`, *conditional on* `species`. ```{r poisson model 3way, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} fittedModel3way <- glm(count ~ species + perch_height_ft + perch_diameter_inches + perch_height_ft * species + perch_diameter_inches * species, family="poisson", data = lizards) ``` The absence of the terms `perch_height_ft * perch_diameter_inches` and `perch_height_ft * perch_diameter_inches * species` means that when `species` is held fixed, the terms `perch_height_ft` and `perch_diameter_inches` are sepable in the model. Hence they are conditionally independent. As before, we can see this by viewing the eikosogram for the fitted values. ```{r poisson model fitted data 3way, eval = FALSE, echo = TRUE, fig.width=12, fig.height=4, fig.align="center", out.width="80%"} # Can simply append the fitted values to the lizards to get a new data frame lizardsFit3way <- data.frame(lizards, fit = fittedModel3way$fitted.values) # and create the table fitTable3way <- xtabs(fit ~ species + perch_height_ft + perch_diameter_inches, data=lizardsFit3way) # and show the eikosograms eikos(y = "perch_diameter_inches", x = c("perch_height_ft", "species"), data = fitTable3way, xlab_rot = 30) ``` ```{r png poisson model fitted data 3way, eval = TRUE, echo = FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="60%"} include_graphics("img/DataAnalysis/poisson2.png") ``` and the conditional independence asserted by the model becomes obvious. See also [vignette on independence exploration](../inst/doc/IndependenceExploration.html). # references