---
title: "Introduction to the mda.biber R package"
author: "David Brown"
output: rmarkdown::html_vignette
bibliography: mda_bib.bib
link-citations: yes
nocite: |
  @*
vignette: >
  %\VignetteIndexEntry{introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

## Load the mda.biber package

Load the package, as well as others that we'll use in this vignette.

```{r setup, message = FALSE, error=FALSE, warning=FALSE}
library(mda.biber)
library(corrplot)
library(tidyverse)
library(kableExtra)
```

## Multi-Dimensional Analysis (MDA)

Multi-Dimensional Analysis (MDA) is a [complex statistical procedure](https://www.uni-bamberg.de/fileadmin/eng-ling/fs/Chapter_21/Index.html?Multidimensionalanalysis.html) developed by Douglas Biber. It is largely used to describe language as it varies by [genre, register, and use](https://www.uni-bamberg.de/fileadmin/eng-ling/fs/Chapter_21/Index.html?21Summary.html).

MDA is based on the fundamental linguistic principle that some linguistic variables co-occur (nouns and adjectives, for example) while others inversely co-occur (think nouns and pronouns).

MDA in conducted in the following stages:

1. Identification of relevant variables
1. Extraction of factors from variables
1. Functional interpretation of factors as dimensions
1. Placement of categories on the dimensions

The initial steps are carried out using well-established procures for factor analysis.

## Inspect the data

First, we inspect the data that comes with the package -- counts of features tagged using the [pseudobibeR package](https://cran.r-project.org/package=pseudobibeR).

```{r echo=FALSE}
knitr::kable(head(micusp_biber[,1:6]))
```

[...]

```{r echo=FALSE}
knitr::kable(head(micusp_biber[,63:68]))
```

Note that the first column is the name of a document from the [Michigan Corpus of Upper-Level Student Papers (MICUSP)](https://elicorpora.info/), which includes various kinds of the meta-data. The beginning of the file, for example, has a series of upper-case letters that identify the discipline the paper was written for (BIO = Biology).

Because MDA involves a specific application of factor analysis, the first thing we need to do is to extract that string and covert the column to a factor (or categorical variable).

```{r format_data, message = FALSE, error=FALSE, warning=FALSE}
d <- micusp_biber |>
  mutate(doc_id = str_extract(doc_id, "^[A-Z]+")) |>
  mutate(doc_id = as.factor(doc_id)) |>
  select(where(~ any(. != 0))) # removes any columns containing all zeros as required for corrplot
```

## Determining the number of factors

The package comes with a wrapper for the `nScree()` function in **nFactors**. However, you can easily use any scree plotting function (or alternative method) for determining the number of factors to be calculated.

```{r scree_plot, fig.height=4.5, fig.width=7}
screeplot_mda(d)
```

## Inspecting a correlation matrix

Next, we can create a correlation matrix and plot it using `corrplot()`. Note
that earlier we omitted any columns that contain only zeros; if they existed,
the correlation would be undefined.

```{r, corr_plot, fig.height=5.5, fig.width=7}
# create a correlation matrix
cor_m <- cor(d[,-1], method = "pearson")

# plot the matrix
corrplot(cor_m, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45, diag = F, tl.cex = 0.5)
```

## The `mda_loadings()` function

As the plot makes clear, there are groupings of features that positively and negatively correlate, making our data a good candidate for MDA. To carry out the MDA procedure use the `mda_loadings()` function.

The function requires a data.frame with 1 factor (or categorical variable) and more than 2 numeric variables.

Ideally, need at least 5 times as many texts as variables (i.e. 5 times more rows than columns).

Variables that don't correlate with any other variable are dropped. The default threshold for dropping variables is **0.20**, but that can be changed in the function arguments.

Also, MDA uses a **promax** rotation.

For this demonstration, we'll return 2 factors (`n_factors = 2`):

```{r mda}
m <- mda_loadings(d, n_factors = 2)
```

## MDA data structure

The `mda_loadings()` function returns a data frame containing the dimension score for each document/text. The score is calculated by:

1. Standardizing data by converting to z-scores
1. For each text, summing all of the high-positive variables and subtracting all of the high-negative variables.

The high and high-negative are specified by a threshold value, which is conventionally **0.35**, but can be set in the function arguments.

Also included in the data structure are the means-by-group, which is used for plotting, and the factor loadings. The values are stored as attributes.

### Means

```{r results='hide'}
attributes(m)$group_means
```

```{r echo=FALSE}
knitr::kable(head(attributes(m)$group_means))
```

### Factor loadings

```{r results='hide'}
attributes(m)$loadings %>% arrange(-Factor1)
```

```{r echo=FALSE}
knitr::kable(head(attributes(m)$loadings %>% arrange(-Factor1)))
```

[...]

```{r echo=FALSE}
knitr::kable(tail(attributes(m)$loadings %>% arrange(-Factor1)))
```

## Plotting the results

One conventional way of plotting the results is to [place the means along a cline in a kind of stick plot](https://www.uni-bamberg.de/fileadmin/eng-ling/fs/Chapter_21/Index.html?23DimensionsofEnglish.html).

The package contains a convenience function for making these kinds of plots. As with all plotting functions, it is easy enough to tweak the code and customize your own plots.

```{r stickplot_mda, fig.height=4.5, fig.width=2.5}
stickplot_mda(m, n_factor = 1)
```

Along this particular dimension, Philosophy is positioned at the extreme positive end, while History and various Engineering specialties are positioned at the negative end.

You can also generate a plot that combines the stick plot with a heatmap of the relevant variables and their factor loadings.

```{r heatmap_mda, fig.height=5.5, fig.width=7}
heatmap_mda(m, n_factor = 1)
```

The plot highlights the variables that contribute to the positive end of cline like adverbs, *to be* as a main verb, and first-person pronouns. At the other end, are nouns, longer words, more attributive adjectives, and prepositions.

Alternatively, you can generate a plot of the kind that is common in reporting PCA, which combines scaled vectors of the relevant variable loadings and boxplots of the dimension scores organized by group.

```{r boxplot_mda, fig.height=6.5, fig.width=7}
boxplot_mda(m, n_factor = 1)
```

## Evaluating the dimensions

Finally, the variation explained by each dimension can measured using linear regression.

We will hack together a table that uses ANOVA to evaluate the dimensions and includes the *R*^2^ value.

```{r}

# Carry out regression
f1_lm <- lm(Factor1 ~ group, data = m)
f2_lm <- lm(Factor2 ~ group, data = m)

# Convert ANOVA results into data.frames allows for easier name manipulation
f1_aov <- data.frame(anova(f1_lm), r.squared = c(summary(f1_lm)$r.squared*100, NA))
f2_aov <- data.frame(anova(f2_lm), r.squared = c(summary(f2_lm)$r.squared*100, NA))

# Putting all into one data.frame/table
anova_results <- data.frame(rbind(c("DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^",
                                    "DF", "Sum Sq", "Mean Sq", "F value", "Pr(>F)", "*R*^2^"),
                                   cbind(round(f1_aov, 2), round(f2_aov, 2))))
colnames(anova_results) <- c("", "", "", "", "", "", "", "", "", "", "", "")
row.names(anova_results)[1] <- ""
anova_results[is.na(anova_results)] <- "--"
```

And output the results:

```{r results='asis'}
anova_results %>% knitr::kable("html") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::add_header_above(c("", "Dimension 1" = 6, "Dimension 2" = 6))

```


## Bibliography