# Chapter 11 Mersey V - Statistical analysis

In this final chapter, we will compare the information about catchment characteristics with the water quality data collected at each of the 70 monitoring stations. To begin, load the csv file created at the end of Task 6 (`mersey_watersheds_ea.csv`

), saving to a new variable called `watersheds_df`

:

```
# Reads completed file from csv
watersheds_df <- read.csv(here("output", "practical_2", "mersey_watersheds_ea.csv"))
```

If you have any other variables in your R environment, these can be removed using `rm()`

.

## 11.1 Task 7: Model building

This data frame should contain the following 10 water quality indicators for each watershed:

- pH: acidity/alkalinity;
- SSC: suspended solids concentration;
- Ca: calcium;
- Mg: magnesium;
- NH
_{4}: ammonium; - NO
_{3}: nitrate; - NO
_{2}: nitrite; - TON: total oxidised nitrogen;
- PO
_{4}: phosphate; - Zn: zinc.

It should also contain the continuous derivatives (e.g. average elevation) and categorical derivatives (e.g. land cover percentage) for each watershed.

**Note**: some of your calculated percentages may not add up to 100%. In Task 4, we reclassified only the most important categorical variables. These are known to have the greatest impact of river hydrochemistry (e.g. urban areas, farmland). While other land cover categories are found within each watershed, these typically account for only a small percentage of the total area and have a limited effect on the river environment. These categories have been excluded to simplify the analysis.

### 11.1.1 An introduction to linear models in R

It is now time to examine the relationships between river water quality and catchment metrics. The key model outputs that are ultimately required for the assessment are:

Regression equations for each water quality variable (dependent variable; n = 10) and the key explanatory catchment characteristics (independent variables; n = 16).

Associated model values (R

^{2},*p*value).

Remember, you don’t have to run every code block shown below, but you can do so if it would help your understanding.

The simplest way to run a linear regression in R is to use the `lm()`

function, an example of which is shown below, storing the output in `model`

(you can change this name to reflect the input variables):

We have defined the data frame being used (`data = watersheds_df`

) and the input variables from that data frame. This is achieved by including their column names, shown here:

```
## [1] "Seed_Point_ID" "FID" "EA_ID"
## [4] "Group" "Ph" "SSC"
## [7] "Ca" "Mg" "NH4"
## [10] "NO3" "NO2" "TON"
## [13] "PO4" "Zn" "area"
## [16] "count" "average_elevation" "average_rainfall"
## [19] "average_slope" "average_aspect" "Arable"
## [22] "Heath" "Grassland" "Urban"
## [25] "Wetland" "Permeable" "Impermeable"
## [28] "Gleyed" "Peats" "Sands_and_Muds"
## [31] "Limestone" "Coal" "Arable_percent"
## [34] "Heath_percent" "Grassland_percent" "Urban_percent"
## [37] "Wetland_percent" "Permeable_percent" "Impermeable_percent"
## [40] "Gleyed_percent" "Peats_percent" "Sands_and_Muds_percent"
## [43] "Limestone_percent" "Coal_percent"
```

Input variables in the **formula** are separated by `~`

, where the variable to the left is the dependent variable (`NO2`

) and the variable to the right is an independent variable (`average_elevation`

). We can, however, include **multiple** independent variables to perform multiple linear regression. This is achieved as follows, where additional independent variables are separated by `+`

:

```
# Fits a linear model
model <- lm(formula = NO2 ~ average_elevation + average_rainfall, data = watersheds_df)
```

We can then assess the model output using the `summary`

function:

```
##
## Call:
## lm(formula = NO2 ~ average_elevation + average_rainfall, data = watersheds_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059950 -0.015188 -0.010499 0.002269 0.226625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.525e-02 3.198e-02 2.978 0.00403 **
## average_elevation -2.096e-04 8.951e-05 -2.341 0.02220 *
## average_rainfall -8.358e-06 5.450e-05 -0.153 0.87857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04402 on 67 degrees of freedom
## Multiple R-squared: 0.3109, Adjusted R-squared: 0.2903
## F-statistic: 15.11 on 2 and 67 DF, p-value: 3.828e-06
```

For this set of independent variables, we have an R^{2} of 0.31 (`Multiple R-squared: 0.3109`

) and a model *p* value of < 0.01 (`p-value: 3.828e-06`

).

The model coefficients for the independent variables are described above, where `*`

denotes *p* values < 0.05 (95% probability) and `**`

denotes *p* values < 0.01 (99% probability). As the coefficients are very small, they are presented in scientific notation. These can be converted to numeric (non-scientific) format using the following code:

`## [1] "-0.0002096"`

We can supply multiple values to the `format`

function by creating a vector:

When you’re happy you understanding the formatting of the

`lm`

function, move on to the next section.

### 11.1.2 Training vs. Testing

One limitation of the above approach is that our dataframe (`watersheds_df`

) contains observations from all 70 EA monitoring stations.

When performing statistical analysis, it is common practice to split any dataset into:

- a
**training**subset, which is used to create the model(s). - a
**testing**subset, which is used to evaluate the model(s).

Subsetting our data in this way allows models to be evaluated more rigorously. Many models perform well “in-sample” but poorly “out-of-sample” when evaluated against independent data (i.e. the testing subset). This is commonly referred to as “over-fitting”.

Training and testing subsets are usually defined randomly, with an approximate ratio of 70:30 (although this varies). However, and to ensure reproducibility, this step has been completed for you: the `watersheds_df`

dataframe contains a `group`

variable denoting which monitoring sites belong to the training and testing subsets.

Run the code above to create

`training`

and`testing`

dataframes:

```
# Extracts training dataset, comprising 50 observations (~70%)
training <- subset(watersheds_df, Group == "Training")
# Extracts training dataset, comprising 20 observations (~30%)
testing <- subset(watersheds_df, Group == "Testing")
```

Before you move on to the next section, can you think of any limitations of this approach?

Hints: How important is the training-testing ratio? How are training-testing subsets created?

### 11.1.3 Variable selection strategies

An addition weakness of the above approach is that we have manually defined the independent variables of interest (`average_elevation + average_rainfall`

). For exploratory analysis, however, we may not know which are the most important variables. Perhaps there is a combination of independent variables which produces a better model fit (e.g. R^{2} > 0.31)?

Determining which variables to include/exclude from a model is a very difficult problem, which has resulted in many different variable selection strategies. Common approaches include expert opinion and/or theory, partial least squares (PLS) regression, implemented in `PLS`

, Least Absolute Shrinkage and Selection Operator (LASSO), implemented in `glmnet`

and `LARS`

, as well as elastic net methods and ridge regression, also implemented in `glmnet`

. You may want to explore some of these more complex approaches for your dissertation.

For our analysis, we are going to use a relatively simple method known as **Stepwise Regression**, implemented in the `MASS`

package. This works by including **all** the relevant independent variables in the analysis and then selecting those with the greatest explanatory power.

However, we don’t necessarily want to test *all* model variables. We would probably want to exclude the categorical counts (e.g. Arable, Heath, …) as these factors are already represented by the normalised variables (e.g. Arable_percent, Heath_percent, …), as well as any IDs or geometry variables (area). In general, we are only interested in testing the continuous derivatives (column names starting with `average_`

) and the normalised categorical derivatives (column names ending in ’_percent’).

Rather than typing out the columns of interest manually, we are going to use the `select`

function from the `dplyr`

package to do so:

```
# Creates a vector of column names, including only those which contain "average" or "percent"
factors <- colnames(watersheds_df %>% dplyr::select(contains(c("average", "percent"))))
# Prints to console
factors
```

```
## [1] "average_elevation" "average_rainfall" "average_slope"
## [4] "average_aspect" "Arable_percent" "Heath_percent"
## [7] "Grassland_percent" "Urban_percent" "Wetland_percent"
## [10] "Permeable_percent" "Impermeable_percent" "Gleyed_percent"
## [13] "Peats_percent" "Sands_and_Muds_percent" "Limestone_percent"
## [16] "Coal_percent"
```

Run the above code.

Note, the formatting of`dplyr::select`

may be slightly confusing but it is necessary because there is also a`select`

function in the`MASS`

package. Here, we are telling R to use`select`

from`dplyr`

.

Using this vector of column names, we are going to create a new data frame (called `variables`

) containing only the independent variables of interest. **Crucially**, this is only for the `training`

dataset:

Run the above code and use

`head()`

to inspect the results.

Next, we are going to combine this data frame (`cbind`

) with a dependent variable of interest; we will use NO_{2} as an example. Our new dataframe will be called `model_df`

as it contains all the variables (dependent + independent) required for multiple linear regression. **Note**: by default, `cbind`

will (somewhat unhelpfully) rename input column names e.g. `NO2`

will become `watersheds_df$NO2`

. The code below specifies the new column name as NO2 (`NO2 =`

) for readability:

```
# Column bind the NO2 column with the independent variables from the training dataset
model_df <- cbind(NO2 = training$NO2, variables)
```

When complete, we can then run a new model, making sure to update the data frame used (`data = model_df`

) and updating the formula to `NO2 ~ .`

. This denotes that *all* other data frame columns will be included as independent variables (a useful time saver!):

```
# Fits a linear model, including all other columns (~.) as independent variables
no2_model <- lm(formula = NO2 ~ ., data = model_df)
```

When you’re happy you understand the

`lm`

syntax, combine the two dataframes, run the linear model and inspect the output using`summary()`

. This should resemble the following:

```
##
## Call:
## lm(formula = NO2 ~ ., data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.050772 -0.015450 -0.000014 0.011843 0.078422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.529e-01 1.541e+00 -0.099 0.9215
## average_elevation 2.021e-04 2.073e-04 0.975 0.3367
## average_rainfall 7.315e-05 8.250e-05 0.887 0.3817
## average_slope -1.371e-02 7.519e-03 -1.823 0.0774 .
## average_aspect 9.968e-05 1.795e-04 0.555 0.5824
## Arable_percent -1.306e-05 8.699e-04 -0.015 0.9881
## Heath_percent -7.314e-04 1.155e-03 -0.633 0.5308
## Grassland_percent -6.684e-04 8.739e-04 -0.765 0.4498
## Urban_percent 5.588e-04 9.223e-04 0.606 0.5488
## Wetland_percent -6.225e-04 1.020e-03 -0.610 0.5459
## Permeable_percent 2.592e-03 1.495e-02 0.173 0.8634
## Impermeable_percent 1.826e-03 1.489e-02 0.123 0.9032
## Gleyed_percent 2.285e-03 1.494e-02 0.153 0.8794
## Peats_percent 2.176e-03 1.503e-02 0.145 0.8858
## Sands_and_Muds_percent -5.747e-04 3.790e-03 -0.152 0.8804
## Limestone_percent 2.227e-05 3.986e-03 0.006 0.9956
## Coal_percent -8.088e-04 3.795e-03 -0.213 0.8326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0351 on 33 degrees of freedom
## Multiple R-squared: 0.5483, Adjusted R-squared: 0.3292
## F-statistic: 2.503 on 16 and 33 DF, p-value: 0.01265
```

Our overall model fit (R^{2}) is 0.55 which indicates that the independent variables explain ~55% of variability in the dependent variable. However, the model contains many independent variables which are not statistically significant, here defined as having a *p* value > 0.05.

This number represents the probability that the result has occurred by chance. When values are very small (e.g. *p* < 0.0005), we would typically present these as a discrete value e.g. *p* < 0.05, < 0.01, < 0.001. Generally, we only use models in which we can be 95% confident or higher (i.e. significance level of 0.05 or less).

However, it is important to note that *p* values should be not be considered in isolation and need to be interpreted carefully. For statistical reviews of using and interpreting *p* values, see Goodman (2008) and Andrade (2019). For a broader overview, see the *Nature* commentary by Amrhein *et al.* (2019), as well as a summary article by Vox.

To filter our independent variables to include only the most important, we can use the `step.AIC`

function from the `MASS`

library as follows:

```
# Stepwise regression model
step.model <- stepAIC(no2_model, # Input linear model
direction = "both",
trace = FALSE, # Print out intermediate results?
k = 1)
```

Helpfully, this takes the output of the `lm`

model (`no2_model`

) with no need for any additional data wrangling. The following are important parameters:

`direction = "both"`

:- Determines the method used, either
**forward**or**backward**stepwise regression, or a mixture of**both**. - “Forward” begins with a model with
**no**variables and then starts adding the most significant variables, stopping when there are no more significant variables. - “Backward” begins with a model with
**all**variables and then starts removing the least significant variables, stopping when only significant variables are remaining. - “Both” includes both of the above, allowing for variables to be added/removed at each step.

- Determines the method used, either
`k = 1`

:- The number of degrees of freedom used for the penalty i.e. for determining whether variables are significant or not.

Run the above model (

`direction = "both"`

and`k = 1`

) and print the output using`summary()`

:

```
##
## Call:
## lm(formula = NO2 ~ average_elevation + average_rainfall + average_slope +
## Urban_percent + Permeable_percent + Coal_percent, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.055180 -0.015245 -0.002834 0.015796 0.084547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.858e-04 3.930e-02 0.023 0.982122
## average_elevation 1.547e-04 1.487e-04 1.040 0.304078
## average_rainfall 6.188e-05 6.636e-05 0.933 0.356265
## average_slope -1.354e-02 6.091e-03 -2.223 0.031519 *
## Urban_percent 1.121e-03 3.095e-04 3.624 0.000763 ***
## Permeable_percent 3.445e-04 2.074e-04 1.661 0.103993
## Coal_percent -2.604e-04 1.850e-04 -1.408 0.166415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03152 on 43 degrees of freedom
## Multiple R-squared: 0.5253, Adjusted R-squared: 0.459
## F-statistic: 7.93 on 6 and 43 DF, p-value: 8.737e-06
```

As you can see above, using a low threshold for the degrees of freedom (`k = 1`

) means we still have many “non-significant” variables remaining (*p* > 0.05)

Re-run the above model, but increasing the value of

`k`

in intervals of 1 until all the independent variables are significant atp= 0.05 (denoted by`*`

):

```
##
## Call:
## lm(formula = NO2 ~ average_slope + Urban_percent, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.054086 -0.020359 -0.004355 0.012888 0.098635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0435410 0.0126033 3.455 0.00118 **
## average_slope -0.0039045 0.0016317 -2.393 0.02076 *
## Urban_percent 0.0009823 0.0002980 3.297 0.00187 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0317 on 47 degrees of freedom
## Multiple R-squared: 0.4753, Adjusted R-squared: 0.4529
## F-statistic: 21.28 on 2 and 47 DF, p-value: 2.622e-07
```

In general, we prefer models with the minimum number of parameters (independent variables). They require fewer assumptions, less intensive data collection and can be applied more confidently to new data sets/locations. This principle of model parsimony is based upon **Occam’s Razor**: “other things being equal, simpler explanations are generally better than more complex ones”.

Our original model, based upon 16 independent variables had an R^{2} of 0.55. This new model, based upon just 2 independent variables (`average_slope + Urban_percent`

) has an R^{2} of 0.48; a relatively minor reduction in explanatory power given the removal of 14 (arguably unimportant) additional variables.

Our model coefficients are now as follows:

`intercept`

= 0.0435410,*p*= 0.00118 (*p*< 0.01)`average_slope`

= -0.0039045,*p*= 0.02076 (*p*< 0.05)`Urban_percent`

= 0.0009823,*p*= 0.00187 (*p*< 0.01)

Coefficients are important because they are used in **regression equations**, which can then be used to predict values.

The general format for a regression equation is as follows:

\[
y = a + (b_1 \cdot x_1) + (b_2 \cdot x_2) + (b_n \cdot x_n)
\]
where `a`

is the constant (intercept) value, and `b`

is the coefficient of x.

For our NO_{2} model above, we can define our regression equation (presented using sensible data precision) as:

\[
NO_2 = 0.044 + (0.001 \cdot Urban \: percent) + (-0.004 \cdot Average \: slope)
\]
**Well done!** You have now calculated a regression which links the dependent variable (NO_{2}) to a set of independent variables, in the case the average slope of the watershed and the percentage urban land cover.

For your assessment, we would like you to

explainthe regression results, linking to hydrological processes and literature. Think about specific sources of pollution, transport pathways, types of flow…

## 11.2 Task 8: Model evaluation

Having created a statistical model, it is necessary to evaluate its performance. Comparison plots of **measured vs. modelled (or predicted) values** are one common way to assess model quality, alongside other metrics such as root-mean-square error (RMSE), normalised root-mean-square-error (nRMSE), Q-Q plots, or histograms of model residuals. You may want to explore some of these for the assessment.

To calculate modelled values, we can use the `predict()`

function, taking the model variable (`step.model`

) as the input, rather than re-creating the above equation manually in code, and using our `testing`

dataframe for the `newdata`

argument:

```
# Predict NO2 values based upon stepwise model, saving to testing dataframe
testing$predicted_no2 <- predict(step.model, newdata = testing)
```

If`new_data`

is not defined, the `predict`

function uses the fitted values for prediction i.e. the training data used to construct the model (see here).

Run the above code block to predict NO

_{2}concentrations in the testing dataset, based on the regression model produced from the training dataset.

These values could be used to calculate RMSE or other metrics (nRMSE) using your own code or additional packages (e.g. `Metrics`

);

\[ RMSE = \sqrt{mean(measured\:values - modelled\:values)^2} \]

Plots of measured vs. modelled values (as well as Q-Q plots and histograms) can be created in ggplot2. Here is an example:

```
# ggplot of measured vs. modelled (predicted) NO2 values
no2_plot <- ggplot(data = testing, aes(x = NO2, y = predicted_no2)) +
# Adding a linear regression ("lm"), removing standard error bars (se = FALSE)
geom_smooth(method = "lm", se = FALSE, colour="#FF953C") +
# Adds a 1:1 line for comparison
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
# Adds the point data, modifying the shape, size, colour and fill
geom_point(shape = 21, colour = "white", fill = "#5695FF", size = 2.5) +
# Setting the theme and aspect ratio
theme_classic() +
theme(aspect.ratio = 1) +
# Axis limits
scale_x_continuous(limits = c(0,0.3)) +
scale_y_continuous(limits = c(0,0.3)) +
# Add axis labels and a title
labs(x = Measured~NO[2], y = Modelled~NO[2],
title = Plot~of~measured~vs.~modelled~NO[2]~values)
no2_plot
```

Does the regression line match the 1:1 line? Is there any evidence of under- or over-prediction? Are there any outliers? What

typesof errors can you identify?

You could also assess this relationship statistically, using linear regression:

```
# Linear regression of measured vs. modelled NO2 values
prediction_model <- lm(formula = NO2 ~ predicted_no2, data = testing)
# Print summary statistics
summary(prediction_model)
```

```
##
## Call:
## lm(formula = NO2 ~ predicted_no2, data = testing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.084258 -0.027100 0.000118 0.011208 0.152154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01551 0.02322 -0.668 0.51280
## predicted_no2 1.79536 0.47103 3.812 0.00128 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05421 on 18 degrees of freedom
## Multiple R-squared: 0.4466, Adjusted R-squared: 0.4159
## F-statistic: 14.53 on 1 and 18 DF, p-value: 0.001278
```

How well does our NO

_{2}model perform on the testing dataset, based on the above graphs/statistics? Is out-of-sample performance comparable to in-sample performance?

**To finish the practical** and to prepare for the assessment:

Replicating the above approaches, calculate regression equations based on stepwise linear regression for all 10 water quality indicators (NO

_{2}, pH, SSC, Ca, Mg, NH_{4}, NO_{3}, TON, PO_{4}, Zn).

Use the same approach to create new data frames for each indicator, remembering to update the

`k`

parameter in the`step.AIC`

function (beginning at`k = 1`

) to determine the statistically significant variables.

Save the relevant model coefficients and the R

^{2}andpvalues for each equation. These should be stored in a single table for the assessment.

### References

Amrhein, V. *et al.* (2019) ‘Scientists rise up against statistical significance’, *Nature*, 567(7748), pp. 305–307. doi:10.1038/d41586-019-00857-9.

Andrade, C. (2019) ‘The P Value and Statistical Significance: Misunderstandings, Explanations, Challenges, and Alternatives’, *Indian Journal of Psychological Medicine*, 41(3), pp. 210–215. doi:10.4103/IJPSYM.IJPSYM_193_19.

Goodman, S. (2008) ‘A Dirty Dozen: Twelve P-Value Misconceptions’, *Seminars in Hematology*, 45(3), pp. 135–140. doi:10.1053/j.seminhematol.2008.04.003.