# Chapter 11 Mersey V - Statistical analysis

In this final chapter, we will compare the information about catchment characteristics with the water quality data collected at each of the 70 monitoring stations. To begin, load the csv file created at the end of Task 6 (mersey_watersheds_ea.csv), saving to a new variable called watersheds_df:

# Reads completed file from csv
watersheds_df <- read.csv(here("output", "practical_2", "mersey_watersheds_ea.csv"))

If you have any other variables in your R environment, these can be removed using rm().

## 11.1 Task 7: Model building

This data frame should contain the following 10 water quality indicators for each watershed:

• pH: acidity/alkalinity;
• SSC: suspended solids concentration;
• Ca: calcium;
• Mg: magnesium;
• NH4: ammonium;
• NO3: nitrate;
• NO2: nitrite;
• TON: total oxidised nitrogen;
• PO4: phosphate;
• Zn: zinc.

It should also contain the continuous derivatives (e.g. average elevation) and categorical derivatives (e.g. land cover percentage) for each watershed.

Note: some of your calculated percentages may not add up to 100%. In Task 4, we reclassified only the most important categorical variables. These are known to have the greatest impact of river hydrochemistry (e.g. urban areas, farmland). While other land cover categories are found within each watershed, these typically account for only a small percentage of the total area and have a limited effect on the river environment. These categories have been excluded to simplify the analysis.

### 11.1.1 An introduction to linear models in R

It is now time to examine the relationships between river water quality and catchment metrics. The key model outputs that are ultimately required for the assessment are:

1. Regression equations for each water quality variable (dependent variable; n = 10) and the key explanatory catchment characteristics (independent variables; n = 16).

2. Associated model values (R2, p value).

Remember, you don’t have to run every code block shown below, but you can do so if it would help your understanding.

The simplest way to run a linear regression in R is to use the lm() function, an example of which is shown below, storing the output in model (you can change this name to reflect the input variables):

# Fits a linear model
model <- lm(formula = NO2 ~ average_elevation, data = watersheds_df)

We have defined the data frame being used (data = watersheds_df) and the input variables from that data frame. This is achieved by including their column names, shown here:

colnames(watersheds_df)
##  [1] "Seed_Point_ID"          "FID"                    "EA_ID"
##  [4] "Group"                  "Ph"                     "SSC"
##  [7] "Ca"                     "Mg"                     "NH4"
## [10] "NO3"                    "NO2"                    "TON"
## [13] "PO4"                    "Zn"                     "area"
## [16] "count"                  "average_elevation"      "average_rainfall"
## [19] "average_slope"          "average_aspect"         "Arable"
## [22] "Heath"                  "Grassland"              "Urban"
## [25] "Wetland"                "Permeable"              "Impermeable"
## [28] "Gleyed"                 "Peats"                  "Sands_and_Muds"
## [31] "Limestone"              "Coal"                   "Arable_percent"
## [34] "Heath_percent"          "Grassland_percent"      "Urban_percent"
## [37] "Wetland_percent"        "Permeable_percent"      "Impermeable_percent"
## [40] "Gleyed_percent"         "Peats_percent"          "Sands_and_Muds_percent"
## [43] "Limestone_percent"      "Coal_percent"

Input variables in the formula are separated by ~, where the variable to the left is the dependent variable (NO2) and the variable to the right is an independent variable (average_elevation). We can, however, include multiple independent variables to perform multiple linear regression. This is achieved as follows, where additional independent variables are separated by +:

# Fits a linear model
model <- lm(formula = NO2 ~ average_elevation + average_rainfall, data = watersheds_df)

We can then assess the model output using the summary function:

summary(model)
##
## Call:
## lm(formula = NO2 ~ average_elevation + average_rainfall, data = watersheds_df)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.059950 -0.015188 -0.010499  0.002269  0.226625
##
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        9.525e-02  3.198e-02   2.978  0.00403 **
## average_elevation -2.096e-04  8.951e-05  -2.341  0.02220 *
## average_rainfall  -8.358e-06  5.450e-05  -0.153  0.87857
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04402 on 67 degrees of freedom
## Multiple R-squared:  0.3109, Adjusted R-squared:  0.2903
## F-statistic: 15.11 on 2 and 67 DF,  p-value: 3.828e-06

For this set of independent variables, we have an R2 of 0.31 (Multiple R-squared: 0.3109) and a model p value of < 0.01 (p-value: 3.828e-06).

The model coefficients for the independent variables are described above, where * denotes p values < 0.05 (95% probability) and ** denotes p values < 0.01 (99% probability). As the coefficients are very small, they are presented in scientific notation. These can be converted to numeric (non-scientific) format using the following code:

format(-2.096e-04, scientific = FALSE)
## [1] "-0.0002096"

We can supply multiple values to the format function by creating a vector:

format(c(-2.096e-04, -8.358e-06, ...) , scientific = FALSE)

When you’re happy you understanding the formatting of the lm function, move on to the next section.

### 11.1.2 Training vs. Testing

One limitation of the above approach is that our dataframe (watersheds_df) contains observations from all 70 EA monitoring stations.

When performing statistical analysis, it is common practice to split any dataset into:

• a training subset, which is used to create the model(s).
• a testing subset, which is used to evaluate the model(s).

Subsetting our data in this way allows models to be evaluated more rigorously. Many models perform well “in-sample” but poorly “out-of-sample” when evaluated against independent data (i.e. the testing subset). This is commonly referred to as “over-fitting”.

Training and testing subsets are usually defined randomly, with an approximate ratio of 70:30 (although this varies). However, and to ensure reproducibility, this step has been completed for you: the watersheds_df dataframe contains a group variable denoting which monitoring sites belong to the training and testing subsets.

Run the code above to create training and testing dataframes:

# Extracts training dataset, comprising 50 observations (~70%)
training <- subset(watersheds_df, Group == "Training")

# Extracts training dataset, comprising 20 observations (~30%)
testing <- subset(watersheds_df, Group == "Testing") 

Before you move on to the next section, can you think of any limitations of this approach?

Hints: How important is the training-testing ratio? How are training-testing subsets created?

### 11.1.3 Variable selection strategies

An addition weakness of the above approach is that we have manually defined the independent variables of interest (average_elevation + average_rainfall). For exploratory analysis, however, we may not know which are the most important variables. Perhaps there is a combination of independent variables which produces a better model fit (e.g. R2 > 0.31)?

Determining which variables to include/exclude from a model is a very difficult problem, which has resulted in many different variable selection strategies. Common approaches include expert opinion and/or theory, partial least squares (PLS) regression, implemented in PLS, Least Absolute Shrinkage and Selection Operator (LASSO), implemented in glmnet and LARS, as well as elastic net methods and ridge regression, also implemented in glmnet. You may want to explore some of these more complex approaches for your dissertation.

For our analysis, we are going to use a relatively simple method known as Stepwise Regression, implemented in the MASS package. This works by including all the relevant independent variables in the analysis and then selecting those with the greatest explanatory power.

However, we don’t necessarily want to test all model variables. We would probably want to exclude the categorical counts (e.g. Arable, Heath, …) as these factors are already represented by the normalised variables (e.g. Arable_percent, Heath_percent, …), as well as any IDs or geometry variables (area). In general, we are only interested in testing the continuous derivatives (column names starting with average_) and the normalised categorical derivatives (column names ending in ’_percent’).

Rather than typing out the columns of interest manually, we are going to use the select function from the dplyr package to do so:

# Creates a vector of column names, including only those which contain "average" or "percent"
factors <- colnames(watersheds_df %>% dplyr::select(contains(c("average", "percent"))))

# Prints to console
factors
##  [1] "average_elevation"      "average_rainfall"       "average_slope"
##  [4] "average_aspect"         "Arable_percent"         "Heath_percent"
##  [7] "Grassland_percent"      "Urban_percent"          "Wetland_percent"
## [10] "Permeable_percent"      "Impermeable_percent"    "Gleyed_percent"
## [13] "Peats_percent"          "Sands_and_Muds_percent" "Limestone_percent"
## [16] "Coal_percent"

Run the above code. Note, the formatting of dplyr::select may be slightly confusing but it is necessary because there is also a select function in the MASS package. Here, we are telling R to use select from dplyr.

Using this vector of column names, we are going to create a new data frame (called variables) containing only the independent variables of interest. Crucially, this is only for the training dataset:

variables <- training[factors]

Run the above code and use head() to inspect the results.

Next, we are going to combine this data frame (cbind) with a dependent variable of interest; we will use NO2 as an example. Our new dataframe will be called model_df as it contains all the variables (dependent + independent) required for multiple linear regression. Note: by default, cbind will (somewhat unhelpfully) rename input column names e.g. NO2 will become watersheds_df$NO2. The code below specifies the new column name as NO2 (NO2 =) for readability: # Column bind the NO2 column with the independent variables from the training dataset model_df <- cbind(NO2 = training$NO2, variables)

When complete, we can then run a new model, making sure to update the data frame used (data = model_df) and updating the formula to NO2 ~ .. This denotes that all other data frame columns will be included as independent variables (a useful time saver!):

# Fits a linear model, including all other columns (~.) as independent variables
no2_model <- lm(formula = NO2 ~ ., data = model_df)

When you’re happy you understand the lm syntax, combine the two dataframes, run the linear model and inspect the output using summary(). This should resemble the following:

##
## Call:
## lm(formula = NO2 ~ ., data = model_df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.05094 -0.01581  0.00000  0.01189  0.07890
##
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)            -1.488e-01  1.541e+00  -0.097   0.9237
## average_elevation       2.101e-04  2.093e-04   1.004   0.3227
## average_rainfall        7.235e-05  8.240e-05   0.878   0.3863
## average_slope          -1.432e-02  7.764e-03  -1.844   0.0741 .
## average_aspect          8.238e-05  1.734e-04   0.475   0.6379
## Arable_percent         -3.115e-05  8.696e-04  -0.036   0.9716
## Heath_percent          -6.940e-04  1.157e-03  -0.600   0.5526
## Grassland_percent      -6.541e-04  8.737e-04  -0.749   0.4594
## Urban_percent           5.643e-04  9.223e-04   0.612   0.5449
## Wetland_percent        -5.993e-04  1.020e-03  -0.588   0.5608
## Permeable_percent       2.498e-03  1.496e-02   0.167   0.8684
## Impermeable_percent     1.755e-03  1.489e-02   0.118   0.9069
## Gleyed_percent          2.198e-03  1.495e-02   0.147   0.8840
## Peats_percent           2.075e-03  1.503e-02   0.138   0.8911
## Sands_and_Muds_percent -4.969e-04  3.784e-03  -0.131   0.8963
## Limestone_percent       9.282e-05  3.982e-03   0.023   0.9815
## Coal_percent           -7.355e-04  3.790e-03  -0.194   0.8473
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03511 on 33 degrees of freedom
## Multiple R-squared:  0.5481, Adjusted R-squared:  0.329
## F-statistic: 2.502 on 16 and 33 DF,  p-value: 0.0127

Our overall model fit (R2) is 0.55 which indicates that the independent variables explain ~55% of variability in the dependent variable. However, the model contains many independent variables which are not statistically significant, here defined as having a p value > 0.05.

This number represents the probability that the result has occurred by chance. When values are very small (e.g. p < 0.0005), we would typically present these as a discrete value e.g. p < 0.05, < 0.01, < 0.001. Generally, we only use models in which we can be 95% confident or higher (i.e. significance level of 0.05 or less).

However, it is important to note that p values should be not be considered in isolation and need to be interpreted carefully. For statistical reviews of using and interpreting p values, see and . For a broader overview, see the Nature commentary by , as well as a summary article by Vox.

To filter our independent variables to include only the most important, we can use the step.AIC function from the MASS library as follows:

# Stepwise regression model
step.model <- stepAIC(no2_model, # Input linear model
direction = "both",
trace = FALSE, # Print out intermediate results?
k = 1) 

Helpfully, this takes the output of the lm model (no2_model) with no need for any additional data wrangling. The following are important parameters:

• direction = "both":
• Determines the method used, either forward or backward stepwise regression, or a mixture of both.
• “Forward” begins with a model with no variables and then starts adding the most significant variables, stopping when there are no more significant variables.
• “Backward” begins with a model with all variables and then starts removing the least significant variables, stopping when only significant variables are remaining.
• “Both” includes both of the above, allowing for variables to be added/removed at each step.
• k = 1:
• The number of degrees of freedom used for the penalty i.e. for determining whether variables are significant or not.

Run the above model (direction = "both" and k = 1) and print the output using summary():

##
## Call:
## lm(formula = NO2 ~ average_elevation + average_rainfall + average_slope +
##     Urban_percent + Permeable_percent + Coal_percent, data = model_df)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.055272 -0.015115 -0.003007  0.015961  0.084550
##
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        6.584e-04  3.927e-02   0.017 0.986701
## average_elevation  1.594e-04  1.497e-04   1.065 0.292790
## average_rainfall   6.196e-05  6.626e-05   0.935 0.354980
## average_slope     -1.400e-02  6.255e-03  -2.239 0.030399 *
## Urban_percent      1.122e-03  3.093e-04   3.628 0.000753 ***
## Permeable_percent  3.384e-04  2.054e-04   1.648 0.106710
## Coal_percent      -2.648e-04  1.855e-04  -1.428 0.160617
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0315 on 43 degrees of freedom
## Multiple R-squared:  0.526,  Adjusted R-squared:  0.4598
## F-statistic: 7.952 on 6 and 43 DF,  p-value: 8.487e-06

As you can see above, using a low threshold for the degrees of freedom (k = 1) means we still have many “non-significant” variables remaining (p > 0.05)

Re-run the above model, but increasing the value of k in intervals of 1 until all the independent variables are significant at p = 0.05 (denoted by *):

##
## Call:
## lm(formula = NO2 ~ average_slope + Urban_percent, data = model_df)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.054050 -0.020404 -0.004278  0.012876  0.098725
##
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.0435188  0.0125665   3.463  0.00115 **
## average_slope -0.0040031  0.0016679  -2.400  0.02040 *
## Urban_percent  0.0009819  0.0002977   3.299  0.00186 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03169 on 47 degrees of freedom
## Multiple R-squared:  0.4756, Adjusted R-squared:  0.4533
## F-statistic: 21.31 on 2 and 47 DF,  p-value: 2.582e-07

In general, we prefer models with the minimum number of parameters (independent variables). They require fewer assumptions, less intensive data collection and can be applied more confidently to new data sets/locations. This principle of model parsimony is based upon Occam’s Razor: “other things being equal, simpler explanations are generally better than more complex ones”.

Our original model, based upon 16 independent variables had an R2 of 0.55. This new model, based upon just 2 independent variables (average_slope + Urban_percent) has an R2 of 0.48; a relatively minor reduction in explanatory power given the removal of 14 (arguably unimportant) additional variables.

Our model coefficients are now as follows:

• intercept = 0.0435188, p = 0.00115 (p < 0.01)
• average_slope = -0.0040031, p = 0.02040 (p < 0.05)
• Urban_percent = 0.0009819, p = 0.00186 (p < 0.01)

Coefficients are important because they are used in regression equations, which can then be used to predict values.

The general format for a regression equation is as follows:

$y = a + (b_1 \cdot x_1) + (b_2 \cdot x_2) + (b_n \cdot x_n)$ where a is the constant (intercept) value, and b is the coefficient of x.

For our NO2 model above, we can define our regression equation (presented using sensible data precision) as:

$NO_2 = 0.044 + (0.001 \cdot Urban \: percent) + (-0.004 \cdot Average \: slope)$ Well done! You have now calculated a regression which links the dependent variable (NO2) to a set of independent variables, in the case the average slope of the watershed and the percentage urban land cover.

For your assessment, we would like you to explain the regression results, linking to hydrological processes and literature. Think about specific sources of pollution, transport pathways, types of flow…

## 11.2 Task 8: Model evaluation

Having created a statistical model, it is necessary to evaluate its performance. Comparison plots of measured vs. modelled (or predicted) values are one common way to assess model quality, alongside other metrics such as root-mean-square error (RMSE), normalised root-mean-square-error (nRMSE), Q-Q plots, or histograms of model residuals. You may want to explore some of these for the assessment.

To calculate modelled values, we can use the predict() function, taking the model variable (step.model) as the input, rather than re-creating the above equation manually in code, and using our testing dataframe for the newdata argument:

# Predict NO2 values based upon stepwise model, saving to testing dataframe
testing\$predicted_no2 <- predict(step.model, newdata = testing)

Ifnew_data is not defined, the predict function uses the fitted values for prediction i.e. the training data used to construct the model (see here).

Run the above code block to predict NO2 concentrations in the testing dataset, based on the regression model produced from the training dataset.

These values could be used to calculate RMSE or other metrics (nRMSE) using your own code or additional packages (e.g. Metrics);

$RMSE = \sqrt{mean(measured\:values - modelled\:values)^2}$

Plots of measured vs. modelled values (as well as Q-Q plots and histograms) can be created in ggplot2. Here is an example:

# ggplot of measured vs. modelled (predicted) NO2 values
no2_plot <- ggplot(data = testing, aes(x = predicted_no2, y = NO2)) +
# Adding a linear regression ("lm"), removing standard error bars (se = FALSE)
geom_smooth(method = "lm", se = FALSE, colour="#5695FF") +
# Adds a 1:1 line for comparison
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
# Adds the point data, modifying the shape, size, colour and fill
geom_point(shape = 21, colour = "white", fill = "#5695FF", size = 2) +
# Setting the theme and aspect ratio
theme_classic() +
theme(aspect.ratio = 1) +
# Add axis labels and a title
labs(x = Modelled~NO[2], y = Measured~NO[2],
title = Plot~of~measured~vs.~modelled~NO[2]~values)

no2_plot

Does the regression line match the 1:1 line? Is there any evidence of under- or over-prediction? Are there any outliers? What types of errors can you identify?

You could also assess this relationship statistically, using linear regression:

# Linear regression of measured vs. modelled NO2 values
prediction_model <- lm(formula = NO2 ~ predicted_no2, data = testing)

# Print summary statistics
summary(prediction_model)
##
## Call:
## lm(formula = NO2 ~ predicted_no2, data = testing)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.084244 -0.027091  0.000268  0.011034  0.151937
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -0.0155     0.0232  -0.668  0.51244
## predicted_no2   1.7953     0.4705   3.816  0.00127 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05418 on 18 degrees of freedom
## Multiple R-squared:  0.4472, Adjusted R-squared:  0.4165
## F-statistic: 14.56 on 1 and 18 DF,  p-value: 0.001265

How well does our NO2 model perform on the testing dataset, based on the above graphs/statistics? Is out-of-sample performance comparable to in-sample performance?

To finish the practical and to prepare for the assessment:

Replicating the above approaches, calculate regression equations based on stepwise linear regression for all 10 water quality indicators (NO2, pH, SSC, Ca, Mg, NH4, NO3, TON, PO4, Zn).

Use the same approach to create new data frames for each indicator, remembering to update the k parameter in the step.AIC function (beginning at k = 1) to determine the statistically significant variables.

Save the relevant model coefficients and the R2 and p values for each equation. These should be stored in a single table for the assessment.

### References

Amrhein, V. et al. (2019) ‘Scientists rise up against statistical significance’, Nature, 567(7748), pp. 305–307. doi:10.1038/d41586-019-00857-9.
Andrade, C. (2019) ‘The P Value and Statistical Significance: Misunderstandings, Explanations, Challenges, and Alternatives, Indian Journal of Psychological Medicine, 41(3), pp. 210–215. doi:10.4103/IJPSYM.IJPSYM_193_19.
Goodman, S. (2008) ‘A Dirty Dozen: Twelve P-Value Misconceptions, Seminars in Hematology, 45(3), pp. 135–140. doi:10.1053/j.seminhematol.2008.04.003.