A modeling example

Julien Vollering

2019-05-30

This vignette is meant to introduce users to the MIAmaxent package by providing a worked example of a distribution modeling exercise. It shows how to use all of the top-level functions included in MIAmaxent in the order of a typical analysis. This vignette does NOT describe the theoretical underpinnings of the package. To learn more about the theory behind MIAmaxent, the user is referred to Halvorsen (2013) and Halvorsen et al. (2015), as well as other references listed in the documentation of the package.

Help pages for the package and for individual functions in the package can be accessed in R using the standard help command: ?"MIAmaxent" (after attaching the package using library(MIAmaxent)).

Abbreviations: EV = explanatory variable; DV = derived variable; RV = response variable; FOP = frequency of observed presence; AUC = area under the curve of the receiver operating characteristic plot


Introducing the data set

The data used in this vignette have been used to model the distribution of semi-natural grasslands in Østfold County, in southeastern Norway. The data set consists of 1059 locations where presence of semi-natural grasslands has been recorded, 13 environmental variables covering the extent of the study area, and 122 locations where the presence or absence of semi-natural grasslands has been recorded, independently of the presence-only records. The extent of the study area is about 4000 square kilometers, and the grain of the raster data is 500 meters (0.25 km2).

The data used in this vignette are included in the package as an example data set, so the results shown here can be directly replicated by executing the code as provided.

Before beginning the modeling exercise, it may be useful to see what some of the data look like in their geographical representation. We can use the raster package to plot the 1059 recorded presences on top of one of the environmental variable rasters:

library(raster)
#> Loading required package: sp
EV1 <- raster(list.files(system.file("extdata", "EV_continuous", 
                                     package="MIAmaxent"), full.names=TRUE)[1])
PO <- read.csv(system.file("extdata", "occurrence_PO.csv", package="MIAmaxent"))
plot(EV1, legend=FALSE)
points(PO$POINT_X, PO$POINT_Y, pch = 20, cex = 0.5, col = 'blue')

readData()

The starting point for modeling using MIAmaxent is a simple data object that contains occurrence data for the modeled target, and some number of explanatory variables (EVs). This data object must be formatted as a data frame, with the binary response variable (RV) representing occurrence in the first column, and corresponding EV values in subsequent columns. When the occurrence data consist of presence and absence records, these should be coded as “1” and “0” respectively. When the occurrence data consist of presence records only, presence locations are contrasted against locations with unknown occurrence, and the RV should be coded as “1” or “NA”. EVs may be continuous (numeric class) or categorical (factor class), as denoted in the data object.

The readData() function transforms data in CSV and ASCII raster file formats into a single data frame which serves as the starting point for modeling.

Users of the highly popular maxent.jar program for maximum entropy modeling are accustomed to data in a different format. Specifically, occurrence data is often in CSV file format, with presences records followed by coordinates, and explanatory data in ASCII raster file format. The readData() function makes it easy to read these data into the data object that is used in MIAmaxent. This function extracts values of the EVs at locations specified in the CSV file and properly formats these into the starting point for modeling. If the CSV file contains presence records only, then readData() also selects a random set of uninformed background locations for the data object. Alternatively, the user can specify a custom set of background locations by giving these in the CSV file.

We begin by creating our data object from file. Note that continuous and categorical environmental variables must be placed in separate directories:

In this case, the number of uninformed background locations to be randomly selected (maxbkg=20000) was larger than the total number of raster cells in the study area, so all cells are included in the data object.

Most functions in MIAmaxent return console output. Therefore, it’s handy to assign function output to an object, so that you can manipulate that object further. If you forget, you can use ?.Last.value().

If we look at the resulting data object we see the response variable (with 1059 presence and 16420 background locations) along with 8 continuous and 5 categorical EVs:

IMPORTANT: A number of important issues for distribution modeling – such as accounting for sampling bias and setting study area extent – are not dealt with in MIAmaxent, and should be dealt with beforehand. Good modeling practice requires that these issues be considered carefully! See Guisan et al. (2017) for a full treatment of distribution modeling in R.


Examining patterns in occurrence

By its simplest definition, a distribution model examines the relationship between the modeled target and its environment. In this way, distribution modeling follows the long tradition of gradient analysis in vegetation ecology (Halvorsen, 2012). Therefore, before building an actual model, we should have some idea about what influence the environmental variables have on the occurrence of the target.

plotFOP()

We can use the plotFOP function to create a so-called Frequency of Observed Presence (FOP) plot. An FOP plot shows how commonly the target occurs across the range of the EV, and makes it possible to recognize patterns in frequency of occurrence. In theory, the relationship between a continuous EV and modeled target is expected to be unimodal, if the observed range of the EV is sufficiently large. In practice, the pattern seen in the FOP plot depends not only on the range of the EV — which is affected by the extent of the study area — but also the scaling of the EV.

Here we examine FOP plots for 2 of the continuous EVs:

The points in these FOP plots show the observed proportion of points in a given interval of the EV which contain presences. The red line is a local regression smoother which aims to summarize the pattern in the empirical FOP values. The grey distribution in the background is an approximation of the data density across the range of the EV.

Notice the difference in the scales of the FOP axes. EVs showing a larger interval on the FOP axis typically carry more explanatory power.

We can change the number of the number of intervals used to calculate FOP, or the neighborhood of the smoother, and we can access the plotted data directly:

Based on this FOP plot, the occurrence of semi-natural grasslands seems to be unimodally related to ‘terslopdg’ (terrain slope) with a maximum at around 6.

Now we examine FOP plots for one of the categorical EVs:

We see that geoberg type 4 has the highest rate of observed presence, followed by type 2, and then types 3 and 28. If we look more closely however, we notice also that geoberg type 4 is sampled very rarely (see grey bars), with only 6 locations falling into that category:

If geoberg type 4 had shown a high FOP value and a large number of observations, the uncertainty associated with its FOP value would be lower and its likelihood of being selected in the model would be increased.

It’s useful to examine FOP plots for all candidate explanatory variables (EVs) before building a model.

Looking at FOP plots should help the modeler decide which EVs are likely to have greatest explanatory power, and gives an idea of the strength and shape of the relationships between the EVs and RV.


Transforming explanatory variables (EVs)

To fit the many different kinds of relationships between explanatory and response variables, we need to transform the EVs. This means that we create new “derived” variables (DVs) from the original EVs. Another way of thinking about this is to put it in terms of rescaling; we adjust the scale of the EV in many different ways in order to check which scaling is most ecologically relevant to the occurrence of the modeled target.

deriveVars()

The deriveVars() function produces DVs from EVs by 7 different transformation types: linear, monotonous, deviation, forward hinge, reverse hinge, threshold, and binary (Halvorsen et al., 2015). The first 6 of these are relevant for continuous variables and the binary transformation is relevant only for categorical variables. Different types of transformations can be turned on or off to balance model complexity with model fit.

For the spline-type transformations (forward hinge, reverse hinge, threshold) an endless number of different transformations are possible, so by default the function produces 20 of each, and then chooses those which explain the most variation in the RV. This means that 20 models are built and evaluated for each combination of EV and spline transformation, so running deriveVars() with these transformation types turned on can take a bit of time — depending on the size of the data set.

Here we produce all types of DVs from our EVs:

Turn write on and (optionally) specify a directory to save the transformation functions produced by deriveVars to file.

The output of deriveVars() is a list consisting of 2 parts:

Both list elements also contain the RV vector.

In our grasslands analysis, the contents of the list items look like this:


Note that the names of DVs indicate the type of transformation was used to create them. For example, “terslpdg_D2” is a deviation-type transformation of terslpdg, where the slope of the deviation is controlled by a parameter value of 2. Meanwhile, “terslpdg_HR4” is a reverse hinge transformation, with the knot in the 4th position.

Underscores (’_‘) are used to denote DVs, and colons (’:’) are used to denote interaction terms, so EV names must not contain these characters. EV names should also be unique.

To illustrate, look at how a given DV relates to the original, untransformed EV from which it was derived. Here we examine “terslpdg_D2” and “terslpdg_M”:

“terslpdg_D2” is the squared deviation (hence D2) from the estimated optimum in terslpdg (around 6). “terslpdg_M” is a monotone (hence M) transformation of terslpdg — specifically a zero-skew transformation.


Selecting variables

With DVs ready, we are ready to begin the process of choosing which variables to include in the model. This is arguably the most critical step in the whole modeling process. Following the principle of parsimony, the aim in selecting variables is to explain as much variation in the RV as efficiently as possible. The greater the number of EVs or DVs included in the model, the more variation in the RV we can explain, but at the cost of model complexity. In the MIAmaxent package, the benefit of additional variation explained is weighed against the cost in model complexity using an inference test (Chi-squared or F). Variables are added to the model one by one in a process termed forward selection, and each new model is compared to its predecessor. Another term for this process is “nested model comparison.”

Rather than selecting from the full pool of DVs one by one, MIAmaxent performs variable selection in two parts:

  1. First, a set of DVs is selected separately for each individual EV. This is done using the selectDVforEV() function.
  2. Second, the EVs themselves — each represented by a parsimonious set of DVs — are selected. This is done using the selectEV() function.

Variable selection occurs hierarchically: first DVs for each EV, then EVs for the full model.

selectDVforEV()

The selectDVforEV() function performs forward selection of individual DVs for each EV. In other words, the function takes each EV one by one, and narrows the group of DVs produced from that EV to a set which explains variation in the RVs most efficiently.

The alpha argument to selectDVforEV() is used in the inference test during forward selection, setting the threshold for how much variation a DV must explain to be retained. A lower alpha results in a more conservative test, i.e. DVs must explain more variation to be selected.

Here we use selectDVforEV() on the grassland data set. Note the “$dvdata” following grasslandsDV, which identifies the list of DVs we made using deriveVars() (see ?deriveVars() Value).

The output is a list consisting of 2 parts:

Comparing the list of DVs before and after selection, we can see that selectDVforEV() reduced the number of DVs considerably:

Note also that the number of EVs was reduced from 13 to 10. EVs for which none of the DVs explained a significant amount of variation are dropped.

Here is an example of one of the (13) trails of forward DV selection. Shown is the trail for the “terdem” EV:

The columns in this data frame represent: the round of variable selection (round), the names of the DVs included in the model (variables), the number of DVs in the model (m), the fraction of deviance explained (D2, sensu Guisan & Zimmerman, 2000), the Chi-squared statistic for the nested model comparison (Chisq), the degrees of freedom for the nested model comparison (df), and the p-value for the Chi-squared statistic under the specified degrees of freedom (P).

This table shows, for example, that in the first round of selection, one model was built for each of the 8 DVs, and that all of these explained enough variation to be retained for the second round of selection. Of all the DVs produced from “terdem”, “terdem_D05” explained the most variation in the RV. However, none of the remaining DVs explained enough of the remaining variation to be selected in addition to “terdem_D05” (P > alpha, in round 2).

selectEV()

Part 2 of variable selection using MIAmaxent is performed by the selectEV() function. This function is similar to the selectDVforEV() function, but instead of selecting a parsimonious set of DVs to represent each EV, selectEV() selects a parsimonious set of EVs to comprise the model. This proceeds in the same manner as selectDVforEV(), with nested model comparison using inference tests. Where selectDVforEV() adds a single DV at a time, selectEV() adds a single set of DVs (representing one EV) at a time.

The selectEV() function also differs from selectDVforEV() in another important way; it provides the option of testing interaction terms between selected EVs (interaction = TRUE). Interaction terms are useful when one EV changes the effect of another EV on the modeled target. In MIAmaxent, only first-order interactions are tested, and only between EVs both included in the model.

Here we use selectEV() on the grassland data set. Note the “$dvdata” following grasslandsDVselect, which identifies the list of selected DVs we made using selectDVforEV() (see ?selectDVforEV() Value).

The output is a list consisting of 3 parts:

Compare the list of EVs before and after:

We can see that selectEV() reduced the number of EVs to 6. To check whether any interaction terms have been included in the model, we can look at its formula:

No interaction terms were significant in the model. These would be denoted by two DV names separated by a colon (e.g. ‘pr.bygall_M:tertpi09_HR12’). This can be confirmed by looking at the trail of forward selection of EVs and interaction terms. Here we show only the best model of each round:

#>    round
#> 1      1
#> 11     2
#> 20     3
#> 27     4
#> 33     5
#> 36     6
#> 37     7
#>                                                                           variables
#> 1                                                                          prbygall
#> 11                                                               prbygall + geoberg
#> 20                                                     prbygall + geoberg + lcucor1
#> 27                                          prbygall + geoberg + lcucor1 + tertpi09
#> 33                               prbygall + geoberg + lcucor1 + tertpi09 + geolmja1
#> 36                    prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif
#> 37 prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif + lcucor1:tertpi09
#>     m   Dsq   Chisq df        P
#> 1   1 0.012 219.973  1 9.17e-50
#> 11  6 0.019 112.063  5 1.50e-22
#> 20  8 0.023  69.863  2 6.75e-16
#> 27  9 0.026  62.059  1 3.33e-15
#> 33 12 0.028  25.978  3 9.64e-06
#> 36 13 0.028  14.735  1 1.24e-04
#> 37 15 0.029  10.593  2 5.01e-03

As expected, the model with the interaction term is not significant under the alpha value of 0.001. Instead, the selected model is the model with 6 single-order terms.

The full selection trail can be saved as a CSV-format file by setting write = TRUE in selectEV(), and optionally specifying a dir. A careful examination of the trail of forward selection can be helpful in choosing the final model for a given application.

Note that above we started the forward selection procedure from a null model including none of the EVs. However, if we had an a priori reason to include one or more of the EVs in the model regardless of explanatory power, we could do so using the formula argument in selectEV(). Then forward selection proceeds with the specified model as a starting point.

chooseModel()

We may choose to plot some of the data in the forward selection trail to help us decide which model to use. For example, we may plot fraction of deviance explained (D2) against round number, to see how much better the model fit is for each added term:

In this case, we may decide that a simpler model with only 5 EVs is better than the selected model, since the amount of deviance explained levels off after round 5. The EVs included in this model are: prbygall + geoberg + lcucor1 + tertpi09 + geolmja1. To proceed with this model, instead of the selectedmodel in grasslandEVselect, we can use the chooseModel() function:

This is the model which we explore further below.


Exploring the model

After building a model by selecting which EVs to include, it is useful to explore what the modeled relationships actually look like. A straightforward way to do this is to look at model predictions over a range of explanatory data. This gives the modeler a sense of the relationships that the model has captured, and can help the modeler understand strengths and weaknesses in the model.

plotResp()

We can use the plotResp() function to create a response curve. A response curve plots model output across the range of one particular EV. When other variables are excluded from the model entirely, this is called a “single-effect response curve”.

Here we examine a single-effect response curve for the most important EV included in the model chosen above:

To assess how well the relationship between the EV and RV is captured by the model, it can be useful to examine FOP plots and response plots side-by-side:

The values on the y-axes of the plots are not directly comparable, but we expect the shape of the response plot curve to mirror, more or less, the shape of the FOP plot curve.

Here is the same comparison for one of the categorical variables included in the model:

plotResp2()

The plotResp2() function is very similar to the plotResp() function in that it takes the same arguments and is used to produce response plots. However, plotResp2() plots “marginal-effect” responses rather than “single-effect” responses. A marginal-effect response plot shows model predictions across the range of one EV while holding all other EVs in the model constant at their mean value. The resulting curves are often similar in shape, if not in scale.

Here is the marginal-effect response plot for the same continuous EV shown above:


Applying the model

For many modeling applications — although not all — the motivation for building a model is to obtain model predictions. Model predictions, or model output, can be used in many different ways: to make predictions about parts of the study area for which there exist no occurrence data, to predict relative probability of occurrence outside the study area, or to predict relative probability of occurrence in the past or future. In other words, any form of spatial or temporal interpolation or extrapolation is possible (although not always recommended!). The only requirement is that the values of the EVs are known for the time or space for which model output is desired.

projectModel()

The projectModel() function can be used to obtain model predictions for any kind of modeling application. As input it takes the model to be projected (model), the transformation functions used by the model (transformations), and explanatory data to project across (data). For “maxent”-type models, the projectModel returns model predictions in probability ratio output (PRO) format for each location represented in data. PRO format gives relative probability of presence, and PRO = 1 is a reference value that represents the probability of presence in an “average” location in the training data.

A value of PRO = 1 can be interpreted as the relative probability of presence of a location randomly drawn from the training data. Put another way, values above 1 represent higher-than-average probability of presence, and vice versa.

Here, we obtain model output across the extent of the study area as represented by the training data. When we enter the data argument as a RasterStack or RasterBrick object, projectModel() automatically shows model predictions in geographical space. Note that the names of the raster layers must match names of EVs in the model.

It is often easier to visualize probability-ratio values on a log scale, so we plot the raster object again as log2(PRO + 1):

Alternatively, if the data are supplied in a simple data frame, model predictions are appended to the data in column 1, and returned as list item output. In this case, the predictions can be mapped to geographical space manually, by including spatial coordinates in the data input to the projectModel(), and then using the rasterize() function in the raster package.

Additionally, the projectModel() function automatically checks how the range of the input explanatory data compares to the range of the data used to train the model. This is important because if the range of the input data lies outside the range of the training data (i.e. the model is extrapolated), the predictions are less reliable. The range of continuous variables is reported on the training data scale, from 0 to 1, and the range of categorical variables is reported as “inside” if all categories are represented in the training data:

Since we projected the model across the training data, it makes sense that all the ranges are reported as [0,1] or “inside.”


Evaluating the model

There are many ways of evaluating the quality of a model, including assessing which EVs are selected and gauging the realism of response curves. Arguably the best way to evaluate a model, however, is to test how often its predictions are correct using occurrence data which are independent from the data used to train the model. This is strongly preferable to using training data because it indicates whether the model reflects patterns specific to the training data or generalized patterns for the modeled target (i.e. whether the model is overfitted). Independent, presence-absence test data are not always available, for example when projecting a model into the future, but when they are, they represent a high standard in model validation.

testAUC()

The evaluation metric which is used most commonly for distribution models and is implemented in MIAmaxent is Area Under the Curve (AUC) of the receiver operating characteristic plot. This is a metric which measures the performance of the model as a binary classifier over the full range of discrimination thresholds. When it is calculated using independent test data we refer to it as “testAUC”.

The testAUC() function takes a data frame of presence and absence locations, along with the corresponding values of EVs at those locations, and calculates testAUC. The evaluation data can easily be read into R using the readData() function with PA = TRUE if desired, as shown below.

In our example, 122 test locations in Østfold County, Norway, were visited to record presence or absence of semi-natural grasslands:

grasslandPA <- readData(
  occurrence = system.file("extdata", "occurrence_PA.csv", package="MIAmaxent"), 
  contEV = system.file("extdata", "EV_continuous", package="MIAmaxent"),
  catEV = system.file("extdata", "EV_categorical", package="MIAmaxent"),
  PA = TRUE, XY = TRUE)
head(grasslandPA)
#>   RV      x       y     pca1    prbygall   prtilany teraspif terdem
#> 1  1 262750 6557250 0.000646 0.002871206 0.45159969       93     24
#> 2  1 263250 6558750 0.000635 0.029484030 0.01523342      109     19
#> 3  1 263750 6555250 0.000694 0.000970874 0.92038828       41      0
#> 4  1 265250 6555750 0.000719 0.003206413 0.27815631        7     13
#> 5  1 265250 6559750 0.000652 0.002914238 0.27019149      151     20
#> 6  1 266750 6553750 0.000848 0.012106540 0.21468930       49     10
#>   terslpdg tersolrade  tertpi09 geoberg geolmja1 lcucor1 lcutilt4
#> 1 1.114550    967.471   8.67901      21      100     311        0
#> 2 1.236990    956.743   1.50617      21      100     322        0
#> 3 0.365185    981.625 -10.39510      21      130     523        1
#> 4 0.326633    988.487  -4.71605      21      130     322        1
#> 5 1.382270    941.362   6.06173      21      130     311        1
#> 6 0.769991    985.827   2.41975      21      130     322        1
#>   terslpps15
#> 1          6
#> 2          6
#> 3          1
#> 4          5
#> 5          6
#> 6          6
tail(grasslandPA)
#>     RV      x       y     pca1 prbygall  prtilany teraspif terdem terslpdg
#> 117  0 313250 6575250 0.000788        0 1.0101010      131    132 0.045296
#> 118  0 313750 6579250 0.000650        0 1.0076010       39    149 1.358030
#> 119  0 314250 6572750 0.000545        0 0.3865006       78    139 0.146076
#> 120  0 314250 6576750 0.000694        0 1.0060360       81    141 1.689610
#> 121  0 314750 6571250 0.000640        0 0.0000000      102    174 1.501690
#> 122  0 314750 6574750 0.000745        0 0.9434344      166    130 0.449772
#>     tersolrade tertpi09 geoberg geolmja1 lcucor1 lcutilt4 terslpps15
#> 117   1000.920 -3.39507      29      100     312        1          1
#> 118   1022.570 -1.76543      62      130     312        1          6
#> 119   1001.790 -7.23457      82       12     312        1          1
#> 120   1028.010  3.16049      62       12     312        1          6
#> 121    983.568 12.33330      82      130     312        0          6
#> 122    988.030 -6.40741      62       12     312        1          1

Plotted on the raster of (log-scaled) model predictions, these occurrences look like this: