Introduction
This Task View contains information about using R to analyse ecological and environmental data.
The base version of R ships with a wide range of functions for use within the field of environmetrics.
This functionality is complemented by a plethora of packages available via CRAN, which provide specialist
methods such as ordination & cluster analysis techniques. A brief overview of the available packages is
provided in this Task View, grouped by topic or type of analysis. As a testament to the popularity of R for the
analysis of environmental and ecological data, a
special volume
of
the
Journal of Statistical Software
was produced in 2007.
Those useRs interested in environmetrics should consult the
Spatial
view.
Complementary information is also available in the
Multivariate
and
Cluster
task
views.
If you have any comments or suggestions for additions or improvements, then please contact the
maintainer
.
A list of available packages and functions is presented below, grouped by analysis type.
Modelling species responses and other data
Analysing species response curves or modeling other data often involves the fitting of standard statistical models
to ecological data and includes simple (multiple) regression, Generalised Linear Models (GLM), extended regression
(e.g. Generalised Least Squares [GLS]), Generalised Additive Models (GAM), and mixed effects models, amongst
others.
-
The base installation of R provides
lm()
and
glm()
for fitting linear and generalised
linear models, respectively.
-
Generalised least squares and linear and non-linear mixed effects models extend the simple regression model
to account for clustering, heterogeneity and correlations within the sample of observations. Package
nlme
provides functions for fitting these models. The package is supported by Pinheiro & Bates (2000)
Mixed-effects Models in S and S-PLUS
, Springer, New York. An updated approach to mixed effects models,
which also fits Generalised Linear Mixed Models (GLMM) and Generalised non-Linear Mixed Models (GNLMM) is provided
by the
lme4
package, though this is currently beta software and does not yet allow correlations within
the error structure.
-
Recommended package
mgcv
fits GAMs and Generalised Additive Mixed Models (GAMM) with
automatic smoothness selection via generalised cross-validation. The author of
mgcv
has
also written a companion monograph, Wood (2006)
Generalized Additive Models; An Introduction with R
Chapman Hall/CRC, which has an accompanying package
gamair.
-
Alternatively, package
gam
provides an implementation of the S-PLUS function
gam()
that
includes LOESS smooths.
-
Proportional odds models for ordinal responses can be fitted using
polr()
in the
MASS
package, of Bill Venables and Brian Ripley.
-
A negative binomial family for GLMs to model over-dispersion in count data is available in
MASS.
-
Models for overdispersed counts and proportions
-
Package
pscl
also contains several functions for dealing with over-dispersed count data. Poisson or
negative binomial distributions are provided for both zero-inflated and hurdle models.
-
aod
provides a suite of functions to analyse overdispersed counts or proportions, plus utility
functions to calculate e.g. AIC, AICc, Akaike weights.
-
Detecting change points and structural changes in parametric models is well catered for in the
segmented
package and the
strucchange
package respectively.
segmented
has recently been
the subject of an R News article (
R News, volume 8 issue 1
).
Tree-based models
Tree-based models are being increasingly used in ecology, particularly for their ability to fit flexible models to
complex data sets and the simple, intuitive output of the tree structure. Ensemble methods such as bagging, boosting and
random forests are advocated for improving predictions from tree-based models and to provide information on uncertainty
in regression models or classifiers.
Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book,
are implemented in
-
recommended package
rpart
-
party
provides an implementation of conditional inference trees which embed tree-structured regression
models into a well defined theory of conditional inference procedures
Multivariate trees are available in
-
package
mvpart, which provides an adaptation of
rpart
for multivariate responses.
-
package
party
can also handle multivariate responses.
Ensemble techniques for trees:
-
The Random Forest method of Breiman and Cutler is implemented in
randomForest, providing classification
and regression based on a forest of trees using random inputs
-
Package
ipred
provides functions for improved predictive models for classification, regression and
survival problems.
Graphical tools for the visualization of trees are available in packages
maptree
and
pinktoe.
Packages
mda
and
earth
implement Multivariate Adaptive Regression Splines (MARS), a technique
which provides a more flexible, tree-based approach to regression than the piecewise constant functions used in
regression trees.
Ordination
R and add-on packages provide a wide range of ordination methods, many of which are specialised techniques
particularly suited to the analysis of species data. The two main packages are
ade4
and
vegan.
ade4
derives from the traditions of the French school of
Analyse des Donnees
and is based on the use of the duality diagram.
vegan
follows
the approach of Mark Hill, Cajo ter Braak and others, though the implementation owes more to that presented in
Legendre & Legendre (1988)
Numerical Ecology, 2
nd
English Edition
, Elsevier. Where the
two packages provide duplicate functionality, the user should choose whichever framework that best suits their
background.
-
Principal Components (PCA) is available via the
prcomp()
function.
rda()
(in package
vegan),
pca()
(in package
labdsv) and
dudi.pca()
(in
package
ade4), provide more ecologically-orientated implementations. A form of PCA used
in the climate and climate change fields is Empirical Orthogonal Function (EOF) analysis and is implemented
in
EOF()
in package
clim.pact.
-
Redundancy Analysis (RDA) is available via
rda()
in
vegan
and
pcaiv()
in
ade4.
-
Canonical Correspondence Analysis (CCA) is implemented in
cca()
in both
vegan
and
ade4.
-
Detrended Correspondence Analysis (DCA) is implemented in
decorana()
in
vegan.
-
Principal coordinates analysis (PCO) is implemented in
dudi.pco()
in
ade4,
pco()
in
labdsv,
pco()
in
ecodist, and
cmdscale()
in package
MASS.
-
Non-Metric multi-Dimensional Scaling (NMDS) is provided by
isoMDS()
in package
MASS
and
nmds()
in
ecodist.
nmds(), a wrapper function for
isoMDS(),
is also provided by package
labdsv.
vegan
provides helper function
metaMDS()
for
isoMDS(), implementing random starts of the algorithm and standardised scaling of the NMDS results.
The approach adopted by
vegan
with
metaMDS()
is the recommended approach for ecological
data.
-
Coinertia analysis is available via
coinertia()
and
mcoa(), both in
ade4.
-
Co-correspondence analysis to relate two ecological species data matrices is available in
cocorresp.
-
Canonical Correlation Analysis (CCoA - not to be confused with CCA, above) is available in
cancor()
in standard package stats.
-
Procrustes rotation is available in
procrustes()
in
vegan
and
procuste()
in
ade4, with both
vegan
and
ade4
providing functions to test the significance of
the association between ordination configurations (as assessed by Procrustes rotation) using permutation/randomisation
and Monte Carlo methods.
-
Constrained Analysis of Principal Coordinates (CAP), implemented in
capscale()
in
vegan,
fits constrained ordination models similar to RDA and CCA but with any any dissimilarity coefficient.
-
Constrained Quadratic Ordination (CQO; formerly known as Canonical Gaussian Ordination (CGO)) is a maximum likelihood
estimation alternative to CCA fit by Quadratic Reduced Rank Vector GLMs. Constrained Additive Ordination (CAO) is a
flexible alternative to CQO which uses Quadratic Reduced Rank Vector GAMs. These methods and more are provided in
Thomas Yee's
VGAM
package.
-
Fuzzy set ordination (FSO), an alternative to CCA/RDA and CAP, is available in package
fso.
fso
complements a recent paper on fuzzy sets in the journal
Ecology
by Dave Roberts (2008, Statistical analysis of
multidimensional fuzzy set ordinations.
Ecology
89(5)
, 1246-1260).
-
See also the
Multivariate
task view for complementary information.
Dissimilarity coefficients
Much ecological analysis proceeds from a matrix of dissimilarities between samples. A large amount of effort has
been expended formulating a wide range of dissimilarity coefficients suitable for ecological data. A selection of
the more useful coefficients are available in R and various contributed packages.
Standard functions that produce, square, symmetric matrices of pair-wise dissimilarities include:
-
dist()
in standard package stats
-
daisy()
in recommended package
cluster
-
vegdist()
in
vegan
-
dsvdis()
in
labdsv
-
Dist()
in
amap
-
distance()
in
ecodist
-
a suite of functions in
ade4
Function
distance()
in package
analogue
can be used to calculate dissimilarity between samples
of one matrix and those of a second matrix. The same function can be used to produce pair-wise dissimilarity matrices,
though the other functions listed above are faster.
distance()
can also be used to generate
matrices based on Gower's coefficient for mixed data (mixtures of binary, ordinal/nominal and continuous variables).
Function
daisy()
in package
cluster
provides a faster implementation of Gower's coefficient for
mixed-mode data than
distance()
if a standard dissimilarity matrix is required.
Cluster analysis
Cluster analysis aims to identify groups of samples within multivariate data sets. A large range of
approaches to this problem have been suggested, but the main techniques are hierarchical cluster analysis,
partitioning methods, such as
k
-means, and finite mixture models or model-based clustering. In the machine
learning literature, cluster analysis is an unsupervised learning problem.
The
Cluster
task view provides a more detailed discussion of available cluster analysis methods and
appropriate R functions and packages.
Hierarchical cluster analysis:
-
hclust()
in standard package stats
-
Recommended package
cluster
provides functions for cluster analysis following the methods
described in Kaufman and Rousseeuw (1990)
Finding Groups in data: an introduction to cluster analysis
,
Wiley, New York
-
hcluster()
in
amap
-
pvclust
is a package for assessing the uncertainty in hierarchical cluster analysis. It provides
approximately unbiased
p
-values as well as bootstrap
p
-values.
Partitioning methods:
-
kmeans()
in stats provides
k
-means clustering
-
cmeans()
in
e1071
implements a fuzzy version of the
k
-means algorithm
-
Recommended package
cluster
also provides functions for various partitioning methodologies.
Mixture models and model-based cluster analysis:
-
mclust
and
flexmix
provide implementations of model-based cluster analysis.
-
prabclus
clusters a species presence-absence matrix object by calculating an
MDS
from the distances, and applying maximum likelihood Gaussian
mixtures clustering to the MDS points. The maintainer's, Christian Hennig, web site contains several publications in
ecological contexts that use
prabclus, especially Hausdorf & Hennig (2007;
Oikos 116 (2007), 818-828
).
Ecological theory
There is a growing number of packages and books that focus on the use of R for theoretical ecological models.
-
vegan
provides a wide range of functions related to ecological theory, such as diversity indices
(including the
so-called
Hill's numbers [e.g. Hill's N
2
] and rarefaction), ranked abundance diagrams,
Fisher's log series, Broken Stick model, Hubbell's abundance model, amongst others.
-
The
vegetarian
provides the diversity measures suggested by Jost
(
2006, Oikos 113(2), 363-375
;
2007, Ecology 88(10), 2427-2439
).
-
untb
provides a collection of utilities for biodiversity data, including the simulation ecological drift
under Hubbell's Unified Neutral Theory of Biodiversity, and the calculation of various diagnostics such as Preston
curves.
-
primer
is a support software for Stevens
(
2009,
A Primer of Ecology with R
,
Springer
). The package provides a variety of functions for modeling ecological data and basic theoretical ecology,
including functions related to demographic matrix models, metapopulation and source-sink models, host-parasitoid and
disease models, multiple basins of attraction, the storage effect, neutral theory, and diversity partitioning.
-
Package
BiodiversityR
provides a GUI for biodiversity and community ecology analysis.
-
Function
betadiver()
in
vegan
implements all of the diversity indices reviewed in
Koleff et al (2003;
Journal of
Animal Ecology 72(3), 367-382
).
betadiver()
also provides a
plot
method to produce the co-occurrence frequency triangle plots
of the type found in Koleff et al (2003).
-
Function
betadisper(), also in
vegan, implements Marti Anderson's distance-based test for
homogeneity of multivariate dispersions (PERMDISP, PERMDISP2), a multivariate analogue of Levene's test (Anderson
2006;
Biometrics 62,
245-253
). Anderson et al (2006;
Ecology Letters 9(6), 683-693
)
demonstrate the use of this approach for measuring beta diversity.
Population dynamics
Estimating vital rates (i.e., growth, survival, and reproduction), especially in monitoring studies of tagged individuals
over time:
-
age-specific survival and reproduction using life tables in
demogR,
-
stage-specific vital rates using transition frequency tables in
popbio,
-
mark and recapture methods in
mra
and
Rcapture
when tagged individuals cannot be consistently
detected identified.
Modeling population growth rates:
-
Packages
demogR
and
popbio
can be used to construct and analyse age- or stage-specific matrix
population models
Environmental time series
-
Time series objects in R are created using the
ts()
function, though see
tseries,
zoo
and
its
below for alternatives.
-
Classical time series functionality is provided by the
ar(), and
arima()
functions in
standard package stats for autoregressive (AR), moving average (MA), autoregressive moving average (ARMA) and
integrated ARMA (ARIMA) models.
-
The
dse
package provide a variety of more advanced estimation methods
and multivariate time series analysis.
-
Packages
tseries
and
zoo
provide general handling and analysis of time series data.
-
Irregular time series can be handled using packages
zoo
and
its, as well as by
irts()
in package
tseries.
-
pastecs
provides functions specifically tailored for the analysis of space-time ecological series.
-
strucchange
allows for testing, dating and monitoring of structural change in linear regression
relationships.
-
Detecting change points in time series data --- see
segmented
above
.
-
The
surveillance
package implements statistical methods for the modeling of and change-point detection
in time series of counts, proportions and categorical data. Focus is on outbreak detection in count data time series.
-
Package
dynlm
provides a convenient interface to fitting time series regressions via ordinary least
squares
-
Package
dyn
provides a different approach to that of
dynlm, which allows time series data to
be used with any regression function written in the style of lm such as
lm(),
glm(),
loess(),
rlm()
and
lqs()
from
MASS,
randomForest()
(package
randomForest),
rq()
(package
quantreg) amongst
others, whilst preserving the time series information.
Spatial data analysis
See the
Spatial
CRAN Task View for an overview of spatial analysis in R.
Extreme values
ismev
provides functions for models for extreme value statistics and is support software for Coles (2001)
An Introduction to Statistical Modelling of Extreme Values
, Springer, New York. Other packages for extreme value
theory include:
-
evir
-
evd
-
evdbayes, which provides a Bayesian approach to extreme value theory
-
extRemes
-
POT, which provides functions for the Generalized Pareto Distribution and the Peaks Over Threshold
method
-
The
SpatialExtremes
provides several approaches for modelling spatial extreme events.
Phylogenetics and evolution
Packages specifically tailored for the analysis of phylogenetic and evolutionary data include:
UseRs may also be interested in Paradis (2006)
Analysis of Phylogenetics and Evolution with R
, Springer,
New York, a book in the new UseR series from Springer.
Other packages
Several other relevant contributed packages for R are available that do not fit under nice headings.
-
adehabitat
complements
ade4
and provides a collection of tools for the analysis of habitat
selection by animals.
-
Andrew Robinson's
equivalence
package provides some statistical tests and graphics for assessing
tests of equivalence. Such tests have similarity as the alternative hypothesis instead of the null. The package
contains functions to perform two one-sided t-tests (TOST) and paired t-tests of equivalence.
-
Thomas Petzoldt's
simecol
package provides an object oriented framework and tools to simulate
ecological (and other) dynamic systems within R. See the
simecol website
and a
R News
article
on the package for further information.
-
Functions for circular statistics are found in
CircStats
and
circular.
-
Package
eco
fits Bayesian models of ecological inference to 2 x 2 contingency tables.
-
Package
e1071
provides functions for latent class analysis, short time Fourier transform, fuzzy clustering,
support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, and more...
-
Package
seacarb
provides functions for calculating parameters of the seawater carbonate system.
-
Package
NADA
contains methods described by Dennis R. Helsel in his book
Nondetects And Data
Analysis: Statistics for Censored Environmental Data
.
-
Package
pcurve
fits a principal curve to a numeric multivariate dataset in arbitrary dimensions. Produces
diagnostic plots.
-
Package
clim.pact
contains R functions for retrieving data, making climate analysis and downscaling of monthly
mean and daily mean global climate scenarios.
-
Package
SoPhy
provides soil physics tools in R including water flow by SWMS_2D.
-
Package
pgirmess
provides a suite of miscellaneous functions for data analysis in ecology.
-
mefa
provides functions for handling and reporting on multivariate count data in ecology and
biogeography.
-
analogue
fits Modern Analogue Technique and Weighted Averaging transfer function models for calibration
(aka bioindication), for prediction of environmental data from species data
-
Stephen Sefick's
StreamMetabolism
package contains function for calculating stream metabolism
characteristics, such as GPP, NDM, and R, from single station diurnal Oxygen curves.