We have a weekly ENM/SDM reading group here at UT, and one of our goals with this blog is to start summarizing our interpretation of the papers we read, and hopefully spark discussion on these papers with the broader ENM/SDM community.  To that end, I'm going to discuss my interpretation of a  paper by Meynard and Kaplan, entitled "The effect of a gradual response to the environment on species distribution modeling performance", Ecography 34:001-011.

We found bits of the paper a little confusing, but I'm fairly sure I understand the overall point.  It's an interesting and useful point to make, so I'm going to try to outline it here.  In the event that I've misinterpreted the paper (a distinct possibility), I'd love to be corrected.  Here is the argument being presented as I understand it.

First, they develop a simulation approach to generate data that treats the probability of sampling from a given environment as a logistic function of some environmental variable, Pi = 1/(1+e^-Yi), where Pi is the probability of sampling in grid cell i and Yi is a function of the environmental gradient.  Yi is calculated as (xi - B)/a, where xi is the value of the variable in cell i and B and a are parameters that determine the inflection point and slope of the species' response to the gradient, respectively (these are beta and alpha in the original paper, but if Weebly has those symbols I don't know how to access them).  So the whole shmear looks like this:
When faced with something like this, I often like to do simple little Excel experiments just to visualize how changes in parameters affect outputs.  I generated a fake environmental gradient and added a bit of randomness to it:
And then I simulated a couple of species.  The first had beta = 1 and alpha = 2:
The second had the same beta, but a much more gradual slope in the species' response to the environmental gradient (alpha = 20):
Here, as I understand it, is their point.  The maximum possible performance of a model will be dependent on the slope of the species' response to the environment.  If the slope is very steep, a perfect model will do a more or less perfect job of distinguishing presences from absences.  For instance, here's one with beta = 3 and alpha = 0.01:
In this sort of system, there's little uncertainty in the expectation of presence or absence across the landscape: almost every grid cell has a near-certainty that you will find the species there, or that you won't.  Contrast that with the previous example (beta = 1, alpha = 5), in which there is a decent probability of finding the species in any grid cell (the minimum probability on that landscape is about as likely as getting heads in a single coin flip), but there's nowhere on the landscape where you're more or less guaranteed to find your species (an absence from the highest probability cell is still about as likely has getting heads three times in a row).

So what's the relevance to model evaluation?  Well, it comes down to the expected maximum performance.  In a situation with a threshold-like response to the environment (e.g., the beta=3, alpha = .01 scenario), it is entirely possible for a decent model to do a great job of distinguishing presence from absence (or pseudoabsence) cells.  In a species that responds more gradually to the environmental predictor (e.g., beta=1, alpha=5), even the TRUE model does a mediocre job of telling you where you should and shouldn't expect to find your species.  This sets a limit on the maximum model performance we can expect, and when alpha is high that maximum performance may be significantly below the theoretical maximum performance of a model for that statistic (AUC, Kappa, sensitivity and specificity).  What it boils down to is, in the authors' words, "the same model that produces a poor prediction in terms of presences and absences may be recovering perfectly well the true probability of occurrence throughout the environment".

It has been remarked before that it is generally going to be more difficult to model habitat generalists than specialists, as their relationship to any one environmental variable will usually be more ambiguous than that of a specialist.  This is essentially putting that argument in the context of model evaluation, which I think is useful - rather than saying "we can't build good models of habitat generalist species", it's saying "we may be able to build good models of habitat generalists, but we can't expect them to perform as well on presence/absence data".


There's actually more to the paper than that, as they tie all of this to issues of species and sample prevalence.  I've got to get back to work right now, but maybe we'll revisit that side of it in a later post.

Author

Dan Warren is a postdoctoral researcher at UT Austin.

Google Scholar profile.
www.danwarren.net

 


Comments




Leave a Reply