Learning Statistics as a Postgrad or: How I Learned to Stop Worrying and Love the Stats.

    … Or maybe its Stockholm Syndrome.

    Early in 2016 I started my PhD in quantitative ecology and I rapidly came to the conclusion that my expectation of making the switch from running what was effectively ‘black box’ code for single-species occupancy-detectability models in my Master’s project, to a PhD project in the developing field of joint species distribution modelling (JSDMs) where the inner workings of said ‘black box’ are still being developed being “a bit tough” was about as accurate as John Howard’s bowling. I enjoyed the work I did during my Masters and figured if I was going to make the transition into a PhD I may as well jump in the deep end. Go big or go home, right? Well after a couple of months of meetings with my crack team of supervisors discussing where my project was heading I was feeling like Ed from Good Burger. Hierarchical modelling? Bayesian statistics? Markov Chain Monte Carlo (MCMC)? Sure, I’ve heard of those words before (I even knew that the model from my Master’s project was a hierarchical model), but I couldn’t have used them in a sentence.

Unless “I couldn’t use MCMC in a sentence” counts.

    At that point I decided my statistical knowledge needed an upgrade (or two or three). I was going to be running a series of models taking potentially days at a time (in one case, 16 days!), and since I couldn’t begin the next stage of my project without the results of these model runs I was going to have plenty of down time to learn statistics. I also started thinking about how lacking my maths and stats education had been at the university level (both under- and post-graduate). A little research and it turns out I’m not the only one who felt this way. In recent years there has been several discussions (by both students and academics) about the inadequacy of how statistics is taught to ecology students in a field that is increasingly quantitative (Ellison & Dennis 2010; Hobbs & Ogle 2011; Barraquand et al. 2014; Butcher et al. 2014; Touchon & McCoy 2016). These are excellent reads, and the discussion is beyond the focus of this post, so I’ll keep the summary brief. A survey of early career researchers (primarily PhD and post-doc respondents) found that 75% felt they didn’t adequately understand the mathematical models they were using, and that they wanted more mathematics (90%) and statistics (95%) subjects in courses (Barraquand et al. 2014).

    This described me exactly. During my undergraduate degree, I took virtually every ecologically focussed subject available and zero maths/stats subjects. I followed the suggested subject list for students under the zoology/ecology banner, and this meant I did two first year chemistry subjects that I have not referred back to once in any subsequent class. Although I will admit that I do make heavy use of my lab safety glasses when woodworking, and the lab coat comes out of the closet when I mix high-speed rotary power tools and various cleaning fluids. My zoology/ecology subjects barely touched on statistics, and if it went beyond a simple ANOVA or t-test we were given plug-and-play code scripts. Even my Masters course required taking zero statistics subjects. At my supervisor’s suggestion, I did take a subject that covered some basic statistics in lectures (and even touched on some more higher-level concepts like bootstrapping towards the end), but combined with prac classes that used plug-and-play scripts I felt like I was shown statistics but not taught statistics. So now that I had decided to learn stats myself, I spoke to some statistically-minded people in my lab about where to start (including some who had been in the same position as me), bought myself a stack of textbooks and began reading.

What I found was that not all the books worked for me. After reading a few I realised that there are two archetypes in statistical textbooks: teaching with examples and teaching by examples. Textbooks that teach with examples cover the theory first and then reinforce it with an example. Those that teach by example explain the theory in terms of an example. If this isn’t clear, think of it in terms of teaching someone how to play tic-tac-toe for the first time:

Teach with example

Rules:

X plays first
Players alternate placing their respective mark in an empty square
First to place three in a row (orthogonal or diagonal) wins

Example game:

011817_0611_LearningSta1.png

Teach by example

  1. X goes first and places a mark in any square
  2. O then places a mark in any unoccupied square
  3. X places a second mark
  4. O places a mark to block X from connecting three in a row (which wins the game)
  5. X places a mark to open two winning subsequent moves
  6. O places a mark to block one winning move
  7. X places a mark to connect three and wins the game.

011817_0611_LearningSta2.png

    Now, this is an overly simplified way of depicting the issue (statistics is not as simple as tic-tac-toe), and I can’t say I have ever seen anyone teach tic-tac-toe the second way, but I feel it serves to highlight the differences between the two approaches. It is the difference between teaching someone how to build a linear model (using y, x, β, etc) and then showing an example for estimating tree height based on measurements of diameter at breast height (dbh), and teaching someone how to build a linear model in terms of building one to estimate tree height. In the literature, the distinction between these two approaches is often not as clear cut as this example. Some books use both approaches at different times (likely due to multiple authors), and occasionally use a hybrid approach (building a linear model where y = tree height, x = dbh, but talk in terms of tree height = β * dbh, then follow up with a complete example).

    Personally, the teach with example method has worked better for me but everyone is different so I have put together a guide for anyone else in a position like mine where they are trying to learn statistics as a postgrad. If you feel like you know which method works best for you then stick to those books in each category below, if you want to find out for yourself then pick up a book from each and see what works for you. The distinction is not as clear in some books, and some often have sections written closer to the other archetype.

Teach with example

    The books in this category tend to be books on “learning statistics” and, while generally social science heavy, use examples from a variety of disciplines to back up the theory. These books will cover the theory and then back it up with a worked example. The three books I would recommend for those who prefer to learn this way are the one-two punch of An Introduction to Statistical Learning by James et al and The Elements of Statistical Learning by Hastie et al (the former written as an entry point to the latter), followed by Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill. Introduction provides a nice entry point which Elements expands upon. Data Analysis does have a lot of overlap with Elements, but is worth the read to really get to grips with some of the methods. Elements and Data Analysis were two of the books most commonly recommended to me. Assuming you follow this 1/2/3 order, one could stop after any book if you felt you have learnt enough for the work you do. If pressed for time Elements provides the best overview methods across the “difficulty spectrum”.

011817_0611_LearningSta3.png

Teach by example

    The books in this category have been, in my experience, “Statistics for Ecologists”, and, as such, all the examples are ecological in nature. Whether this is happenstance, related to the fact that many ecologists don’t come from a statistical background and this method happens to work best for them, or just because I didn’t try any books that were “Statistics for Chemists/Psychologists/Magicians/etc” I can’t say. Generally, these books will cover the theory with a step-by-step ecological example (and sometimes followed up with another). The books I would recommend here are Ecological Models and Data in R by Bolker, Models for Ecological Data (textbook and lab manual) by Clark, and How to be a Quantitative Ecologist by Matthiopoulos. The book by Clark was one of the books most frequently recommended to me (and often by ecologists come statisticians who mentioned it would have been an excellent resource for their postgrad work). Unlike the previous section I haven’t provided a suggested reading order here as there is no clear progression between books. One could read any, or as many, of these as they felt necessary. The book by Matthiopoulos is a good example of a hybrid approach as it straddles the line between the teach with and teach by examples approaches. As mentioned previously, this archetype did not work as well for me as the other but, flicking back through these suggested books while writing this post, having gone on to read books from the other archetype, they are much clearer to me now.

011817_0611_LearningSta4.png

Appendices

    Regardless of which of the above archetypes you adhere to, Models for Ecological Data by Clark and Quantitative Ecology and Evolutionary Biology by Ovaskainen et al both have excellent maths and stats appendices that I have referred to on multiple occasions. These cover things like how to do matrix algebra and overviews of the different probability distributions ecologists would commonly use. The Ovaskainen et al book is not something I have recommended in previous sections because it is not a book about learning statistics. Instead it provides an excellent overview of the different types of models commonly used in the fields of population ecology/community ecology/evolutionary biology. It also has an appendix specifically on building generalised linear mixed models.

011817_0611_LearningSta5.png

Exercises

    While most of these books do end each chapter with a set of exercises, I would suggest that those in The Elements of Statistical Learning and the lab manual for Models for Ecological Data are arguably two of the better ones. Although I’m guilty of not doing most of these myself (yet!). The exercises in the lab manual are written such that one doesn’t need to have the textbook to do them, which makes it valuable to people choosing books from either archetype pool. It does refer readers back to the specific chapters in the textbook if they are looking for more information on the theory, but any other book recommended here will provide the same information.

Where to next?

    So, you’ve now read all the books on learning statistics and have become a master statistician (or, like me, learnt enough to keep your head above water), what is the next step? Well, now it depends on your own work or interests. Want more of the same? I’ve heard good things about A Biologist’s Guide to Mathematical Modelling by Otto and Day, and The Theoretical Biologist’s Toolbox by Mangel (although I’ve yet to read either). Want more detail on the Bayesian methods touched on in some of the books? Then Bayesian Methods for Ecology by McCarthy. Spatial distributions? I’ve been recommended Statistical Methods for Spatial Data Analysis by Schabenberger and Gotway, and Mapping Species Distributions by Franklin. Occupancy modelling? Then you want the occupancy bible that is Occupancy Estimation and Modelling by MacKenzie et al. As for me, my next port of call will be Applied Hierarchical Modelling in Ecology Volume 1 by Kéry and Royle, and I’ll shortly be picking up Statistics for Terrified Biologists by van Emden.

– David

Advertisements