A postdoc at the University of Wollongong, working on applied spatio-temporal statistical problems with Noel Cressie and Andrew Zammit-Mangion.
Before that, I was a PhD student at the University of Western Australia.
Before that, I was a software engineer for 10 years.
Assumed programming knowledge
You have programmed before.
You can read basic R or Python.
If you’re at an earlier stage and want to learn more about these, see Danielle Navarro’s great textbook [1] or the R for Data Science book [2].
⚠️ Warning ⚠️
This tutorial is rather informal and unscholarly. It is full of opinions and observations from my own experience.
The hope is to provoke discussion on good coding practices within this community.
Please share your thoughts, opinions, rants, etc. in Slack at any time!
We will stop for questions and discussion a few times.
Motivation
There are several things that can (should?) keep a statistician from falling asleep at night.
One of those is Simpson’s paradox:
Another is having bugs 🐞 in your code.
Story time 📊 🐞
Reinhart, C, M.; Rogoff, K. S. (2010). Growth in a Time of Debt. American Economic Review. 100 (2): 573–78. doi:10.1257/aer.100.2.573.
Their conclusions led several governments to implement financial austerity programs.
A student reproducing the work found that they’d made an error in a formula in their Excel spreadsheet, invalidating some of their empirical claims.
Story time
My first research project…
Your stories and a poll
If you’ve had a similar experience, feel free to share it in the Slack chat and we can discuss it a bit later.
🗳️ Now let’s just run a quick poll… 🗳️
Peer review: the crown jewel of science 👑
We peer review each other’s work prior to publication.
Keeps us honest
Encourages rigour
Ensures the work is useful
Catches errors
But it generally only applies to the written part of the work. If the numbers reported look plausible, there’s (usually) an assumption that they are computed correctly.
Code is a blind spot for peer review.
How big is the blind spot?
It’s hard to be objective about this, but take two examples:
One (short) chapter of my PhD thesis:
Lines of LaTeX: ~300
Lines of code: ~1,400 - 4.5x more
Paper submitted based on a longer chapter:
Lines of LaTeX: ~1000
Lines of code: ~13,000 - 13x more
Bugs aren’t just a problem in academic work. Professional software developers invest huge effort into avoiding and fixing bugs.
Some gold standard strategies the best teams use:
Peer review: software is not released until the code has been read by multiple people
You can tweak the guidelines to suit you, but not following some style guide is like writing a paper with inconsistent grammar, or using two different citation styles in the same document (Bertolacci, 2020).
R style example
Non-compliant:
y <- x*2if(y >2) { z <- y +2}sqrt_z <-sqrt (z)
Compliant:
y <- x *2if (y >2) { z <- y +2}sqrt_z <-sqrt(z)
The R package lintr[5] can check your code automatically:
example.R:1:7: style: Put spaces around all infix operators.
y <- x*2
~^~
example.R:2:3: style: Place a space before left parenthesis, except in a function call.
if(y > 2) {
^
example.R:5:11: style: Remove spaces before the left parenthesis in a function call.
sqrt_z <- sqrt (z)
Python style example
Non-compliant:
y = x*2if y>2: z = y +2sqrt_z = sqrt (z)
Compliant:
y = x *2if y >2: z = y +2sqrt_z = sqrt(z)
The program pycodestyle[6] checks if you are following the rules:
example.py:2:5: E225 missing whitespace around operator
example.py:3:4: E111 indentation is not a multiple of four
example.py:4:14: E211 whitespace before '('
Sometimes you knew what the code does when you wrote it, but a few months later…
MARS_RADIUS <-3389.5# in kilometreshav_deg <-function(x) sin(x * pi /360) ^2cos_deg <-function(x) cos(x * pi /180)mars_haversine_dist <-function(locations, origin) {with(locations, {2* MARS_RADIUS *asin(sqrt(hav_deg(latitude - origin['latitude'])+cos_deg(latitude) *cos_deg(origin['latitude'])*hav_deg(longitude - origin['longitude']) )) })}curiosity_location <-c(longitude =137.5, latitude =-4.7)perseverance_location <-c(longitude =77.5, latitude =18.1)locations <-read.csv('locations.csv')curiousity_dist <-mars_haversine_dist(locations, curiosity_location)perseverance_dist <-mars_haversine_dist(locations, perseverance_location)locations$nearest_rover <-'none'for (i in1:nrow(locations)) {if (curiousity_dist[i] < perseverance_dist[i]) { locations$nearest_rover[i] <-'curiousity' } else { locations$nearest_rover[i] <-'perseverance' }}
Comments
A rule of thumb: Comments in code should say what your code does, not explain how.
Unhelpful:
mars_haversine_dist <-function( locations, origin) {# Take two times the radius times# the arcsine of the square root# of ... ...}
Helpful:
# Calculate the distance in# kilometres between two coordinates# on Marsmars_haversine_dist <-function( locations, origin) { ...}
The code should be clean enough that the how is obvious.
(I break this rule around 10% of the time.)
Long functions
A rule of thumb: functions should fit on one or two screenfuls.
The maximum length of a function is inversely proportional to the complexity and indentation level of that function. So, if you have a conceptually simple function that is just one long (but simple) case-statement, where you have to do lots of small things for a lot of different cases, it’s OK to have a longer function. - Linus Torvalds [7]
(I break this rule around 10% of the time)
Long files
You should also organise your code into multiple well-named files.
A rule of thumb: code files should not be too long, say no longer than 1000-2000 lines.
(I break this rule less often.)
More clean code tips
Code for correctness and readability first.
Shorter isn’t necessarily better—short programs don’t run any faster!
Tab completion in your editor means you don’t have to type longer function and variable names every time.
Picking good names for things is hard, but it’s worth the effort.
Discussion on clean code 🗣️
Questions? Comments? Disagreements?
Feel free to add your own clean code tips to Slack!
n <-length(y)sigma <-1log_likelihood <- (- n *log(2* pi) /2- n *log(sigma)-sum(y ^2/ sigma ^2) /2)bic <-log(n) -2* log_likelihoodprint(bic)
## [1] NA
No explicit error, but you probably don’t want that value to be NA. What’s the root cause?
dat <-data.frame(x =1:10)dat$y <-sin(df$x)fit <-lm(y ~ x, data = dat)print(coef(fit))plot(dat$x, fitted(dat$y))
## Error in df$x: object of type
## 'closure' is not subsettable
Explicit error; annoying, but offending line is obvious.
Dead programs tell no lies
Defensive programming can be used when you’re aware you’re making an assumption. The idea is to intentionally throw an error when an assumption found to be false.
Most languages have functions to help you do this.
This way, you don’t have to search for the offending line.
# Given a vector or matrix x, return the covariance matrix based on the# exponential covariance functioncov_exponential <-function(x, length_scale) {stopifnot(length_scale >0)exp(-fields::rdist(x) / length_scale)}cov_exponential(y, -1)
## Error in cov_exponential(y, -1): length_scale > 0 is not TRUE
In Python, you can use the assert keyword:
import numpy as npdef cov_exponential(x, length_scale):'''Given a vector or matrix x, return the covariance matrix based on the exponential covariance function'''assert length_scale >0, 'length_scale must be positive'return np.exp(-np.linalg.norm(x) / length_scale)cov_exponential(y, -1)
Traceback (most recent call last):
File "example.py", line 7, in <module>
cov_exponential(y, -1)
File "example.py", line 4, in cov_exponential
assert length_scale > 0, 'length_scale must be positive'
AssertionError: length_scale must be positive
Back to the tale of two errors
stopifnot(all(!is.na(y)))n <-length(y)sigma <-1log_likelihood <- (- n *log(2* pi) /2- n *log(sigma)-sum(y ^2/ sigma ^2) /2)bic <-log(n) -2* log_likelihoodprint(bic)
## Error: all(!is.na(y)) is not TRUE
More defensive programming tips
You won’t always know what assumptions you’re making implicitly immediately.
But! when your code breaks, add a stopifnot or assert to catch it next time.
Don’t feel the need to check everything, just the most important parts.
Feel free to add your own defensive programming tips to Slack!
Or raise your hand in the Zoom chat. ✋
(But note that we will talk about a related topic, automated testing, very soon.)
Strategy 3: Automated testing
Defensive programming actively tests your assumptions when the code is running.
Automated testing goes one step further: you write code that calls your code, actively probing for problems.
add_one.py:
def add_one(x):'''Returns the argument, plus one'''return x +1.1
test_add_one.py:
from add_one import add_onedef test_add_one():assert add_one(1) ==2
Running the tests:
======================= test session starts ========================
collected 1 item
test_add_one.py F [100%]
============================= FAILURES =============================
___________________________ test_add_one ___________________________
def test_add_one():
> assert add_one(1) == 2
E assert 2.1 == 2
E + where 2.1 = add_one(1)
test_add_one.py:4: AssertionError
===================== short test summary info ======================
FAILED test_add_one.py::test_add_one - assert 2.1 == 2
=================== 1 failed, 0 passed in 0.15s ====================
What makes for a good set of tests?
Each test tests one thing at a time
Tests are short and simple; don’t test too many things
Boundary conditions are tested
All code lines are explored
def max_one(x):'''Return the maximum between the argument and one'''if x <1:return1else:return xdef test_max_one():assert max_one(0) ==1# First code pathassert max_one(1) ==1# Boundary conditionassert max_one(2) ==2# Second code path
covariance.py:
import numpy as npdef cov_exponential(x, length_scale):'''Given a vector or matrix x, return the covariance matrix based on the exponential covariance function'''assert length_scale >0, 'length_scale must be positive'return np.exp(-np.linalg.norm(x) / length_scale)
A guide to setting this up is in the R packages book [13].
Testing and version control
You may have seen these badges on repositories on GitHub:
Tests are automatically run when changes to the software are published. The code coverage metric counts how many lines of code in the software are tested.
This is beyond what most small research projects need, but it’s nice to have!
How to make the most of automated testing
A few thoughts on automated testing:
Automated testing is helpful when your code is already organised into functions (or packages).
If a function is hard to test, it might be too complicated anyway. Break it up into smaller functions.
If you find a bug, write a test for the bug that fails, THEN fix the bug
Comments
A rule of thumb: Comments in code should say what your code does, not explain how.
Unhelpful:
Helpful:
The code should be clean enough that the how is obvious.
(I break this rule around 10% of the time.)