Defensive Programming: How to (Help) Shield Your Code From Error

Michael Bertolacci

Centre for Environmental Informatics, UOW, Australia

2021-02-17

Who I am

A postdoc at the University of Wollongong, working on applied spatio-temporal statistical problems with Noel Cressie and Andrew Zammit-Mangion.

Before that, I was a PhD student at the University of Western Australia.

Before that, I was a software engineer for 10 years.

Assumed programming knowledge

  • You have programmed before.
  • You can read basic R or Python.

Motivation

A meme

There are several things that can (should?) keep a statistician from falling asleep at night.

One of those is Simpson’s paradox:

Another is having bugs in your code.

Story time

Reinhart, C, M.; Rogoff, K. S. (2010). Growth in a Time of Debt. American Economic Review. 100 (2): 573–78. doi:10.1257/aer.100.2.573.

Their conclusions led several governments to implement financial austerity programs.

A student reproducing the work found that they’d made an error in a formula in their Excel spreadsheet, invalidating some of their empirical claims.

Story time

My first research project…

Peer review - the crown jewel of science

We peer review each other’s work prior to publication.

  • Keeps us honest
  • Encourages rigour
  • Ensures the work is useful
  • Catches errors

But it generally only applies to the written part of the work. If the numbers reported look plausible, there’s (usually) an assumption that they are computed correctly.

Code is a blind spot for peer review.

How big is the blind spot?

It’s hard to be objective about this, but take two examples:

  • One (short) chapter of my PhD thesis:
    • Lines of LaTeX: ~300
    • Lines of code: ~1,400 - 4.5x more
  • Paper submitted based on a longer chapter:
    • Lines of LaTeX: ~1000
    • Lines of code: ~13,000 - 13x more

Bugs aren’t just a problem in academic work. Professional software developers invest huge effort into avoiding and fixing bugs.

Some gold standard strategies the best teams use:


Most academics don’t get to use these strategies…

There are some strategies that an individual programmer can use.

We’ll cover two:

  • Strategy 1: Make your code easier to understand
  • Strategy 2: Defensive programming

The main one we won’t cover is automated testing.

Every bug can be traced back to a wrong assumption.

The person who wrote the code made the assumption.
⇒ You wrote the code.
You caused the bug :(


What you need are ways to:

  • Make your assumptions more obvious so you avoid making wrong ones (Strategy 1).
  • Make it so you know when an assumption is violated so you can find and fix the error (Strategy 2).

Strategy 1: Make your code easier to understand

Code that is hard to read is hard to understand.

You should strive to make your code clean and readable.

This helps:

  • You now, when you debug the code
  • You in the future when you read the code
  • Everyone who reads the code

The DRY principle

Don’t Repeat Yourself.

\[ \textrm{RMS}(x_1, \ldots, x_n) = \sqrt{\frac{1}{n} \sum_{i = 1}^n x_i^2} \]

rms1 = np.sqrt(np.mean(
  np.square(error1)
))
rms2 = np.sqrt(np.mean(
  np.square(error2)
))
def rms(x):
    '''Return the root mean square of
    the vector x'''
    return np.sqrt(np.mean(
      np.square(x)
    ))


rms1 = rms(error1)
rms2 = rms(error2)

Follow a style guide

Style guides list rules to follow when laying out your code.

You should start from a publicly available guide.

For R, you can follow the Tidyverse style guide [1].

For Python, the standard is called PEP 8 [2].

You can tweak the guidelines to suit you, but not following some style guide is like writing a paper with inconsistent grammar, or using two different citation styles in the same document (Bertolacci, 2020).

R style example

Non-compliant:

y <- x*2
if(y > 2) {
  z <- y + 2
}
sqrt_z <- sqrt (z)

Compliant:

y <- x * 2
if (y > 2) {
  z <- y + 2
}
sqrt_z <- sqrt(z)

The R package lintr [3] can check your code automatically:

example.R:1:7: style: Put spaces around all infix operators.
y <- x*2
     ~^~
example.R:2:3: style: Place a space before left parenthesis, except in a function call.
if(y > 2) {
  ^
example.R:5:11: style: Remove spaces before the left parenthesis in a function call.
sqrt_z <- sqrt (z)

Python style example

Non-compliant:

y = x*2
if y> 2:
   z = y + 2
sqrt_z = sqrt (z)

Compliant:

y = x * 2
if y > 2:
    z = y + 2
sqrt_z = sqrt(z)

The program pycodestyle [4] checks if you are following the rules:

example.py:2:5: E225 missing whitespace around operator
example.py:3:4: E111 indentation is not a multiple of four
example.py:4:14: E211 whitespace before '('

Sometimes you knew what the code does when you wrote it, but a few months later…

loc<-c(137.5, -4.7)
loc2<-c(77.5, 18.1)
dat<-read.csv('locations.csv')
d=2*3389.5*asin(sqrt(
sin((dat$lat*pi/180-loc[2]*pi/180)/2)^2
+cos(dat$lat*pi/180)*cos(loc[2]*pi/180)
*sin((dat$lon*pi/180-loc[1]*pi/180)/2)^2))
d2=2*3389.5*asin(sqrt(
sin((dat$lat*pi/180-loc2[2]*pi/180)/2)^2
+cos(dat$lat*pi/180)*cos(loc2[2]*pi/180)
*sin((dat$lon*pi/180-loc2[1]*pi/180)/2)^2))
#dat$nrst=-1
dat$nrst=0
for(i in 1:nrow(dat)){
if(d[i]<d2[i])
dat$nrst[i]<-1
else
dat$nrst[i]<-2
}
MARS_RADIUS <- 3389.5  # in kilometres

hav_deg <- function(x) sin(x * pi / 360) ^ 2
cos_deg <- function(x) cos(x * pi / 180)
mars_haversine_dist <- function(locations, origin) {
  with(locations, {
    2 * MARS_RADIUS * asin(sqrt(
      hav_deg(latitude - origin['latitude'])
      + cos_deg(latitude) * cos_deg(origin['latitude'])
      * hav_deg(longitude - origin['longitude'])
    ))
  })
}

curiosity_location <- c(longitude = 137.5, latitude = -4.7)
perseverance_location <- c(longitude = 77.5, latitude = 18.1)

locations <- read.csv('locations.csv')
curiousity_dist <- mars_haversine_dist(locations, curiosity_location)
perseverance_dist <- mars_haversine_dist(locations, perseverance_location)

locations$nearest_rover <- 'none'
for (i in 1 : nrow(locations)) {
  if (curiousity_dist[i] < perseverance_dist[i]) {
    locations$nearest_rover[i] <- 'curiousity'
  } else {
    locations$nearest_rover[i] <- 'perseverance'
  }
}

More clean code tips

  • Code for correctness and readability first.
  • Comments are great, but it’s better if the code is obvious enough not to need comments.
  • Shorter isn’t necessarily better—short programs don’t run any faster!
  • Tab completion in your editor means you don’t have to type longer function and variable names every time.
  • Picking good names for things is hard, but it’s worth the effort.

Strategy 2: Defensive programming…

…or, dead programs tell no lies [5].

A tale of two errors…

n <- length(y)
sigma <- 1
log_likelihood <- (
  - n * log(2 * pi) / 2
  - n * log(sigma)
  - sum(y ^ 2 / sigma ^ 2) / 2
)
bic <- log(n) - 2 * log_likelihood
print(bic)
## [1] NA

No explicit error, but you probably don’t want that value to be NA. What’s the root cause?

dat <- data.frame(x = 1 : 10)
dat$y <- sin(df$x)
fit <- lm(y ~ x, data = dat)
print(coef(fit))
plot(dat$x, fitted(dat$y))
## Error in df$x: object of type
##   'closure' is not subsettable

Explicit error; annoying, but offending line is obvious.

Dead programs tell no lies

Defensive programming can be used when you’re aware you’re making an assumption. The idea is to intentionally throw an error when an assumption found to be false.

Most languages have functions to help you do this.

This way, you don’t have to search for the offending line.

Example: the exponential covariance function,

\[ C(\mathbf{x}, \mathbf{x}') \equiv \exp\left( -\frac{||\mathbf{x} - \mathbf{x}'||}{l} \right) \]

where \(l > 0\).

In R, you can use the stopifnot function:

# Given a vector or matrix x, return the covariance matrix based on the
# exponential covariance function
cov_exponential <- function(x, length_scale) {
  stopifnot(length_scale > 0)
  exp(-fields::rdist(x) / length_scale)
}

cov_exponential(y, -1)
## Error in cov_exponential(y, -1): length_scale > 0 is not TRUE

In Python, you can use the assert keyword:

import numpy as np

def cov_exponential(x, length_scale):
    '''Given a vector or matrix x, return the covariance matrix based on the
    exponential covariance function'''
    assert length_scale > 0, 'length_scale must be positive'
    return np.exp(-np.linalg.norm(x) / length_scale)

cov_exponential(y, -1)
Traceback (most recent call last):
  File "example.py", line 7, in <module>
    cov_exponential(y, -1)
  File "example.py", line 4, in cov_exponential
    assert length_scale > 0, 'length_scale must be positive'
AssertionError: length_scale must be positive

Back to the tale of two errors

stopifnot(all(!is.na(y)))
n <- length(y)
sigma <- 1
log_likelihood <- (
  - n * log(2 * pi) / 2
  - n * log(sigma)
  - sum(y ^ 2 / sigma ^ 2) / 2
)
bic <- log(n) - 2 * log_likelihood
print(bic)
## Error: all(!is.na(y)) is not TRUE

More defensive programming tips

  • You won’t always know what assumptions you’re making implicitly immediately.
  • But! when your code breaks, add a stopifnot or assert to catch it next time.
  • Don’t feel the need to check everything, just the most important parts.
  • In R, the assertthat package [6] can help.

Other strategies

  • Automated testing. See:
    • Wikipedia on unit testing [7]
    • The testthat package in R [8]
    • The pytest framework for Python [9]
  • Not reinventing the wheel. If a mature package has a function you need, use that rather than write your own.
  • Internal peer review:
    • Can a coauthor or a friend read your code for you?
  • Automated checking for common bugs:
    • The pylint program can do this for Python [10].
    • The lintr package does a little bit of this [3].

Conclusion

Bugs can taint otherwise great research, and peer review (probably) won’t catch them.

Some easy-to-follow strategies can help you avoid some bugs. Practice makes perfect!

Sadly, though, there will always be more bugs. If you find a way to avoid them all, please tell me!

Thanks everyone!

Resources

[1] Wickham, H., (2021). The tidyverse Style Guide. Published online at https://style.tidyverse.org/.
[2] van Rossum, G., Warsaw, B., and Coghlan, N. PEP 8—Style Guide for Python Code. https://www.python.org/dev/peps/pep-0008/, last accessed 2021-02-16.
[3] Hester, J., Angly, F., and Hyde, R. (2020). lintr: A ‘Linter’ for R Code. R package version 2.0.1. https://CRAN.R-project.org/package=lintr.
[4] Rocholl, J. and Lee, I., and other contributors. pycodestyle—Python Style Guide Checker. https://pypi.org/project/pycodestyle/.
[5] Thomas, D., Hunt, A. (2019). The Pragmatic Programmer: Your Journey to Mastery (20th ed.). Addison-Wesley Professional, Boston, MA.
[6] Wickham, H. (2019). assertthat: Easy Pre and Post Assertions. R package version 0.2.1. https://CRAN.R-project.org/package=assertthat.
[7] Wikipedia contributors. (2021, January 23). Unit Testing. In Wikipedia, The Free Encyclopedia. Retrieved 00:49, February 17, 2021, from https://en.wikipedia.org/w/index.php?title=Unit_testing&oldid=1002146431
[8] Wickham, H., Bryan, J. (2021). R Packages (2nd ed.). Published online at https://r-pkgs.org/index.html.
[9] Krekel, H., and other contributors, (2021). pytest. https://docs.pytest.org/en/stable/.
[10] Logilab, (2021). pylint. https://www.pylint.org/.