1 - Efficient Haversine Calculation

I'm a fan of data.table, and I'm a fan of StackOverflow. So when I see questions like How to efficiently calculate distance between pair of coordinates using data.table, my intrigue is naturally piqued.

The then-accepted answer used geosphere::distGeo() inside a data.table 'update-by-reference' function (:=). This seemed llike a good, logical approach (from my years of using data.table).

It was only when writing this answer a few days later that I started to look into the solution more deeply. For example, it seemed a bit unnecessary to have to convert the columns of the data.table into a matrix just to use geosphere::distHaversine() (you'll also see from a comment to my answer that using good ol' c() didn't work either).

So, I opened up distHaversine to have a look:

function (p1, p2, r = 6378137) 
    toRad <- pi/180
    p1 <- .pointsToMatrix(p1) * toRad
    p2 <- .pointsToMatrix(p2) * toRad
    p = cbind(p1[, 1], p1[, 2], p2[, 1], p2[, 2], as.vector(r))
    dLat <- p[, 4] - p[, 2]
    dLon <- p[, 3] - p[, 1]
    a <- sin(dLat/2) * sin(dLat/2) + cos(p[, 2]) * cos(p[, 4]) * 
        sin(dLon/2) * sin(dLon/2)
    dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * p[, 5]

Note that

  • The arguments p1 and p2 "can be a vector of two numbers, a matrix of 2 columns, or a SpatialPoints* object".

  • There are two calls to .pointsToMatrix(), which again convert the points into a matrix (plus a few other checks).

This seemed a tad unnecessary to me. And having used data.table on a daily basis for quite a while, I was used to just passing-in unquoted column names into my functions.

Simplifying the functions

The first thing I did was to write my own version (which is unashamedly a copy of geosphere, but without the matrix conversion).

dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
  radians <- pi/180
  lat_to <- lat_to * radians
  lat_from <- lat_from * radians
  lon_to <- lon_to * radians
  lon_from <- lon_from * radians
  dLat <- (lat_to - lat_from)
  dLon <- (lon_to - lon_from)
  a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
  return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)

A quick benchmark showed it to be a non-insignificant improvement (but more on the benchmarking to come). And as it turns out, all the geosphere::dist_ calculations use this .pointsToMatrix() function.

So, I re-wrote most of them without it, packaged it up, et voila, spatialdatatable is born.

The need for speed

To try and eek out even more speed when working on multiple rows of data at a time, I re-wrote the functions again in C++.*

And, if you don't believe me about the speed improvements, here's some benchmarks:

## install via github
# devtools::install_github("SymbolixAU/spatialdatatable")

n <- 100000
lats <- -90:90
lons <- -180:180
dt <- data.table::data.table(lat1 = sample(lats, size = n, replace = T),
            lon1 = sample(lons, size = n, replace = T),
            lat2 = sample(lats, size = n, replace = T),
            lon2 = sample(lons, size = n, replace = T))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

m <- microbenchmark(
    sdt = { dt1[, dtDistance := dtHaversine(lat1, lon1, lat2, lon2)]  },
    geo1 = { dt2[, geoDistance := distHaversine(
       matrix(c(lon1, lat1), ncol = 2), 
       matrix(c(lon2, lat2), ncol = 2))]  },
    geo2 = { dt3[, geoDistance := distHaversine(dt3[, .(lon1, lat1)], dt3[, .(lon2, lat2)])] }

* A word of warning

  • This is still a development package, so use at your own risk
  • I have no idea what's causing that long tail in the benchmark, happy to hear any ideas
  • I'm not a C++ programmer