# spatial.data.table 1 - Efficient Haversine Calculation

I'm a fan of `data.table`, and I'm a fan of StackOverflow. So when I see questions like How to efficiently calculate distance between pair of coordinates using data.table, my intrigue is naturally piqued.

The then-accepted answer used `geosphere::distGeo()` inside a `data.table` 'update-by-reference' function (:=). This seemed llike a good, logical approach (from my years of using `data.table`).

It was only when writing this answer a few days later that I started to look into the solution more deeply. For example, it seemed a bit unnecessary to have to convert the columns of the `data.table` into a matrix just to use `geosphere::distHaversine()` (you'll also see from a comment to my answer that using good ol' `c()` didn't work either).

So, I opened up `distHaversine` to have a look:

``````geosphere::distHaversine
function (p1, p2, r = 6378137)
{
p = cbind(p1[, 1], p1[, 2], p2[, 1], p2[, 2], as.vector(r))
dLat <- p[, 4] - p[, 2]
dLon <- p[, 3] - p[, 1]
a <- sin(dLat/2) * sin(dLat/2) + cos(p[, 2]) * cos(p[, 4]) *
sin(dLon/2) * sin(dLon/2)
dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * p[, 5]
return(as.vector(dist))
}``````

Note that

• The arguments `p1` and `p2` "can be a vector of two numbers, a matrix of 2 columns, or a SpatialPoints* object".

• There are two calls to `.pointsToMatrix()`, which again convert the points into a matrix (plus a few other checks).

This seemed a tad unnecessary to me. And having used `data.table` on a daily basis for quite a while, I was used to just passing-in unquoted column names into my functions.

## Simplifying the functions

The first thing I did was to write my own version (which is unashamedly a copy of `geosphere`, but without the matrix conversion).

``````dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
dLat <- (lat_to - lat_from)
dLon <- (lon_to - lon_from)
a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}``````

A quick benchmark showed it to be a non-insignificant improvement (but more on the benchmarking to come). And as it turns out, all the `geosphere::dist_` calculations use this `.pointsToMatrix()` function.

So, I re-wrote most of them without it, packaged it up, et voila, `spatialdatatable` is born.

## The need for speed

To try and eek out even more speed when working on multiple rows of data at a time, I re-wrote the functions again in C++.*

And, if you don't believe me about the speed improvements, here's some benchmarks:

``````## install via github
# devtools::install_github("SymbolixAU/spatialdatatable")
library(spatialdatatable)
library(microbenchmark)
library(geosphere)

n <- 100000
set.seed(20170511)
lats <- -90:90
lons <- -180:180
dt <- data.table::data.table(lat1 = sample(lats, size = n, replace = T),
lon1 = sample(lons, size = n, replace = T),
lat2 = sample(lats, size = n, replace = T),
lon2 = sample(lons, size = n, replace = T))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

m <- microbenchmark(
sdt = { dt1[, dtDistance := dtHaversine(lat1, lon1, lat2, lon2)]  },
geo1 = { dt2[, geoDistance := distHaversine(
matrix(c(lon1, lat1), ncol = 2),
matrix(c(lon2, lat2), ncol = 2))]  },
geo2 = { dt3[, geoDistance := distHaversine(dt3[, .(lon1, lat1)], dt3[, .(lon2, lat2)])] }
)``````

* A word of warning

• This is still a development package, so use at your own risk
• I have no idea what's causing that long tail in the benchmark, happy to hear any ideas
• I'm not a C++ programmer