# spatial.data.table 1 - Efficient Haversine Calculation

I'm a fan of `data.table`

, and I'm a fan of StackOverflow. So when I see questions like How to efficiently calculate distance between pair of coordinates using data.table, my intrigue is naturally piqued.

The then-accepted answer used `geosphere::distGeo()`

inside a `data.table`

'update-by-reference' function (:=). This seemed llike a good, logical approach (from my years of using `data.table`

).

It was only when writing this answer a few days later that I started to look into the solution more deeply. For example, it seemed a bit unnecessary to have to convert the columns of the `data.table`

into a matrix just to use `geosphere::distHaversine()`

(you'll also see from a comment to my answer that using good ol' `c()`

didn't work either).

So, I opened up `distHaversine`

to have a look:

```
geosphere::distHaversine
function (p1, p2, r = 6378137)
{
toRad <- pi/180
p1 <- .pointsToMatrix(p1) * toRad
p2 <- .pointsToMatrix(p2) * toRad
p = cbind(p1[, 1], p1[, 2], p2[, 1], p2[, 2], as.vector(r))
dLat <- p[, 4] - p[, 2]
dLon <- p[, 3] - p[, 1]
a <- sin(dLat/2) * sin(dLat/2) + cos(p[, 2]) * cos(p[, 4]) *
sin(dLon/2) * sin(dLon/2)
dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * p[, 5]
return(as.vector(dist))
}
```

Note that

The arguments

`p1`

and`p2`

"can be a vector of two numbers, a matrix of 2 columns, or a SpatialPoints* object".There are two calls to

`.pointsToMatrix()`

, which again convert the points into a matrix (plus a few other checks).

This seemed a tad unnecessary to me. And having used `data.table`

on a daily basis for quite a while, I was used to just passing-in unquoted column names into my functions.

## Simplifying the functions

The first thing I did was to write my own version (which is unashamedly a copy of `geosphere`

, but without the matrix conversion).

```
dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
radians <- pi/180
lat_to <- lat_to * radians
lat_from <- lat_from * radians
lon_to <- lon_to * radians
lon_from <- lon_from * radians
dLat <- (lat_to - lat_from)
dLon <- (lon_to - lon_from)
a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}
```

A quick benchmark showed it to be a non-insignificant improvement (but more on the benchmarking to come). And as it turns out, all the `geosphere::dist_`

calculations use this `.pointsToMatrix()`

function.

So, I re-wrote most of them without it, packaged it up, et voila, `spatialdatatable`

is born.

## The need for speed

To try and eek out even more speed when working on multiple rows of data at a time, I re-wrote the functions again in C++.*

And, if you don't believe me about the speed improvements, here's some benchmarks:

```
## install via github
# devtools::install_github("SymbolixAU/spatialdatatable")
library(spatialdatatable)
library(microbenchmark)
library(geosphere)
n <- 100000
set.seed(20170511)
lats <- -90:90
lons <- -180:180
dt <- data.table::data.table(lat1 = sample(lats, size = n, replace = T),
lon1 = sample(lons, size = n, replace = T),
lat2 = sample(lats, size = n, replace = T),
lon2 = sample(lons, size = n, replace = T))
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
m <- microbenchmark(
sdt = { dt1[, dtDistance := dtHaversine(lat1, lon1, lat2, lon2)] },
geo1 = { dt2[, geoDistance := distHaversine(
matrix(c(lon1, lat1), ncol = 2),
matrix(c(lon2, lat2), ncol = 2))] },
geo2 = { dt3[, geoDistance := distHaversine(dt3[, .(lon1, lat1)], dt3[, .(lon2, lat2)])] }
)
```

* A word of **warning**

- This is still a development package, so use at your own risk
- I have no idea what's causing that long tail in the benchmark, happy to hear any ideas
- I'm not a C++ programmer