Code: Simplicity or Speed?

While delving into the details of this StackOverflow question

I am trying to generate a vector containing a increasing, reverse series such as 
1,2,1,3,2,1,4,3,2,1,5,4,3,2,1.

various solutions arose. The simplest (in terms of number of key-strokes) was provided by user Henrik

rev(sequence(5:1))

which is indeed a very elegant yet simple solution. However, this wasn't the fastest solution, as we will soon see.

As with many programming problems there is often a trade-off between code simplicity and speed. One of the first lessons in R (especially if you are moving from other languages) is that it's better forgo the apparent simplicity of constructs like for loops, for optimised functions like the apply family. On the other hand there are whole libraries (think dplyr and the tidyverse in general) whose primary aims include improving code readability.

With that in mind, let's get back to the StackOverflow example. In addition to Henrik's solution, the ever-present user akrun (with input from others) suggested

unlist(lapply(1:5, ":", 1))

which is also a nice solution that requires a few more key strokes, but in practice runs faster.

The need for speed..

In trying to provide an alternative answer I went back to basics looking for a faster implementation. Coupled with what I've learnt while integrating C++ into my googleway package, I came up with a simple for-loop written in Rcpp.

(And, hopefully as the loop is written in C++ all the for-loop-in-R haters will be appeased)

library(Rcpp)

cppFunction('NumericVector reverseSequence(int maxValue, int vectorLength){
              NumericVector out(vectorLength);
              int counter = 0;

              for(int i = 1; i <= maxValue; i++){
                  for(int j = i; j > 0; j--){
                      out[counter] = j;
                      counter++;
                  }
              }
              return out;
              }')

maxValue <- 5
reverseSequence(maxValue, sum(1:maxValue))
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1

Rcpp provides methods that allow you to easily integrate R and C++. And its speed benefit became clear when I benchmarked it against the two R solutions. The looping C++ implementation is faster most of the time (median speed of 1037ms, compared with 1900ms (akrun) and 4994ms (henrik)).

library(microbenchmark)

maxValue <- 1000

microbenchmark(
    henrik = {
        rev(sequence(maxValue:1))
    },
    akrun = {
        unlist(lapply(1:maxValue, ":", 1))
    },
    symbolix = {
        reverseSequence(maxValue, sum(1:maxValue))
    }
)

# Unit: microseconds
#     expr      min       lq     mean   median       uq      max neval
#   henrik 3788.987 4567.422 7085.908 4993.793 5689.287 35355.34   100
#    akrun 1533.615 1723.819 3302.222 1900.983 2688.463 35944.15   100
# symbolix  502.540  663.786 2818.100 1037.945 1545.540 33808.83   100

Righto, so which one?

Back to the title of this blog; Simplicity or Speed? Well, I can't answer that for you, you'll have to decide whether those extra few seconds are worth the time spent designing a longer piece of code. In this case we have a sequence of 1000 and a difference of just under a second. But if we have a sequence of one million, the impact is much larger.

Because we deal with big data we often look for speed over code simplicity. I like watching my codes tick away for a while, but it wears thin if every test takes an hour (or a day) to complete.

If you are still not sure, you can always refer to the repository of all programmng wisdom, xkcd: