I recently pushed version 0.2.0 of my fromo package to CRAN. This package implements (relatively) fast, numerically robust computation of moments via Rcpp.

The big changes in this release are:

  • Support for weighted moment estimation.
  • Computation of running moments over windows defined by time (or some other increasing index), rather than vector index.
  • Some modest improvements in speed for the 'dangerous' use cases (no checking for NA, no weights, etc.)

The time-based running moments are supported via the t_running_* operations, and we support means, standard deviation, skew, kurtosis, centered and standardized moments and cumulants, z-score, Sharpe, and t-stat. The idea is that your observations are associated with some increasing index, which you can think of as the observation time, and you wish to compute moments over a fixed time window. To bloat the API, the times from which you 'look back' can optionally be something other than the time indices of the input, so the input and output size can be different.

Some example uses might be:

  • Compute the volatility of an asset's returns over the previous 6 months, on every trade day.
  • Compute the total monthly sales of a company at month ends.

Because the API also allows you to use weights as implicit time deltas, you can also do weird and unadvisable things like compute the Sharpe of an asset over the last 1 million shares traded.

Speed improvements come from my random walk through c++ design idioms. I also implemented a 'swap' procedure for the running standard deviation which incorporates a Welford's method addition and removal into a single step. I do not believe that Welford's method is the fastest algorithm for a summarizing moment computation: probably a two pass solution to compute the mean first, then the centered moments is faster. However, for the case of running moments computations, Welford's method probably is the fastest.

Here is an example of the speedups seen in the 0.2.0 release. Again, I am cherry picking the running standard deviation computation, but I believe most methods have seen at least some modest improvements in speed. I compute the base sd of the data just a baseline to compare times in two different versions. Here are timings under the 0.2.0 code:

options(width=180)
options(digits=2)
library(fromo)
library(microbenchmark)
set.seed(1234)
x <- rnorm(1e5)
fromo_running_sd <- function(x) {
  running_sd3(x,window=1000,na_rm=FALSE,restart_period=10000L)
}
gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells 5.2e+06  279    9.7e+06  516  9.7e+06  516
Vcells 5.9e+07  451    2.8e+08 2146  8.9e+08 6816
# compute sd(x) as a reference 
microbenchmark(sd(x),
               fromo_running_sd(x),
               times=1000L)
Unit: microseconds
                expr  min   lq mean median   uq   max neval cld
               sd(x)  304  308  331    321  339   501  1000  a 
 fromo_running_sd(x) 1409 1448 1637   1502 1688 32200  1000   b

Under the old 0.1.3 code:

Unit: microseconds
                expr  min   lq mean median   uq   max neval cld
               sd(x)  457  473  498    488  508   767  1000  a 
 fromo_running_sd(x) 4183 4523 5168   4730 5623 66529  1000   b

I would call this a win, except you might interpret it as meaning the old code was crap. Also, I worry about the maximum time taken, which suggests some kind of lurking boo-boo.