I recently pushed version 0.2.0 of my fromo
package to
CRAN.
This package implements (relatively) fast, numerically robust
computation of moments via Rcpp
.
The big changes in this release are:
- Support for weighted moment estimation.
- Computation of running moments over windows defined by time (or some other increasing index), rather than vector index.
- Some modest improvements in speed for the 'dangerous'
use cases (no checking for
NA
, no weights, etc.)
The time-based running moments are supported via the t_running_*
operations,
and we support means, standard deviation, skew, kurtosis, centered and
standardized moments and cumulants, z-score, Sharpe, and t-stat. The
idea is that your observations are associated with some increasing
index, which you can think of as the observation time, and you wish
to compute moments over a fixed time window. To bloat the API, the
times from which you 'look back' can optionally be something other
than the time indices of the input, so the input and output size
can be different.
Some example uses might be:
- Compute the volatility of an asset's returns over the previous 6 months, on every trade day.
- Compute the total monthly sales of a company at month ends.
Because the API also allows you to use weights as implicit time deltas, you can also do weird and unadvisable things like compute the Sharpe of an asset over the last 1 million shares traded.
Speed improvements come from my random walk through c++ design idioms. I also implemented a 'swap' procedure for the running standard deviation which incorporates a Welford's method addition and removal into a single step. I do not believe that Welford's method is the fastest algorithm for a summarizing moment computation: probably a two pass solution to compute the mean first, then the centered moments is faster. However, for the case of running moments computations, Welford's method probably is the fastest.
Here is an example of the speedups seen in the 0.2.0 release. Again,
I am cherry picking the running standard deviation computation, but I
believe most methods have seen at least some modest improvements in speed.
I compute the base sd
of the data just a baseline to compare times
in two different versions. Here are timings under the 0.2.0 code:
options(width=180)
options(digits=2)
library(fromo)
library(microbenchmark)
set.seed(1234)
x <- rnorm(1e5)
fromo_running_sd <- function(x) {
running_sd3(x,window=1000,na_rm=FALSE,restart_period=10000L)
}
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 5.2e+06 279 9.7e+06 516 9.7e+06 516
Vcells 5.9e+07 451 2.8e+08 2146 8.9e+08 6816
# compute sd(x) as a reference
microbenchmark(sd(x),
fromo_running_sd(x),
times=1000L)
Unit: microseconds
expr min lq mean median uq max neval cld
sd(x) 304 308 331 321 339 501 1000 a
fromo_running_sd(x) 1409 1448 1637 1502 1688 32200 1000 b
Under the old 0.1.3 code:
Unit: microseconds
expr min lq mean median uq max neval cld
sd(x) 457 473 498 488 508 767 1000 a
fromo_running_sd(x) 4183 4523 5168 4730 5623 66529 1000 b
I would call this a win, except you might interpret it as meaning the old code was crap. Also, I worry about the maximum time taken, which suggests some kind of lurking boo-boo.