I recently released a package to CRAN called madness. The eponymous object supports 'multivariate' automatic differentiation by forward accumulation. By 'multivariate', I mean it allows you to track (and automatically computes) the derivative of a scalar, or vector, or matrix, or multidimensional array with respect to a scalar, vector, matrix or multidimensional array.

The primary use case in mind is the multivariate delta method, where one has an estimate of a population quantity and the variance-covariance of the same, and wants to perform inference on some transform of that population quantity. With the values stored in a madness object, one merely performs the transforms directly on the estimate, and the derivatives are computed automatically. A secondary use case would be for the automatic computation of gradients when optimizing some complex function, e.g. in the computation of the MLE of some quantity.

A madness object contains a value, val, as well as the derivative of val with respect to some \(X\), called dvdx. The derivative is stored as a matrix in 'numerator layout' convention: if val holds \(m\) values, and \(X\) holds \(n\) values, then dvdx is a \(m \times n\) matrix. This unfortunately means that a gradient is stored as a row vector. Numerator layout feels more natural (to me, at least) when propagating derivatives via the chain rule.

For convenience, one can also store the 'tags' of the value and \(X\), in vtag and xtag, respectively. The vtag will be modified when computations are performed, which can be useful for debugging. One can also store the variance-covariance matrix of \(X\) in varx.

Here is an example session showing the use of a madness object. Note that by default if one does not feed in dvdx, the object constructor assumes that the value is equal to \(X\), and so instantiates dvdx as an identity matrix. Here I will instantiate a square matrix of strictly positive elements, and stuff it into a madness object:

X <- matrix(runif(9,min=0.5,max=1.5),ncol=3)
Xmad <- madness(X)
## class: madness 
##         d X
##  calc: ----- 
##         d X
##   val: 0.787578 1.38302 1.02811 ...
##  dvdx: 1 0 0 0 0 0 0 0 0 ...
##  varx:  ...

The display is a bit odd: only a subset of the val and dvdx are shown, to stop screen flyby. (I would love to make this better; if you can improve it, submit a PR.) At the top, you see that we have computed \(\mathrm{d}X/\mathrm{d}X\). The derivative is indeed the identity. You can get the value and derivative from a madness object by the val and dvdx 'getter' methods.

Now take the log of each element, and compute the Frobenius norm of that:

fval <- norm(log(Xmad),'F')
## class: madness 
##         d norm(log(X), 'f')
##  calc: --------------------- 
##                 d X
##   val: 0.921251 ...
##  dvdx: -0.329118 0.213445 -0.113967 0.254506 0.275026 -1.20564 0.0292646 0.258069 0.05178 ...
##  varx:  ...

You see the scalar value, and the derivative with respect to the original matrix X. Note also that the vtag is a log of the methods called on X, again principally for debugging.

I implemented many of the common methods on madness objects: one can sum them, take their Hadamard, matrix or Kronecker product, or perform these operations against constants, compute element-wise trigonometric, or log or exponent, compute the Cholesky factor, the symmetric square root, invert the matrix, compute the trace or derivative, reshape the array, concatenate them, and so on. If there are methods missing, file an issue. In the meantime, there is a numerical approximation method, numderiv, which can approximate the derivative of complex operations via differencing schemes.

Insert coin

With a system like this, it seemed to me that the weak points would be getting data into and out of the objects. For convenience, there are two methods which can get you a madness object with a variance-covariance of \(X\). One, as.madness, takes any object with a coef method (and, ideally, a vcov method), and encapsulates the coefficients, again with identity derivative. So, suppose you computed the linear regression coefficient of a linear system, then wanted to compute some function of the coefficients:

n <- 1000
adf <- data.frame(x=rnorm(n),y=rnorm(n),z=rnorm(n),err=rnorm(n))
adf$res <- adf$x - adf$y + 3 * adf$z + adf$err
beta <- lm(res ~ x + y + z,data=adf)
bmad <- as.madness(beta)
## class: madness 
##         d val
##  calc: ------- 
##         d val
##   val: 0.027397 ...
##  dvdx: 1 0 0 0 ...
##  varx: 0.000929 6.48986e-08 -8.99532e-06 -7.42748e-07 ...
# is the beta vector all zeros? check its norm:
bcross <- crossprod(bmad)
## class: madness 
##         d (t(val) %*% val)
##  calc: -------------------- 
##                d val
##   val: 10.7838 ...
##  dvdx: 0.0547939 1.99158 -2.01165 5.92615 ...
##  varx: 0.000929 6.48986e-08 -8.99532e-06 -7.42748e-07 ...

You now have a madness object with a valid varx. The vcov method now gives an estimate of the variance-covariance of the value in the object via the multivariate delta method. The typical use case is to compute the Wald statistic:

wald <- val(bcross) / sqrt(vcov(bcross))
##         [,1]
## [1,] 50.8158

The other method for input is twomoments, which computes the mean and covariance of a vector of inputs where the rows are given in a matrix on the input. Here I take the monthly returns of the Fama-French three factors and compute the first two moments:

ffrets <- as.data.frame(1e-2 * mff4[,c('Mkt','SMB','HML')])
twom <- twomoments(ffrets)
## class: madness 
##           d mu
##  calc: ---------- 
##         d ffrets
##   val: 0.00921322 ...
##  dvdx: 0 1 0 0 0 0 0 0 0 0 ...
##  varx: 3.95423e-34 1.31845e-22 1.38795e-22 -8.04428e-22 5.27357e-23 1.03228e-23 6.93189e-24 1.79318e-23 7.52093e-24 -1.99062e-23 ...

The Markowitz portfolio is then computed as on regular data, by solveing a system:

markowitz <- solve(twom$Sigma,twom$mu)
## class: madness 
##         d solve(Sigma, mu)
##  calc: -------------------- 
##              d ffrets
##   val: 2.88011 ...
##  dvdx: 0 430.694 -205.903 -124.517 -1179.34 472.625 -432.017 51.0478 421.256 245.034 ...
##  varx: 3.95423e-34 1.31845e-22 1.38795e-22 -8.04428e-22 5.27357e-23 1.03228e-23 6.93189e-24 1.79318e-23 7.52093e-24 -1.99062e-23 ...
# marginal Wald statistics:
wald <- val(markowitz) / sqrt(diag(vcov(markowitz))) 
##          [,1]
## [1,] 4.209399
## [2,] 0.274838
## [3,] 2.249841

You can feed in a different vcov function, but unfortunately this has to be done at the beginning:

twom <- twomoments(ffrets,vcov=sandwich::vcovHAC)
markowitz <- solve(twom$Sigma,twom$mu)
wald <- val(markowitz) / sqrt(diag(vcov(markowitz))) 
##          [,1]
## [1,] 3.749292
## [2,] 0.264687
## [3,] 1.805240

I'll be your mirror

The twomoments method (and the theta method, which it is built upon) respect the symmetry of the covariance matrix being estimated. If you naively put one of these in a madness object, you may _over_estimate the variance of your final value because you are effectively ignoring the fact that elements mirroring the diagonal of a covariance must equal each other and cannot vary independently. There are other forms of symmetry one would often like to bake into a system. These have to be enforced when setting the varx of a madness object.

Note also that some functions in R assume symmetry of the input and act accordingly. I am thinking of chol, which only looks at the upper triangular part of the input. The derivative of this function with respect to the lower triangular part should be zero. However, when considering the mathematical function, the derivative should be nonzero. This is a bit odd, because one has to assume the input matrix is symmetric and could not be otherwise. There is a simple way around this: symmetrize the input prior to a chol:

csigma <- chol(0.5 * (twom$Sigma + t(twom$Sigma)))
## class: madness 
##         d chol((numeric * (Sigma + t(Sigma))))
##  calc: ---------------------------------------- 
##                        d ffrets
##   val: 0.0534007 0.0101888 0.00830648 ...
##  dvdx: 0 -0.173 0 0 9.38869 0 0 0 0 0 ...
##  varx: 3.98675e-34 -3.53089e-22 3.05579e-22 -1.025e-21 1.15502e-22 2.47245e-23 -3.55091e-24 4.84984e-23 8.24594e-24 -7.46717e-24 ...

I will save the 'real' intended use cases for a future blog post.

Not another AD package?!

It is not apparent that this package needed to exist. There are a few other automatic differentiation solutions in R already. These felt too basic for me, however, since they focus on scalar operations (or scalar operations performed elementwise). Sure, one can pose matrix inversion as a bunch of additions and multiplications, divisions, and so on, but that is not how it is implemented in R for performance reasons, rather a C library is called instead. If an AD package cannot pierce into C routines (and it seems unlikely it could), it cannot be applied to these operations. A higher-level approach is needed.

I wrote this package for a few reasons:

  • I needed the functionality for some things I am thinking about.
  • I wanted to learn about the 'S4' system in R for multiple dispatch.

Besides the cute acronym ('Multivariate Automatic Differentiation'), the latter is the reason I called the package madness. (I tried to justify an acronym for masochists, but it was not forthcoming.) In fact, when I announced that I was going to use the S4 system, the reaction was a mixture of alarm and condolences. Apparently S4 is a self-deprecating feature of R, the crazy uncle whose existence means you never invite your friends to Thanksgiving.

But, because I wanted to support non-commutative mixed operations (e.g. matrix multiplication between a constant and a madness object, in either order), I had to use S4 for multiple dispatch. (Actually, I already wrote this package in another language which does not have multiple dispatch, and it is possible, just more unpleasant.)

S4 is indeed frustrating for a newcomer. My main goals, as a package designer, were:

  • Define or overload methods for objects defined in my package and under my control.
  • Ensure that by so doing I was not breaking existing code for users of my package.

It was never clear to me that I was achieving either of these goals. Writing an S4 class, properly documenting it with roxygen2 and making all the CRAN checks pass was a bit hairy for me. Part of the problem is that I have the attention span of a fruitfly and cannot be bothered to read the fine manual. Part of the problem is that the best practices have changed over time, and reading old documentation on the web is actually counterproductive (I think a best practices S4 skeleton package on github, that could be maintained over time, would be an excellent resource, though only for the benefit of a select few). But part of the problem is probably that S4 is an awful system, and the community is justified in warning you not to use it unless you know otherwise.

While I have essentially an elitist view of the CRAN ecosystem--there is enough of a critical mass of packages that we should require some amount of work from package maintainers as 'proof' that their package will work seamlessly for users--I was put in the uncomfortable situation of being the clueless user who could not do anything right. This makes me feel a bit uneasy about my relationship with R, though I suspect that with practice I will come to understand and appreciate this system for what it is--after all, I came to peace with Matlab after ten years of daily use. But it does leave a bad taste in my mouth, like expensive champagne.

If I understand the history correctly, S4 was bolted onto R as an afterthought, to work side by side with the older S3 system. Moreover, it was not a ten year migration process. I suspect that if S4 was truly a pain point for users or developers, the R developers could create an 'S4 1/2' system and release it forthwith: the R language is fairly dynamic. Again, this is probably not a priority for anyone, but it is worth noting that multiple dispatch is one of the main selling points of Julia, where it provides enough hints for the interpreter to produce efficient code. While the efficiency gains would likely not translate into R due to interpreter architecture, it seems to me that multiple dispatch is an essential feature for a language which users can tinker with, overload, and take ownership of.