I recently released a package to CRAN called madness. The eponymous object supports 'multivariate' automatic differentiation by forward accumulation. By 'multivariate', I mean it allows you to track (and automatically computes) the derivative of a scalar, or vector, or matrix, or multidimensional array with respect to a scalar, vector, matrix or multidimensional array.
The primary use case in mind is the
multivariate delta method,
where one has an estimate of a population quantity and the variance-covariance
of the same, and wants to perform inference on some transform of that
population quantity. With the values stored in a madness
object, one merely
performs the transforms directly on the estimate, and the derivatives are
computed automatically. A secondary use case would be for the automatic
computation of gradients when optimizing some complex function, e.g. in the
computation of the MLE of some quantity.
A madness
object contains a value, val
, as well as the derivative of
val
with respect to some \(X\), called dvdx
. The derivative is stored
as a matrix in 'numerator layout' convention: if val
holds
\(m\) values, and \(X\) holds \(n\) values, then dvdx
is a \(m \times n\) matrix.
This unfortunately means that a gradient is stored as a row vector.
Numerator layout feels more natural (to me, at least) when propagating
derivatives via the chain rule.
For convenience, one can also store the 'tags' of the value and \(X\), in
vtag
and xtag
, respectively. The vtag
will be modified when computations
are performed, which can be useful for debugging. One can also store
the variance-covariance matrix of \(X\) in varx
.
Here is an example session showing the use of a madness
object. Note that by
default if one does not feed in dvdx
, the object constructor assumes that
the value is equal to \(X\), and so instantiates dvdx
as an identity matrix.
Here I will instantiate a square matrix of strictly positive elements,
and stuff it into a madness
object:
library(madness)
set.seed(123)
X <- matrix(runif(9,min=0.5,max=1.5),ncol=3)
Xmad <- madness(X)
show(Xmad)
## class: madness
## d X
## calc: -----
## d X
## val: 0.787578 1.38302 1.02811 ...
## dvdx: 1 0 0 0 0 0 0 0 0 ...
## varx: ...
The display is a bit odd: only a subset of the val
and dvdx
are shown,
to stop screen flyby.
(I would love to make this better; if you can improve it, submit a PR.)
At the top, you see that we have computed \(\mathrm{d}X/\mathrm{d}X\). The
derivative is indeed the identity. You can get the value and derivative from a
madness
object by the val
and dvdx
'getter' methods.
Now take the log of each element, and compute the Frobenius norm of that:
fval <- norm(log(Xmad),'F')
show(fval)
## class: madness
## d norm(log(X), 'f')
## calc: ---------------------
## d X
## val: 0.921251 ...
## dvdx: -0.329118 0.213445 -0.113967 0.254506 0.275026 -1.20564 0.0292646 0.258069 0.05178 ...
## varx: ...
You see the scalar value, and the derivative with respect to the original
matrix X
. Note also that the vtag
is a log of the methods called
on X
, again principally for debugging.
I implemented many of the common methods on madness
objects: one can sum
them, take their Hadamard, matrix or Kronecker product, or perform these
operations against constants, compute element-wise trigonometric, or log or
exponent, compute the Cholesky factor, the symmetric square root, invert the
matrix, compute the trace or derivative, reshape the array, concatenate them,
and so on. If there are methods missing, file an
issue. In the meantime, there
is a numerical approximation method, numderiv
, which can approximate the
derivative of complex operations via differencing schemes.
Insert coin
With a system like this, it seemed to me that the weak points would be getting
data into and out of the objects. For convenience, there are two methods which
can get you a madness
object with a variance-covariance of \(X\).
One, as.madness
, takes any object with a coef
method (and, ideally, a
vcov
method), and encapsulates the coefficients, again with identity
derivative. So, suppose you computed the linear regression coefficient of a
linear system, then wanted to compute some function of the
coefficients:
set.seed(345)
n <- 1000
adf <- data.frame(x=rnorm(n),y=rnorm(n),z=rnorm(n),err=rnorm(n))
adf$res <- adf$x - adf$y + 3 * adf$z + adf$err
beta <- lm(res ~ x + y + z,data=adf)
bmad <- as.madness(beta)
show(bmad)
## class: madness
## d val
## calc: -------
## d val
## val: 0.027397 ...
## dvdx: 1 0 0 0 ...
## varx: 0.000929 6.48986e-08 -8.99532e-06 -7.42748e-07 ...
# is the beta vector all zeros? check its norm:
bcross <- crossprod(bmad)
show(bcross)
## class: madness
## d (t(val) %*% val)
## calc: --------------------
## d val
## val: 10.7838 ...
## dvdx: 0.0547939 1.99158 -2.01165 5.92615 ...
## varx: 0.000929 6.48986e-08 -8.99532e-06 -7.42748e-07 ...
You now have a madness
object with a valid varx
. The vcov
method now
gives an estimate of the variance-covariance of the value in the object
via the multivariate delta method. The typical use case is to compute the
Wald statistic:
wald <- val(bcross) / sqrt(vcov(bcross))
print(wald)
## [,1]
## [1,] 50.8158
The other method for input is twomoments
, which computes the mean and
covariance of a vector of inputs where the rows are given in a matrix on the
input. Here I take the monthly returns of the Fama-French three factors and
compute the first two moments:
library(tsrsa)
data(mff4)
ffrets <- as.data.frame(1e-2 * mff4[,c('Mkt','SMB','HML')])
twom <- twomoments(ffrets)
show(twom$mu)
## class: madness
## d mu
## calc: ----------
## d ffrets
## val: 0.00921322 ...
## dvdx: 0 1 0 0 0 0 0 0 0 0 ...
## varx: 3.95423e-34 1.31845e-22 1.38795e-22 -8.04428e-22 5.27357e-23 1.03228e-23 6.93189e-24 1.79318e-23 7.52093e-24 -1.99062e-23 ...
The Markowitz portfolio is then computed as on regular data, by solve
ing a
system:
markowitz <- solve(twom$Sigma,twom$mu)
show(markowitz)
## class: madness
## d solve(Sigma, mu)
## calc: --------------------
## d ffrets
## val: 2.88011 ...
## dvdx: 0 430.694 -205.903 -124.517 -1179.34 472.625 -432.017 51.0478 421.256 245.034 ...
## varx: 3.95423e-34 1.31845e-22 1.38795e-22 -8.04428e-22 5.27357e-23 1.03228e-23 6.93189e-24 1.79318e-23 7.52093e-24 -1.99062e-23 ...
# marginal Wald statistics:
wald <- val(markowitz) / sqrt(diag(vcov(markowitz)))
print(wald)
## [,1]
## [1,] 4.209399
## [2,] 0.274838
## [3,] 2.249841
You can feed in a different vcov
function, but unfortunately this has to be
done at the beginning:
require(sandwich)
twom <- twomoments(ffrets,vcov=sandwich::vcovHAC)
markowitz <- solve(twom$Sigma,twom$mu)
wald <- val(markowitz) / sqrt(diag(vcov(markowitz)))
print(wald)
## [,1]
## [1,] 3.749292
## [2,] 0.264687
## [3,] 1.805240
I'll be your mirror
The twomoments
method (and the theta
method, which it is built upon)
respect the symmetry of the covariance matrix being estimated. If you naively
put one of these in a madness
object, you may _over_estimate the variance
of your final value because you are effectively ignoring the fact that elements
mirroring the diagonal of a covariance must equal each other and cannot vary
independently. There are other forms of symmetry one would often like to bake
into a system. These have to be enforced when setting the varx
of a madness
object.
Note also that some functions in R assume symmetry of the input and act
accordingly. I am thinking of chol
, which only looks at the upper triangular
part of the input. The derivative of this function with respect to the lower
triangular part should be zero. However, when considering the mathematical
function, the derivative should be nonzero. This is a bit odd, because one has
to assume the input matrix is symmetric and could not be otherwise. There is a
simple way around this: symmetrize the input prior to a chol
:
csigma <- chol(0.5 * (twom$Sigma + t(twom$Sigma)))
show(csigma)
## class: madness
## d chol((numeric * (Sigma + t(Sigma))))
## calc: ----------------------------------------
## d ffrets
## val: 0.0534007 0.0101888 0.00830648 ...
## dvdx: 0 -0.173 0 0 9.38869 0 0 0 0 0 ...
## varx: 3.98675e-34 -3.53089e-22 3.05579e-22 -1.025e-21 1.15502e-22 2.47245e-23 -3.55091e-24 4.84984e-23 8.24594e-24 -7.46717e-24 ...
I will save the 'real' intended use cases for a future blog post.
Not another AD package?!
It is not apparent that this package needed to exist. There are a few other automatic differentiation solutions in R already. These felt too basic for me, however, since they focus on scalar operations (or scalar operations performed elementwise). Sure, one can pose matrix inversion as a bunch of additions and multiplications, divisions, and so on, but that is not how it is implemented in R for performance reasons, rather a C library is called instead. If an AD package cannot pierce into C routines (and it seems unlikely it could), it cannot be applied to these operations. A higher-level approach is needed.
I wrote this package for a few reasons:
- I needed the functionality for some things I am thinking about.
- I wanted to learn about the 'S4' system in R for multiple dispatch.
Besides the cute acronym ('Multivariate Automatic Differentiation'), the latter
is the reason I called the package madness
. (I tried to justify an acronym for
masochists
, but it was not forthcoming.) In fact, when I announced that I was
going to use the S4 system, the reaction was a mixture of alarm and condolences.
Apparently S4 is a self-deprecating feature of R, the crazy uncle whose
existence means you never invite your friends to Thanksgiving.
But, because I wanted to support non-commutative mixed operations (e.g.
matrix multiplication between a constant and a madness
object, in either
order), I had to use S4 for multiple dispatch. (Actually, I already
wrote this package in another language which does not have multiple
dispatch, and it is possible, just more unpleasant.)
S4 is indeed frustrating for a newcomer. My main goals, as a package designer, were:
- Define or overload methods for objects defined in my package and under my control.
- Ensure that by so doing I was not breaking existing code for users of my package.
It was never clear to me that I was achieving either of these goals.
Writing an S4 class, properly documenting it with roxygen2
and making all the
CRAN checks pass was a bit hairy for me. Part of the problem is that I have
the attention span of a fruitfly and cannot be bothered to read the fine
manual. Part of the problem is that the best practices have changed over time,
and reading old documentation on the web is actually counterproductive (I think
a best practices S4 skeleton package on github, that could be maintained over
time, would be an excellent resource, though only for the benefit of a select few).
But part of the problem is probably that S4 is an awful system, and the
community is justified in warning you not to use it unless you know otherwise.
While I have essentially an elitist view of the CRAN ecosystem--there is enough of a critical mass of packages that we should require some amount of work from package maintainers as 'proof' that their package will work seamlessly for users--I was put in the uncomfortable situation of being the clueless user who could not do anything right. This makes me feel a bit uneasy about my relationship with R, though I suspect that with practice I will come to understand and appreciate this system for what it is--after all, I came to peace with Matlab after ten years of daily use. But it does leave a bad taste in my mouth, like expensive champagne.
If I understand the history correctly, S4 was bolted onto R as an afterthought, to work side by side with the older S3 system. Moreover, it was not a ten year migration process. I suspect that if S4 was truly a pain point for users or developers, the R developers could create an 'S4 1/2' system and release it forthwith: the R language is fairly dynamic. Again, this is probably not a priority for anyone, but it is worth noting that multiple dispatch is one of the main selling points of Julia, where it provides enough hints for the interpreter to produce efficient code. While the efficiency gains would likely not translate into R due to interpreter architecture, it seems to me that multiple dispatch is an essential feature for a language which users can tinker with, overload, and take ownership of.