Gilgamathhttps://www.gilgamath.com/Sun, 13 Jan 2019 10:23:39 -0800fromo 0.2.0https://www.gilgamath.com/fromo-two.html<p>I recently pushed version 0.2.0 of my <code>fromo</code> package to
<a href="https://cran.r-project.org/web/packages/fromo/index.html">CRAN</a>.
This package implements (relatively) fast, numerically robust
computation of moments via <code>Rcpp</code>.
<!-- PELICAN_END_SUMMARY --></p>
<p>The big changes in this release are:</p>
<ul>
<li>Support for weighted moment estimation.</li>
<li>Computation of running moments over windows defined
by time (or some other increasing index), rather
than vector index.</li>
<li>Some modest improvements in speed for the 'dangerous'
use cases (no checking for <code>NA</code>, no weights, <em>etc.</em>)</li>
</ul>
<p>The time-based running moments are supported via the <code>t_running_*</code> operations,
and we support means, standard deviation, skew, kurtosis, centered and
standardized moments and cumulants, z-score, Sharpe, and t-stat. The
idea is that your observations are associated with some increasing
index, which you can think of as the observation time, and you wish
to compute moments over a fixed time window. To bloat the API, the
times from which you 'look back' can optionally be something other
than the time indices of the input, so the input and output size
can be different.</p>
<p>Some example uses might be:</p>
<ul>
<li>Compute the volatility of an asset's returns over the previous 6 months,
on every trade day.</li>
<li>Compute the total monthly sales of a company at month ends.</li>
</ul>
<p>Because the API also allows you to use weights as implicit time deltas, you can
also do weird and unadvisable things like compute the Sharpe of an asset
over the last 1 million shares traded.</p>
<p>Speed improvements come from my random walk through c++ design idioms.
I also implemented a 'swap' procedure for the running standard deviation
which incorporates a Welford's method addition and removal into a single
step. I do not believe that Welford's method is the fastest algorithm
for a summarizing moment computation: probably a two pass solution to
compute the mean first, then the centered moments is faster. However,
for the …</p>Steven E. PavSun, 13 Jan 2019 10:23:39 -0800tag:www.gilgamath.com,2019-01-13:/fromo-two.htmlRpackageTwelve Dimensional Chess is Stupidhttps://www.gilgamath.com/twelve_dimensional_chess.html<p>Chess and the Curse of Dimensionality</p>StevenTue, 16 Oct 2018 22:24:30 -0700tag:www.gilgamath.com,2018-10-16:/twelve_dimensional_chess.htmlanalysischessR in Finance 2018https://www.gilgamath.com/rfin2018.html<p>Review of R in Finance 2018 conference</p>StevenFri, 01 Jun 2018 10:00:32 -0700tag:www.gilgamath.com,2018-06-01:/rfin2018.htmlquant-financereportsAnother Confidence Limit for the Markowitz Signal Noise ratiohttps://www.gilgamath.com/new_mp_ci.html<p>Another confidence limit on the Signal Noise ratio of the Markowitz portfolio.</p>StevenWed, 28 Mar 2018 21:33:59 -0700tag:www.gilgamath.com,2018-03-28:/new_mp_ci.htmlstatisticsquant-financeanalysisRMarkowitz Portfolio Covariance, Elliptical Returnshttps://www.gilgamath.com/markowitz-cov-elliptical.html<p>In a <a href="bad-cis">previous blog post</a>, I looked at asymptotic confidence
intervals for the Signal to Noise ratio of the (sample) Markowitz
portfolio, finding them to be deficient. (Perhaps they are useful if
one has hundreds of thousands of days of data, but are otherwise
awful.) Those confidence intervals came from revision four of my paper
on the <a href="https://arxiv.org/abs/1312.0557">Asymptotic distribution of the Markowitz Portfolio</a>.
In that same update, I also describe, albeit in an obfuscated form,
the asymptotic distribution of the sample Markowitz portfolio for
elliptical returns. Here I check that finding empirically.
<!-- PELICAN_END_SUMMARY --></p>
<p>Suppose you observe a <span class="math">\(p\)</span> vector of returns drawn from an elliptical
distribution with mean <span class="math">\(\mu\)</span>, covariance <span class="math">\(\Sigma\)</span> and 'kurtosis factor',
<span class="math">\(\kappa\)</span>. Three times the kurtosis factor is the kurtosis of marginals
under this assumed model. It takes value <span class="math">\(1\)</span> for a multivariate normal.
This model of returns is slightly more realistic than multivariate normal,
but does not allow for skewness of asset returns, which seems unrealistic.</p>
<p>Nonetheless, let <span class="math">\(\hat{\nu}\)</span> be the Markowitz portfolio built on a sample
of <span class="math">\(n\)</span> days of independent returns:
</p>
<div class="math">$$
\hat{\nu} = \hat{\Sigma}^{-1} \hat{\mu},
$$</div>
<p>
where <span class="math">\(\hat{\mu}, \hat{\Sigma}\)</span> are the regular 'vanilla' estimates
of mean and covariance. The vector <span class="math">\(\hat{\nu}\)</span> is, in a sense, over-corrected,
and we need to cancel out a square root of <span class="math">\(\Sigma\)</span> (the population value). So
we will consider the distribution of <span class="math">\(Q \Sigma^{\top/2} \hat{\nu}\)</span>, where
<span class="math">\(\Sigma^{\top/2}\)</span> is the upper triangular Cholesky factor of <span class="math">\(\Sigma\)</span>,
and where <span class="math">\(Q\)</span> is an orthogonal matrix (<span class="math">\(Q Q^{\top} = I\)</span>), and where
<span class="math">\(Q\)</span> rotates <span class="math">\(\Sigma^{-1/2}\mu\)</span> onto <span class="math">\(e_1\)</span>, the first basis vector:
</p>
<div class="math">$$
Q \Sigma^{-1/2}\mu = \zeta e_1,
$$</div>
<p>
where <span class="math">\(\zeta\)</span> is the Signal to Noise ratio of the population Markowitz
portfolio: <span class="math">\(\zeta = \sqrt{\mu^{\top}\Sigma^{-1}\mu} = \left\Vert …</span></p>Steven E. PavMon, 12 Mar 2018 22:28:31 -0700tag:www.gilgamath.com,2018-03-12:/markowitz-cov-elliptical.htmlquant-financeanalysisstatisticsMarkowitzportfolioRA Lack of Confidence Intervalhttps://www.gilgamath.com/bad-cis.html<p>For some years now I have been playing around with a certain problem
in portfolio statistics: suppose you observe <span class="math">\(n\)</span> independent observations
of a <span class="math">\(p\)</span> vector of returns, then form the Markowitz portfolio based on
those returns. What then is the distribution of what I call the 'signal to
noise ratio' of that Markowitz portfolio, defined as the true expected
return divided by the true volatility. That is, if <span class="math">\(\nu\)</span> is the Markowitz
portfolio, built on a sample, its 'SNR' is <span class="math">\(\nu^{\top}\mu /
\sqrt{\nu^{\top}\Sigma \nu}\)</span>, where <span class="math">\(\mu\)</span> is the population mean vector, and
<span class="math">\(\Sigma\)</span> is the population covariance matrix.
<!-- PELICAN_END_SUMMARY --></p>
<p>This is an odd problem, somewhat unlike classical statistical inference because the
unknown quantity, the SNR, depends on population parameters, but also the
sample. It is random and unknown. What you learn in your basic statistics class is
inference on fixed unknowns. (Actually, I never really took a basic statistics
class, but I think that's right.)</p>
<p>Paulsen and Sohl made some progress on this problem in their 2016 paper on what
they call the
<a href="https://arxiv.org/abs/1602.06186">Sharpe Ratio Information Criterion.</a>
They find a sample statistic which is unbiased for the portfolio SNR when
returns are (multivariate) Gaussian. In my mad scribblings on the backs of
envelopes and scrap paper, I have been trying to find the <em>distribution</em> of the SNR.
I have been looking for this love, as they say, in all the wrong places,
usually hoping for some clever transformation that will lead to a slick proof.
(I was taught from a young age to look for slick proofs.) </p>
<p>Having failed that mission, I pivoted to looking for confidence intervals for
the SNR (and maybe even <em>prediction intervals</em> on the out-of-sample Sharpe ratio
of the in-sample Markowitz portfolio). I realized that some of the work I had
done …</p>Steven E. PavThu, 15 Feb 2018 21:58:58 -0800tag:www.gilgamath.com,2018-02-15:/bad-cis.htmlquant-financeanalysisstatisticsgeom cloud.https://www.gilgamath.com/geom-cloud.html<p>I wanted a drop-in replacement for <code>geom_errorbar</code> in <code>ggplot2</code> that would
plot a density cloud of uncertainty.
<!-- PELICAN_END_SUMMARY -->
The idea is that typically (well, where I work),
the <code>ymin</code> and <code>ymax</code> of an errorbar are plotted at plus and minus
one standard deviation. A 'cloud' where the alpha is proportional to a normal
density with the same standard deviations could show the same information
on a plot with a little less clutter. I found out how to do this with
a very ugly function, but wanted to do it the 'right' way by spawning my
own geom. So the <code>geom_cloud</code>.</p>
<p>After looking at a bunch of other <code>ggplot2</code> extensions, some amount of
tinkering and hair-pulling, and we have the following code. The first part
just computes standard deviations which are equally spaced in normal density.
This is then used to create a list of <code>geom_ribbon</code> with equal alpha, but
the right size. A little trickery is used to get the scales right. There
are three parameters: the <code>steps</code>, which control how many ribbons are drawn.
The default value is a little conservative. A larger value, like 15, gives
very smooth clouds. The <code>se_mult</code> is the number of standard deviations that
the <code>ymax</code> and <code>ymin</code> are plotted at, defaulting to 1 here. If you plot
your errorbars at 2 standard errors, change this to 2. The <code>max_alpha</code> is the
alpha at the maximal density, <em>i.e.</em> around <code>y</code>.</p>
<div class="highlight"><pre><span></span><span class="c1"># get points equally spaced in density </span>
equal_ses <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>steps<span class="p">)</span> <span class="p">{</span>
xend <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">4</span><span class="p">)</span>
endpnts <span class="o"><-</span> dnorm<span class="p">(</span>xend<span class="p">)</span>
<span class="c1"># perhaps use ppoints instead?</span>
deql <span class="o"><-</span> <span class="kp">seq</span><span class="p">(</span>from<span class="o">=</span>endpnts<span class="p">[</span><span class="m">1</span><span class="p">],</span>to<span class="o">=</span>endpnts<span class="p">[</span><span class="m">2</span><span class="p">],</span>length.out<span class="o">=</span>steps<span class="m">+1</span><span class="p">)</span>
davg <span class="o"><-</span> <span class="p">(</span>deql<span class="p">[</span><span class="m">-1</span><span class="p">]</span> <span class="o">+</span> deql<span class="p">[</span><span class="o">-</span><span class="kp">length</span><span class="p">(</span>deql<span class="p">)])</span><span class="o">/</span><span class="m">2</span>
<span class="c1"># invert</span>
xeql <span class="o"><-</span> <span class="kp">unlist</span><span class="p">(</span><span class="kp">lapply</span><span class="p">(</span>davg<span class="p">,</span><span class="kr">function</span><span class="p">(</span>d<span class="p">)</span> <span class="p">{</span>
uniroot<span class="p">(</span>f<span class="o">=</span><span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span> dnorm<span class="p">(</span>x<span class="p">)</span> <span class="o">-</span> d <span class="p">},</span>interval<span class="o">=</span>xend<span class="p">)</span><span class="o">$</span>root
<span class="p">}))</span>
xeql
<span class="p">}</span>
<span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>grid<span class="p">)</span>
geom_cloud <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>mapping …</pre></div>Steven E. PavThu, 21 Sep 2017 21:51:28 -0700tag:www.gilgamath.com,2017-09-21:/geom-cloud.htmlRggplotSpy vs Spy vs Wald Wolfowitz.https://www.gilgamath.com/spy-vs-wald-wolfowitz.html<p>I turned my kids on to the great Spy vs Spy cartoon from Mad Magazine.
This strip is pure gold for two young boys: Rube Goldberg plus
explosions with not much dialog (one child is still too young to read).
I became curious whether the one Spy had the upper hand, whether
Prohias worked to keep the score 'even', and so on.
<!-- PELICAN_END_SUMMARY --></p>
<p>Not finding any data out there, I collected the data to the best
of my ability from the Spy vs Spy Omnibus, which collects all
248 strips that appeared in Mad Magazine (plus two special issues).
I think there are more strips out there by Prohias that appeared
only in collected books, but have not collected them yet.
I entered the data into a google spreadsheet, then converted into
CSV, then into <a href="http://www.github.com/shabbychef/SPYvsSPY">an R data package</a>.
Now you can play along at home.</p>
<p>On to the simplest form of my question: did Prohias alternate between
Black and White Spy victories? or did he choose at random?
Up until 1968 it was common for two strips to appear in one issue
of Mad, with one victory per Spy. In some cases <em>three</em> strips
appeared per issue, with the Grey Spy appearing in the third;
the Black and White Spies always receive a comeuppance when she
appears, and so the balance of power was maintained.
After 1972, it seems that only a single strip appeared per issue,
and we can examine the time series of victories. </p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>SPYvsSPY<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>dplyr<span class="p">)</span>
data<span class="p">(</span>svs<span class="p">)</span>
<span class="c1"># show that there are multiple per strip</span>
svs <span class="o">%>%</span>
group_by<span class="p">(</span>Mad_no<span class="p">,</span>yrmo<span class="p">)</span> <span class="o">%>%</span>
summarize<span class="p">(</span>nstrips<span class="o">=</span>n<span class="p">(),</span>
net_victories<span class="o">=</span><span class="kp">sum</span><span class="p">(</span><span class="kp">as.numeric</span><span class="p">(</span>white_comeuppance<span class="p">)</span> <span class="o">-</span> <span class="kp">as.numeric</span><span class="p">(</span>black_comeuppance<span class="p">)))</span> <span class="o">%>%</span>
ungroup<span class="p">()</span> <span class="o">%>%</span>
select<span class="p">(</span>yrmo<span class="p">,</span>nstrips<span class="p">,</span>net_victories<span class="p">)</span> <span class="o">%>%</span>
<span class="kp">head</span><span class="p">(</span>n<span class="o">=</span><span class="m">20</span><span class="p">)</span> <span class="o">%>%</span>
kable<span class="p">()</span>
</pre></div>
<table>
<thead>
<tr>
<th align="left">yrmo</th>
<th align="right">nstrips</th>
<th align="right">net_victories</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1961-01</td>
<td align="right">3</td>
<td align="right">-1</td>
</tr>
<tr>
<td align="left">1961-03</td>
<td align="right">2</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">1961-04</td>
<td align="right">2</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">1961-06</td>
<td align="right">2</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">1961-07</td>
<td align="right">2 …</td></tr></tbody></table>Steven E. PavTue, 05 Sep 2017 21:34:15 -0700tag:www.gilgamath.com,2017-09-05:/spy-vs-wald-wolfowitz.htmlanalysisdataR in Finance 2017https://www.gilgamath.com/rfin2017.html<p>Review of R in Finance 2017 conference</p>StevenFri, 19 May 2017 09:30:24 -0700tag:www.gilgamath.com,2017-05-19:/rfin2017.htmlquant-financereportsCalendar plots in ggplot2.https://www.gilgamath.com/calendar-plots-ggplot2.html<p>I like the calendar 'heatmap' plots of commits you can see on
<a href="https://github.com/shabbychef">github user pages</a>, and wanted to play around with some.
Of course, if I just wanted to make some plots, I could have just googled around, and then
followed <a href="http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Calendar%20Heat%20Map">this recipe</a>,
or maybe used the <a href="https://github.com/ramnathv/rChartsCalmap">rChartsCalmap package</a>.
Instead I set out, as an exercise, to make my own using ggplot2. </p>
<!-- PELICAN_END_SUMMARY -->
<p>For data, I am using the daily GHCND observations data for station <code>USC00047880</code>, which is
located in the San Rafael, CA, Civic Center. I downloaded this data as part of a project
to join weather data to campground data (yes, it's been done before), directly from
the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily">NOAA FTP site</a>, then read the fixed width
file. I then processed the data, subselected to 2016 and beyond, and converted the units.
I am left with a dataframe of dates, the element name, and the value, which is a temperature
in Celsius. The first ten values I show here:</p>
<table>
<thead>
<tr>
<th align="left">date</th>
<th align="left">element</th>
<th align="right">value</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">2016-01-01</td>
<td align="left">TMAX</td>
<td align="right">9.4</td>
</tr>
<tr>
<td align="left">2016-01-01</td>
<td align="left">TMIN</td>
<td align="right">0.0</td>
</tr>
<tr>
<td align="left">2016-01-02</td>
<td align="left">TMAX</td>
<td align="right">10.0</td>
</tr>
<tr>
<td align="left">2016-01-02</td>
<td align="left">TMIN</td>
<td align="right">3.9</td>
</tr>
<tr>
<td align="left">2016-01-03</td>
<td align="left">TMAX</td>
<td align="right">11.7</td>
</tr>
<tr>
<td align="left">2016-01-03</td>
<td align="left">TMIN</td>
<td align="right">6.7</td>
</tr>
<tr>
<td align="left">2016-01-04</td>
<td align="left">TMAX</td>
<td align="right">12.8</td>
</tr>
<tr>
<td align="left">2016-01-04</td>
<td align="left">TMIN</td>
<td align="right">6.7</td>
</tr>
<tr>
<td align="left">2016-01-05</td>
<td align="left">TMAX</td>
<td align="right">12.8</td>
</tr>
<tr>
<td align="left">2016-01-05</td>
<td align="left">TMIN</td>
<td align="right">8.3</td>
</tr>
</tbody>
</table>
<p>Here is the code to produce the heatmap itself. I first use the <code>date</code> field
to compute the x axis labels and locations: the dates are converted essentially
to 'Julian' days since January 4, 1970 (a Sunday), then divided by seven to
get a 'Julian' week number. The week number containing the tenth of the month is
then set as the location of the month name in the x axis labels. I add years to
the January labels.</p>
<p>I then compute the Julian week number and day number of the week. I create a variable
which alternates between …</p>Steven E. PavThu, 18 May 2017 17:21:54 -0700tag:www.gilgamath.com,2017-05-18:/calendar-plots-ggplot2.htmlanalysisRggplot