ohenery!

Wed 25 September 2019 by Steven

I just pushed the first version of my ohenery package to CRAN. The package supports estimation of softmax regression for ordinal outcomes under the Harville and Henery models. Unlike the usual multinomial representation for ordinal outcomes, softmax regression is useful for 'ragged' cases. Contrast:

observed independent variables on participants in multiple races, with the outcomes recorded, and different participants in each race, perhaps different numbers of participants in each race.
observed independent variables on independent trials where for each trial there is a single outcome taking values from some ordered set.

Multinomial ordinal regression is for the latter case, while softmax is for the former. It generalizes logistic regression. I had first stumbled on the idea when working in the film industry, but called it a 'Bradley-Terry model' out of ignorance.

The basic setup is as follows: suppose you observe independent variables $x_i$ for a participant in a race. Let $\eta_i = x_i^{\top}\beta$ for some coefficients $\beta$. Then let

$$ \pi_i = \frac{\exp{\eta_i}}{\sum_j \exp{\eta_j}}, $$

where we sum over all $j$ in the same race. Under the softmax regression model, the probability that participant $i$ takes first place is $\pi_i$.

This formulation is sufficient when you only observe the winner of a multi-participant race, like say the Best Picture winner of the Oscars. However, in some cases you observe the rank of several or all participants. For example, in Olympic events, one observes Gold, Silver and Bronze finishes.

Note that it is generally recommended that you not discard continuous information to dichotomize your variables in this way. However, in some cases one only observes the ordinal outcomes. In this case softmax regression can be used.

In the case where ranked outcomes are observed beyond the winner, we wish to 'recycle' softmax probabilities. Under the Harville model, the probabilities are recycled proportionally. An example will illustrate: condition on the outcome that participant 11 took first place. Then for $i \ne 11$, compute

$$ \pi_i = \frac{\exp{\eta_i}}{\sum_{j\ne 11} \exp{\eta_j}}. $$

Under the Harville model, the probability that the $i$th participant took second place is $\pi_i$, conditional on the event that 11 took first.

The Henery model slightly generalizes the Harville model. Here we imagine some $\gamma_2, \gamma_3, \gamma_4$ and so on such that the above computation becomes

$$ \pi_i = \frac{\exp{\gamma_2 \eta_i}}{\sum_{j\ne 11} \exp{\gamma_2 \eta_j}}. $$

Then conditional on 11 taking first, and participant 5 taking second, compute

$$ \pi_i = \frac{\exp{\gamma_3 \eta_i}}{\sum_{j\ne 11, j\ne 5} \exp{\gamma_3 \eta_j}} $$

as the probability that participant $i$ takes third place, and so on. Obviously the Harville model is a Henery model with all $\gamma_i=1$.

I wasn't sure how to deal with ties in the code. On the one hand, ties are legitimate possible outcomes in some cases. On the other, they are convenient to introduce as some unobserved 'runner up' status. For example, create an 'Aluminum Medal' outcome for Olympians who take neither Gold, Silver or Bronze; in this case many participants tie for the fourth place medal. However, we should not expect the regression to try to fit some order on those participants. The solution was to introduce weights to the estimation. Set the weights to zero for outcomes which are fake ties, and set them to one otherwise.

The package uses Rcpp to compute a likelihood (and gradient), then maxLik does the estimation and inference. The rest of the work was me tearing my hair out trying to decipher model.frame and its friends.

Olympic Diving

The package is bundled with a dataset of 100 years of Olympic Men's Platform Diving Records, sourced from Randi Griffin's excellent dataset on kaggle.

Here we convert the medal records into finishing places of 1, 2, 3 and 4 (no medal), add weights for the fitting, make a factor variable for age, factor the NOC (country) of the athlete. Because Platform Diving is a subjective competition, based on scores from judges, we investigate whether there is a 'home field advantage' by creating a Boolean variable indicating whether the athlete is representing the host nation.

We then fit a Henery model to the data. Note that the gamma terms come out very close to one, indicating the Harville model would be sufficient. The home field advantage does not appear real in this analysis. (Note: in the first draft of this blog post, using the first version of the package, the home field effect appeared significant due to coding error.)

# this should be ohenery 0.1.1
library(ohenery)
library(dplyr)
library(forcats)

data(diving)
fitdat <- diving %>%
  mutate(Finish=case_when(grepl('Gold',Medal)   ~ 1,  # make outcomes
                          grepl('Silver',Medal) ~ 2,
                          grepl('Bronze',Medal) ~ 3,
                          TRUE ~ 4)) %>%
  mutate(weight=ifelse(Finish <= 3,1,0)) %>%
  mutate(cut_age=cut(coalesce(Age,22.0),c(12,19.5,21.5,22.5,25.5,99),include.lowest=TRUE)) %>%
  mutate(country=forcats::fct_relevel(forcats::fct_lump(factor(NOC),n=5),'Other')) %>%
  mutate(home_advantage=NOC==HOST_NOC)

hensm(Finish ~ cut_age + country + home_advantage,data=fitdat,weights=weight,group=EventId,ngamma=3)

--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 43 iterations
Return code 0: successful convergence 
Log-Likelihood: -214.01 
12  free parameters
Estimates:
                   Estimate Std. error t value Pr(> t)    
cut_age(19.5,21.5]   0.0303     0.4185    0.07 0.94227    
cut_age(21.5,22.5]  -0.7276     0.5249   -1.39 0.16565    
cut_age(22.5,25.5]   0.0950     0.3790    0.25 0.80199    
cut_age(25.5,99]    -0.1838     0.4111   -0.45 0.65474    
countryGBR          -0.6729     0.8039   -0.84 0.40258    
countryGER           1.0776     0.4960    2.17 0.02981 *  
countryMEX           0.7159     0.4744    1.51 0.13126    
countrySWE           0.6207     0.5530    1.12 0.26172    
countryUSA           2.3201     0.4579    5.07 4.1e-07 ***
home_advantageTRUE   0.5791     0.4112    1.41 0.15904    
gamma2               1.0054     0.2853    3.52 0.00042 ***
gamma3               0.9674     0.2963    3.26 0.00109 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

fromo 0.2.0

Sun 13 January 2019 by Steven E. Pav

I recently pushed version 0.2.0 of my fromo package to CRAN. This package implements (relatively) fast, numerically robust computation of moments via Rcpp.

The big changes in this release are:

Support for weighted moment estimation.
Computation of running moments over windows defined by time (or some other increasing index), rather than vector index.
Some modest improvements in speed for the 'dangerous' use cases (no checking for NA, no weights, etc.)

The time-based running moments are supported via the t_running_* operations, and we support means, standard deviation, skew, kurtosis, centered and standardized moments and cumulants, z-score, Sharpe, and t-stat. The idea is that your observations are associated with some increasing index, which you can think of as the observation time, and you wish to compute moments over a fixed time window. To bloat the API, the times from which you 'look back' can optionally be something other than the time indices of the input, so the input and output size can be different.

Some example uses might be:

Compute the volatility of an asset's returns over the previous 6 months, on every trade day.
Compute the total monthly sales of a company at month ends.

Because the API also allows you to use weights as implicit time deltas, you can also do weird and unadvisable things like compute the Sharpe of an asset over the last 1 million shares traded.

Speed improvements come from my random walk through c++ design idioms. I also implemented a 'swap' procedure for the running standard deviation which incorporates a Welford's method addition and removal into a single step. I do not believe that Welford's method is the fastest algorithm for a summarizing moment computation: probably a two pass solution to compute the mean first, then the centered moments is faster. However, for the …

Gilgamath

ohenery!

Olympic Diving

fromo 0.2.0