I just pushed the first version of my ohenery package to CRAN. The package supports estimation of softmax regression for ordinal outcomes under the Harville and Henery models. Unlike the usual multinomial representation for ordinal outcomes, softmax regression is useful for 'ragged' cases. Contrast:

  • observed independent variables on participants in multiple races, with the outcomes recorded, and different participants in each race, perhaps different numbers of participants in each race.
  • observed independent variables on independent trials where for each trial there is a single outcome taking values from some ordered set.

Multinomial ordinal regression is for the latter case, while softmax is for the former. It generalizes logistic regression. I had first stumbled on the idea when working in the film industry, but called it a 'Bradley-Terry model' out of ignorance.

The basic setup is as follows: suppose you observe independent variables \(x_i\) for a participant in a race. Let \(\eta_i = x_i^{\top}\beta\) for some coefficients \(\beta\). Then let

$$ \pi_i = \frac{\exp{\eta_i}}{\sum_j \exp{\eta_j}}, $$

where we sum over all \(j\) in the same race. Under the softmax regression model, the probability that participant \(i\) takes first place is \(\pi_i\).

This formulation is sufficient when you only observe the winner of a multi-participant race, like say the Best Picture winner of the Oscars. However, in some cases you observe the rank of several or all participants. For example, in Olympic events, one observes Gold, Silver and Bronze finishes.

Note that it is generally recommended that you not discard continuous information to dichotomize your variables in this way. However, in some cases one only observes the ordinal outcomes. In this case softmax regression can be used.

In the case where ranked outcomes are observed beyond the winner, we wish to 'recycle' softmax probabilities. Under the Harville model, the probabilities are recycled proportionally. An example will illustrate: condition on the outcome that participant 11 took first place. Then for \(i \ne 11\), compute

$$ \pi_i = \frac{\exp{\eta_i}}{\sum_{j\ne 11} \exp{\eta_j}}. $$

Under the Harville model, the probability that the \(i\)th participant took second place is \(\pi_i\), conditional on the event that 11 took first.

The Henery model slightly generalizes the Harville model. Here we imagine some \(\gamma_2, \gamma_3, \gamma_4\) and so on such that the above computation becomes

$$ \pi_i = \frac{\exp{\gamma_2 \eta_i}}{\sum_{j\ne 11} \exp{\gamma_2 \eta_j}}. $$

Then conditional on 11 taking first, and participant 5 taking second, compute

$$ \pi_i = \frac{\exp{\gamma_3 \eta_i}}{\sum_{j\ne 11, j\ne 5} \exp{\gamma_3 \eta_j}} $$

as the probability that participant \(i\) takes third place, and so on. Obviously the Harville model is a Henery model with all \(\gamma_i=1\).

I wasn't sure how to deal with ties in the code. On the one hand, ties are legitimate possible outcomes in some cases. On the other, they are convenient to introduce as some unobserved 'runner up' status. For example, create an 'Aluminum Medal' outcome for Olympians who take neither Gold, Silver or Bronze; in this case many participants tie for the fourth place medal. However, we should not expect the regression to try to fit some order on those participants. The solution was to introduce weights to the estimation. Set the weights to zero for outcomes which are fake ties, and set them to one otherwise.

The package uses Rcpp to compute a likelihood (and gradient), then maxLik does the estimation and inference. The rest of the work was me tearing my hair out trying to decipher model.frame and its friends.

Olympic Diving

The package is bundled with a dataset of 100 years of Olympic Men's Platform Diving Records, sourced from Randi Griffin's excellent dataset on kaggle.

Here we convert the medal records into finishing places of 1, 2, 3 and 4 (no medal), add weights for the fitting, make a factor variable for age, factor the NOC (country) of the athlete. Because Platform Diving is a subjective competition, based on scores from judges, we investigate whether there is a 'home field advantage' by creating a Boolean variable indicating whether the athlete is representing the host nation.

We then fit a Henery model to the data. Note that the gamma terms come out very close to one, indicating the Harville model would be sufficient. The home field advantage does not appear real in this analysis. (Note: in the first draft of this blog post, using the first version of the package, the home field effect appeared significant due to coding error.)

# this should be ohenery 0.1.1
library(ohenery)
library(dplyr)
library(forcats)

data(diving)
fitdat <- diving %>%
  mutate(Finish=case_when(grepl('Gold',Medal)   ~ 1,  # make outcomes
                          grepl('Silver',Medal) ~ 2,
                          grepl('Bronze',Medal) ~ 3,
                          TRUE ~ 4)) %>%
  mutate(weight=ifelse(Finish <= 3,1,0)) %>%
  mutate(cut_age=cut(coalesce(Age,22.0),c(12,19.5,21.5,22.5,25.5,99),include.lowest=TRUE)) %>%
  mutate(country=forcats::fct_relevel(forcats::fct_lump(factor(NOC),n=5),'Other')) %>%
  mutate(home_advantage=NOC==HOST_NOC)

hensm(Finish ~ cut_age + country + home_advantage,data=fitdat,weights=weight,group=EventId,ngamma=3)
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 43 iterations
Return code 0: successful convergence 
Log-Likelihood: -214.01 
12  free parameters
Estimates:
                   Estimate Std. error t value Pr(> t)    
cut_age(19.5,21.5]   0.0303     0.4185    0.07 0.94227    
cut_age(21.5,22.5]  -0.7276     0.5249   -1.39 0.16565    
cut_age(22.5,25.5]   0.0950     0.3790    0.25 0.80199    
cut_age(25.5,99]    -0.1838     0.4111   -0.45 0.65474    
countryGBR          -0.6729     0.8039   -0.84 0.40258    
countryGER           1.0776     0.4960    2.17 0.02981 *  
countryMEX           0.7159     0.4744    1.51 0.13126    
countrySWE           0.6207     0.5530    1.12 0.26172    
countryUSA           2.3201     0.4579    5.07 4.1e-07 ***
home_advantageTRUE   0.5791     0.4112    1.41 0.15904    
gamma2               1.0054     0.2853    3.52 0.00042 ***
gamma3               0.9674     0.2963    3.26 0.00109 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------