The nerdosphere is in a minor tizzy over a putative bias in IMDb ratings for the new (2016) Ghostbusters film. It seems a bit odd to me, since IMDb ratings have always been horribly 'biased': If the question you are trying to answer is "If I am forced to watch this randomly selected movie, will I like it?", then IMDb ratings, and most aggregated movie ratings are difficult to interpret, very likely 'biased'. The typical mechanism by which a rating ends up on IMDb is that a person somehow gains an awareness of the film (this has been the major problem for studios since the end of the studio-theatre model seventy years ago), enough so to view the film; they are then more likely to rate the movie if they liked it, or liked it more than expected it, or really hated it. Those who had low to middling opinions of the film are less likely to rate it, and so you have the problem of missing data, without the simplifying assumption of "missing at random."
The Ghostbusters argy bargy (or one of them) is that reviews are suspected to be coming from people who have not seen the movie. This is possibly a problem for all reviews on IMDb, though less so for reviews appearing in streaming services, who know when you have seen a film. (The other argy bargy is that sexist and racist jerks have been harassing stars of the new film.) The analysis on five thirty eight is informative, but uses information (e.g. age and sex of the reviewers) that is not widely available, and which is volunteered by the reviewers. Given the IMDb mirror at my disposal, I can look for systematic biases for films based on sex, and will do so here.
I recently looked at IMDb ratings by actor age to see if the
'De Niro effect' was idiosyncratic, or whether reviewers systematically
disliked films with older actors. The analysis there was a bit wonky, since
I attempted to fit a fixed effect for every actor, but ratings are quoted for
films not a particular actor's part in a film, but then multiple actors
might participate in a given film, so they were randomly sampled to a single
actor per film. Here I take a different approach, computing the (weighted)
average age and sex of actors in a film. The weighting is based on the
nr_order
of actors within a film, the ordering given in IMDb which tells
you roughly which actor or actress has top billing in a film. I exponentially
down-weight based on this order, with the first listed actor/actress twice
as important as the actor/actress in the fourth slot, who has twice
the weight as number 7 and so on.
As before, remove films tagged with Documentary
genre, keep those
which have a production year between 1965 and 2015, and have listed English
as a language. (You will find that Indian and Turkish movies have many fans on
IMDb, often with uncomparable ratings.) You should be able to follow along
with this analysis at home if you have the
mirror. If you want to skip
the data gathering part, you can get
my cut of the data.
library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
inner_join(tbl(dbcon,'info_type') %>%
filter(info %regexp% 'genres') %>%
select(info_type_id),
by='info_type_id')
# get documentary movies;
doccos <- movie_genres %>%
filter(info %regexp% 'Documentary') %>%
select(movie_id)
# language information
movie_languages <- tbl(dbcon,'movie_info') %>%
inner_join(tbl(dbcon,'info_type') %>%
filter(info %regexp% 'languages') %>%
select(info_type_id),
by='info_type_id')
# get movies with English
unnerstandit <- movie_languages %>%
filter(info %regexp% 'English') %>%
select(movie_id)
# movies which are not documentaries, have some English, filtered by production year
movies <- tbl(dbcon,'title') %>%
select(-imdb_index,-ttid,-md5sum) %>%
anti_join(doccos %>% distinct(movie_id),by='movie_id') %>%
inner_join(unnerstandit %>% distinct(movie_id),by='movie_id') %>%
filter(production_year >= 1965,production_year <= 2015)
# votes for all movies, filtered by having enough votes
vote_info <- tbl(dbcon,'movie_votes') %>%
select(movie_id,votes,vote_mean,vote_sd,vote_se) %>%
filter(votes >= 25)
# join the two together
# nb. dplyr is having problems with collect, so collect early...
mvotes <- inner_join(movies,vote_info,by='movie_id') %>%
collect(n=Inf)
# change this to change downweighting.
# 3 = person #1 is twice as important as person #4
# 10 = person #1 is twice as important as person #11
ORDER_DOWNWEIGHTING <- 3
# acts in relation
# inner join with subselected movies
# nb. dplyr is having problems with collect, so collect early...
acts_in <- tbl(dbcon,'cast_info') %>%
inner_join(tbl(dbcon,'role_type') %>%
filter(role %regexp% 'actor|actress'),
by='role_id') %>%
select(person_id,movie_id,nr_order) %>%
filter(!is.na(nr_order)) %>%
mutate(weight=2^(-nr_order/ORDER_DOWNWEIGHTING)) %>%
inner_join(movies %>% select(movie_id),by='movie_id') %>%
collect(n=Inf)
# get actors with many films
good_actors <- tbl(dbcon,'name') %>%
select(person_id,name,gender,dob) %>%
filter(!is.na(dob)) %>%
mutate((gender=='m') || (gender=='f')) %>%
mutate(yob=year(dob)) %>%
mutate(ismale=(gender=='m')) %>%
filter(yob >= 1875) %>%
collect(n=Inf)
# join the good actors with acts-in
# with mvotes.
bigdata <- good_actors %>%
inner_join(acts_in %>% inner_join(mvotes,by='movie_id'),by='person_id') %>%
mutate(actor_age=production_year - yob) %>%
filter(actor_age >= 5,actor_age <= 100)
# get the mean age and sex
mean_stuff <- bigdata %>%
group_by(movie_id) %>%
summarize(sum_wgt = sum(weight),
sum_age = sum(weight*actor_age),
sum_ism = sum(weight*ismale)) %>%
ungroup() %>%
mutate(mean_age = sum_age / sum_wgt,
mean_ism = sum_ism / sum_wgt)
# join together with votes
joined <- mean_stuff %>%
inner_join(mvotes,by='movie_id')
# write it so you all can have it.
#library(readr)
#readr::write_csv(joined,path='../data/movie_rate_by_sex.csv')
We have a weighted average age of actors, and a weighted average sex, where a
0 means "all female cast" and 1 "all male cast". Here are some top films
based on mean rating, with mean age and sex of the cast. (ism
stands for
"is male".)
joined %>%
filter(votes > 5e4) %>%
arrange(desc(vote_mean)) %>%
select(movie_id,title,mean_ism,mean_age,production_year,votes,vote_mean,vote_sd) %>%
head(10) %>%
kable()
movie_id | title | mean_ism | mean_age | production_year | votes | vote_mean | vote_sd |
---|---|---|---|---|---|---|---|
756550 | The Godfather | 0.952 | 42.4 | 1972 | 1131245 | 7.96 | 2.70 |
799705 | The Shawshank Redemption | 1.000 | 44.4 | 1994 | 1652593 | 7.96 | 2.70 |
5868 | 3 Idiots | 0.850 | 38.7 | 2009 | 199139 | 7.79 | 2.74 |
706413 | Swades: We, the People | 0.624 | 36.9 | 2004 | 56090 | 7.79 | 2.74 |
741988 | The Dark Knight | 0.869 | 42.1 | 2008 | 1638089 | 7.79 | 2.74 |
756561 | The Godfather: Part II | 0.791 | 38.3 | 1974 | 772323 | 7.79 | 2.74 |
773650 | The Lord of the Rings: The Return of the King | 0.665 | 30.2 | 2003 | 1189275 | 7.79 | 2.74 |
266250 | Forrest Gump | 0.746 | 42.1 | 1994 | 1218328 | 7.73 | 2.58 |
420754 | La vita bella | 0.675 | 42.5 | 1997 | 408755 | 7.73 | 2.58 |
777754 | The Matrix | 0.739 | 38.0 | 1999 | 1190603 | 7.73 | 2.58 |
First, plots of the IMDb rating of a film versus average cast maleness, and then versus average cast age. The plots are further colored by age and sex. Then a boxplot grouped by age and sex classe.
require(ggplot2)
plot_dat <- joined %>%
filter(votes >= 100) %>%
mutate(cast_maleness = 200 * (mean_ism - 0.5)) %>%
mutate(ageish = cut(mean_age, breaks=c(0,20,30,40,65,100),
labels=c("teenage","twentysomething","thirtysomething","middle age","senior citizen"),right=FALSE)) %>%
mutate(maleish = cut(cast_maleness, breaks=c(-101,-33,33,100),
labels=c("mostly female","balanced","mostly male"),right=TRUE))
ph <- ggplot(plot_dat,aes(x=cast_maleness,y=vote_mean,color=ageish)) +
geom_jitter() +
geom_smooth() +
labs(x="cast maleness: -100=all female; 100=all male;",
y="IMDb rating")
ph
ph <- ggplot(plot_dat,aes(x=mean_age,y=vote_mean,colour=maleish)) +
geom_jitter() +
geom_smooth() +
labs(x="cast average age",
y="IMDb rating")
ph
ph <- ggplot(plot_dat,aes(x=ageish,y=vote_mean)) +
geom_boxplot(aes(fill=maleish),varwidth=FALSE) +
geom_jitter(alpha=0.05,aes(color=maleish)) +
labs(x="cast average age",
y="IMDb rating")
ph
I do not see a huge effect here. The slight apparent increase in ratings for films with very young or very old cast members is likely caused by the small sample sizes. A regression might tell us something about the effect sizes:
mod0 <- lm(vote_mean ~ maleish * ageish,plot_dat)
print(summary(mod0))
##
## Call:
## lm(formula = vote_mean ~ maleish * ageish, data = plot_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.119 -0.439 0.152 0.650 3.281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2248 0.1186 52.49 < 2e-16
## maleishbalanced 0.0814 0.2219 0.37 0.714
## maleishmostly male 0.0706 0.1504 0.47 0.639
## ageishtwentysomething -0.5861 0.1240 -4.73 2.3e-06
## ageishthirtysomething -0.5247 0.1224 -4.29 1.8e-05
## ageishmiddle age -0.1957 0.1253 -1.56 0.118
## ageishsenior citizen 0.0013 0.2161 0.01 0.995
## maleishbalanced:ageishtwentysomething 0.2088 0.2265 0.92 0.357
## maleishmostly male:ageishtwentysomething 0.2893 0.1567 1.85 0.065
## maleishbalanced:ageishthirtysomething 0.1075 0.2244 0.48 0.632
## maleishmostly male:ageishthirtysomething 0.1272 0.1539 0.83 0.409
## maleishbalanced:ageishmiddle age -0.1829 0.2260 -0.81 0.419
## maleishmostly male:ageishmiddle age -0.2402 0.1561 -1.54 0.124
## maleishbalanced:ageishsenior citizen 0.1511 0.3402 0.44 0.657
## maleishmostly male:ageishsenior citizen -0.1638 0.2487 -0.66 0.510
##
## (Intercept) ***
## maleishbalanced
## maleishmostly male
## ageishtwentysomething ***
## ageishthirtysomething ***
## ageishmiddle age
## ageishsenior citizen
## maleishbalanced:ageishtwentysomething
## maleishmostly male:ageishtwentysomething .
## maleishbalanced:ageishthirtysomething
## maleishmostly male:ageishthirtysomething
## maleishbalanced:ageishmiddle age
## maleishmostly male:ageishmiddle age
## maleishbalanced:ageishsenior citizen
## maleishmostly male:ageishsenior citizen
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.956 on 27802 degrees of freedom
## Multiple R-squared: 0.00674, Adjusted R-squared: 0.00624
## F-statistic: 13.5 on 14 and 27802 DF, p-value: <2e-16
Note that the Intercept term here refers to the lowest class levels: a mostly female, teenage cast. (And yes, I have removed porn films from the mirror.) We see a 'significant' decrease in rating for age classes twenty and thirty somethings, but no significant effects otherwise. The effect sizes for age are on the order of half a rating point, somewhat larger than the average effects seen previously, but not terribly larger, while the effect sizes for sex are small, less than a tenth of a rating point.
In all, we do not see here a significant 'sexist bias' in film ratings, where films with mostly female cast are consistently rated lower. This does not mean that some films are subject to sexist campaigns, nor does it mean that reviewer sex is independent of review. It merely suggests that among the many biases in IMDb reviews, rampant sexism is not a leading cause of error.