Calendar plots in ggplot2.
Thu 18 May 2017
by
Steven E. Pav
I like the calendar 'heatmap' plots of commits you can see on
github user pages, and wanted to play around with some.
Of course, if I just wanted to make some plots, I could have just googled around, and then
followed this recipe,
or maybe used the rChartsCalmap package.
Instead I set out, as an exercise, to make my own using ggplot2.
For data, I am using the daily GHCND observations data for station USC00047880
, which is
located in the San Rafael, CA, Civic Center. I downloaded this data as part of a project
to join weather data to campground data (yes, it's been done before), directly from
the NOAA FTP site, then read the fixed width
file. I then processed the data, subselected to 2016 and beyond, and converted the units.
I am left with a dataframe of dates, the element name, and the value, which is a temperature
in Celsius. The first ten values I show here:
date |
element |
value |
2016-01-01 |
TMAX |
9.4 |
2016-01-01 |
TMIN |
0.0 |
2016-01-02 |
TMAX |
10.0 |
2016-01-02 |
TMIN |
3.9 |
2016-01-03 |
TMAX |
11.7 |
2016-01-03 |
TMIN |
6.7 |
2016-01-04 |
TMAX |
12.8 |
2016-01-04 |
TMIN |
6.7 |
2016-01-05 |
TMAX |
12.8 |
2016-01-05 |
TMIN |
8.3 |
Here is the code to produce the heatmap itself. I first use the date
field
to compute the x axis labels and locations: the dates are converted essentially
to 'Julian' days since January 4, 1970 (a Sunday), then divided by seven to
get a 'Julian' week number. The week number containing the tenth of the month is
then set as the location of the month name in the x axis labels. I add years to
the January labels.
I then compute the Julian week number and day number of the week. I create a variable
which alternates between …
read more
Do You Want Ballot Stuffing With Your Turkey?
Wed 07 December 2016
by
Steven E. Pav
Rather than fade into ignominy as a lesser Ralph Nader, Jill Stein has managed
to twist the knife in the still-bleeding Left, fleecing a couple of million
from some disoriented voters for a recount of the election in Wisconsin,
Michigan, and Pennsylvania. While a recount seems like a less likely path
to victory for Clinton than, say, a revolt of the Electoral College, or
the Donald pulling an Andy Kaufman, perhaps it should be undertaken if
there is any evidence of fraud. Recall that prior to the election (and since!)
we were warned of the possibility of 'massive voter fraud'. I am not familiar
with the legal argument for a recount, but was curious if there is a
statistical argument for one. I pursue a simple analysis here.
The arguments that I have heard for a recount (other than the danger to our
republic from giving power to mentally unstable blowhard, but I will try to keep my
political bias out of here) sounded pretty weak, as they could easily be
explained away by an omitted variable. For example, arguments of the form
"Trump outperformed Clinton in counties with electronic voting machines," even
if couched in a 'proper' statistical test, are likely to be assuming
independence of those events, when they need not be independent for numerous
reasons.
Instead, I will fall back here to a weaker analysis, based on
Benford's Law. Benford's Law,
which is more of a stylized fact, states that the leading digit of naturally
occurring collections of numbers should follow a certain distribution.
Apparently this method was used to detect suspicious patterns in the 2009
Iranian elections, so you would expect only an amateur ballot-stuffer would
expose themselves to this kind of diagnostic.
First I grab the ward by ward Wisconsin voter
data.
This set is …
read more
IMDb Rating by Sex
Thu 21 July 2016
by
Steven E. Pav
The nerdosphere is in a
minor tizzy over a putative bias
in IMDb ratings for the new (2016) Ghostbusters film.
It seems a bit odd to me, since IMDb ratings have always been horribly
'biased': If the question you are trying to answer is "If I am forced to watch
this randomly selected movie, will I like it?", then IMDb ratings, and most
aggregated movie ratings are difficult to interpret, very likely 'biased'.
The typical mechanism by which a rating ends up on IMDb is that a person
somehow gains an awareness of the film (this has been the major problem
for studios since the end of the studio-theatre model seventy years ago),
enough so to view the film; they are then more likely to rate the movie if
they liked it, or liked it more than expected it, or really hated it. Those
who had low to middling opinions of the film are less likely to rate it,
and so you have the problem of missing data, without the simplifying assumption
of "missing at random."
The Ghostbusters argy bargy (or one of them) is that reviews are suspected to be
coming from people who have not seen the movie. This is possibly a
problem for all reviews on IMDb, though less so for reviews
appearing in streaming services, who know when you have seen a film.
(The other argy bargy is that sexist and racist jerks have been
harassing stars of the new film.)
The analysis on five thirty eight
is informative, but uses information (e.g. age and sex of the reviewers) that is not
widely available, and which is volunteered by the reviewers. Given the
IMDb mirror at my disposal, I can
look for systematic biases for films based on sex, and will do so here.
I …
read more
IMDb Rating by Actor Age
Wed 13 July 2016
by
Steven E. Pav
I recently looked at IMDb ratings for Robert De Niro movies,
finding slight evidence for a dip in ratings in his third act. I noted then
that the data were subject to all kinds of selection biases, and that even in a
perfect world would only reflect the ratings of movies that De Niro was in,
not of his individual performance. I speculated that older actors might no
longer be offered parts in good movies. This is something that can be explored
via the IMDb mirror at my disposal, but
only very weakly: if actors 'stopped caring' after a certain age, or declined
in abilities, or even if IMDb raters simply liked movies with more young
people, one might see the same patterns in the data. Despite these caveats,
let us press on.
That struts and frets his hour upon the stage
First, I collect all movies which are not marked as Documentary
in the data,
and which have a production year between 1965 and 2015, and have at least 250
votes on IMDb. This does present a selection bias towards better movies in the
earlier period we will have to correct for. I then collect actors and actresses
with a known date of birth who have featured in at least 30 of these films.
I bring them into R via dplyr
, and then subselect to observations where
the actor was between 18 and 90 in the production year of the film. This should
look like a lot of blah blah blah, but you can follow along at home if you
have the mirror, which you can install yourself.
library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port …
read more