Polling data for 2015 Polish parliamentary election

I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).

After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
missmap_PL_elections_2015
To get an overview I used tableplot from the tabplot package.
tableplot_PL_elections_2015
The next step was plotting time series of the individual variables.


The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.

Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.

And here’s the code:

# analyse data about 2015 Polish parliamentary elections
# more info: http://smarterpoland.pl/index.php/2015/12/czy-internet-pozwala-przewidziec-wyniki-wyborow/
library(Amelia)
library(ggplot2)
library(tabplot)
# load the data
dane <- read.table("https://raw.githubusercontent.com/pbiecek/SmarterPoland_blog/master/dane/Wybory2015/2r.txt",
header=T,
sep="\t",
dec=",")
# map the missing values
missmap(dane)
# overview plot
tableplot(dane,
select = c(objekt, sm, google, sondaz),
sortCol = objekt)
# change the format
dane$data <- as.Date(dane$data, "%Y-%m-%d")
# plot the time series
ggplot(data=dane,aes(x=dane$data,y=sondaz)) +
geom_line(aes(color=objekt), size=1) +
scale_x_date("Date") +
scale_y_continuous("Opinion Poll") +
geom_smooth(method = "loess", size = 1)
ggplot(data=dane,aes(x=dane$data,y=sm)) +
geom_line(aes(color=objekt), size=1) +
scale_x_date("Date") +
scale_y_continuous("Social Media") +
geom_smooth(method = "loess", size = 1)
ggplot(data=dane,aes(x=dane$data,y=google)) +
geom_line(aes(color=objekt), size=1) +
scale_x_date("Date") +
scale_y_continuous("Google Trends") +
geom_smooth(method = "loess", size = 1)
# autocorrelation
keeps <- c("google", "sm", "sondaz")
# remove NAs to run acf
daneWOdate <- na.omit(dane[keeps])
acf(daneWOdate)