Research Councils UK Diversity Data

Lately I came across a data set published by Research Councils UK (RCUK) about the diversity of grant applicants. The document didn’t include any plots, so I decided to fix that. Unfortunately, the report was only available in a pdf format which is not optimal for acquiring data.
Luckily, I came across Tabula which is an open source project that allows extracting data from tables in pdf files. With the help of Tabula I extracted data from the document to create plots. Formatting of the original document required too much data wrangling, so I only used data from the first table.
This data is showing an estimate of academic populations applying to the Research Councils.

Higher Education Salaries in the UK

Times Higher Education published a 2015 pay survey in April, but only now I found time to create interactive visualisations showing their data.

I had to edit the table to make it usable in Tableau. I removed the ‘All’ column from the original data source. This column most likely represented weighted means. I couldn’t find any information about how it was calculated, so I only used ‘Female’ and ‘Male’ values. Combined (‘All’) values used in my workbooks are means of these two fields.

I used Edubase data to find additional information about universities, e.g. region or location (available only for HE institutions in England).

Speech SENCOs in England

Department for Education regularly publishes the list of educational establishments in England and Wales. I was interested in seeing which areas have the largest number of schools with Speech Special Educational Needs Coordinators, and what are the characteristics of those schools. In order to find that out, I created a visualisation in Tableau.

The first tab (‘Map’) is showing the map of educational establishments with or without Speech SENCOs, type of establishment, number of establishments, and total number of pupils.
The second tab (‘Pupils by Region and FSM’) shows a fine-grained description of pupils in different types of educational establishments.

Butterworth Filter Demo in Shiny

I am using EEGLab to process my electroencephalografic data (i.e. brain’s electric activity), but I wanted to have an interactive visualisation showing how different filter settings change my data. I prefer using R to Matlab, so I decided to create a Shiny app that would do just that.

I tried to filter brainstem’s activity during several speech conditions using Butterworth band-pass filter to get rid of the artefacts.

I wrote a butterHz function which is based on butter_filtfilt.m from the EEGLab Matlab package and is using butter function from the signal R package.

Here I used a time-domain waveform of speech-evoked Auditory Brainstem Responses to demostrate the use of the Butterworth filter.

The code is available on GitHub.

Foreign crime victims in Poland

Recently I read an article (PL) about massaging statistics by Polish police. It made me wonder what kind of data is available on their website and whether any interesting patterns could be observed.

The website offers some data but it is badly formatted, not very recent, and can be only downloaded as a PDF :O

I didn’t feel like scraping the page so I manually copied and pasted the data from the website and initially preprocessed it in Excel by extracting the numbers following the backslash.


I decided to focus on the dataset ‘Foreign – Crime‘. Surprisingly enough, both crime perpetrators and victims, are lumped together in one table, separated by a backslash. As if that wasn’t enough of bad formatting, someone decided to split the table in two. Each table with a different number of rows and some missing values (marked as ‘bd’). Victims/suspects from countries not specified in the table were aggregated in the total values (Pl: ‘RAZEM’). I intentionally omitted these values from my analyses.

The original(-ish) data was in wide format, but I needed to turn it into long format. I used tidyr for that:

Then I created a heatmap using ggplot2 and RColorBrewer

The result was this heatmap:


Now it’s pretty obvious, which country’s citizens were the most common crime victims in Poland if you focus on raw numbers registered by police. This dataset doesn’t include any information about the number of visitors from other countries so it’s hard to answer the question about the likelihood of being a crime victim as a foreigner in Poland.

I wanted to have some interactivity and I didn’t have much time so I made a dashboard in Tableau:

It’s a much faster way to create static or interactive plots but they are more difficult to reproduce than in R.

Popularity of UK political parties on Wikipedia

General Election 2015 is coming so I decided to compare the popularity of the main UK political parties on Wikipedia.

I gathered the data about the Wikipedia page views using wikipediatrend R package. Pretty plots were made with dygraphs and ggplot2. Wikipedia article traffic was collected from 1st March 2015 till 2nd May 2015 only for English language version of the site.

The Wikipedia interest in Labour, Conservatives, and UKIP is highly aligned. Another tier consists of Liberal Democrats and SNP, both seem to be getting similar amount of page views. Greens have the lowest total volume of the main parties, with a marked peak on 1st April.

All parties experienced a boost in the traffic volume around the time of major TV debates (2nd, 16th, and 30th April 2015).

Page views of articles about political parties decline on weekends. The pattern is fairly consistent and can be observed in all parties.
Here is how it looks for Labour:

Page views of Labour Party (UK) on English-language Wikipedia
Page views of Labour Party (UK) on English-language Wikipedia