Pupils by first language in London boroughs (2015)

Some time ago I discovered London Datastore, a governmental data repository publishing a wide variety of interesting data sets. One of the data sets that drew my attention was describing the composition of the school population in England by first language. Being a non-native English speaker myself, I decided to see whether I could see any interesting patterns and to create a set of choropleth maps.

These maps show that the higher percentage of primary and secondary schools pupils, whose first language is English, tend to occur in the outer London boroughs, e.g. Havering, Bexley, and Bromley. On the other hand, larger percentage of pupils, whose first language is not English, can be found in boroughs in East London (with Tower Hamlets and Newham having especially large percentage).

My first Kaggle competition

Last month I took part in my first Kaggle competition using BNP Paribas Cardif’s data. The aim was to accelerate claims management process but my personal goal was to apply machine learning techniques.
That officially makes me a Kaggler 😛
I used xgboost R package to implement gradient boosting. The results are out so I know there’s a long way for me to improve my ML skills. I guess that I will need to work more on feature engineering and ensembling my models in future.

Using ESRI shapefiles to create maps in R

R has a number of libraries that can be used for plotting. They can be combined with open GIS data to create custom maps.
In this post I’ll demonstrate how to create several maps.

First step is getting shapefiles that will be used to create maps. One of the sources could be this site, but any source with open .shp files will do.

Here I’ll focus on country level (administrative) data for Poland.
If you follow the link to diva-gis you should see the following screen:

I’ll plot powiats and voivodeships which are first- and second-level administrative subdivisions in Poland.

After downloading and unzipping POL_adm.zip into your working directory in R you will be able to use the scripts underneath to recreate the maps.

The simplest map is using only the shapefiles without any extra background.
Clearly, it’s not the most attractive map, but it’s still informative.
It was generated with the following code:

Nicer maps can be generated with ggmap package. This package allows adding a shapefile overlay onto Google Maps or OSM. In this example I used get_googlemap function, but if you want other background then you should use get_map with appropriate arguments.
Code used to generate the map above:

And last, but not least is my favourite interactive map created with leaflet.


> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rgdal_1.1-7 ggmap_2.6.1 ggplot2_2.1.0 leaflet_1.0.1 maptools_0.8-39
[6] sp_1.2-2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 magrittr_1.5 maps_3.1.0 munsell_0.4.3
[5] colorspace_1.2-6 geosphere_1.5-1 lattice_0.20-33 rjson_0.2.15
[9] jpeg_0.1-8 stringr_1.0.0 plyr_1.8.3 tools_3.2.4
[13] grid_3.2.4 gtable_0.2.0 png_0.1-7 htmltools_0.3.5
[17] yaml_2.1.13 digest_0.6.9 RJSONIO_1.3-0 reshape2_1.4.1
[21] mapproj_1.2-4 htmlwidgets_0.6 labeling_0.3 stringi_1.0-1
[25] RgoogleMaps_1.2.0.7 scales_0.4.0 jsonlite_0.9.19 foreign_0.8-66
[29] proto_0.3-10

Center for World University Rankings – Kaggle dataset

Kaggle publishes many interesting datasets and one of them was including various world university rankings.
I decided to run a quick analysis of the CWUR data and create a map in R using rworldmap package.

The initial results are here:
USA and China outnumber other countries by the number of universities in the CWUR data.

The map shows that USA by far outnumbers other countries in the top 100 universities according to CWUR.

Here’s the gist:

My latest script for this analysis can be found on Kaggle.

Read from txt file in Praat – “File not recognized” error

Praat is a great tool for analysing speech data but lately I came across a frustrating problem. While trying to open a txt file (vector of numbers) in Praat I would get the following error message:
File not recognized. File not finished.
After consulting my fellow PhD students I discovered that what I was missing was a header enabling Praat to read txt files.
The simplest way to fix this error is to add the following header to a text file using your favourite text editor:

However, if you want to automate the process then scripting can save you a lot of time. That’s why I created a function (txt2praat.R) appending this header to the original text file and saving the output to a new text file.

You can use the function in the following way:
txtfile <- file.choose()
txt2praat(txtfile, testfile-modified)

These commands should create a txt file (testfile - modified) appended with the short header. New file can be then opened in Praat without the error message.

Analysing US Higher Education – Fees, admission rates, and SAT

I finally found some time to crunch numbers from a Kaggle swag competition. Available dataset was rather large, but I wanted to focus on the latest data (from 2013) so I only analysed MERGED2013_PP.csv. I started filtering numbers in R but then I decided to move back to Tableau for interactive visualizations. The result can be seen underneath and I hope it’s self-explanatory.

Polling data for 2015 Polish parliamentary election

I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).

After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
To get an overview I used tableplot from the tabplot package.
The next step was plotting time series of the individual variables.


The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.

Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.

And here’s the code: