The html report generated with RMarkdown and the latest version uploaded to Kaggle (kernel) is here.
Some time ago I discovered London Datastore, a governmental data repository publishing a wide variety of interesting data sets. One of the data sets that drew my attention was describing the composition of the school population in England by first language. Being a non-native English speaker myself, I decided to see whether I could see any interesting patterns and to create a set of choropleth maps.
These maps show that the higher percentage of primary and secondary schools pupils, whose first language is English, tend to occur in the outer London boroughs, e.g. Havering, Bexley, and Bromley. On the other hand, larger percentage of pupils, whose first language is not English, can be found in boroughs in East London (with Tower Hamlets and Newham having especially large percentage).
Last month I took part in my first Kaggle competition using BNP Paribas Cardif’s data. The aim was to accelerate claims management process but my personal goal was to apply machine learning techniques.
That officially makes me a Kaggler 😛
I used xgboost R package to implement gradient boosting. The results are out so I know there’s a long way for me to improve my ML skills. I guess that I will need to work more on feature engineering and ensembling my models in future.
One of my work projects which gained a lot of publicity was analysing residential property sales in England and Wales. Underlying data was collected by Land Registry and is publicly available.
Land Registry also makes their House Price Index data publicly available. I used it to create the following visualization:
R has a number of libraries that can be used for plotting. They can be combined with open GIS data to create custom maps.
In this post I’ll demonstrate how to create several maps.
First step is getting shapefiles that will be used to create maps. One of the sources could be this site, but any source with open .shp files will do.
Here I’ll focus on country level (administrative) data for Poland.
If you follow the link to diva-gis you should see the following screen:
After downloading and unzipping POL_adm.zip into your working directory in R you will be able to use the scripts underneath to recreate the maps.
Nicer maps can be generated with ggmap package. This package allows adding a shapefile overlay onto Google Maps or OSM. In this example I used
get_googlemap function, but if you want other background then you should use
get_map with appropriate arguments.
Code used to generate the map above:
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
 LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
 LC_TIME=English_United Kingdom.1252
attached base packages:
 stats graphics grDevices utils datasets methods base
other attached packages:
 rgdal_1.1-7 ggmap_2.6.1 ggplot2_2.1.0 leaflet_1.0.1 maptools_0.8-39
loaded via a namespace (and not attached):
 Rcpp_0.12.4 magrittr_1.5 maps_3.1.0 munsell_0.4.3
 colorspace_1.2-6 geosphere_1.5-1 lattice_0.20-33 rjson_0.2.15
 jpeg_0.1-8 stringr_1.0.0 plyr_1.8.3 tools_3.2.4
 grid_3.2.4 gtable_0.2.0 png_0.1-7 htmltools_0.3.5
 yaml_2.1.13 digest_0.6.9 RJSONIO_1.3-0 reshape2_1.4.1
 mapproj_1.2-4 htmlwidgets_0.6 labeling_0.3 stringi_1.0-1
 RgoogleMaps_18.104.22.168 scales_0.4.0 jsonlite_0.9.19 foreign_0.8-66
Kaggle publishes many interesting datasets and one of them was including various world university rankings.
I decided to run a quick analysis of the CWUR data and create a map in R using rworldmap package.
Here’s the gist:
My latest script for this analysis can be found on Kaggle.
Praat is a great tool for analysing speech data but lately I came across a frustrating problem. While trying to open a txt file (vector of numbers) in Praat I would get the following error message:
File not recognized. File not finished.
After consulting my fellow PhD students I discovered that what I was missing was a header enabling Praat to read txt files.
The simplest way to fix this error is to add the following header to a text file using your favourite text editor:
However, if you want to automate the process then scripting can save you a lot of time. That’s why I created a function (txt2praat.R) appending this header to the original text file and saving the output to a new text file.
You can use the function in the following way:
txtfile <- file.choose()
These commands should create a txt file (testfile - modified) appended with the short header. New file can be then opened in Praat without the error message.
I finally found some time to crunch numbers from a Kaggle swag competition. Available dataset was rather large, but I wanted to focus on the latest data (from 2013) so I only analysed MERGED2013_PP.csv. I started filtering numbers in R but then I decided to move back to Tableau for interactive visualizations. The result can be seen underneath and I hope it’s self-explanatory.
I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).
After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
To get an overview I used tableplot from the tabplot package.
The next step was plotting time series of the individual variables.
The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.
Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.
And here’s the code: