My first Kaggle competition

Last month I took part in my first Kaggle competition using BNP Paribas Cardif’s data. The aim was to accelerate claims management process but my personal goal was to apply machine learning techniques.
That officially makes me a Kaggler 😛
I used xgboost R package to implement gradient boosting. The results are out so I know there’s a long way for me to improve my ML skills. I guess that I will need to work more on feature engineering and ensembling my models in future.

Using ESRI shapefiles to create maps in R

R has a number of libraries that can be used for plotting. They can be combined with open GIS data to create custom maps.
In this post I’ll demonstrate how to create several maps.

First step is getting shapefiles that will be used to create maps. One of the sources could be this site, but any source with open .shp files will do.

Here I’ll focus on country level (administrative) data for Poland.
If you follow the link to diva-gis you should see the following screen:

I’ll plot powiats and voivodeships which are first- and second-level administrative subdivisions in Poland.

After downloading and unzipping into your working directory in R you will be able to use the scripts underneath to recreate the maps.

The simplest map is using only the shapefiles without any extra background.
Clearly, it’s not the most attractive map, but it’s still informative.
It was generated with the following code:

Nicer maps can be generated with ggmap package. This package allows adding a shapefile overlay onto Google Maps or OSM. In this example I used get_googlemap function, but if you want other background then you should use get_map with appropriate arguments.
Code used to generate the map above:

And last, but not least is my favourite interactive map created with leaflet.


> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rgdal_1.1-7 ggmap_2.6.1 ggplot2_2.1.0 leaflet_1.0.1 maptools_0.8-39
[6] sp_1.2-2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 magrittr_1.5 maps_3.1.0 munsell_0.4.3
[5] colorspace_1.2-6 geosphere_1.5-1 lattice_0.20-33 rjson_0.2.15
[9] jpeg_0.1-8 stringr_1.0.0 plyr_1.8.3 tools_3.2.4
[13] grid_3.2.4 gtable_0.2.0 png_0.1-7 htmltools_0.3.5
[17] yaml_2.1.13 digest_0.6.9 RJSONIO_1.3-0 reshape2_1.4.1
[21] mapproj_1.2-4 htmlwidgets_0.6 labeling_0.3 stringi_1.0-1
[25] RgoogleMaps_1.2.0.7 scales_0.4.0 jsonlite_0.9.19 foreign_0.8-66
[29] proto_0.3-10

Center for World University Rankings – Kaggle dataset

Kaggle publishes many interesting datasets and one of them was including various world university rankings.
I decided to run a quick analysis of the CWUR data and create a map in R using rworldmap package.

The initial results are here:
USA and China outnumber other countries by the number of universities in the CWUR data.

The map shows that USA by far outnumbers other countries in the top 100 universities according to CWUR.

Here’s the gist:

My latest script for this analysis can be found on Kaggle.

Read from txt file in Praat – “File not recognized” error

Praat is a great tool for analysing speech data but lately I came across a frustrating problem. While trying to open a txt file (vector of numbers) in Praat I would get the following error message:
File not recognized. File not finished.
After consulting my fellow PhD students I discovered that what I was missing was a header enabling Praat to read txt files.
The simplest way to fix this error is to add the following header to a text file using your favourite text editor:

However, if you want to automate the process then scripting can save you a lot of time. That’s why I created a function (txt2praat.R) appending this header to the original text file and saving the output to a new text file.

You can use the function in the following way:
txtfile <- file.choose()
txt2praat(txtfile, testfile-modified)

These commands should create a txt file (testfile - modified) appended with the short header. New file can be then opened in Praat without the error message.

Analysing US Higher Education – Fees, admission rates, and SAT

I finally found some time to crunch numbers from a Kaggle swag competition. Available dataset was rather large, but I wanted to focus on the latest data (from 2013) so I only analysed MERGED2013_PP.csv. I started filtering numbers in R but then I decided to move back to Tableau for interactive visualizations. The result can be seen underneath and I hope it’s self-explanatory.

Polling data for 2015 Polish parliamentary election

I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).

After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
To get an overview I used tableplot from the tabplot package.
The next step was plotting time series of the individual variables.


The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.

Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.

And here’s the code:

Using R to query Google Trends and Ngrams

Recently I’ve been playing with the idea of comparing popularity of various people and ideas. I’ve previously queried Wikipedia pageviews using R but I wondered whether the same can be done with Google Trends or Google Ngrams. Both of these Google services provide interesting insights into relative popularity of various queries. Luckily for me there were other people who created fantastic connections between R and Google Trends and Google Ngrams.

One of the topics that interested me as an experimental psychologist was the changing popularity of two psychoanalysts – Sigmund Freud and Carl Jung. Knowing that psychology is becoming more empirical I expected that these two gentlemen will start losing their stardom as time goes by.
Trends extracted from Google Ngrams show the peak popularity for both psychoanalysts around 1995. The relative frequency of occurence of their names seems to decline since that time.

Google Ngram - Freud vs. Jung
Google Ngram – Freud vs. Jung

However, the last year recorded in the Ngram data was 2008, so things could have changed since that time. To answer this question I queried Google Trends, which shows the relative frequency of Google search terms. I didn’t set the locale in the function so I assume that the results are for global searches (but I used English spelling of the names).
The results from Google Trends support the Ngram results. Decline in popularity of both Freud and Jung can observed by using this measure.
Google Trends - Freud vs. Jung
Google Trends – Freud vs. Jung

It was just a brief write-up of my analysis so feel free to modify my code:

The Migrants’ Files – Counting the dead

Recently I came across an interesting data journalism project called The Migrant’s Files which collects and analyses information related to migrations. Data about the dead and missing would-be migrants was publicly available so I created a dashboard in Tableau using a Google Spreadsheet Web Connector (described in my previous post).
Here’s the result: