Kaggle released another interesting data set. This time it’s a loan book of a P2P lender – Lending Club.
I had a stab at analysing it and here are some teaser charts that were created, but more can be found here.
Last month I took part in my first Kaggle competition using BNP Paribas Cardif’s data. The aim was to accelerate claims management process but my personal goal was to apply machine learning techniques.
That officially makes me a Kaggler 😛
I used xgboost R package to implement gradient boosting. The results are out so I know there’s a long way for me to improve my ML skills. I guess that I will need to work more on feature engineering and ensembling my models in future.
One of my work projects which gained a lot of publicity was analysing residential property sales in England and Wales. Underlying data was collected by Land Registry and is publicly available.
Land Registry also makes their House Price Index data publicly available. I used it to create the following visualization:
R has a number of libraries that can be used for plotting. They can be combined with open GIS data to create custom maps.
In this post I’ll demonstrate how to create several maps.
First step is getting shapefiles that will be used to create maps. One of the sources could be this site, but any source with open .shp files will do.
Here I’ll focus on country level (administrative) data for Poland.
If you follow the link to diva-gis you should see the following screen:
After downloading and unzipping POL_adm.zip into your working directory in R you will be able to use the scripts underneath to recreate the maps.
Nicer maps can be generated with ggmap package. This package allows adding a shapefile overlay onto Google Maps or OSM. In this example I used
get_googlemap function, but if you want other background then you should use
get_map with appropriate arguments.
Code used to generate the map above:
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
 LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
 LC_TIME=English_United Kingdom.1252
attached base packages:
 stats graphics grDevices utils datasets methods base
other attached packages:
 rgdal_1.1-7 ggmap_2.6.1 ggplot2_2.1.0 leaflet_1.0.1 maptools_0.8-39
loaded via a namespace (and not attached):
 Rcpp_0.12.4 magrittr_1.5 maps_3.1.0 munsell_0.4.3
 colorspace_1.2-6 geosphere_1.5-1 lattice_0.20-33 rjson_0.2.15
 jpeg_0.1-8 stringr_1.0.0 plyr_1.8.3 tools_3.2.4
 grid_3.2.4 gtable_0.2.0 png_0.1-7 htmltools_0.3.5
 yaml_2.1.13 digest_0.6.9 RJSONIO_1.3-0 reshape2_1.4.1
 mapproj_1.2-4 htmlwidgets_0.6 labeling_0.3 stringi_1.0-1
 RgoogleMaps_18.104.22.168 scales_0.4.0 jsonlite_0.9.19 foreign_0.8-66
Kaggle publishes many interesting datasets and one of them was including various world university rankings.
I decided to run a quick analysis of the CWUR data and create a map in R using rworldmap package.
Here’s the gist:
My latest script for this analysis can be found on Kaggle.
Praat is a great tool for analysing speech data but lately I came across a frustrating problem. While trying to open a txt file (vector of numbers) in Praat I would get the following error message:
File not recognized. File not finished.
After consulting my fellow PhD students I discovered that what I was missing was a header enabling Praat to read txt files.
The simplest way to fix this error is to add the following header to a text file using your favourite text editor:
However, if you want to automate the process then scripting can save you a lot of time. That’s why I created a function (txt2praat.R) appending this header to the original text file and saving the output to a new text file.
You can use the function in the following way:
txtfile <- file.choose()
These commands should create a txt file (testfile - modified) appended with the short header. New file can be then opened in Praat without the error message.
I finally found some time to crunch numbers from a Kaggle swag competition. Available dataset was rather large, but I wanted to focus on the latest data (from 2013) so I only analysed MERGED2013_PP.csv. I started filtering numbers in R but then I decided to move back to Tableau for interactive visualizations. The result can be seen underneath and I hope it’s self-explanatory.
I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).
After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
To get an overview I used tableplot from the tabplot package.
The next step was plotting time series of the individual variables.
The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.
Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.
And here’s the code:
Recently I’ve been playing with the idea of comparing popularity of various people and ideas. I’ve previously queried Wikipedia pageviews using R but I wondered whether the same can be done with Google Trends or Google Ngrams. Both of these Google services provide interesting insights into relative popularity of various queries. Luckily for me there were other people who created fantastic connections between R and Google Trends and Google Ngrams.
One of the topics that interested me as an experimental psychologist was the changing popularity of two psychoanalysts – Sigmund Freud and Carl Jung. Knowing that psychology is becoming more empirical I expected that these two gentlemen will start losing their stardom as time goes by.
Trends extracted from Google Ngrams show the peak popularity for both psychoanalysts around 1995. The relative frequency of occurence of their names seems to decline since that time.
However, the last year recorded in the Ngram data was 2008, so things could have changed since that time. To answer this question I queried Google Trends, which shows the relative frequency of Google search terms. I didn’t set the locale in the function so I assume that the results are for global searches (but I used English spelling of the names).
The results from Google Trends support the Ngram results. Decline in popularity of both Freud and Jung can observed by using this measure.
It was just a brief write-up of my analysis so feel free to modify my code:
Recently I came across an interesting data journalism project called The Migrant’s Files which collects and analyses information related to migrations. Data about the dead and missing would-be migrants was publicly available so I created a dashboard in Tableau using a Google Spreadsheet Web Connector (described in my previous post).
Here’s the result: