Using ESRI shapefiles to create maps in R

R has a number of libraries that can be used for plotting. They can be combined with open GIS data to create custom maps.
In this post I’ll demonstrate how to create several maps.

First step is getting shapefiles that will be used to create maps. One of the sources could be this site, but any source with open .shp files will do.

Here I’ll focus on country level (administrative) data for Poland.
If you follow the link to diva-gis you should see the following screen:
diva-gis_poland

I’ll plot powiats and voivodeships which are first- and second-level administrative subdivisions in Poland.

After downloading and unzipping POL_adm.zip into your working directory in R you will be able to use the scripts underneath to recreate the maps.

The simplest map is using only the shapefiles without any extra background.
shapefile_map_poland_1
Clearly, it’s not the most attractive map, but it’s still informative.
It was generated with the following code:

Nicer maps can be generated with ggmap package. This package allows adding a shapefile overlay onto Google Maps or OSM. In this example I used get_googlemap function, but if you want other background then you should use get_map with appropriate arguments.
shapefile_map_poland_2_google_maps
Code used to generate the map above:

And last, but not least is my favourite interactive map created with leaflet.

Snippet:


> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rgdal_1.1-7 ggmap_2.6.1 ggplot2_2.1.0 leaflet_1.0.1 maptools_0.8-39
[6] sp_1.2-2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 magrittr_1.5 maps_3.1.0 munsell_0.4.3
[5] colorspace_1.2-6 geosphere_1.5-1 lattice_0.20-33 rjson_0.2.15
[9] jpeg_0.1-8 stringr_1.0.0 plyr_1.8.3 tools_3.2.4
[13] grid_3.2.4 gtable_0.2.0 png_0.1-7 htmltools_0.3.5
[17] yaml_2.1.13 digest_0.6.9 RJSONIO_1.3-0 reshape2_1.4.1
[21] mapproj_1.2-4 htmlwidgets_0.6 labeling_0.3 stringi_1.0-1
[25] RgoogleMaps_1.2.0.7 scales_0.4.0 jsonlite_0.9.19 foreign_0.8-66
[29] proto_0.3-10

Center for World University Rankings – Kaggle dataset

Kaggle publishes many interesting datasets and one of them was including various world university rankings.
I decided to run a quick analysis of the CWUR data and create a map in R using rworldmap package.

The initial results are here:
cwur_counties_by_universities_in_the_ranking
USA and China outnumber other countries by the number of universities in the CWUR data.

map_cwur_top_100
The map shows that USA by far outnumbers other countries in the top 100 universities according to CWUR.

Here’s the gist:

My latest script for this analysis can be found on Kaggle.

Read from txt file in Praat – “File not recognized” error

Praat is a great tool for analysing speech data but lately I came across a frustrating problem. While trying to open a txt file (vector of numbers) in Praat I would get the following error message:
File not recognized. File not finished.
Praat-read-from-file-error
After consulting my fellow PhD students I discovered that what I was missing was a header enabling Praat to read txt files.
The simplest way to fix this error is to add the following header to a text file using your favourite text editor:

However, if you want to automate the process then scripting can save you a lot of time. That’s why I created a function (txt2praat.R) appending this header to the original text file and saving the output to a new text file.

You can use the function in the following way:
txtfile <- file.choose()
txt2praat(txtfile, testfile-modified)

These commands should create a txt file (testfile - modified) appended with the short header. New file can be then opened in Praat without the error message.
Praat-testfile-modified

Polling data for 2015 Polish parliamentary election

I’m back to analysing political data after finding nicely formatted data set on one of my favourite blogs. The blog post that inspired me to do it discussed the possibility of predicting election results using polls and popularity data found online. In brief, he response is: not yet. However, with the increasing number of people using digital media and opinion polls, these channel will have more impact on the future political campaigns.
I haven’t used the actual results in this analysis but I only used the variables that came with the compiled data set. The variables in questions are: Google Trends popularity, Social Media popularity, and Opinion Polls. More details about the data can be found here (text in Polish).

After loading the data, I used the missmap function to examine the missing values. It seems like there are quite a few gaps in the data about the polls, social media, and Google Trends (in the decreasing order).
missmap_PL_elections_2015
To get an overview I used tableplot from the tabplot package.
tableplot_PL_elections_2015
The next step was plotting time series of the individual variables.
social_media_PL_elections_2015

poll_PL_elections_2015

google_trends_PL_elections_2015
The plots above show that the overall Social Media and Google Trends activity (dark blue line) increased closer to the election day. The averaged rating (dark blue line) of all parties in the polls seemed fairly stable. This is probably not the most interesting finding so splitting the values by party/candidate would be recommended.

Autocorrelation was conducted on the cleaned data frame (NAs were removed) to show how the variables correlate with themselves.
autocorrelation_PL_elections_2015

And here’s the code:

Using R to query Google Trends and Ngrams

Recently I’ve been playing with the idea of comparing popularity of various people and ideas. I’ve previously queried Wikipedia pageviews using R but I wondered whether the same can be done with Google Trends or Google Ngrams. Both of these Google services provide interesting insights into relative popularity of various queries. Luckily for me there were other people who created fantastic connections between R and Google Trends and Google Ngrams.

One of the topics that interested me as an experimental psychologist was the changing popularity of two psychoanalysts – Sigmund Freud and Carl Jung. Knowing that psychology is becoming more empirical I expected that these two gentlemen will start losing their stardom as time goes by.
Trends extracted from Google Ngrams show the peak popularity for both psychoanalysts around 1995. The relative frequency of occurence of their names seems to decline since that time.

Google Ngram - Freud vs. Jung
Google Ngram – Freud vs. Jung

However, the last year recorded in the Ngram data was 2008, so things could have changed since that time. To answer this question I queried Google Trends, which shows the relative frequency of Google search terms. I didn’t set the locale in the function so I assume that the results are for global searches (but I used English spelling of the names).
The results from Google Trends support the Ngram results. Decline in popularity of both Freud and Jung can observed by using this measure.
Google Trends - Freud vs. Jung
Google Trends – Freud vs. Jung

It was just a brief write-up of my analysis so feel free to modify my code:

Analysing Quandl FX data in R – plotting, decomposing time series, and detecting anomalies

Quandl is a great provider of various types of data that can be easily integrated with R. I wanted to play with the available data and see what kind of insights can be gathered using R.

I wrote a short script that collects British Pound Sterling (GBP) to Polish Złoty (PLN) historic exchange rates from 1996 till today (4th September 2015).

First thing I did was loading the libraries that will be used:

Then I downloaded the exchange rates data from Quandl. Data can be downloaded in several popular formats (i.e. ts, xts, or zoo), each appropriate for different packages.

Once I collected my data, I started cleaning the data frame that was downloaded from Quandl.
The column names designating High and Low Rates had whitespaces in them, and that could cause all sorts of problems in R. I removed ‘High_est’ and ‘Low_est’ from the xts data to make plotting with dygraphs easier. Thanks for lubridate package I easily extracted years, months, and days from the Date field. I also added a ‘Volatility’ column that showed the difference between the High and Low rates.

Now I had all my data cleaned so I could start plotting. I started by shamelessly copying the code for the Reuters-like plot that was included in Quandl’s cheat sheet.

GBP/PLN Historic Exchange Rates
GBP/PLN Historic Exchange Rates

All plots were created using the following code:

The results can be seen underneath.

Density plots:
gbp.pln.fx.rate.density.by.year

gbp.pln.fx.rate.density.by.month

Histograms:
gbp.pln.fx.rate.histogram.by.year

gbp.pln.fx.rate.histogram.by.month

Volatility histogram shows clearly that there are many fields with missing values.

GBP/PLN Volatility
GBP/PLN Volatility

Comparing the blank fields with those containing non-zero values showed that the number of blank vs. non-blank is almost the same. This means that Volatility should not be used as it has a high number of missing values.

missing_volatility_values

The box plots show how the GBP/PLN exchange rates fluctuate across different years and months. They also include the mean values and the variation in the exchange rates.
gbp.pln.fx.rate.boxplots.by.year

gbp.pln.fx.rate.boxplots.by.month

Last, but not least was the interactive plot created with dygraph package. That’s definitely my favourite as it allows fine-grained analysis of the underlying information (including date range selection).


I added the event lines that seemed to be relevant to the the observed maximum and minimum values of the GBP/PLN exchange rates. Around the time when Poland joined the EU, the GBP/PLN exchange rate peaked at 7.3 PLN. Four years later, just before Lehman Brothers went bankrupt, Pound Sterling was valued as low as 4.07 PLN.

The interactive plot was created using the following code:

I thought that I had enough simple plots and now was the right time for more complicated analyses. I started with the Seasonal Decomposition of Time Series by Loess using the stl function.

GBP/PLN Seasonal Decomposition of the Historic Exchange Rates
GBP/PLN Seasonal Decomposition of the Historic Exchange Rates

Seasonal, trend and irregular components were extracted from the underlying data. The main trend is showing the appreciation of GBP against PLN but the seasonal component might be (?) indicating approaching correction of that trend.

I wanted to finish the analysis with applying the Anomaly Detection package, that was released by Twitter’s engineering team, to my data collected from Quandl. My plan was to see whether there were any anomalous changes of the exchange rate during the recorded period.

Running this code resulted in obtaining one anomalous value, i.e. 7.328 PLN, which was the maximum value observed in the collected time series.
gbp_pln_fx_anomaly_detection

I hope that you enjoyed this walkthrough. In future posts I want to descibe more ways to analyse time series.

Automatic pitch extraction from speech recordings

I needed to extract mean pitch values from audio recordings of human speech, but I wanted to automate it and easily recreate my analyses so I wrote a couple of scripts that can do it much faster.

Here is a recipe for extracting pitch from voice recordings.

 

  • Cleaning audio files

My audio files were stereo recordings of a participant saying /a/ while hearing (near) real-time pitch shifts in their own productions. The left channel contains the shifted pitch (heard by participants) and the right channel contains the original speech productions.

The first step is to examine the audio recordings for any non-speech sounds. I used Audacity for that. Any grunts or sights can mess up the outcome of scripts used in the analysis. Irrelevant parts of the audio track can be silenced (CTRL+L in Audacity). Once the audio track is cleaned, I split the channels and save them in separate wav files.

audio

Acoustic signal used in the analysis. Highlighted part is showing noise that should be removed.

  • Splitting continuous recordings using SFS

My pitch-extracting scripts expects each utterance to be saved in a separate wav file so I need to split the continuous recordings. It could be done manually but for longer recordings it’s cumbersome. Speech Filing System (SFS) has an option that allows splitting the continuous files on silence.

Manual:

1. Load a sound file

load

2. Create multiple annotations

Tools > Speech > Annotate > Find multiple endpoints

annotations

Specify the values of npoint. More information can be found here. You don’t need to know the exact number of utterances, but a close approximation should work.

 

Visualise the results of automatic annotation:

display

Check if the annotations are correct. If not, then tweak the npoint settings to get the effect you need.

 

3. Chop the files on annotations

Tools > Speech > Export > Chop signal into annotated regions

This will save the files in the sfs format, but PraatR can’t work with these files. They need to be transformed into wav.

 

4. Convert sfs into wav files

Load the files you want to convert, highlight them, and go to:

File > Export > Speech

 

Automatic:

If you don’t want to spend hours doing what I’ve just described then a simpler solution is using a program that runs all the commands described above.

Use the batch script that follows the steps described above (plus some extras).

 

  • Extracting mean pitch using PraatR

Pitch could be extracted manually in Praat by going to

View & Edit > Pitch > Get pitch

but doing this for many files would take a lot of time and would be error-prone.

praat_get_pitch

Luckily, there is a connection between Praat and R (PraatR) which can speed up this task.

I extracted mean pitch and duration of files. The latter can be used to reject any non-speech files. Here’s the script:

Now you should get a nicely formatted csv file.

I hope this will save you a lot of time.

 

Butterworth Filter Demo in Shiny

I am using EEGLab to process my electroencephalografic data (i.e. brain’s electric activity), but I wanted to have an interactive visualisation showing how different filter settings change my data. I prefer using R to Matlab, so I decided to create a Shiny app that would do just that.

I tried to filter brainstem’s activity during several speech conditions using Butterworth band-pass filter to get rid of the artefacts.

I wrote a butterHz function which is based on butter_filtfilt.m from the EEGLab Matlab package and is using butter function from the signal R package.

Here I used a time-domain waveform of speech-evoked Auditory Brainstem Responses to demostrate the use of the Butterworth filter.


The code is available on GitHub.

Foreign crime victims in Poland

Recently I read an article (PL) about massaging statistics by Polish police. It made me wonder what kind of data is available on their website and whether any interesting patterns could be observed.

The website offers some data but it is badly formatted, not very recent, and can be only downloaded as a PDF :O

I didn’t feel like scraping the page so I manually copied and pasted the data from the website and initially preprocessed it in Excel by extracting the numbers following the backslash.

table_foreign_crime_victims_in_poland_2004-2012

I decided to focus on the dataset ‘Foreign – Crime‘. Surprisingly enough, both crime perpetrators and victims, are lumped together in one table, separated by a backslash. As if that wasn’t enough of bad formatting, someone decided to split the table in two. Each table with a different number of rows and some missing values (marked as ‘bd’). Victims/suspects from countries not specified in the table were aggregated in the total values (Pl: ‘RAZEM’). I intentionally omitted these values from my analyses.

The original(-ish) data was in wide format, but I needed to turn it into long format. I used tidyr for that:

Then I created a heatmap using ggplot2 and RColorBrewer

The result was this heatmap:

heatmap_foreign_crime_victims_in_poland

Now it’s pretty obvious, which country’s citizens were the most common crime victims in Poland if you focus on raw numbers registered by police. This dataset doesn’t include any information about the number of visitors from other countries so it’s hard to answer the question about the likelihood of being a crime victim as a foreigner in Poland.

I wanted to have some interactivity and I didn’t have much time so I made a dashboard in Tableau:


It’s a much faster way to create static or interactive plots but they are more difficult to reproduce than in R.

Popularity of UK political parties on Wikipedia

General Election 2015 is coming so I decided to compare the popularity of the main UK political parties on Wikipedia.

I gathered the data about the Wikipedia page views using wikipediatrend R package. Pretty plots were made with dygraphs and ggplot2. Wikipedia article traffic was collected from 1st March 2015 till 2nd May 2015 only for English language version of the site.


The Wikipedia interest in Labour, Conservatives, and UKIP is highly aligned. Another tier consists of Liberal Democrats and SNP, both seem to be getting similar amount of page views. Greens have the lowest total volume of the main parties, with a marked peak on 1st April.

All parties experienced a boost in the traffic volume around the time of major TV debates (2nd, 16th, and 30th April 2015).

Page views of articles about political parties decline on weekends. The pattern is fairly consistent and can be observed in all parties.
Here is how it looks for Labour:

Page views of Labour Party (UK) on English-language Wikipedia
Page views of Labour Party (UK) on English-language Wikipedia