Configuring Tableau Web Data Connectors on Windows 10 – using Google Sheets

Tableau Desktop 9.1 is out and Web Data Connectors are available as a new data source in this version. Luckily Tableau released several working connectors to popular web data sources and Google Sheets is one of them. Today I tried to connect to Google Sheets but I couldn’t find a step-by-step description of setting it up and the official thread lacked details about configuring WDC. It took me a while to figure out how to start using Web Data Connectors so I decided to write this tutorial which hopefully will help others to start using Web Connectors. This tutorial describes how to use Tableau web connectors hosted locally.

Before starting to use Web Data Connectors you need to activate Internet Information Services (IIS) Manager (more info available in this Stackoverflow thread).
This can be done on Windows 10 by pressing Windows key and typing Windows Features.
Then go to Turn Windows Features On or Off, and tick the box next to Internet Information Services.
IIS
Now you should be able to see IIS in Control Panel > Administrative Tools.
ISS_port_80
Open IIS Manager to check the values in Binding and Path columns. Binding should be set to *:8888 (http). Path points to a folder where the Tableau Connector files should be stored.
If Binding is set to a different value than 8888 (mine was set to 80 by default) then go to Actions on click on Bindings.
bindings
Then change the Port number to 8888 and click OK.
edit_site_bindings
Now your IIS Manager configuration should look like this
ISS_configuration
Check the value of the Path column. It’s set to %SystemDrive%\inetpub\wwwroot by default which in most cases means C:\inetpub\wwwroot

In order to start using Google Sheets as a data source you need to start with downloading the Web Data Connector SDK.
Unzip the zip file and go to the Examples folder.
Copy all files from the Examples folder to the Path specified in IIS. In my case it was C:\inetpub\wwwroot

Now you should be ready to start using Web Data Connectors. Go to Connect > To a server > Web Data Connector
A pop-up window like the one below should appear
tableau-web-data-connector

If you copied the example files provided by Tableau to your IIS path then type the following into the address bar of the WDC pop-up window:
http://localhost:8888/GoogleSheetsConnector.html
You should see the following window
googlesheet_connector
All you need to do now is to provide a link to a Google Sheet you want to use. You might also need to log into your Google Account if a Sheet provided is private.
Now you can start using Google Sheets as a data source. Hooray!

If you need to use other connectors then just change the path in the Address bar of the pop-up window.

Analysing Quandl FX data in R – plotting, decomposing time series, and detecting anomalies

Quandl is a great provider of various types of data that can be easily integrated with R. I wanted to play with the available data and see what kind of insights can be gathered using R.

I wrote a short script that collects British Pound Sterling (GBP) to Polish Złoty (PLN) historic exchange rates from 1996 till today (4th September 2015).

First thing I did was loading the libraries that will be used:

Then I downloaded the exchange rates data from Quandl. Data can be downloaded in several popular formats (i.e. ts, xts, or zoo), each appropriate for different packages.

Once I collected my data, I started cleaning the data frame that was downloaded from Quandl.
The column names designating High and Low Rates had whitespaces in them, and that could cause all sorts of problems in R. I removed ‘High_est’ and ‘Low_est’ from the xts data to make plotting with dygraphs easier. Thanks for lubridate package I easily extracted years, months, and days from the Date field. I also added a ‘Volatility’ column that showed the difference between the High and Low rates.

Now I had all my data cleaned so I could start plotting. I started by shamelessly copying the code for the Reuters-like plot that was included in Quandl’s cheat sheet.

GBP/PLN Historic Exchange Rates
GBP/PLN Historic Exchange Rates

All plots were created using the following code:

The results can be seen underneath.

Density plots:
gbp.pln.fx.rate.density.by.year

gbp.pln.fx.rate.density.by.month

Histograms:
gbp.pln.fx.rate.histogram.by.year

gbp.pln.fx.rate.histogram.by.month

Volatility histogram shows clearly that there are many fields with missing values.

GBP/PLN Volatility
GBP/PLN Volatility

Comparing the blank fields with those containing non-zero values showed that the number of blank vs. non-blank is almost the same. This means that Volatility should not be used as it has a high number of missing values.

missing_volatility_values

The box plots show how the GBP/PLN exchange rates fluctuate across different years and months. They also include the mean values and the variation in the exchange rates.
gbp.pln.fx.rate.boxplots.by.year

gbp.pln.fx.rate.boxplots.by.month

Last, but not least was the interactive plot created with dygraph package. That’s definitely my favourite as it allows fine-grained analysis of the underlying information (including date range selection).


I added the event lines that seemed to be relevant to the the observed maximum and minimum values of the GBP/PLN exchange rates. Around the time when Poland joined the EU, the GBP/PLN exchange rate peaked at 7.3 PLN. Four years later, just before Lehman Brothers went bankrupt, Pound Sterling was valued as low as 4.07 PLN.

The interactive plot was created using the following code:

I thought that I had enough simple plots and now was the right time for more complicated analyses. I started with the Seasonal Decomposition of Time Series by Loess using the stl function.

GBP/PLN Seasonal Decomposition of the Historic Exchange Rates
GBP/PLN Seasonal Decomposition of the Historic Exchange Rates

Seasonal, trend and irregular components were extracted from the underlying data. The main trend is showing the appreciation of GBP against PLN but the seasonal component might be (?) indicating approaching correction of that trend.

I wanted to finish the analysis with applying the Anomaly Detection package, that was released by Twitter’s engineering team, to my data collected from Quandl. My plan was to see whether there were any anomalous changes of the exchange rate during the recorded period.

Running this code resulted in obtaining one anomalous value, i.e. 7.328 PLN, which was the maximum value observed in the collected time series.
gbp_pln_fx_anomaly_detection

I hope that you enjoyed this walkthrough. In future posts I want to descibe more ways to analyse time series.

Article Processing Charges paid by UCL (2013-2014)

I found an interesting data source on FigShare showing how much University College London paid for the Article Processing Charges. According to the description of this data set it should include items between 1st April 2013 and 31st July 2014. However, there are some entries that are dated earlier and later than that. I guess that only the previously mentioned period is complete. I included all data anyway and then I added a linear regression line on top of the plot showing total cost by date. Judging by that, it seems like the costs keep on increasing. However, it is just a short period of time and I don’t know whether this trend held for a longer period of time.
Large journals (e.g. Nature Communications, PLOS One, and Lancet) stand out as taking more APCs than others. The vast majority of articles in this data set has been published with a CC BY licence.
Feel free to further explore the plots. They are all interactive.


Research Councils UK Diversity Data

Lately I came across a data set published by Research Councils UK (RCUK) about the diversity of grant applicants. The document didn’t include any plots, so I decided to fix that. Unfortunately, the report was only available in a pdf format which is not optimal for acquiring data.
Luckily, I came across Tabula which is an open source project that allows extracting data from tables in pdf files. With the help of Tabula I extracted data from the document to create plots. Formatting of the original document required too much data wrangling, so I only used data from the first table.
This data is showing an estimate of academic populations applying to the Research Councils.

Higher Education Salaries in the UK

Times Higher Education published a 2015 pay survey in April, but only now I found time to create interactive visualisations showing their data.

I had to edit the table to make it usable in Tableau. I removed the ‘All’ column from the original data source. This column most likely represented weighted means. I couldn’t find any information about how it was calculated, so I only used ‘Female’ and ‘Male’ values. Combined (‘All’) values used in my workbooks are means of these two fields.

I used Edubase data to find additional information about universities, e.g. region or location (available only for HE institutions in England).

Speech SENCOs in England

Department for Education regularly publishes the list of educational establishments in England and Wales. I was interested in seeing which areas have the largest number of schools with Speech Special Educational Needs Coordinators, and what are the characteristics of those schools. In order to find that out, I created a visualisation in Tableau.

The first tab (‘Map’) is showing the map of educational establishments with or without Speech SENCOs, type of establishment, number of establishments, and total number of pupils.
The second tab (‘Pupils by Region and FSM’) shows a fine-grained description of pupils in different types of educational establishments.



Automatic pitch extraction from speech recordings

I needed to extract mean pitch values from audio recordings of human speech, but I wanted to automate it and easily recreate my analyses so I wrote a couple of scripts that can do it much faster.

Here is a recipe for extracting pitch from voice recordings.

 

  • Cleaning audio files

My audio files were stereo recordings of a participant saying /a/ while hearing (near) real-time pitch shifts in their own productions. The left channel contains the shifted pitch (heard by participants) and the right channel contains the original speech productions.

The first step is to examine the audio recordings for any non-speech sounds. I used Audacity for that. Any grunts or sights can mess up the outcome of scripts used in the analysis. Irrelevant parts of the audio track can be silenced (CTRL+L in Audacity). Once the audio track is cleaned, I split the channels and save them in separate wav files.

audio

Acoustic signal used in the analysis. Highlighted part is showing noise that should be removed.

  • Splitting continuous recordings using SFS

My pitch-extracting scripts expects each utterance to be saved in a separate wav file so I need to split the continuous recordings. It could be done manually but for longer recordings it’s cumbersome. Speech Filing System (SFS) has an option that allows splitting the continuous files on silence.

Manual:

1. Load a sound file

load

2. Create multiple annotations

Tools > Speech > Annotate > Find multiple endpoints

annotations

Specify the values of npoint. More information can be found here. You don’t need to know the exact number of utterances, but a close approximation should work.

 

Visualise the results of automatic annotation:

display

Check if the annotations are correct. If not, then tweak the npoint settings to get the effect you need.

 

3. Chop the files on annotations

Tools > Speech > Export > Chop signal into annotated regions

This will save the files in the sfs format, but PraatR can’t work with these files. They need to be transformed into wav.

 

4. Convert sfs into wav files

Load the files you want to convert, highlight them, and go to:

File > Export > Speech

 

Automatic:

If you don’t want to spend hours doing what I’ve just described then a simpler solution is using a program that runs all the commands described above.

Use the batch script that follows the steps described above (plus some extras).

 

  • Extracting mean pitch using PraatR

Pitch could be extracted manually in Praat by going to

View & Edit > Pitch > Get pitch

but doing this for many files would take a lot of time and would be error-prone.

praat_get_pitch

Luckily, there is a connection between Praat and R (PraatR) which can speed up this task.

I extracted mean pitch and duration of files. The latter can be used to reject any non-speech files. Here’s the script:

Now you should get a nicely formatted csv file.

I hope this will save you a lot of time.

 

Butterworth Filter Demo in Shiny

I am using EEGLab to process my electroencephalografic data (i.e. brain’s electric activity), but I wanted to have an interactive visualisation showing how different filter settings change my data. I prefer using R to Matlab, so I decided to create a Shiny app that would do just that.

I tried to filter brainstem’s activity during several speech conditions using Butterworth band-pass filter to get rid of the artefacts.

I wrote a butterHz function which is based on butter_filtfilt.m from the EEGLab Matlab package and is using butter function from the signal R package.

Here I used a time-domain waveform of speech-evoked Auditory Brainstem Responses to demostrate the use of the Butterworth filter.


The code is available on GitHub.

Foreign crime victims in Poland

Recently I read an article (PL) about massaging statistics by Polish police. It made me wonder what kind of data is available on their website and whether any interesting patterns could be observed.

The website offers some data but it is badly formatted, not very recent, and can be only downloaded as a PDF :O

I didn’t feel like scraping the page so I manually copied and pasted the data from the website and initially preprocessed it in Excel by extracting the numbers following the backslash.

table_foreign_crime_victims_in_poland_2004-2012

I decided to focus on the dataset ‘Foreign – Crime‘. Surprisingly enough, both crime perpetrators and victims, are lumped together in one table, separated by a backslash. As if that wasn’t enough of bad formatting, someone decided to split the table in two. Each table with a different number of rows and some missing values (marked as ‘bd’). Victims/suspects from countries not specified in the table were aggregated in the total values (Pl: ‘RAZEM’). I intentionally omitted these values from my analyses.

The original(-ish) data was in wide format, but I needed to turn it into long format. I used tidyr for that:

Then I created a heatmap using ggplot2 and RColorBrewer

The result was this heatmap:

heatmap_foreign_crime_victims_in_poland

Now it’s pretty obvious, which country’s citizens were the most common crime victims in Poland if you focus on raw numbers registered by police. This dataset doesn’t include any information about the number of visitors from other countries so it’s hard to answer the question about the likelihood of being a crime victim as a foreigner in Poland.

I wanted to have some interactivity and I didn’t have much time so I made a dashboard in Tableau:


It’s a much faster way to create static or interactive plots but they are more difficult to reproduce than in R.