Preparing for and passing BCS Foundation Certificate in Business Analysis (v4)

Last week I passed the BCS Business Analysis Foundation exam. My preparation took around a week, with cramming over a weekend. However, I have previously studied a lot of material covered in this exam during another project management course.

My approach was to

  • make sure that I understand the topics highlighted in the syllabus,

The book on its own might be a bit dry to read, but it’s useful to fill in the syllabus and to practice potential questions. If you don’t have much time to prepare, you can download my exam notes that cover the syllabus (38 pages, including links to mock exams).

BCS Business Analysis Foundation - Syllabus - Learning Objectives

It still makes sense to read the textbook, so read it if you have time. Here’s the edition I used:

Good luck!

PostcodesioR 0.1.1 is on CRAN

Introduction

The latest stable version of my UK geocoder package has finally made it to CRAN. PostcodesioR is a wrapper for postcodes.io and it provides multiple functions to work with UK geospatial data.

This package is based exclusively on open data provided by Ordnance Survey and Office for National Statistics and turned into an API by postcodes.io.

PostcodesioR can be used by data scientists or social scientists working with geocoded UK data. A common task when working with such data is aggregating data at different administrative levels, e.g. turning postcode-level data into counties or regions. This package can help in achieving this goal and in many other cases involving geospatial data.

Installation

The package can be installed from CRAN with

install.packages("PostcodesioR")

or from GitHub

devtools::install_github("erzk/PostcodesioR")

Once the package is installed, load it with library(PostcodesioR)

Examples

The workhorse of the package is the postcode_lookup() function which takes a postcode and returns a data frame with the following fields:

  • postcode Postcode. All current (‘live’) postcodes within the United Kingdom, the Channel Islands and the Isle of Man, received monthly from Royal Mail. 2, 3 or 4-character outward code, single space and 3-character inward code.
  • quality Positional Quality. Shows the status of the assigned grid reference.
  • eastings Eastings. The Ordnance Survey postcode grid reference Easting to 1 metre resolution; blank for postcodes in the Channel Islands and the Isle of Man. Grid references for postcodes in Northern Ireland relate to the Irish Grid system.
  • northings Northings. The Ordnance Survey postcode grid reference Easting to 1 metre resolution; blank for postcodes in the Channel Islands and the Isle of Man. Grid references for postcodes in Northern Ireland relate to the Irish Grid system.
  • country Country. The country (i.e. one of the four constituent countries of the United Kingdom or the Channel Islands or the Isle of Man) to which each postcode is assigned.
  • nhs_ha Strategic Health Authority. The health area code for the postcode.
  • longitude Longitude. The WGS84 longitude given the Postcode’s national grid reference.
  • latitude Latitude. The WGS84 latitude given the Postcode’s national grid reference.
  • european_electoral_region European Electoral Region (EER). The European Electoral Region code for each postcode.
  • primary_care_trust Primary Care Trust (PCT). The code for the Primary Care areas in England, LHBs in Wales, CHPs in Scotland, LCG in Northern Ireland and PHD in the Isle of Man; there are no equivalent areas in the Channel Islands. Care Trust/ Care Trust Plus (CT) / local health board (LHB) / community health partnership (CHP) / local commissioning group (LCG) / primary healthcare directorate (PHD).
  • region Region (formerly GOR). The Region code for each postcode. The nine GORs were abolished on 1 April 2011 and are now known as ‘Regions’. They were the primary statistical subdivisions of England and also the areas in which the Government Offices for the Regions fulfilled their role. Each GOR covered a number of local authorities.
  • lsoa 2011 Census lower layer super output area (LSOA). The 2011 Census lower layer SOA code for England and Wales, SOA code for Northern Ireland and data zone code for Scotland.
  • msoa 2011 Census middle layer super output area (MSOA). The 2011 Census middle layer SOA (MSOA) code for England and Wales and intermediate zone for Scotland.
  • incode Incode. 3-character inward code that is following the space in the full postcode.
  • outcode Outcode. 2, 3 or 4-character outward code. The part of postcode before the space.
  • parliamentary_constituency Westminster Parliamentary Constituency. The Westminster Parliamentary Constituency code for each postcode.
  • admin_district District. The current district/unitary authority to which the postcode has been assigned.
  • parish Parish (England)/ community (Wales). The smallest type of administrative area in England is the parish (also known as ‘civil parish’); the equivalent units in Wales are communities.
  • admin_county County. The current county to which the postcode has been assigned.
  • admin_ward Ward. The current administrative/electoral area to which the postcode has been assigned.
  • ccg Clinical Commissioning Group. Clinical commissioning groups (CCGs) are NHS organisations set up by the Health and Social Care Act 2012 to organise the delivery of NHS services in England.
  • nuts Nomenclature of Units for Territorial Statistics (NUTS) / Local Administrative Units (LAU) areas. The LAU2 code for each postcode. NUTS is a hierarchical classification of spatial units that provides a breakdown of the European Union’s territory for producing regional statistics which are comparable across the Union. The NUTS area classification in the United Kingdom comprises current national administrative and electoral areas, except in Scotland where some NUTS areas comprise whole and/or part Local Enterprise Regions. NUTS levels 1-3 are frozen for a minimum of three years and NUTS levels 4 and 5 are now Local Administrative Units (LAU) levels 1 and 2 respectively.
  • _code Returns an ID or Code associated with the postcode. Typically these are a 9 character code known as an ONS Code or GSS Code. This is currently only available for districts, parishes, counties, CCGs, NUTS and wards.

One postcode can be geocoded in the following way

rss <- postcode_lookup("EC1Y8LX")

More than one postcode can be geocoded using purrr

postcodes <- c("EC1Y8LX", "SW1X 7XL")
postcodes_df <- purrr::map_df(postcodes, postcode_lookup)

The remaining functions are demonstrated in the vignette.

Documentation and participation

To read the full documentation of the PostcodesioR package, you can follow this link to the pkgdown site.

If you want to help with developing the package, report bugs or propose pull requests, you will find the GitHub page here.

Extracting pitch track from audio files into a data frame

My task was to extract pitch values from a long list of audio files. Previously I used Praat and R for this task but looping in R was rather slow so I wanted to find another solution. The following analysis was developed on Linux (Ubuntu).

Firstly, aubio (CLI-only Python tool) was used to extract pitch from wav files. aubio has fewer arguments than Praat and it returned awkward values using default settings so I didn’t explore it further. The good thing about it is that it is easy to use and is relatively simple and Python-native. To extract pitch with aubio use:

sudo apt install aubio-tools
aubiopitch -i P17_trim_short_10.000-11.150.wav

Eventually I decided to stick to Praat, which is the workhorse of phonetics and can be used from the command line.

Praat saves all commands that are executed and this can be a great start for creating a script. More information about scripting in Praat is here. My solution is here:

This script will extract .pitch files from all .wav files in the working directory and will save them to a subfolder. Praat scripts can be called from the command line:

praat --run extract_pitch_script.praat

Which will extract pitch tracks from all .wav files in the directory. The pitch extraction will use default settings in Praat. The output will be one .pitch file for each .wav file. The files themselves contain all candidates and are not in a tidy format so they have to be transformed. This step could probably be done in Praat scripting but I did not have patience to achieve it there and I moved to R which could easily produce desired output.

R can be called from the command line using littler. Shebang on the first line means that the script can be called from the command line. The script below transforms .praat files into clean .csv files.

To invoke the R script, run in the command line:

r praat_pitch_analysis_CLI.R untitled_script.pitch

This creates a .csv file with the best candidate pitch above a certain confidence threshold. Pitch extraction algorithm used by Praat was developed by Boersma (1993).

Automatic splitting of audio files on silence in Python

In my previous post I described how to split audio files into chunks using R. This time I wanted to use Python to prepare long audio files (.mp3) for further analysis. The use case would be splitting a long audio file that contains many words/utterances/syllables that need to be then analysed separately, e.g. recorded list of words.

The analysis described here was conducted on Linux (Ubuntu 16.04) and it should be fairly similar on MacOS, but Windows would require quite a few ammendments.

The first step was to turn the original .m4a files into .mp3 and to extract the segment I was interested in. I used ffmpeg for these tasks. This can be skipped if your files are already clean.

ffmpeg -i P17.m4a P17.mp3
ffmpeg -i P17.mp3 -ss 00:17:50 -to 00:23:30 -c copy P17_trim.mp3

The second command created a copy of the original .mp3 file and extracted the segment between 17 min 50 sec and 23 min 30 sec. That’s where speech was recorded in my file.

ffmpeg output

The continuous audio file that I used contained repeated utterances of the same syllable. Use the code below to split this file into segments. Silence detection is conducted using Support-vector machine (SVM):

Install pyAudioAnalysis and run on the command line:

python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i P17_trim_short.mp3 --smoothing 1.0 --weight 0.3
pyAudioAnalysis detect silence in audio files
Top row shows the waveform of the audio signal. Y-axis is amplitude, X-axis is time. Bottom row shows the probabily of non-silence, the vertical lines are markers that will be used to split the file.

The result is a list of sliced wav files. The names contain timings of the boundaries.

pyAudioAnalysis silenceRemoval output example.

All files in a given directory can be split using the following script:

Make sure to point the script to the directory where audioAnalysis.py lives. Modifying smoothing and weight parameters will lead to different effects so this should be adjusted depending on a type of audio recording. By default the script will show a pop-up window with the suggested split. This is very useful for monitoring data quality. The Python script can be used from the command line with:

python split_continuous_audio.py

Book Review – Sound Analysis and Synthesis with R

R might not be the most obvious tool when it comes to analysing audio data. However, an increasing number of packages allows analysing and synthesising sounds. One of such packages is seewave. Jerome Sueur, one of the authors of seewave, now wrote a book about working with audio data in R. The book is entitled Sound Analysis and Synthesis with R and was published by Springer in 2018. I highly recommend it to anyone working with audio data.

The book starts with a general explanation of sound. Then it introduces R to readers who have no experience using it. Over the 17 chapters the author describes basic audio analyses that can be conducted with R. The underlying concepts are explained using both mathematical equations and R code. There is also some material on sound synthesis, but this is a minor point when compared to the space devoted to the analysis. Additional materials include sound samples used across the book.

As mentioned before the main topic of the book is the analysis of sound, predominantly in scientific settings. Researchers (or data scientists) typically would want to load, visualise, play, and quantify a particular sound that they work on. These basic steps are desribed in this book with code examples that are simple to follow and richly illustrated with R-generated plots. Check the book preview here.

If you ever need to paste, delete, repeat or reverse audio files with R then recipes for these tasks can be found in this book. The book contains twenty DIY Boxes which show alternative ways to use already coded functions and demonstrate new tasks. These boxes cover topics ranging from loading audio files, plotting to frequency and amplitude analysis.

Even though the author created his own package, the book shows how to use a wide range of audio-specific R package like tuneR or warbleR.

I can only wish that this book had been released earlier. It would have saved me a lot of pain conducting audio analyses.

Final verdict: 5/5

Spectrograms in R – a gallery

Creating a spectrogram is a basic step in every analysis of audio signals. Spectrograms visualise how frequencies change over a time period. Luckily, there is a selection of R packages that can help with this task. I will present a selection of packages that I like to use. This post is not an introduction to spectrograms. If you want to learn more about them then try other resources (e.g. lecture notes from UCL).

The examples shown below came mostly from the official documentation and were kept as simple as possible. The majority of functions allow further customisation of the plots.

phonTools

seewave

seewave and ggplot2

signal

soundgen

warbleR

hht

Creating a spectrogram from the scratch is not so difficult, as shown by Hansen Johnson in this blog post. Another solution was provided by Aaron Albin.

Praat is a workhorse of audio analysis. It is a standalone software, but there is also an R controller called PraatR, that allows calling Praat functions from R. It is not the easiest tool to use so I will just mention it here for reference.

I am pretty sure that there are more packages that allow creating spectrograms but I had to stop somewhere. Feel free to leave comments about other examples.

Removing triggers from Hitachi ETG-4000 fNIRS recordings

Homer2 needs a particular format of a .nirs file that cannot have consecutive triggers (also called Marks in Hitachi files).
hitachi2nirs Matlab script also removes the markers but I wanted to recreate the whole process and be sure that I’m doing it correctly. Answering Yes to the question Do you want to remove the marker at the end of each stimulus? y/n will run the following code:

To remove the triggers/markings in R follow the steps below.

Start with loading the packages and files

This will produce a table showing a structure

## 'data.frame': 2500 obs. of 50 variables:
## $ Probe1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CH1.703.6. : num 0.1865 0.0182 -0.4738 -0.1521 -0.3078 ...
## $ CH1.829.0. : num 0.412 0.547 0.534 0.314 0.106 ...
## $ CH2.703.9. : num 0.739 0.764 0.746 0.751 0.762 ...
## $ CH2.829.3. : num 1.01 1.01 1.03 1.03 1.03 ...
## $ CH3.703.9. : num 1.57 1.58 1.59 1.59 1.6 ...
## $ CH3.829.3. : num 1.64 1.65 1.65 1.66 1.67 ...
## $ CH4.703.9. : num 1.48 1.45 1.55 1.51 1.47 ...
## $ CH4.828.8. : num 1.63 1.64 1.66 1.66 1.68 ...
## $ CH5.703.6. : num -1.226 -1.743 -0.546 -0.556 -0.75 ...
## $ CH5.829.0. : num 0.00397 -0.23102 -1.11099 -0.64056 -1.01425 ...
## $ CH6.703.1. : num -0.247 -0.335 -0.371 -0.667 -1.064 ...
## $ CH6.828.8. : num 0.987 0.892 0.892 0.933 0.796 ...
## $ CH7.703.9. : num 1.03 1.3 1.11 1.02 1.44 ...
## $ CH7.829.3. : num 1.2 1.22 1.21 1.23 1.23 ...
## $ CH8.702.9. : num 2 2.01 2.03 2.04 2.04 ...
## $ CH8.829.0. : num 1.79 1.81 1.81 1.83 1.85 ...
## $ CH9.703.9. : num 2.07 2.02 2.12 2.12 2.01 ...
## $ CH9.828.8. : num 1.82 1.82 1.82 1.84 1.85 ...
## $ CH10.703.1. : num -0.492 -0.135 -0.598 -0.598 -0.328 ...
## $ CH10.828.8. : num 0.672 0.61 0.823 0.724 0.724 ...
## $ CH11.703.1. : num -1.042 -0.255 -1.773 -1.419 -0.449 ...
## $ CH11.828.8. : num 1.071 1.052 0.804 1.107 1.047 ...
## $ CH12.702.9. : num 0.684 0.771 0.704 0.512 0.905 ...
## $ CH12.829.0. : num 1.02 1.01 1.03 1.08 1.07 ...
## $ CH13.702.9. : num 2.03 2.03 2.05 2.05 2.05 ...
## $ CH13.829.0. : num 1.76 1.78 1.79 1.79 1.81 ...
## $ CH14.703.6. : num -1.719 -1.196 -0.359 -0.883 -1.99 ...
## $ CH14.829.0. : num -0.0832 0.0209 -0.1123 -0.2014 -0.2011 ...
## $ CH15.703.1. : num 1.97 1.89 1.82 1.98 2.09 ...
## $ CH15.828.8. : num 1.81 1.78 1.8 1.81 1.84 ...
## $ CH16.703.4. : num 0.0209 -0.4283 -0.0848 -0.278 0.4996 ...
## $ CH16.829.0. : num 1.36 1.26 1.38 1.23 1.27 ...
## $ CH17.702.9. : num 2.35 2.35 2.36 2.37 2.38 ...
## $ CH17.829.0. : num 2.08 2.09 2.11 2.12 2.13 ...
## $ CH18.703.6. : num 2.1 2.1 2.09 2.1 2.1 ...
## $ CH18.828.5. : num 2.14 2.14 2.14 2.15 2.15 ...
## $ CH19.703.6. : num -1.104 -1.134 -0.658 -0.886 -0.336 ...
## $ CH19.829.0. : num -0.1239 0.09369 0.05463 0.01617 -0.00427 ...
## $ CH20.703.4. : num 1.65 1.55 1.28 1.35 1.56 ...
## $ CH20.829.0. : num 1.77 1.75 1.8 1.81 1.8 ...
## $ CH21.703.4. : num 1.41 1.43 1.31 1.42 1.46 ...
## $ CH21.829.0. : num 1.76 1.77 1.76 1.77 1.79 ...
## $ CH22.703.6. : num 2.11 2.29 2.18 2.24 2.21 ...
## $ CH22.828.5. : num 2.17 2.17 2.18 2.17 2.2 ...
## $ Mark : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Time : num 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ...
## $ BodyMovement: int 0 0 0 0 0 0 0 0 0 0 ...
## $ RemovalMark : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PreScan : int 1 1 1 1 1 1 1 1 1 1 ...

table of triggers

## .
## 1 2 9 10
## 73 10 4 2

and a plot showing raw data in one channel (quite noisy) with all the triggers.

This shows several triggers (all plotted using red colour). I will only keep trigger ‘2’ to mark the beginning of a block. The first step is cleaning the data by removing all but trigger ‘2’.

Which results in fewer events

It turned out that there were two ‘2’ triggers next to each other. That’s because ETG-4000 does not allow odd triggers next to each other, e.g. 212 is invalid, but 22111122 is valid. I wrote a function (soon incorporated into fnirsr package) that deals with this problem.

The result, only the first block events, is here

Geofacet Polski – wykresy w miejscu województw

Niedawno odnalazłem ciekawy pakiet geofacet, który umożliwia rozmieszczenie wykresów zgodnie z ich pozycją na mapie. Główna funkcja facet_geo() zastępuje facet_wrap() z ggplot2. Polska mapa jeszcze nie jest dostępna w standardowym pakiecie geofacet, ale mam nadzieję, że już wkrótce tam się znajdzie, bo dodałem ją na GitHubie.

Stworzyłem siatkę z koordynatami poszczególnych województw. Wykresy z pakietem geofacet mogą wyglądać tak:

geofacet_polska_poland_wojewodztwa
Rozmieszczenie województw nie jest idealne, ale pakiet geofacet umożliwia użycie własnych ustawień.

Dane pochodzą z Banku Danych Lokalnych (XLS – tablica przestawna)

Kod do stworzenia wykresów:

Downloading UK property prices from Zoopla in R

Zoopla allows a limited access to its API providing the latest property prices and area indices. I created a package in R that allows querying this database. See the GitHub documentation or zooplaR’s page for the latest info.

You can easily get prices in the last couple of months or years for a particular postcode, outcode or area:

Given, the limit number of queries, it might be worth double-checking the results with the property widget offered by Zoopla (redirects to zoopla.co.uk).

It doesn’t have as many options as the API and obviously is not automatic but it’s worth using for a sanity check.

Enabling MATLAB in Jupyter notebooks on Linux

Introduction

In my previous post I showed how to enable MATLAB in Jupyter notebooks on Windows. Now it’s time for GNU/Linux (Ubuntu).

My main issue with enabling new kernel was having initially installed two Anacondas and two Python versions (2.7 and 3.5). After a lot of frustration, I decided to remove both Anacondas and have a clear install of the latest Anaconda with Python 2.7 and 3.5. In this tutorial I assume that Jupyter and MATLAB are already installed on your system.

Using the right environment

Although the official MATLAB website states that Python-MATLAB engine works with Python 2.7, 3.4, 3.5 and 3.6, I struggled to install it using Python 3.5. If you try to install it with a 3.5 version, you will see the following error:

OSError: MATLAB Engine for Python supports Python version 2.7, 3.3 and 3.4, but your version of Python is 3.5

The error makes it obvious that you need an older version of Python. I decided to use 2.7. To do that, I created another environment with Python 2.7:

conda create -n py27 python=2.7 anaconda

The guidelines to managing Python environments are here.

The next step was checking what environments were available:

conda info --envs

And activating Python 2.7 (py27):

source activate py27

Install Python-MATLAB engine

To install the engine connecting both languages: go to your MATLAB folder, find the Python engine folder and install setup.py. This can be done in the following way:

Change your working directory to where your MATLAB lives:
cd "MATLABROOT/extern/engines/python"

If you don’t know where your MATLAB is installed, use:
locate matlab

Then install the engine (it will only work with MATLAB >=2014b):

sudo python setup.py install

And the latest remaining dependencies:

sudo pip install -U metakernel
sudo pip install -U matlab_kernel
sudo pip install -U pymatbridge

That should do the job. Now open new Jupyter notebook:

jupyter notebook

To check whether you can find MATLAB among the available engines (top right corner):

Now check whether you can actually run the notebook. Initially, when I tried using Python 3.5, I could see MATLAB among the options but the kernels would die each time I tried running the MATLAB code. Moving to Python 2.7, as described in this tutorial, solved the problem.

If all works fine then the following notebook should generate correctly:

Even though I’m getting the MetaKernelApp error, the notebook continues to work correctly:
[MetaKernelApp] ERROR | No such comm target registered: jupyter.widget.version

To leave the environment used to run the notebook, simply type:

source deactivate

Notes

Initially, I struggled a bit with making it all work so in the meantime I also tried installing Octave (a free equivalent of MATLAB). I’m not sure whether that installation helped me with running MATLAB within Jupyter.

While trying to install the engine I came across several errors. I guess that most of them were related to my OS configuration and all of them were solved by searching for the error message. One of the errors was:

Error:
[I 00:58:19.847 NotebookApp] KernelRestarter: restarting kernel (3/5)
/home/eub/anaconda3/bin/python: No module named matlab_kernel

This was due to installing the Python engine in the wrong environment (i.e. my default Python 3.5). It was solved by activating Python 2.7 and using it to install the Python-MATLAB engine.

I think that there is an alternative way to activate MATLAB in Jupyter, without Anaconda, would be to explicitly point the installer to the Python version that supports the Py-MATLAB engine.

In my case:
sudo ~/anaconda/pkgs/python-2.7.13-0/bin/python2.7 setup.py install

You might also want install the engine in a non-default location. In that case, MATLAB has a solution to that problem and suggested installing Python in the home directory.

There is another Jupyter kernel (imatlab) engine that supposedly works with Python 3.5 and MATLAB R2016b+ but I haven’t tested it myself. As long as my current configuration works, I’m not planning to go through the hell of installing dependencies again.