Recently I read an article (PL) about massaging statistics by Polish police. It made me wonder what kind of data is available on their website and whether any interesting patterns could be observed.
The website offers some data but it is badly formatted, not very recent, and can be only downloaded as a PDF :O
I didn’t feel like scraping the page so I manually copied and pasted the data from the website and initially preprocessed it in Excel by extracting the numbers following the backslash.
I decided to focus on the dataset ‘Foreign - Crime‘. Surprisingly enough, both crime perpetrators and victims, are lumped together in one table, separated by a backslash. As if that wasn’t enough of bad formatting, someone decided to split the table in two. Each table with a different number of rows and some missing values (marked as ‘bd’). Victims/suspects from countries not specified in the table were aggregated in the total values (Pl: ‘RAZEM’). I intentionally omitted these values from my analyses.
The original(-ish) data was in wide format, but I needed to turn it into long format. I used tidyr for that:
Then I created a heatmap using ggplot2 and RColorBrewer
The result was this heatmap:
Now it’s pretty obvious, which country’s citizens were the most common crime victims in Poland if you focus on raw numbers registered by police. This dataset doesn’t include any information about the number of visitors from other countries so it’s hard to answer the question about the likelihood of being a crime victim as a foreigner in Poland.
I wanted to have some interactivity and I didn’t have much time so I made a dashboard in Tableau:
It’s a much faster way to create static or interactive plots but they are more difficult to reproduce than in R.