Introduction to Open Data Science - Course Project

About the project

Hello world! This is my first entry to the Introduction to Open Data Science 2020 course project. This course teaches data analysis using R and writing stuff down with R Markdown. Interesting and useful stuff. Previously I’ve worked with Python and GitHub-flavored Markdown syntax.

I hope to learn more of the “statistics” side in data science. Topics such as

How to ensure my analysis is statistically sound?
How to process the data “properly” (outlier detection etc)?
How to select proper methods and descriptive “output numbers”?
How to discern independent and co-dependent parameters from the data.

By the way, here is the link to my GitHub repository for this project. Currently it’s rather empty as the “real” work starts next week…

Regression and model validation

This section is the report for the second week of IODS. To start with, here are the instructions given for writing this report:

Describe the work you have done this week and summarize your learning.

Describe your work and results clearly.

Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.

Assume the reader has no previous knowledge of your data or the more advanced methods you are using.

With that out of the way, let’s start the analysis part of the current exercise set!

Setting up

First, I need to include a couple of libraries to properly build (or rather, “knit”) this report.

# knitr is used to build this report, so it is included by default

# Remove the '##' from each line of R output blocks
knitr::opts_chunk$set(comment = '')

# Libraries for convenience
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)

# NOTE: Messages are omitted

The analysis data

The analysis data was produced in “Data wrangling” part of the exercises. The original data is the “JYTOPKYS2” study data by Kimmo Vehkalahti. The original data is available from here and the data is explained here.

The analysis data is a mix of original variables (gender, age, attitude, and points) and derived variables (deep, stra, and surf). The original and derived variables are explained as following, based on Vehkalahti’s metadata. Note that the derived variables were omitted from the original data linked above – so that we’d have to calculate them ourselves – but they were included in the metadata.

Variable	Meaning	Type	Notes
`gender`	Gender of the student	Categorical	“M” or “F”
`age`	Age (in years)	Number	derived from the date of birth
`attitude`	Global attitude toward statistics	Number	Sum of 10 answers
`points`	Total points from coursework	Number	Student’s best exam score + extra points from answering the study questionnaire
`deep`	Average score of answers related to Deep approach	Number	Avg. of 12 answers
`stra`	Average score of answers related to Strategic approach	Number	Avg. of 12 answers
`surf`	Average score of answers related to Surface approach	Number	Avg. of 8 answers

You can see the creation script here (most recent version) or the original from commit bb6de84). The script stores the data to learning2014.csv, so let’s start by reading that file and printing out the dataset structure, first few rows¹, and some summary statistics.

dset <- read.csv("data/learning2014.csv")

str(dset)

'data.frame':   166 obs. of  7 variables:
 $ gender  : chr  "F" "M" "F" "M" ...
 $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
 $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
 $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
 $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
 $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
 $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

knitr::kable(head(dset, 5))  # Prettier table, see also footnote

gender	age	attitude	deep	stra	surf	points
F	53	37	3.583333	3.375	2.583333	25
M	55	31	2.916667	2.750	3.166667	12
F	49	25	3.500000	3.625	2.250000	24
M	53	35	3.500000	3.125	2.250000	10
M	49	37	3.666667	3.625	2.833333	22

summary(dset)

    gender               age           attitude          deep      
 Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
 Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
 Mode  :character   Median :22.00   Median :32.00   Median :3.667  
                    Mean   :25.51   Mean   :31.43   Mean   :3.680  
                    3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
                    Max.   :55.00   Max.   :50.00   Max.   :4.917  
      stra            surf           points     
 Min.   :1.250   Min.   :1.583   Min.   : 7.00  
 1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
 Median :3.188   Median :2.833   Median :23.00  
 Mean   :3.121   Mean   :2.787   Mean   :22.72  
 3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
 Max.   :5.000   Max.   :4.333   Max.   :33.00

count(dset, gender)

  gender   n
1      F 110
2      M  56

This data contains 166 data points (rows). The original data has 187 rows, however the students with 0 points were omitted during the data wrangling. Out of the 166 remaining students, 110 were female and 56 were male. The students’ age ranges from 17 to 55 years, with median at 22 years.

Next, here is a plot that shows the distributions of the variables and the cross-correlation between each variable within the data. First, most of the distributions are similar to (slightly skewed) Gaussian distributions, however there are exceptions. Notably, age is similar to an inverse-gamma distribution: a sharp peak at low values and a long trailing tail. Even when ignoring the high-end “outliers”, the distribution is still positively skewed – as seen in the boxplot. Second, there is no clear difference between males and females for any of the variables. Third, the correlations seem to be rather weak ($|\rho| \lesssim 0.2$), except for stra and surf ($\rho_{stra,surf} \approx -0.32$) and for attitude and points($\rho_{attitude,points} \approx 0.44$), for whom the correlation is a bit stronger. Finally, as the correlation values are not properly displayed in the figure due to the figure size, I’ve included the Pearson correlation coefficient table below the figure.

ggpairs(dset, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo=wrap("facethist", bins=20))
       )

cor(dset[-1])  # Exclude 'gender' column

                 age    attitude        deep        stra       surf      points
age       1.00000000  0.02220071  0.02507804  0.10244409 -0.1414052 -0.09319032
attitude  0.02220071  1.00000000  0.11024302  0.06174177 -0.1755422  0.43652453
deep      0.02507804  0.11024302  1.00000000  0.09650255 -0.3238020 -0.01014849
stra      0.10244409  0.06174177  0.09650255  1.00000000 -0.1609729  0.14612247
surf     -0.14140516 -0.17554218 -0.32380198 -0.16097287  1.0000000 -0.14435642
points   -0.09319032  0.43652453 -0.01014849  0.14612247 -0.1443564  1.00000000

Fitting the linear model

Next up is a task to fit a linear model to the data, where points is the target variable and one or more other variables are the explanatory variables. First instruction was to select three explanatory variables. I chose attitude, stra, and surf as their correlation coefficients with points were the largest (absolute value). And the results are…

# Target: points
# Explanatory variables (3): attitude, stra, surf
#   based on absolute value of correlation coefficient (Pearson)
model <- lm(points ~ attitude + stra + surf, dset)
summary(model)


Call:
lm(formula = points ~ attitude + stra + surf, data = dset)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1550  -3.4346   0.5156   3.6401  10.8952 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.01711    3.68375   2.991  0.00322 ** 
attitude     0.33952    0.05741   5.913 1.93e-08 ***
stra         0.85313    0.54159   1.575  0.11716    
surf        -0.58607    0.80138  -0.731  0.46563    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared:  0.2074,    Adjusted R-squared:  0.1927 
F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Yeah, not so good. First, I’ll have to quickly interpret the output. First is the distribution of residuals. A (minimum) residual of “-17” seems rather large (it’s absolute value, that is), when the points range from 7 to 33 – there might be some strong outliers. At least median is close to zero, which is reassuring. Next are the coefficients for the linear regression. The estimate column shows the slope of the linear regression (except for the intercept it shows the value of “target” when all explanatory variables are zero). So having zero attitude and learning strategies you’d expect a student to get 11 points. For the slopes, the value means “the increase or decrease in points, when attitude/strategic learning/surface learning value increases by one unit”. So, having larger attitude and stra scores would indicate better performance, whereas larger surf score would decrease performance. Next column, Std. Error, shows the standard error of the estimate. Third column, t value, shows the t-statistic – which is a measure of the estimates statistical significance. However that value is rather transient by itself, so we’ll rely on the fourth column to interpret the t-statistic. This column shows the probability of having this particular non-zero value for estimate by random chance, or in other words, how confident we can be in rejecting the null hypothesis. One more time for the non-statisticians like me: the smaller p-value is, the more we can trust that our target and explanatory variable have a relationship – i.e. the slope is not zero. Other interesting numbers are the R-squared values: they estimate “the fraction of variance explained by the model”. Zero means that our model cannot explain anything about the target variable, while one would mean that our model explains everything, including any random errors.

So, looking at the significance of the estimates, it looks like that only the attitude has any explaining power here. But stra is close. The R² values indicate that we’re not capturing the majority of the variance with this model – or that our data is really noisy and has plenty of unknown random errors. Let’s try to improve the model by dropping surf as it is not meaningful at all; maybe it masks stra a little.

model <- lm(points ~ attitude + stra, dset)
summary(model)


Call:
lm(formula = points ~ attitude + stra, data = dset)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6436  -3.3113   0.5575   3.7928  10.9295 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.97290    2.39591   3.745  0.00025 ***
attitude     0.34658    0.05652   6.132 6.31e-09 ***
stra         0.91365    0.53447   1.709  0.08927 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.289 on 163 degrees of freedom
Multiple R-squared:  0.2048,    Adjusted R-squared:  0.1951 
F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Well, there is a slight improvement for stra. Also, the intercept got more significant! Cool. Also, adjusted R² slightly improved. However, the stra is still a little suspect whether it truly has some explaining power. Let’s inspect it a little further by plotting the scatterplots for points ~ attitude and points ~ stra:

# Do scatterplots with ggplot2: prettier figures and include a linear regression fit easily!
p_att <- ggplot(dset, mapping=aes(attitude, points)) +         # Set up the figure
           geom_point() +                                      # Add scatter points
           geom_smooth(method="lm") +                          # Add linear regression with confidence interval
           labs(title="points vs. attitude (with regression)") # Add title for the plot
        
p_stra <- ggplot(dset, mapping=aes(stra, points)) +
            geom_point() +
            geom_smooth(method="lm") +
            labs(title="points vs. stra (with regression)")

# Use gridExtra package to present the figures side-by-side. Saves some screen space :)
grid.arrange(p_att, p_stra, nrow=1, ncol=2)

`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

Okay, so, the left and right figures have attitude and stra respectively as explanatory variable. As expected, attitude shows at least some linear pattern in the scatter plot. Though there are some points rather far away from the regression line (I’ll explore this in a bit)… On the contrary, stra plot doesn’t seem to have any discernible pattern, and the regression line’s slope is quite close to zero. So I’m not convinced that stra would be a good explanatory variable here. Let’s see what happens when I drop stra also…

model <- lm(points ~ attitude, dset)
summary(model)


Call:
lm(formula = points ~ attitude, data = dset)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.9763  -3.2119   0.4339   4.1534  10.6645 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
attitude     0.35255    0.05674   6.214 4.12e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.32 on 164 degrees of freedom
Multiple R-squared:  0.1906,    Adjusted R-squared:  0.1856 
F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Yes, there is some change. Most drastic change is in intercept’s significance: the t-value nearly doubled and the corresponding probability value became five magnitudes smaller. Also, the residuals minimum, maximum, and median got all a little closer to zero. Although both multiple and adjusted R² value took a minor hit (they weren’t very high to begin with, though). Speaking of R² values being low, looks like our simple linear model explains only about 19% of the variance. That might be acceptable for this study, but I’d be cautious to use this model in forecasting. Regardless, it shows that there is a relationship between the two variables.

Interpreting the model and the model diagnostics

So, based on the above analysis, the student’s attitude towards statistics can indicate their course performance: more enthusiastic student is likely to perform better than a student who is not very interested in statistics. However, having a good attitude towards statistics won’t guarantee them success! There must be something else, as the scatter plot for points ~ attitude hints: the points are rather spread out from the regression line.

To conclude the analysis, I’ll study the model’s diagnostics a bit. A good data science practice calls for critical evaluation of the model – you shouldn’t blindly trust a model output without knowing if the model is reliable and trustworthy. In linear regression, we implicitly assume a couple of things about our random errors:

The errors are normally distributed
The error distribution has a zero mean and a constant variance
The errors are not correlated with each other

For checking the validity of these assumptions, the figure below shows two diagnostics plots: Residuals vs. Fitted and Normal Q-Q plot. The first one should have no visible pattern if the assumptions are true. Well, that doesn’t seem to be the case here. The residuals seem to converge towards zero when fitted value increases. That indicates that the variance of errors is not constant, violating the second point listed above. The Q-Q plot shows how well the residual distribution matches an idealised (theoretical) normal distribution. The points on the figure should be close to a straight line – for two identical distributions the points would all be on the dotted line in the figure. If the points deviate from the line, then the normal distribution assumption doesn’t hold. In the picture, there is a slight curve on the points, and the outernmost quantiles clearly deviate from the ideal line. So, the normality assumption is violated at the edges of the distribution – at the center the distribution is almost “normal”.

Then, there is a question of outliers, which are data points that deviate from the main body of data points. Does our data have outliers, that will distort the fitting process by having strong “leverage”? The third plot, “Residuals vs Leverage” answers this question. As the leverages are quite small (<0.05), there doesn’t seem to be any data points that would affect our fit disproportionately. Turns out the data points with large residuals were close to the centerpoint of the linear fit, so they didn’t have that much leverage.

In conclusion, the model can indicate something about the relationship between the students’ performance and their attitude towards statistics, but the model is not very good for prediction purposes. A linear regression model is not suitable for this data – an approach with less assumptions about the random errors is required.

par(mfrow = c(2,2))
plot(model, which=c(1,2,5))

Logistic regression

A new week, a new report! First, I need to set up the libraries and knitr options.

# knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(tidyr)
library(reshape2)

Data

Today’s data is from UCI Machine Learning Repository, kindly provided by Paulo Cortez. The data is Student Performance Data Set (Cortez and Silva, 2008). The authors describe the dataset as following:

“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”

However, I’ve modified the data as following (see the R script here):

Join the two datasets using 13 identifying variables: "school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery", "internet"
For each column that was not used in joining and appeared in both datasets, combine them into one by taking the average value (numerical values) or the value in the first column (categorical values).
Calculate average alcohol use (column alc_use) by calculating the mean of weekday and weekend alcohol use values.
Create a logical column high_use where TRUE denotes the students who have average alcohol use value greater than 2.

Here are the variables and their explanations (adapted from the data source website):,

Variable	Meaning	Datatype	Type	Notes
school	student’s school	Categorical	Identifier	‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira
sex	student’s sex	Categorical	Identifier	‘F’ - female or ‘M’ - male
age	student’s age	Numeric	Identifier	from 15 to 22
address	student’s home address type	Categorical	Identifier	‘U’ - urban or ‘R’ - rural
famsize	family size	Categorical	Identifier	‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3
Pstatus	parent’s cohabitation status	Categorical	Identifier	‘T’ - living together or ‘A’ - apart
Medu	mother’s education	Categorical	Identifier	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education
Fedu	father’s education	Categorical	Identifier	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education
Mjob	mother’s job	Categorical	Identifier	‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’
Fjob	father’s job	Categorical	Identifier	‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’
reason	reason to choose this school	Categorical	Identifier	close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’
guardian	student’s guardian	Categorical	Original	‘mother’, ‘father’ or ‘other’
traveltime	home to school travel time	Categorical	Original	1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour
studytime	weekly study time	Categorical	Original	1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours
failures	number of past class failures	Categorical	Original	n if 1<=n<3, else 4
schoolsup	extra educational support	Categorical	Original	yes or no
famsup	family educational support	Categorical	Original	yes or no
paid	extra paid classes within the course subject (Math or Portuguese)	Categorical	Original	yes or no
activities	extra-curricular activities	Categorical	Original	yes or no
nursery	attended nursery school	Categorical	Identifier	yes or no
higher	wants to take higher education	Categorical	Original	yes or no
internet	Internet access at home	Categorical	Identifier	yes or no
romantic	with a romantic relationship	Categorical	Original	yes or no
famrel	quality of family relationships	Categorical	Original	from 1 - very bad to 5 - excellent
freetime	free time after school	Categorical	Original	from 1 - very low to 5 - very high
goout	going out with friends	Categorical	Original	from 1 - very low to 5 - very high
Dalc	workday alcohol consumption	Categorical	Original	from 1 - very low to 5 - very high
Walc	weekend alcohol consumption	Categorical	Original	from 1 - very low to 5 - very high
health	current health status	Categorical	Original	from 1 - very bad to 5 - very good
absences	number of school absences	Numeric	Original	from 0 to 93
G1	first period grade	Numeric	Original	from 0 to 20
G2	second period grade	Numeric	Original	from 0 to 20
G3	final grade	Numeric	Original	from 0 to 20
alc_use	Average alcohol usage	Numeric	Calculated	Avg. of `Dalc`and `Walc`
high_use	Has high alcohol consumption	Boolean	Calculated	`alc_use > 2`

So, with that out of the way, let’s see the actual data.

dataset <- read.csv("data/alc.csv", header=TRUE)
str(dataset)

'data.frame':   382 obs. of  35 variables:
 $ school    : chr  "GP" "GP" "GP" "GP" ...
 $ sex       : chr  "F" "F" "F" "F" ...
 $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
 $ address   : chr  "R" "R" "R" "R" ...
 $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
 $ Pstatus   : chr  "T" "T" "T" "T" ...
 $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
 $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
 $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
 $ Fjob      : chr  "other" "other" "other" "health" ...
 $ reason    : chr  "home" "reputation" "reputation" "course" ...
 $ nursery   : chr  "yes" "no" "yes" "yes" ...
 $ internet  : chr  "yes" "yes" "no" "yes" ...
 $ guardian  : chr  "mother" "mother" "mother" "mother" ...
 $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
 $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
 $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
 $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
 $ famsup    : chr  "yes" "yes" "yes" "yes" ...
 $ paid      : chr  "yes" "no" "yes" "yes" ...
 $ activities: chr  "yes" "no" "yes" "yes" ...
 $ higher    : chr  "yes" "yes" "yes" "yes" ...
 $ romantic  : chr  "no" "yes" "no" "no" ...
 $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
 $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
 $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
 $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
 $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
 $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
 $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
 $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
 $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
 $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
 $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
 $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...

This time I’ve omitted showing head, as the table is pretty wide. However, the dataframe summary shows the first values from each column, so that’ll do. Most of the variables are categorical data - although the authors have used integers to denote some of the classes.

Research question(s)

Next, I’ll explore four variables (absences, health, freetime, famrel) and their relationship with high or low alcohol consumption. My initial hypothesis would be that high value for absences corresponds to high alcohol usage (high_use=TRUE) whereas for health and famrel the higher values indicate low alcohol consumption. I suspect that the freetime variable might not have any discernible relationship with alcohol usage, as it measures the amount of free time, not how the student spends the time. A lot of free time only increases the number of opportunities to drink alcohol, which may or may not lead to actual consumption.

# Attach a ID number for each row and select only the columns-of-interest
data <- mutate(dataset, ID = row_number()) %>%
  select(any_of(c("ID", "high_use", "absences", "health", "freetime", "famrel")))
# Let's see...
summary(data[-1])  # Ignore ID

  high_use          absences        health         freetime        famrel     
 Mode :logical   Min.   : 0.0   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 FALSE:268       1st Qu.: 1.0   1st Qu.:3.000   1st Qu.:3.00   1st Qu.:4.000  
 TRUE :114       Median : 3.0   Median :4.000   Median :3.00   Median :4.000  
                 Mean   : 4.5   Mean   :3.573   Mean   :3.22   Mean   :3.937  
                 3rd Qu.: 6.0   3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:5.000  
                 Max.   :45.0   Max.   :5.000   Max.   :5.00   Max.   :5.000

# Plot with ggplot2, line chart for absences and bar charts for others
p1 <- ggplot(data, aes(absences)) + stat_count(geom="line", aes(colour=high_use) )
p2 <- ggplot(data, aes(health)) + geom_bar(aes(fill=high_use), position="dodge")
p3 <- ggplot(data, aes(freetime)) + geom_bar(aes(fill=high_use), position="dodge")
p4 <- ggplot(data, aes(famrel)) + geom_bar(aes(fill=high_use), position="dodge")

# Use gridExtra package to present the figures side-by-side. Saves some screen space :)
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

Well, guess I was wrong. freetime seems to have the clearest signal, as the peaks are at 3 for high_use=FALSE and at 4 for high_use=TRUE. For other plots, it’s kinda hard to see any signal. Probably this kind of visual inspection isn’t the best tool here. Let’s print out some numbers.

table(high_use = data$high_use, absences = cut(data$absences, include.lowest=T, breaks=c(0,1,5,10,15,20,30)))%>% proportions() %>% round(digits=3) %>% addmargins

        absences
high_use [0,1] (1,5] (5,10] (10,15] (15,20] (20,30]   Sum
   FALSE 0.237 0.303  0.132   0.021   0.008   0.003 0.704
   TRUE  0.068 0.111  0.058   0.039   0.011   0.011 0.298
   Sum   0.305 0.414  0.190   0.060   0.019   0.014 1.002

table(high_use = data$high_use, health = data$health)%>% proportions() %>% round(digits=3) %>% addmargins

        health
high_use     1     2     3     4     5   Sum
   FALSE 0.092 0.073 0.162 0.128 0.246 0.701
   TRUE  0.029 0.042 0.050 0.047 0.131 0.299
   Sum   0.121 0.115 0.212 0.175 0.377 1.000

table(high_use = data$high_use, freetime = data$freetime)%>% proportions() %>% round(digits=3) %>% addmargins

        freetime
high_use     1     2     3     4     5   Sum
   FALSE 0.039 0.128 0.296 0.183 0.055 0.701
   TRUE  0.005 0.042 0.105 0.107 0.039 0.298
   Sum   0.044 0.170 0.401 0.290 0.094 0.999

table(high_use = data$high_use, famrel = data$famrel) %>% proportions() %>% round(digits=3) %>% addmargins

        famrel
high_use     1     2     3     4     5   Sum
   FALSE 0.016 0.026 0.102 0.353 0.204 0.701
   TRUE  0.005 0.024 0.065 0.141 0.063 0.298
   Sum   0.021 0.050 0.167 0.494 0.267 0.999

So, for absences it looks like the FALSE dominates the ratio at low levels while TRUE gets a larger piece of the pie at high values. So I think my hypothesis was correct for that one. For freetime the proportion of TRUE increases from lowest to highest class, though it never dominates. I don’t see any clear signal in health and famrel tables. So, for my hypotheses:

absences: CORRECT
health: inconlusive
freetime: WRONG
famrel: inconclusive

Well, that’s that. The more you know!

Regression

Moving on to the logistic regression, we’ll see if The Machine has more insight than I have. Probably, as it is much better at number-crunching than I am.

# Fit a model!
model <- glm(high_use ~ absences + health + freetime + famrel, data=data, family="binomial")
summary(model)


Call:
glm(formula = high_use ~ absences + health + freetime + famrel, 
    family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0084  -0.8216  -0.6462   1.1891   2.0837  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.74200    0.66202  -2.631 0.008505 ** 
absences     0.08848    0.02235   3.960 7.51e-05 ***
health       0.11100    0.08657   1.282 0.199762    
freetime     0.41336    0.12472   3.314 0.000919 ***
famrel      -0.33352    0.12977  -2.570 0.010167 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.68  on 381  degrees of freedom
Residual deviance: 430.06  on 377  degrees of freedom
AIC: 440.06

Number of Fisher Scoring iterations: 4

So, absences, health and freetime have a positive coefficient and famrel a negative. So maybe I got that last one right, after all? Anyways, health is not a significant variable here, so I’ll dump it and play with the remaining three.

# Re-fit without health
model <- glm(high_use ~ absences + freetime + famrel, data=data, family="binomial")
summary(model)


Call:
glm(formula = high_use ~ absences + freetime + famrel, family = "binomial", 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9318  -0.8233  -0.6580   1.1905   1.9597  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.43379    0.61550  -2.329 0.019835 *  
absences     0.08813    0.02236   3.942 8.09e-05 ***
freetime     0.41844    0.12419   3.369 0.000754 ***
famrel      -0.31322    0.12824  -2.443 0.014586 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.68  on 381  degrees of freedom
Residual deviance: 431.73  on 378  degrees of freedom
AIC: 439.73

Number of Fisher Scoring iterations: 4

# Odds ratio and confidence intervals
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp

Waiting for profiling to be done...

cbind(OR,CI) # SHOW IT TO ME!

                   OR      2.5 %    97.5 %
(Intercept) 0.2384039 0.06958343 0.7831989
absences    1.0921339 1.04749241 1.1438386
freetime    1.5195946 1.19585392 1.9480047
famrel      0.7310919 0.56733732 0.9397040

So, the Odds Ratio (OR) is defined as $OR = p / (1 - p)$, where $p = P(Y=1)$, which denotes probability that our target is TRUE. So when the OR is greater than 1, then our target becomes more likely to be TRUE than FALSE when our estimator increases in value². Based on the numbers, increasing number of absences and more free time available correspond to high alcohol consumption, while improving family relations decreases the alcohol consumption. So, let’s revise my hypotheses:

absences: CORRECT
health: inconlusive (statistically not significant)
freetime: WRONG
famrel: CORRECT

References

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Web Link

Clustering and classification

A new week, a new section! This week is about clustering the data. As per usual, here are the libraries I’m using.

# Used knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(reshape2)
library(tidyr)
library(MASS)  # Contains the used dataset
library(corrplot)
set.seed(2020)

Data description

Unlike previous weeks, this week we are using a dataset that we have not wrangled with before! The data is the Boston dataset that is available in R package called MASS. Below is a table that lists the columns in the dataset. Full disclosure, I lifted the descriptions directly from R documentation (accessible by typing ?Boston in R console).

Column	Description
crim	per capita crime rate by town.
zn	proportion of residential land zoned for lots over 25,000 sq.ft.
indus	proportion of non-retail business acres per town.
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox	nitrogen oxides concentration (parts per 10 million).
rm	average number of rooms per dwelling.
age	proportion of owner-occupied units built prior to 1940.
dis	weighted mean of distances to five Boston employment centres.
rad	index of accessibility to radial highways.
tax	full-value property-tax rate per $10,000.
ptratio	pupil-teacher ratio by town.
black	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat	lower status of the population (percent).
medv	median value of owner-occupied homes in $1000s.

The data is titled as Housing Values in Suburbs of Boston. The dataset contains 14 variables and 506 rows. All columns are numerical, although the variable chas is actually a boolean indicator represented as ones and zeros. Also as per usual, here are the outputs for str, head and summary functions for the Boston dataset:

data("Boston")  # Creates variable named Boston
str(Boston)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

knitr::kable(head(Boston, 5))

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

Looking at the summary, there are a couple of notable elements visible. First, the crim (per capita crime rate by town) variable is under 4 for most of the rows, but the maximum is 88.98, almost three magnitudes larger than the median value (0.26)! The high discrepancy could be explained by the town in question having a low number of people, however the data doesn’t include this value. The variable zn (proportion of residential land zoned for lots over 25000 sq.ft.) exhibits similar behavior: minimum and median values are both 0 and maximum value is 100. This tells that most of the towns have no residential lots over 25000 sq.ft. in size. Similarly for variable chas, which simply denotes which areas are connected to Charles River and which not. Data states that at least 75% of the data is not connected to the river. For drawing more insight from the rest of the data, let’s plot a couple of pictures.

ggpairs(Boston)

Okay, so, the ggpairs plot out-of-the-box isn’t the most clearest presentation. Let’s try some alternatives… First, let’s take a look at the density plots for each variable. (I finally figured out how to do this, thanks StackOverflow! Yay, go me.) The only variable resembling a normal distribution is rm, which is average number of rooms per dwelling. Okay, makes sense as it is an average value. Variables indus, rad and tax show two peaks, meaning that there are more towns with “low” and “high” values than “middle” values. The populace seems to be elderly, as the density increases (almost monotonically) as age gets larger. Variables nox, dis, lstat and medv are positively skewed and ptratio is negatively skewed.

longtable <- melt(Boston)
ggplot(longtable, aes(value)) + geom_density() + facet_wrap(~variable, scales="free")

Interestingly, the variable black looks like a mirrored version of crim. It’s a derived value related to the proportion of black people. I don’t know how the original dataset authors derived the function ($1000\left(Bk - 0.63\right)^2$), but the density graph shows that most of the towns have a proportion of black people that deviates from value 0.63. Plotting the function shows that when its value is over 375.38 (i.e. 1st quantile), the proportion of black people is under 0.02.

x <- 0:100 / 100
y <- 1000*(x-0.63)^2
plot(x,y, xlab="Proportion of black people", ylab="Function value")

# Show the values of x when y > 375.38 (i.e. the value of 1st quantile)
x[y > 375.38]

[1] 0.00 0.01

Finally, let’s plot the correlation between all variables using the fancy corrplot function from the corrplot library. From the picture, looks like rad and tax are strongly correlated (positive). That implies that the towns closer to more highways also tax their people more. Other notable correlations are indus, nox and age each negatively correlating with dis, and lstat correlating with medv negatively. Notably, the further away from an employment hotspot the town is, the less there is industrial activity and less pollution (indicated by amount of NO_x_) there is.

corrplot(cor(Boston), method="circle")

Data wrangling

Whew, that was a lot of text for simply exploring the data! Now I have to actually do something with it, so let’s get to it! First, the data needs to be standardized, which is done by subtracting the mean from the values and dividing by the standard deviation for each column. Now the data has a mean value of zero for each column! Also, the values are now “how many standard deviations away they are from the mean value” instead of actual values. However, the density functions’ form did not change, which is also what we want.

scaled <- scale(Boston)

# scale() returns a matrix object, change it to a data.frame
scaled <- as.data.frame(scaled)
summary(scaled)

      crim                 zn               indus              chas        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
 Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
      nox                rm               age               dis         
 Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
 1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
 Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
 Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
      rad               tax             ptratio            black        
 Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
 1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
 Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
 Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
     lstat              medv        
 Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 3.5453   Max.   : 2.9865

longtable <- melt(scaled)

No id variables; using all as measure variables

ggplot(longtable, aes(value)) + geom_density() + facet_wrap(~variable, scales="free")

In this exercise, we’re trying to ultimately predict the crime rate using Linear Discriminant Analysis (LDA). To do this, we first create a categorical target variable for crime rate by dividing the data. Here we do this by cutting the data into four quantiles.

# create a quantile vector of crim and print it
bins <- quantile(scaled$crim)
bins

          0%          25%          50%          75%         100% 
-0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

# create a categorical variable 'crime' and take a look using table()
crime <- cut(scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
table(crime)

crime
     low  med_low med_high     high 
     127      126      126      127

# remove original `crim` and add the new `crime` variable to the dataset
# Here I need to explicitly call dplyr::select, as MASS library also has a select function
scaled <- dplyr::select(scaled, -crim)
scaled <- data.frame(scaled, crime)

Next up, we’ll have to train and test the LDA model with our data. So, let’s split the data to training and testing datasets using a 80:20 split. Also, the test data shouldn’t contain the true labels so that we don’t confuse our model, but we need that information for verifying our model. So let’s store the labels into another object.

# number of rows in the Boston dataset 
n <- nrow(scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train and test sets
train <- scaled[ind,]
test <- scaled[-ind,]  # note the minus!

# save the correct classes from test data
correct_classes_for_test <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear Discriminant Analysis

Now that our data is in a suitable format, let’s get to business. Luckily, fitting the LDA is a one-liner. Here our target is crime and explanatory variables are everything else (denoted with .). Also, I shamelessly copied the biplot plotting code from Datacamp exercises (because programming it would be too difficult at this point). Let’s see what happens.

lda.fit <- lda(crime ~ ., data = train)
lda.fit  # Just checking the output

Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2549505 0.2400990 0.2599010 0.2450495 

Group means:
                 zn      indus        chas        nox         rm        age
low       0.9753697 -0.9434514 -0.08120770 -0.8837233  0.4325390 -0.9381520
med_low  -0.1157120 -0.2855907  0.01179157 -0.5543991 -0.1296850 -0.3048100
med_high -0.3631007  0.1961002  0.25261763  0.3994375  0.1537163  0.3898780
high     -0.4872402  1.0149946 -0.07348562  1.0385550 -0.3794012  0.8013618
                dis        rad        tax     ptratio       black       lstat
low       0.9342664 -0.6785867 -0.7383022 -0.44674754  0.37958778 -0.77846727
med_low   0.3208755 -0.5591880 -0.4598702 -0.04993701  0.31380803 -0.15663318
med_high -0.4035941 -0.4065439 -0.3123179 -0.43168810  0.09591388 -0.05170852
high     -0.8422717  1.6596029  1.5294129  0.80577843 -0.68079981  0.90402871
                medv
low       0.53480421
med_low  -0.02744276
med_high  0.24205894
high     -0.73842724

Coefficients of linear discriminants:
                LD1          LD2         LD3
zn       0.07214555  0.610195519 -0.85295795
indus    0.04785751 -0.409068774  0.13049624
chas    -0.08241604 -0.009705042  0.09797199
nox      0.22690421 -0.714147079 -1.42365374
rm      -0.09277179 -0.062028307 -0.11140387
age      0.26316698 -0.346412935  0.02425944
dis     -0.08409444 -0.147792348 -0.02421958
rad      3.41887302  0.670047867 -0.31331749
tax      0.01133046  0.319706540  0.70765000
ptratio  0.08939538  0.134648187 -0.18848405
black   -0.12840758 -0.012087855  0.07913283
lstat    0.22188192 -0.191386778  0.28940556
medv     0.19374414 -0.370964344 -0.30837420

Proportion of trace:
   LD1    LD2    LD3 
0.9470 0.0412 0.0119

# Function lifted directly from DataCamp exercises
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Okay, that’s a colorful mess. Looking at the arrows, rad stands out quite nicely. Also printing the fit shows that it has the largest coefficients for linear discriminants LD1 and LD2. Other notable variables would be zn and nox with high-ish coefficients for LD2 and LD3. Also, the crime==high points all are nicely together in the right side of the picture, though some med_highs have ended up there, too.

So, what about our model’s predictions? Predict with test data and cross tabulate the predictions with the actual labels, got it!

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes_for_test, predicted = lda.pred$class) %>% addmargins

          predicted
correct    low med_low med_high high Sum
  low       12       7        5    0  24
  med_low    8      17        4    0  29
  med_high   1      11        8    1  21
  high       0       0        1   27  28
  Sum       21      35       18   28 102

Well, looks like the model is successful in predicting the crime rate classes - up to a point. It is a good sign that the diagonal elements have the largest values, as these elements show the correct predictions! However, looking at the numbers in a more detail shows interesting things. The high looks quite easy to predict, model correctly predicted 19/20 rows with no false positives, missing one row. For other classes, model didn’t perform as well. For med_high model had worst performance, predicting correctly only 12/22 with 8 false positives. However, med_low had more false positives even if the model also had more rows correctly (20/30 correct, 19 false positives – that’s almost 50% of the predictions in this class)! Finally, 18/30 rows of low class were predicted correctly, with “only” 6 false positives.

Predicted	Correct	Misses	False positives
low	18/30 (60%)	12/30	6/24 (25%)
med_low	20/30 (67%)	10/30	19/39 (49%)
med_high	12/22 (55%)	10/22	8/20 (40%)
high	19/20 (95%)	1/20	0/19 (0%)

k-means Clustering

Finally, we’ll practice some clustering with the “clustering hello world algorithm”, also known as k-means clustering. Basically this algorithm creates k clusters by finding k points that minimize the mean distance between the “k-points” and the data points (though it’s not foolproof, the outcome depends on where the initial k-points are). Each data point is assigned to the nearest k-point, which forms the cluster.

Moving on, now that short (and probably a bit unnecessary) introduction is out of the way, I’ll reload the original Boston dataset and scale it once again for clustering.

# Reload original Boston dataset
data('Boston')

# Scale the dataset
boston_scaled <- scale(Boston) %>% as.data.frame()
# I won't repeat the summary here, it's the same as shown previously...

# Calculate euclidean distance and show it
dist_eu <- dist(boston_scaled)
summary(dist_eu)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

# Let's calculate also manhattan distance for good measure
dist_man <- dist(boston_scaled, method="manhattan")
summary(dist_man)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Okay, next up is the actual clustering. Let’s start with three k-points and see the result.

km <-kmeans(boston_scaled, centers = 3)
# Plot a subset of boston_scaled data with clusters colored, just to see how it looks like
pairs(boston_scaled[5:9], col = km$cluster)

I don’t know whose idea it was to have black, red and green as the three first default colors in R… I just hope that these shades of red and green are distinct enough for (red-green) colorblind people are able to tell the difference. And wow, the clusters swap colors when I replot the figure! Yeah, I really have to learn the ggplot2 library by heart if I’m ever going to use R outside this course. But I digress. Moving on…

Multiple methods for figuring out the optimal value for k exist, but here we use a metric called total of within cluster sum of squares (WCSS). The optimal number for k is where the WCSS has a radical drop! The kmeans function returns an object that handily provides the WCSS value in tot.withinss component. The DataCamp exercises graciously include an one-liner to calculate WCSS for the data using k=1...k_max, so I’m totally going to use that one! Let’s set k_max to 10 to limit the iterations a bit and then we’ll look for the optimal value visually - i.e by plotting WCSS(k).

# Maximum number of clusters considered here
k_max <- 10

# This one is also copied from DataCamp...
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Visualise!
qplot(x = 1:k_max, y = twcss, geom = c('line', 'point'))

The most radical decrease would be from going k=1 to k=2, so the value 2 seems to be a great candidate. Though the WCSS values drop up to k=9, so something like 5 or 7 might be potential candidates. But 2 makes it so that there will be only two colors in the up-coming plots, making them a bit easier to interpret. So I’ll use k=2.³

# Re-cluster the data
km <-kmeans(boston_scaled, centers = 2)
# 
# plot the Boston dataset with clusters
# This time the full dataset with ggpairs -- it's stil going to be a mess, but it'll be a pretty mess.
ggpairs(boston_scaled, aes(col=as.factor(km$cluster)))

# This chunk emits a bunch of "Warning in cor(x, y): the standard deviation is zero" warning messages.

Yep, that’s a pretty looking mess. Also that chunk emits a bunch of Warning in cor(x, y): the standard deviation is zero warning messages (I’ve omitted those as they’re wasting space). Looks like the correlations for zn in the red class are NA, so that’s probably where the standard deviation is zero. Anyways, let’s first plot the density plots like before, and then check only a handful of variables with larger pictures.

with_colors <- boston_scaled
with_colors$cluster <- as.factor(km$cluster)
longtable <- melt(with_colors)
ggplot(longtable, aes(value, col=cluster)) + geom_density() + facet_wrap(~variable, scales="free")

Well, that’s interesting! Some of the variables now show two differing density curves – essentially meaning that there could be a way to separate the data into two different groups. If you recall my pondering from the data exploration, I noted that indus, rad and tax variables show two peaks – well, the two peaks seem to belong into different clusters! However, what I couldn’t see before, was that also age, dis, ptratio, and arguably lstat and medv, would be a combination of two different (and separable) curves. nox being also very separable is not a surprise (in hindsight), when you already know that industrial zones cause more pollution than residential zones (typically).

Let’s pick some variables that look interesting.

#with_colors <- boston_scaled
#with_colors$cluster <- as.factor(km$cluster)
subset <- dplyr::select(with_colors, c("indus", "age", "dis", "rad", "tax", "medv", "cluster"))
#subset$cluster <- as.factor(subset$cluster)
ggpairs(subset, aes(col=cluster))

Well, they look interesting. For example, tax looks to be separated quite nicely, same with dis and age. Old people live close to the industrial employment centers and pay more in taxes? That would be one plausible conclusion from this data.

Bonus task

Surprise! There are two bonus tasks for this week! There were some also for the last week, but I didn’t do those. But it’s getting a bit late, so this’ll have to be short. The first one is to fit LDA for some number of clusters (>2). I chose 7. For that, the most influential variables seem to be chas and rad. The second bonus task, well, I can’t manage everything.

# Reload original Boston dataset
data('Boston')

# Scale the dataset
bonus_scaled <- scale(Boston) %>% as.data.frame
summary(bonus_scaled)

      crim                 zn               indus              chas        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
 Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
      nox                rm               age               dis         
 Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
 1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
 Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
 Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
      rad               tax             ptratio            black        
 Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
 1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
 Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
 Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
     lstat              medv        
 Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 3.5453   Max.   : 2.9865

# k-means clustering with 7 clusters
km <-kmeans(bonus_scaled, centers = 7)
summary(km)

             Length Class  Mode   
cluster      506    -none- numeric
centers       98    -none- numeric
totss          1    -none- numeric
withinss       7    -none- numeric
tot.withinss   1    -none- numeric
betweenss      1    -none- numeric
size           7    -none- numeric
iter           1    -none- numeric
ifault         1    -none- numeric

# Add clusters to dataset
bonus_scaled$cluster <- km$cluster

# LDA fit targetting clusters
lda.fit <- lda(cluster ~ ., data = bonus_scaled)
lda.fit  # Just checking the output

Call:
lda(cluster ~ ., data = bonus_scaled)

Prior probabilities of groups:
         1          2          3          4          5          6          7 
0.09683794 0.09288538 0.06916996 0.15415020 0.17588933 0.34387352 0.06719368 

Group means:
        crim         zn      indus       chas        nox         rm         age
1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554  1.5399833  0.07933756
2 -0.2834643 -0.4872402  1.5965043 -0.2723291  1.0453673 -0.6245590  0.94375129
3  1.4802645 -0.4872402  1.0149946 -0.2723291  0.9676887 -0.2969389  0.76560159
4 -0.4126954  1.9952361 -1.1032525 -0.2218534 -1.1552982  0.6080657 -1.40363754
5  0.9722616 -0.4872402  1.0149946 -0.2723291  1.0010692 -0.4790990  0.74608713
6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
7 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.37213224
         dis          rad         tax    ptratio       black      lstat
1 -0.2868574 -0.520140581 -0.82627237 -1.0314738  0.35523433 -0.9636997
2 -0.8963217 -0.622669809  0.03772813 -0.2084479 -0.08717824  0.7185475
3 -0.8580043  1.659602895  1.52941294  0.8057784 -3.29705640  1.1699052
4  1.5772490 -0.627024335 -0.57588304 -0.7161406  0.35335146 -0.9028481
5 -0.8009959  1.659602895  1.52941294  0.8057784  0.16590165  0.8119640
6  0.2769540 -0.587168164 -0.59021294  0.1702603  0.30993538 -0.1365045
7 -0.4033382  0.001081444 -0.09756330 -0.3924585  0.17154271 -0.1643525
        medv
1  1.6147330
2 -0.5152915
3 -1.0473739
4  0.6621944
5 -0.6627877
6 -0.1747229
7  0.5733409

Coefficients of linear discriminants:
                 LD1         LD2          LD3         LD4         LD5
crim    -0.032848874 -0.11853833 -0.121992191 -0.04587015  0.02028959
zn      -0.236386535 -0.11164525 -0.892168260  0.07752273 -1.47340439
indus    0.122244436 -0.61676862  1.373115707 -0.29017671 -1.77461828
chas     5.826738406  0.17279307 -0.348402443 -0.10993258 -0.06074649
nox      0.007038657  0.41797220  0.349383719  0.25040677 -0.36306581
rm      -0.082150352 -0.04052241 -0.161363218 -0.03368422  0.13175307
age      0.072083760 -0.01514390  0.355986612 -0.06494734  0.22785773
dis      0.073348439  0.30301112 -0.212017094  0.04270701 -0.55076413
rad      0.245600390 -3.13810821 -1.153371145  1.57770009  1.03068941
tax     -0.002101628  0.25323107 -0.371868363  0.07772706 -0.49231336
ptratio -0.002263332  0.04753426  0.008400806 -0.14134197 -0.03644768
black    0.051112837  0.90346968  0.535825097  2.21912811 -0.18944422
lstat   -0.126706573 -0.15549426 -0.018983527 -0.04794742 -0.02923982
medv    -0.185674589  0.31410850 -0.008748079 -0.13496368  0.21260332
                LD6
crim    -0.12706552
zn      -0.35055644
indus   -0.40144745
chas     0.27933856
nox     -0.44219930
rm      -0.56819492
age     -0.24907323
dis      0.14176355
rad     -0.05116781
tax      0.20960416
ptratio  0.32614059
black    0.01111638
lstat   -0.25238312
medv    -0.88020068

Proportion of trace:
   LD1    LD2    LD3    LD4    LD5    LD6 
0.5738 0.2535 0.0673 0.0566 0.0353 0.0134

# target classes as numeric
classes <- as.numeric(bonus_scaled$cluster)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Dimensionality reduction

Another week, another go. This is gonna be a short one, got sidetracked with actual work this week…

Starting with the setup for this week. New libraries: stringr, FactoMiner.

# Used knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(reshape2)
library(tidyr)
library(corrplot)
library(stringr)
library(FactoMineR)  # Contains the "tea" dataset

Data overview

This time we’re having a wrangled dataset! See the creation script here. Original data is from Human Development Reports by United Nations Development Programme, see http://hdr.undp.org/en/content/human-development-index-hdi . Datasets used in the data wrangling were provided by IODS lecturers.

Starting with the data overview, first is the classic table of columns and their meaning:

Variable	Meaning	Type	Notes
`edu2.ratio`	Females-to-males ratio of people with secondary or higher education	num	`edu2.F / edu2.M`
`lab.ratio`	Females-to-males ratio of people in the work force	num	`lab.F / lab.M`
`exp.edu.years`	Expected years in education	num
`exp.life.years`	Life expectancy at birth	num
`GNI`	Gross National Income per capita	num
`mat.mort.rate`	Maternal mortality rate	int
`adol.birth.rate`	Adolescent birth rate	num
`parl.F.ratio`	Ratio of females in the parliament	num

dataset <- read.csv("data/human.csv", header=TRUE, row.names=1)
str(dataset)

'data.frame':   155 obs. of  8 variables:
 $ edu2.ratio     : num  1.007 0.997 0.983 0.989 0.969 ...
 $ lab.ratio      : num  0.891 0.819 0.825 0.884 0.829 ...
 $ exp.edu.years  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
 $ exp.life.years : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
 $ GNI            : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
 $ mat.mort.rate  : int  4 6 6 5 6 7 9 28 11 8 ...
 $ adol.birth.rate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
 $ parl.F.ratio   : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

summary(dataset)

   edu2.ratio       lab.ratio      exp.edu.years   exp.life.years 
 Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
 1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
 Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
 Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
 3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
 Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
      GNI         mat.mort.rate    adol.birth.rate   parl.F.ratio  
 Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
 1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
 Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
 Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
 3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
 Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

knitr::kable(head(dataset, 5))

	edu2.ratio	lab.ratio	exp.edu.years	exp.life.years	GNI	mat.mort.rate	adol.birth.rate	parl.F.ratio
Norway	1.0072389	0.8908297	17.5	81.6	64992	4	7.8	39.6
Australia	0.9968288	0.8189415	20.2	82.4	42261	6	12.1	30.5
Switzerland	0.9834369	0.8251001	15.8	83.0	56431	6	1.9	28.5
Denmark	0.9886128	0.8840361	18.7	80.2	44025	5	5.1	38.0
Netherlands	0.9690608	0.8286119	17.9	81.6	45435	6	6.2	36.9

ggpairs(dataset)

cor(dataset)%>% corrplot

Looking at the median of education and labor ratios, they are under one, meaning that more men than women are both educated and involved in the labor force. However, the education ratio median is close to one, but the 3rd quantile is also under one. Maternal mortality rate and adolescent birth rate are correlating with each other positively and negatively with life expectancy, education and GNI.

Principal component analysis

First, I’ll perform principal component analysis (PCA) using the singular value decomposition, but without scaling the dataset.

pca_unscaled <- prcomp(dataset)
pca_unscaled

Standard deviations (1, .., p=8):
[1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
[6] 1.565912e+00 1.912052e-01 1.591112e-01

Rotation (n x k) = (8 x 8):
                          PC1           PC2           PC3           PC4
edu2.ratio      -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
lab.ratio        2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
exp.edu.years   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
exp.life.years  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
GNI             -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
mat.mort.rate    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
adol.birth.rate  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
parl.F.ratio    -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
                          PC5           PC6           PC7           PC8
edu2.ratio      -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
lab.ratio        0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
exp.edu.years    0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
exp.life.years   0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
GNI             -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
mat.mort.rate    0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
adol.birth.rate  0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
parl.F.ratio    -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03

# Drawing the biplot for the first 2 PCA dimensions
biplot(pca_unscaled, choices = 1:2, cex=c(0.8,1), col=c("#66CC66", "#663399")) # purple color is "rebecca"

Yup, that’s a mess. PCA expects that the variable values are of similar scales, so having a GNI column that is about 3-4 magnitudes larger than other variables causes this.

scaled <- scale(dataset)
pca_scaled <- prcomp(scaled)
pca_scaled

Standard deviations (1, .., p=8):
[1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
[8] 0.3222406

Rotation (n x k) = (8 x 8):
                        PC1         PC2         PC3         PC4        PC5
edu2.ratio      -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
lab.ratio        0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
exp.edu.years   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
exp.life.years  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
GNI             -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
mat.mort.rate    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
adol.birth.rate  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
parl.F.ratio    -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
                        PC6         PC7         PC8
edu2.ratio       0.17713316  0.05773644  0.16459453
lab.ratio       -0.03500707 -0.22729927 -0.07304568
exp.edu.years   -0.38606919  0.77962966 -0.05415984
exp.life.years  -0.42242796 -0.43406432  0.62737008
GNI              0.11120305 -0.13711838 -0.16961173
mat.mort.rate    0.17370039  0.35380306  0.72193946
adol.birth.rate -0.76056557 -0.06897064 -0.14335186
parl.F.ratio     0.13749772  0.00568387 -0.02306476

# Drawing the biplot for the first 2 PCA dimensions
biplot(pca_scaled, choices = 1:2, cex=c(0.8,1), col=c("#66CC66", "#663399")) # purple color is "rebecca"

Now this looks a bit better. Based on this analysis, country’s healthcare (life expectancy, maternal mortality, adolescent pregnancies), education of women, and income levels are the most significant indicators in this particular dataset. The first dimension is women education and healthcare axis: more rich and educated countries have better healthcare than poor countries. Or that educated women have less pregnancies. The second dimension is about how much women are involved with the society - are they working or are they staying at home (for any reason)? Are women in positions of power?

Multiple Correspondence Analysis

Wrapping up this week is multiple correspondence analysis (MCA). For this analysis, we’re using the tea dataset from FactoMineR library. It is a dataset of questionnaire answers regarding tea According to the documentation, the questionnaire asked 300 individuals about their tea-drinking habits (18 questions), their product’s perception (12 questions) and personal details (4 questions). So, the data has 300 rows and 36 columns, with each individual on one row and each question in their own column.

# using `tea` dataset from FactoMineR library
data(tea)
str(tea)

'data.frame':   300 obs. of  36 variables:
 $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
 $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
 $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
 $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
 $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
 $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
 $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
 $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
 $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
 $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
 $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
 $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
 $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
 $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
 $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
 $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
 $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
 $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
 $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
 $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
 $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
 $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
 $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
 $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
 $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
 $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
 $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
 $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
 $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
 $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
 $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

summary(tea)

         breakfast           tea.time          evening          lunch    
 breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
 Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
        dinner           always          home           work    
 dinner    : 21   always    :103   home    :291   Not.work:213  
 Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
                                                                
                                                                
                                                                
                                                                
                                                                
        tearoom           friends          resto          pub     
 Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
 tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
                                                                  
                                                                  
                                                                  
                                                                  
                                                                  
        Tea         How           sugar                     how     
 black    : 74   alone:195   No.sugar:155   tea bag           :170  
 Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
 green    : 33   milk : 63                  unpackaged        : 36  
                 other:  9                                          
                                                                    
                                                                    
                                                                    
                  where                 price          age        sex    
 chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
 chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
 tea shop            : 30   p_private label: 21   Median :32.00          
                            p_unknown      : 12   Mean   :37.05          
                            p_upscale      : 53   3rd Qu.:48.00          
                            p_variable     :112   Max.   :90.00          
                                                                         
           SPC               Sport       age_Q          frequency  
 employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
 middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
 non-worker  :64                       35-44:40   +2/day     :127  
 other worker:20                       45-59:61   3 to 6/week: 34  
 senior      :35                       +60  :38                    
 student     :70                                                   
 workman     :12                                                   
             escape.exoticism           spirituality        healthy   
 escape-exoticism    :142     Not.spirituality:206   healthy    :210  
 Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
         diuretic             friendliness            iron.absorption
 diuretic    :174   friendliness    :242   iron absorption    : 31   
 Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
         feminine             sophisticated        slimming          exciting  
 feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
 Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
                                                                               
                                                                               
                                                                               
                                                                               
                                                                               
        relaxing              effect.on.health
 No.relaxing:113   effect on health   : 66    
 relaxing   :187   No.effect on health:234

tea_no_age <- dplyr::select(tea, -age)

tea_no_age %>% dplyr::select(1:9) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

tea_no_age %>% dplyr::select(10:18) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

tea_no_age %>% dplyr::select(19:27) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

tea_no_age %>% dplyr::select(28:35) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

tea %>% ggplot(aes(age)) + geom_density()

Looks to be an interesting dataset, though it’s a bit difficult to properly visualise. Let’s pick a few columns and perform MCA to them.

keep_cols <- c("age_Q", "sex", "frequency", "Tea", "price", "sugar", "How")
data <- dplyr::select(tea, one_of(keep_cols))
mca <- MCA(data, graph=FALSE)
mca$eig

       eigenvalue percentage of variance cumulative percentage of variance
dim 1  0.25377239               9.349509                          9.349509
dim 2  0.21962466               8.091435                         17.440944
dim 3  0.19242745               7.089432                         24.530377
dim 4  0.18126192               6.678071                         31.208447
dim 5  0.17107329               6.302700                         37.511147
dim 6  0.16281205               5.998339                         43.509486
dim 7  0.15351935               5.655976                         49.165462
dim 8  0.14892416               5.486680                         54.652142
dim 9  0.13900476               5.121228                         59.773370
dim 10 0.13855325               5.104593                         64.877963
dim 11 0.12861410               4.738414                         69.616377
dim 12 0.12434570               4.581157                         74.197535
dim 13 0.12112984               4.462678                         78.660213
dim 14 0.12094719               4.455949                         83.116162
dim 15 0.11055570               4.073105                         87.189267
dim 16 0.10387947               3.827138                         91.016405
dim 17 0.08964207               3.302603                         94.319008
dim 18 0.08046729               2.964584                         97.283592
dim 19 0.07373107               2.716408                        100.000000

mca$var$contrib

                       Dim 1       Dim 2        Dim 3      Dim 4       Dim 5
15-24           7.913377e+00 10.97033963  1.708701684  1.9968789  2.07232894
25-34           4.056933e+00  7.85507712  0.071367598 12.0683981  1.93779594
35-44           1.446520e+00  0.27107440  0.355201941  4.6803402 18.16844062
45-59           4.541041e+00  0.15278159 14.827082242  0.4401140  0.03107965
+60             9.968695e+00  1.79028331  6.724952355  0.3337406 14.33152539
F               6.581611e-01 14.09573560  0.110315981  0.1234269  0.49893018
M               9.602678e-01 20.56590932  0.160952824  0.1800818  0.72794731
1/day           1.824514e+00  3.78037554 11.165695773  0.1886927  0.89022035
1 to 2/week     2.479474e+00  1.25863987  5.960546982  0.2467699  0.38723509
+2/day          3.942764e+00  5.11421794  0.104603818  0.1316144  0.45963324
3 to 6/week     4.474078e-02  0.02420935  4.765988678  0.3481716  4.75035719
black           1.515701e+01  3.18701783  1.741831467  5.1900990  0.23013408
Earl Grey       7.428548e+00  2.99324399  0.098507993  0.1065889  0.06568091
green           5.796958e-01  2.28218164  7.482241840 17.6488807  1.79066476
p_branded       7.263746e-01  1.51913249  6.766248448  0.2991279  6.29340604
p_cheap         2.045066e-01  0.35175914  0.896865300  5.3912908 12.86899210
p_private label 1.432982e+00  0.42811026  0.006741122  1.3641797  1.13413645
p_unknown       1.516061e+00  3.10008365  3.879587807 17.0230069  1.91080210
p_upscale       7.190898e+00  6.11545398  0.646959738  2.4477659  0.18756054
p_variable      6.321193e-02  1.63170789  4.905228919  0.7340771 12.34728571
No.sugar        1.012510e+01  1.95243120  3.827438813  0.5567729  1.51225785
sugar           1.082338e+01  2.08708163  4.091400111  0.5951710  1.61655149
alone           1.149679e-01  2.86953762  0.063490448  5.8662485  2.90514235
lemon           3.009857e-01  1.93018383  6.575250633  0.8332322 12.64607902
milk            9.033844e-04  3.64282402  5.882342043 18.8400640  0.22131975
other           6.498879e+00  0.03060716  7.180455442  2.3652654  0.01449295

plot(mca, invisible=c("ind"), habillage="quali")

So I decided to go with person’s age and gender, how often they drink tea, what kind of tea they drink, how expensive their tea is, and if they add something (sugar, milk) to their tea. So, putting those in and taking the model out, let’s see… Okay, so my set of variables wasn’t very good. The first two dimensions explain only 17.4% of the (cumulative) variance! So it is probably no wonder the biplot looks like a scatterplot where the dots are all over the place.

This week’s diary has been really short compared to other weeks… Well, such is life, there’s so much to do and so little time.

Analysis of longitudinal data

Hoo boy. It’s the last week of IODS course, and this time we’re analyzing data that includes non-independent measurements (e.g. repeated measurements of a test subject over time). Let’s start with the libraries and data exploration!

# Setup block
# Remove the '##' from each line of R output blocks
knitr::opts_chunk$set(comment = '')

# Libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
library(stringr)
library(lme4)
library(reshape2)

This week’s assignment has a twist! We are going to repeat the examples from Vehkalahti and Everitt (2019) book Multivariate Analysis for The Behavioral Sciences, chapters 8 and 9, except with the data sets swapped between the chapters. I’ve wrangled the data to a “long table” format in the corresponding wrangling exercise. You can find the code here.

So first we have the rats data introduced in Ch. 9. The authors describe the data as following:

[W]e shall use some data from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

rats <- read.csv("data/rats.csv")
rats$ID <- as.factor(rats$ID)
rats$Group <- as.factor(rats$Group)
str(rats)

'data.frame':   176 obs. of  5 variables:
 $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
 $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
 $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
 $ time  : int  1 1 1 1 1 1 1 1 1 1 ...

summary(rats)

       ID      Group       WD                weight           time      
 1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
 2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
 3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
 4      : 11                             Mean   :384.5   Mean   :33.55  
 5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
 6      : 11                             Max.   :628.0   Max.   :64.00  
 (Other):110

There are 16 rats in total, 8 in group 1, 4 in group 2 and 4 in group 3. Each rat has an unique ID. time column is the measurement day.

Next, we have the bprsdata introduced in Ch. 8. The authors’ description follows:

Table 8.1 taken from Davis (2002). Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

bprs <- read.csv("data/bprs.csv")
# Create ID column before factor conversion
# treatment "1" and subject "1" -> 101
# treatment "2" and subject "20" -> 220
bprs$ID <- as.factor(bprs$treatment * 100 + bprs$subject)
# Categorical integers to categories...
bprs$treatment <- as.factor(bprs$treatment)
bprs$subject <- as.factor(bprs$subject)
str(bprs)

'data.frame':   360 obs. of  6 variables:
 $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ weeks    : chr  "week0" "week0" "week0" "week0" ...
 $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
 $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ID       : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...

summary(bprs)

 treatment    subject       weeks                bprs            week  
 1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
 2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
           3      : 18   Mode  :character   Median :35.00   Median :4  
           4      : 18                      Mean   :37.66   Mean   :4  
           5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
           6      : 18                      Max.   :95.00   Max.   :8  
           (Other):252                                                 
       ID     
 101    :  9  
 102    :  9  
 103    :  9  
 104    :  9  
 105    :  9  
 106    :  9  
 (Other):306

# Check that there are 40 unique ID values
unique(bprs$ID)

 [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[20] 120 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
[39] 219 220
40 Levels: 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 ... 220

So, that’s 40 subjects divided evenly into two groups. This time however, the subject IDs are not unique within the data. They are unique within the group, though. That will cause me some issues later on, ~~because I suck at R~~ so I’ve created unique ID column here. In hindsight, this should’ve been done at the data wrangling stage, though I don’t recall it being in the instructions.

Working with RATS data

Plot of the rats’ weights (Fig. 8.1).

So here is a graphical overview of the time series of rats’ weight development. Group 1 seems to consist of smaller rats than the other two groups. Also there is one huge rat in group 2, which will skew the mean value for that group.

ggplot(rats, aes(x = time, y = weight, linetype= ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:8, 2)) +
  facet_grid(. ~ Group, labeller=label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(rats$weight), max(rats$weight)))

Standardized weights (Fig 8.2)

Next up is “tracking”, i.e. plotting the time series of standardized weights. That’ll show how a rat gains weight compared to the others. This feels like anomaly detection in atmospheric dynamics, or de-trending. Though those things are a bit different – or at least the questions they try to answer are different. I think. Anyways, now the lines are not always increasing, so the relative development is easier to see.

rats_tracked <- rats %>% 
  group_by(time) %>%
  mutate(std.weight = (weight - mean(weight))/sd(weight)) %>%
  ungroup()

ggplot(rats_tracked, aes(x = time, y = std.weight, linetype= ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:8, 2)) +
  facet_grid(. ~ Group, labeller=label_both) +
  theme(legend.position = "none") +
  scale_y_continuous("Standardized weight")

Mean response profiles (Fig. 8.3)

This is a way of visualising the overall development of the group and estimate the typical range of values for a given group and time. Also I noticed that in the course’s discussion page there was some talk about the standard error calculation. Namely that the denominator sqrt(n) was incorrectly presented in the Datacamp tutorials (n was the number of unique time instants there). So I opted to change the value to the number of members (rows) in a certain group at a certain time.

ratss <- rats %>%
  group_by(Group, time) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(n()), .groups="keep") %>%
  ungroup()

# Glimpse the data
glimpse(ratss)

Rows: 33
Columns: 4
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
$ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
$ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.8…

# Plot the mean profiles
ggplot(ratss, aes(x = time, y = mean, linetype = Group, shape="1", color=Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3, show.legend = F) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.5)) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

Boxplots for rats data (Fig. 8.4)

Next up is a boxplot. I think it shows about the same thing as the previous figure, but emphasis is on the quantiles (inter-quantile range etc.) instead of standard error. Of course, if the distributions are skewed, then the mean and median values differ: in the Group 2 the outlier rat has quite a pull to the mean value.

ggplot(rats, aes(x=as.factor(time), y=weight, color=Group)) +
  geom_boxplot()

Boxplots of mean summary measures for the three rat groups (Fig. 8.5)

This plot should tell me a quick overview about the differences between the groups. And if there are any outliers.

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline day 1).
rats_sm <- rats %>%
  filter(time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight), .groups="keep" ) %>%
  ungroup()

# Glimpse the data
glimpse(rats_sm)

Rows: 16
Columns: 3
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
$ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
$ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …

# Draw a boxplot of the mean versus treatment
ggplot(rats_sm, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), time > 1")

Same as above but with outliers (if any) removed (Fig. 8.6)

Okay so, group 2 has an outlier. Remove it. Though I’m removing 1 rat from a total 4 rats, meaning that the statistics start to be quite dubious… n=3 doesn’t sound like a large enough of a sample.

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
rats_sm1 <-  rats_sm %>% filter(mean < 575)
glimpse(rats_sm1)

Rows: 15
Columns: 3
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
$ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
$ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …

ggplot(rats_sm1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), time > 1")

t-test for Mean Summary Measure (table 8.3)

t-test is a paired test, so there can be only two variables. Luckily three groups equate to three pairs, so I can do the t-test for each pair with minimal effort (copypaste). Based on the results it is very clear that the Group 1 differs from Groups 2 and 3. Between the groups 2 and 3 the difference is not as clear, but it’s there.

rats_sm1_noG3 <- rats_sm1 %>% filter(Group != 3)
t.test(mean ~ Group, data = rats_sm1_noG3, var.equal = TRUE)


    Two Sample t-test

data:  mean by Group
t = -25.399, df = 9, p-value = 1.094e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -204.0633 -170.6867
sample estimates:
mean in group 1 mean in group 2 
        265.025         452.400

rats_sm1_noG2 <- rats_sm1 %>% filter(Group != 2)
t.test(mean ~ Group, data = rats_sm1_noG2, var.equal = TRUE)


    Two Sample t-test

data:  mean by Group
t = -27.824, df = 10, p-value = 8.345e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -283.4943 -241.4557
sample estimates:
mean in group 1 mean in group 3 
        265.025         527.500

rats_sm1_noG1 <- rats_sm1 %>% filter(Group != 1)
t.test(mean ~ Group, data = rats_sm1_noG1, var.equal = TRUE)


    Two Sample t-test

data:  mean by Group
t = -5.632, df = 5, p-value = 0.002446
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -109.37769  -40.82231
sample estimates:
mean in group 2 mean in group 3 
          452.4           527.5

Analysis of Covariance (ANOVA) (table 8.4)

For the ANOVA I need a baseline for each row – and that is easy to pick up from the original data. This works because the rats_sm contains only one row for each rat – it is a summary of all measurements – and the original data has one row for each rat, too. If it weren’t for the summarized data, I’d have to join by ID columns. Although that would be clearer than this, but (probably) slower…

original_rats <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
                          header=TRUE, sep="\t")
rats_sm2 <- rats_sm %>%
  mutate(baseline = original_rats$WD1)

fit <- lm(mean ~ baseline + Group, data = rats_sm2)
anova(fit)

Analysis of Variance Table

Response: mean
          Df Sum Sq Mean Sq   F value   Pr(>F)    
baseline   1 253625  253625 1859.8201 1.57e-14 ***
Group      2    879     439    3.2219  0.07586 .  
Residuals 12   1636     136                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So the interpretation for this set. Baseline (=starting weight) is significant contributor to the later weight. Belonging to a certain group has some significance, probably whether the rat belongs to the Group 1 or not.

Lastly, I’m omitting the Table 8.6. It doesn’t use the BPRS data in the original, so strictly speaking cannot swap that data to RATS dataset. It also demonstrates how to deal with missing data, so performing the analysis with a full dataset is not the most productive way (of course I could randomly remove some of the data and work that way, but I’m being lazy here).

Working with BPRS data

Moving on to the BPRS data and Chapter 9 examples. I might take some liberties here and there in visualising the data instead of 100% faithfully replicating the figures. Mainly because the data is different so I might need to adjust the visuals a bit to increase readability.

Week vs BPRS with group numbers (Fig. 9.1)

Here the two groups overlap more than the rats, and we have more individuals here as well!

ggplot(bprs, aes(x=week, y=bprs, color=treatment)) +
  geom_text(aes(label=treatment), show.legend=FALSE)

Results from fitting linear reg.model ignoring the repeated-measures structure of the data (Table 9.3)

These results should be total nonsense. Spoilers: they are. The results “show” that the time dimension has a really strong effect on the measures. Well yeah, about that independence… The linear regression is assuming independent measures, i.e. no autocorrelation or codependency. Shame it is totally untrue when one measures the same subject(s) over time…

# create a regression model RATS_reg
bprs_reg <- lm(bprs ~ week + treatment, data = bprs)
# print out a summary of the model
summary(bprs_reg)


Call:
lm(formula = bprs ~ week + treatment, data = bprs)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.454  -8.965  -3.196   7.002  50.244 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  46.4539     1.3670  33.982   <2e-16 ***
week         -2.2704     0.2524  -8.995   <2e-16 ***
treatment2    0.5722     1.3034   0.439    0.661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.37 on 357 degrees of freedom
Multiple R-squared:  0.1851,    Adjusted R-squared:  0.1806 
F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Don’t tell anyone, but I’m going to cheat a bit here and plot an extra figure or two. Because I want to see how this faulty analysis looks like!

fit <- fitted(bprs_reg, bprs)

ggplot(bprs, aes(x=week, y=bprs)) +
  geom_line(aes(group=subject), color="gray75") +
  geom_point() +
  geom_smooth(method="lm") +
  labs(title="BPRS by week (with regression) for both treatment groups") +
  facet_wrap(vars(treatment))

Since the treatment is a categorical value, I can plot the linear regression this way. It helps that there are two categories in treatment, though. I included the the time series as lines, too, to highlight why this sort of regression modeling doesn’t work here. Okay, sure, it shows the downward trend in both groups, but for analysing trends you’d want to consider the “repeated measures” nature of the data anyways⁴.

Individual BTRS profiles (Fig. 9.2)

Here’s where the custom ID column comes in handy. I need to be able to differentiate between the subjects with the same subject value but different treatment value. Otherwise this plot would become confused, poor thing.

# Check that the ID column is included
str(bprs)

'data.frame':   360 obs. of  6 variables:
 $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ weeks    : chr  "week0" "week0" "week0" "week0" ...
 $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
 $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ID       : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...

ggplot(bprs, aes(x = week, y = bprs, group=ID)) +
  geom_line(aes(linetype=treatment)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

Scatterplot matrix of repeated measures (Fig. 9.3)

This thing requires the original data. Maybe I could pivot the data… Yep, that works. Yes, the scatters form rather nice lines, especially between two consecutive weeks.

bprs_w <- tidyr::pivot_wider(bprs, names_from=weeks, values_from=bprs, 
                             id_cols=c("treatment", "subject", "ID")) %>% 
  as.data.frame
str(bprs_w)

'data.frame':   40 obs. of  12 variables:
 $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ ID       : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
 $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
 $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
 $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
 $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
 $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
 $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
 $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
 $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...

plot(bprs_w[-1:-3])

Results from fitting random intercept model (Table 9.4)

But now we will loosen the requirement of totally independent measures. First the random intercept allows the fit differ between the individuals.

# Create a random intercept model
bprs_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = bprs, REML = FALSE)

# Print the summary of the model
summary(bprs_ref)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week + treatment + (1 | subject)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2748.7   2768.1  -1369.4   2738.7      355 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0481 -0.6749 -0.1361  0.4813  3.4855 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept)  47.41    6.885  
 Residual             104.21   10.208  
Number of obs: 360, groups:  subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  46.4539     1.9090  24.334
week         -2.2704     0.2084 -10.896
treatment2    0.5722     1.0761   0.532

Correlation of Fixed Effects:
           (Intr) week  
week       -0.437       
treatment2 -0.282  0.000

Okay, so now the week has no strong effect anymore.

Results from fitting random intercept and slope model (Table 9.5)

Next, let’s introduce the random slope to the model. And the winning numbers are… Huh? Not much of a change.

# create a random intercept and random slope model
bprs_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = bprs, REML = FALSE)

# print a summary of the model
summary(bprs_ref1)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week + treatment + (week | subject)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2745.4   2772.6  -1365.7   2731.4      353 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8919 -0.6194 -0.0691  0.5531  3.7976 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 subject  (Intercept) 64.8222  8.0512        
          week         0.9609  0.9802   -0.51
 Residual             97.4305  9.8707        
Number of obs: 360, groups:  subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  46.4539     2.1052  22.066
week         -2.2704     0.2977  -7.626
treatment2    0.5722     1.0405   0.550

Correlation of Fixed Effects:
           (Intr) week  
week       -0.582       
treatment2 -0.247  0.000

# perform an ANOVA test on the two models
anova(bprs_ref1, bprs_ref)

Data: bprs
Models:
bprs_ref: bprs ~ week + treatment + (1 | subject)
bprs_ref1: bprs ~ week + treatment + (week | subject)
          npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
bprs_ref     5 2748.7 2768.1 -1369.4   2738.7                       
bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, knowing there are duplicate values in subject column, I’m not so sure if that fitting did what we wanted to do here. Let’s do this again using ID instead of subject.

# Create a random intercept model
bprs_ref_id <- lmer(bprs ~ week + treatment + (1 | ID), data = bprs, REML = FALSE)

# Print the summary of the model
summary(bprs_ref_id)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week + treatment + (1 | ID)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2582.9   2602.3  -1286.5   2572.9      355 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.27506 -0.59909 -0.06104  0.44226  3.15835 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 97.39    9.869   
 Residual             54.23    7.364   
Number of obs: 360, groups:  ID, 40

Fixed effects:
            Estimate Std. Error t value
(Intercept)  46.4539     2.3521  19.750
week         -2.2704     0.1503 -15.104
treatment2    0.5722     3.2159   0.178

Correlation of Fixed Effects:
           (Intr) week  
week       -0.256       
treatment2 -0.684  0.000

# create a random intercept and random slope model
bprs_ref_id1 <- lmer(bprs ~ week + treatment + (week | ID), data = bprs, REML = FALSE)

# print a summary of the model
summary(bprs_ref_id1)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week + treatment + (week | ID)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2523.2   2550.4  -1254.6   2509.2      353 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4655 -0.5150 -0.0920  0.4347  3.7353 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 ID       (Intercept) 167.827  12.955        
          week          2.331   1.527   -0.67
 Residual              36.747   6.062        
Number of obs: 360, groups:  ID, 40

Fixed effects:
            Estimate Std. Error t value
(Intercept)  45.9830     2.6470  17.372
week         -2.2704     0.2713  -8.370
treatment2    1.5139     3.1392   0.482

Correlation of Fixed Effects:
           (Intr) week  
week       -0.545       
treatment2 -0.593  0.000

# perform an ANOVA test on the two models
anova(bprs_ref_id1, bprs_ref_id)

Data: bprs
Models:
bprs_ref_id: bprs ~ week + treatment + (1 | ID)
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
             npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
bprs_ref_id     5 2582.9 2602.3 -1286.5   2572.9                         
bprs_ref_id1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well look at that! Using ID instead of subject made quite a difference! While the previous ANOVA result had “only” one star of significance for the random intercept and slope model, this one here has three stars!

… I have no idea what’s going on here. The baseline got really important? Or that I cut the sample size down by telling the model “hey there are 40 dudes instead of 20”? I mean, the sample size for each individual fit, that is. Let me ask a question…

# perform an ANOVA test on the two models
anova(bprs_ref_id1, bprs_ref1)

Data: bprs
Models:
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
bprs_ref1: bprs ~ week + treatment + (week | subject)
             npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id1    7 2523.2 2550.4 -1254.6   2509.2                    
bprs_ref1       7 2745.4 2772.6 -1365.7   2731.4     0  0

Yeah, I’m not sure what I expected. There’s not much of a difference whether you use subject or ID? Probably my dataset is too small to make a difference.

Fitting the random intercept and slope model with Subject×Week interaction (Table 9.6)

Then there is a third way of building a model: we allow an interaction between the two variables. Notice the * (star) instead of + (plus) symbol in the model definition!

# create a random intercept and random slope model with the interaction
bprs_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = bprs, REML = FALSE)

# print a summary of the model
summary(bprs_ref2)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week * treatment + (week | subject)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2744.3   2775.4  -1364.1   2728.3      352 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0512 -0.6271 -0.0768  0.5288  3.9260 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 subject  (Intercept) 64.9964  8.0620        
          week         0.9687  0.9842   -0.51
 Residual             96.4707  9.8220        
Number of obs: 360, groups:  subject, 20

Fixed effects:
                Estimate Std. Error t value
(Intercept)      47.8856     2.2521  21.262
week             -2.6283     0.3589  -7.323
treatment2       -2.2911     1.9090  -1.200
week:treatment2   0.7158     0.4010   1.785

Correlation of Fixed Effects:
            (Intr) week   trtmn2
week        -0.650              
treatment2  -0.424  0.469       
wek:trtmnt2  0.356 -0.559 -0.840

# perform an ANOVA test on the two models
anova(bprs_ref2, bprs_ref1)

Data: bprs
Models:
bprs_ref1: bprs ~ week + treatment + (week | subject)
bprs_ref2: bprs ~ week * treatment + (week | subject)
          npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
bprs_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
bprs_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So looks like that change might bring a little bit more prediction power to the model. Or at least the models’ difference has a bit of significance. Let’s do the same with ID this time.

# create a random intercept and random slope model with the interaction
bprs_ref_id2 <- lmer(bprs ~ week * treatment + (week | ID), data = bprs, REML = FALSE)

# print a summary of the model
summary(bprs_ref_id2)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: bprs ~ week * treatment + (week | ID)
   Data: bprs

     AIC      BIC   logLik deviance df.resid 
  2523.5   2554.5  -1253.7   2507.5      352 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4747 -0.5256 -0.0866  0.4435  3.7884 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 ID       (Intercept) 164.204  12.814        
          week          2.203   1.484   -0.66
 Residual              36.748   6.062        
Number of obs: 360, groups:  ID, 40

Fixed effects:
                Estimate Std. Error t value
(Intercept)      47.8856     2.9840  16.047
week             -2.6283     0.3752  -7.006
treatment2       -2.2911     4.2200  -0.543
week:treatment2   0.7158     0.5306   1.349

Correlation of Fixed Effects:
            (Intr) week   trtmn2
week        -0.668              
treatment2  -0.707  0.473       
wek:trtmnt2  0.473 -0.707 -0.668

# perform an ANOVA test on the two models
anova(bprs_ref_id2, bprs_ref_id1)

Data: bprs
Models:
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
bprs_ref_id2: bprs ~ week * treatment + (week | ID)
             npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id1    7 2523.2 2550.4 -1254.6   2509.2                    
bprs_ref_id2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

Well. Maybe the model overfits here and thus I don’t have enough variance that could vary between the two models? Let’s see how the subject versus ID question works out.

anova(bprs_ref_id2, bprs_ref2)

Data: bprs
Models:
bprs_ref_id2: bprs ~ week * treatment + (week | ID)
bprs_ref2: bprs ~ week * treatment + (week | subject)
             npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id2    8 2523.5 2554.6 -1253.7   2507.5                    
bprs_ref2       8 2744.3 2775.4 -1364.1   2728.3     0  0

Zero degrees of freedom. Yeah, that’s a dead end.

Fitted and observed growth profiles (Figure 9.4)

Finally, there was a plot side-by-side showing original and fitted lines. Well, here you go dear reader! As the authors of MABS said, the regression lines fit quite nicely to the original data. Granted, the BPRS measures had more variance than the rats, but overall the figures are in agreement. Notably, the rats’ weight was more or less monotonically increasing, while BPRS goes up-and-down (the main trend is downwards for most of the individuals, though).

# Create a vector of the fitted values
# Making a copy so I won't accidentally use this data somewhere else :)
bprs2 <- bprs
bprs2$Fitted <- fitted(bprs_ref2, bprs)
# Rename original "bprs" column for visualisation purposes
colnames(bprs2)[4] <- "Original"

# Melting data to "melted" (which is probably grammatically incorrect form of 'melt')
melted <- melt(bprs2, 
               id.vars=c("treatment","subject","ID", "week"), 
               measure.vars=c("Original", "Fitted"),
               variable.name = "key",
               value.name = "BPRS")

ggplot(melted, aes(x=week, y=BPRS, group=ID)) +
  geom_line(aes(linetype=treatment)) +
  facet_wrap(vars(key)) +
  theme(legend.position="top") +
  xlab("Week")

Final word

So, this was it! That was the last chapter of this course and with it this course diary comes to an end. In a proper book-ends-style, I’m going to write a bit about my thoughts about the course.

First of all, I had fun! While the course had a lot of work and strict deadlines, I still enjoyed writing this diary. Partially because I decided to write in my own style – with the casual remarks and all. I wanted to make the writing fun for me and reading enjoyable to the readers. Based on the peer reviews I got along the way, I’d say that I was successful!

Second, the R language was a new experience for me. The tasks weren’t too difficult especially when the Datacamp exercises pretty much solve nearly everything on the coding side. The hardest part was overcoming my initial way of solving things – R is not Python, and things work differently here! Once I realised this the coding became smoother.

Third, I learned a lot about data analysis! I hope I can use this new knowledge as a basis when I start delving deeper into my research (which will inevitably require a better understanding of statistics).

Finally, thanks for Kimmo Vehkalahti, Reijo Sund, and the teaching assistants for organising the course and you, dear reader, for reading my ramblings.

Petteri signing out.

I’m using the knitr::kable function here to output prettier tables. It doesn’t affect the data values in any way, it is a purely aesthetic helper function.↩︎
For factor variables the interpretation is “how the probability changes compared to a baseline class when our estimator moves from baseline to this class”.↩︎
I also got k=3 as optimal value when I accidentally ran the code without setting seed beforehand. Whoops. Well, like I said, the results of k-means depend on the initial (random) points.↩︎
I’m having trouble imagining what sort of measures would be time series measures and simultaneously not repeated measures, as “time series” already implies repetition. Perhaps sequences of truly random noise, such as atmospheric noise, could be considered as such measure?↩︎

IODS course project

Petteri Karsisto