Hello world! This is my first entry to the Introduction to Open Data Science 2020 course project. This course teaches data analysis using R and writing stuff down with R Markdown. Interesting and useful stuff. Previously I’ve worked with Python and GitHub-flavored Markdown syntax.
I hope to learn more of the “statistics” side in data science. Topics such as
By the way, here is the link to my GitHub repository for this project. Currently it’s rather empty as the “real” work starts next week…
This section is the report for the second week of IODS. To start with, here are the instructions given for writing this report:
Describe the work you have done this week and summarize your learning.
- Describe your work and results clearly.
- Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
- Assume the reader has no previous knowledge of your data or the more advanced methods you are using.
With that out of the way, let’s start the analysis part of the current exercise set!
First, I need to include a couple of libraries to properly build (or rather, “knit”) this report.
# knitr is used to build this report, so it is included by default
# Remove the '##' from each line of R output blocks
knitr::opts_chunk$set(comment = '')
# Libraries for convenience
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
# NOTE: Messages are omitted
The analysis data was produced in “Data wrangling” part of the exercises. The original data is the “JYTOPKYS2” study data by Kimmo Vehkalahti. The original data is available from here and the data is explained here.
The analysis data is a mix of original variables (gender
, age
, attitude
, and points
) and derived variables (deep
, stra
, and surf
). The original and derived variables are explained as following, based on Vehkalahti’s metadata. Note that the derived variables were omitted from the original data linked above – so that we’d have to calculate them ourselves – but they were included in the metadata.
Variable | Meaning | Type | Notes |
---|---|---|---|
gender |
Gender of the student | Categorical | “M” or “F” |
age |
Age (in years) | Number | derived from the date of birth |
attitude |
Global attitude toward statistics | Number | Sum of 10 answers |
points |
Total points from coursework | Number | Student’s best exam score + extra points from answering the study questionnaire |
deep |
Average score of answers related to Deep approach | Number | Avg. of 12 answers |
stra |
Average score of answers related to Strategic approach | Number | Avg. of 12 answers |
surf |
Average score of answers related to Surface approach | Number | Avg. of 8 answers |
You can see the creation script here (most recent version) or the original from commit bb6de84
). The script stores the data to learning2014.csv
, so let’s start by reading that file and printing out the dataset structure, first few rows1, and some summary statistics.
dset <- read.csv("data/learning2014.csv")
str(dset)
'data.frame': 166 obs. of 7 variables:
$ gender : chr "F" "M" "F" "M" ...
$ age : int 53 55 49 53 49 38 50 37 37 42 ...
$ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
$ deep : num 3.58 2.92 3.5 3.5 3.67 ...
$ stra : num 3.38 2.75 3.62 3.12 3.62 ...
$ surf : num 2.58 3.17 2.25 2.25 2.83 ...
$ points : int 25 12 24 10 22 21 21 31 24 26 ...
knitr::kable(head(dset, 5)) # Prettier table, see also footnote
gender | age | attitude | deep | stra | surf | points |
---|---|---|---|---|---|---|
F | 53 | 37 | 3.583333 | 3.375 | 2.583333 | 25 |
M | 55 | 31 | 2.916667 | 2.750 | 3.166667 | 12 |
F | 49 | 25 | 3.500000 | 3.625 | 2.250000 | 24 |
M | 53 | 35 | 3.500000 | 3.125 | 2.250000 | 10 |
M | 49 | 37 | 3.666667 | 3.625 | 2.833333 | 22 |
summary(dset)
gender age attitude deep
Length:166 Min. :17.00 Min. :14.00 Min. :1.583
Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
Mode :character Median :22.00 Median :32.00 Median :3.667
Mean :25.51 Mean :31.43 Mean :3.680
3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
Max. :55.00 Max. :50.00 Max. :4.917
stra surf points
Min. :1.250 Min. :1.583 Min. : 7.00
1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
Median :3.188 Median :2.833 Median :23.00
Mean :3.121 Mean :2.787 Mean :22.72
3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
Max. :5.000 Max. :4.333 Max. :33.00
count(dset, gender)
gender n
1 F 110
2 M 56
This data contains 166 data points (rows). The original data has 187 rows, however the students with 0 points were omitted during the data wrangling. Out of the 166 remaining students, 110 were female and 56 were male. The students’ age ranges from 17 to 55 years, with median at 22 years.
Next, here is a plot that shows the distributions of the variables and the cross-correlation between each variable within the data. First, most of the distributions are similar to (slightly skewed) Gaussian distributions, however there are exceptions. Notably, age
is similar to an inverse-gamma distribution: a sharp peak at low values and a long trailing tail. Even when ignoring the high-end “outliers”, the distribution is still positively skewed – as seen in the boxplot. Second, there is no clear difference between males and females for any of the variables. Third, the correlations seem to be rather weak (\(|\rho| \lesssim 0.2\)), except for stra
and surf
(\(\rho_{stra,surf} \approx -0.32\)) and for attitude
and points
(\(\rho_{attitude,points} \approx 0.44\)), for whom the correlation is a bit stronger. Finally, as the correlation values are not properly displayed in the figure due to the figure size, I’ve included the Pearson correlation coefficient table below the figure.
ggpairs(dset,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo=wrap("facethist", bins=20))
)
cor(dset[-1]) # Exclude 'gender' column
age attitude deep stra surf points
age 1.00000000 0.02220071 0.02507804 0.10244409 -0.1414052 -0.09319032
attitude 0.02220071 1.00000000 0.11024302 0.06174177 -0.1755422 0.43652453
deep 0.02507804 0.11024302 1.00000000 0.09650255 -0.3238020 -0.01014849
stra 0.10244409 0.06174177 0.09650255 1.00000000 -0.1609729 0.14612247
surf -0.14140516 -0.17554218 -0.32380198 -0.16097287 1.0000000 -0.14435642
points -0.09319032 0.43652453 -0.01014849 0.14612247 -0.1443564 1.00000000
Next up is a task to fit a linear model to the data, where points
is the target variable and one or more other variables are the explanatory variables. First instruction was to select three explanatory variables. I chose attitude
, stra
, and surf
as their correlation coefficients with points
were the largest (absolute value). And the results are…
# Target: points
# Explanatory variables (3): attitude, stra, surf
# based on absolute value of correlation coefficient (Pearson)
model <- lm(points ~ attitude + stra + surf, dset)
summary(model)
Call:
lm(formula = points ~ attitude + stra + surf, data = dset)
Residuals:
Min 1Q Median 3Q Max
-17.1550 -3.4346 0.5156 3.6401 10.8952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.01711 3.68375 2.991 0.00322 **
attitude 0.33952 0.05741 5.913 1.93e-08 ***
stra 0.85313 0.54159 1.575 0.11716
surf -0.58607 0.80138 -0.731 0.46563
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Yeah, not so good. First, I’ll have to quickly interpret the output. First is the distribution of residuals. A (minimum) residual of “-17” seems rather large (it’s absolute value, that is), when the points range from 7 to 33 – there might be some strong outliers. At least median is close to zero, which is reassuring. Next are the coefficients for the linear regression. The estimate
column shows the slope of the linear regression (except for the intercept it shows the value of “target” when all explanatory variables are zero). So having zero attitude and learning strategies you’d expect a student to get 11 points. For the slopes, the value means “the increase or decrease in points, when attitude/strategic learning/surface learning value increases by one unit”. So, having larger attitude
and stra
scores would indicate better performance, whereas larger surf
score would decrease performance. Next column, Std. Error
, shows the standard error of the estimate. Third column, t value
, shows the t-statistic – which is a measure of the estimates statistical significance. However that value is rather transient by itself, so we’ll rely on the fourth column to interpret the t-statistic. This column shows the probability of having this particular non-zero value for estimate by random chance, or in other words, how confident we can be in rejecting the null hypothesis. One more time for the non-statisticians like me: the smaller p-value is, the more we can trust that our target and explanatory variable have a relationship – i.e. the slope is not zero. Other interesting numbers are the R-squared values: they estimate “the fraction of variance explained by the model”. Zero means that our model cannot explain anything about the target variable, while one would mean that our model explains everything, including any random errors.
So, looking at the significance of the estimates, it looks like that only the attitude
has any explaining power here. But stra
is close. The R2 values indicate that we’re not capturing the majority of the variance with this model – or that our data is really noisy and has plenty of unknown random errors. Let’s try to improve the model by dropping surf
as it is not meaningful at all; maybe it masks stra
a little.
model <- lm(points ~ attitude + stra, dset)
summary(model)
Call:
lm(formula = points ~ attitude + stra, data = dset)
Residuals:
Min 1Q Median 3Q Max
-17.6436 -3.3113 0.5575 3.7928 10.9295
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.97290 2.39591 3.745 0.00025 ***
attitude 0.34658 0.05652 6.132 6.31e-09 ***
stra 0.91365 0.53447 1.709 0.08927 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.289 on 163 degrees of freedom
Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Well, there is a slight improvement for stra
. Also, the intercept got more significant! Cool. Also, adjusted R2 slightly improved. However, the stra
is still a little suspect whether it truly has some explaining power. Let’s inspect it a little further by plotting the scatterplots for points ~ attitude
and points ~ stra
:
# Do scatterplots with ggplot2: prettier figures and include a linear regression fit easily!
p_att <- ggplot(dset, mapping=aes(attitude, points)) + # Set up the figure
geom_point() + # Add scatter points
geom_smooth(method="lm") + # Add linear regression with confidence interval
labs(title="points vs. attitude (with regression)") # Add title for the plot
p_stra <- ggplot(dset, mapping=aes(stra, points)) +
geom_point() +
geom_smooth(method="lm") +
labs(title="points vs. stra (with regression)")
# Use gridExtra package to present the figures side-by-side. Saves some screen space :)
grid.arrange(p_att, p_stra, nrow=1, ncol=2)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
Okay, so, the left and right figures have attitude
and stra
respectively as explanatory variable. As expected, attitude
shows at least some linear pattern in the scatter plot. Though there are some points rather far away from the regression line (I’ll explore this in a bit)… On the contrary, stra
plot doesn’t seem to have any discernible pattern, and the regression line’s slope is quite close to zero. So I’m not convinced that stra
would be a good explanatory variable here. Let’s see what happens when I drop stra
also…
model <- lm(points ~ attitude, dset)
summary(model)
Call:
lm(formula = points ~ attitude, data = dset)
Residuals:
Min 1Q Median 3Q Max
-16.9763 -3.2119 0.4339 4.1534 10.6645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
attitude 0.35255 0.05674 6.214 4.12e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.32 on 164 degrees of freedom
Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Yes, there is some change. Most drastic change is in intercept’s significance: the t-value nearly doubled and the corresponding probability value became five magnitudes smaller. Also, the residuals minimum, maximum, and median got all a little closer to zero. Although both multiple and adjusted R2 value took a minor hit (they weren’t very high to begin with, though). Speaking of R2 values being low, looks like our simple linear model explains only about 19% of the variance. That might be acceptable for this study, but I’d be cautious to use this model in forecasting. Regardless, it shows that there is a relationship between the two variables.
So, based on the above analysis, the student’s attitude towards statistics can indicate their course performance: more enthusiastic student is likely to perform better than a student who is not very interested in statistics. However, having a good attitude towards statistics won’t guarantee them success! There must be something else, as the scatter plot for points ~ attitude
hints: the points are rather spread out from the regression line.
To conclude the analysis, I’ll study the model’s diagnostics a bit. A good data science practice calls for critical evaluation of the model – you shouldn’t blindly trust a model output without knowing if the model is reliable and trustworthy. In linear regression, we implicitly assume a couple of things about our random errors:
For checking the validity of these assumptions, the figure below shows two diagnostics plots: Residuals vs. Fitted and Normal Q-Q plot. The first one should have no visible pattern if the assumptions are true. Well, that doesn’t seem to be the case here. The residuals seem to converge towards zero when fitted value increases. That indicates that the variance of errors is not constant, violating the second point listed above. The Q-Q plot shows how well the residual distribution matches an idealised (theoretical) normal distribution. The points on the figure should be close to a straight line – for two identical distributions the points would all be on the dotted line in the figure. If the points deviate from the line, then the normal distribution assumption doesn’t hold. In the picture, there is a slight curve on the points, and the outernmost quantiles clearly deviate from the ideal line. So, the normality assumption is violated at the edges of the distribution – at the center the distribution is almost “normal”.
Then, there is a question of outliers, which are data points that deviate from the main body of data points. Does our data have outliers, that will distort the fitting process by having strong “leverage”? The third plot, “Residuals vs Leverage” answers this question. As the leverages are quite small (<0.05), there doesn’t seem to be any data points that would affect our fit disproportionately. Turns out the data points with large residuals were close to the centerpoint of the linear fit, so they didn’t have that much leverage.
In conclusion, the model can indicate something about the relationship between the students’ performance and their attitude towards statistics, but the model is not very good for prediction purposes. A linear regression model is not suitable for this data – an approach with less assumptions about the random errors is required.
par(mfrow = c(2,2))
plot(model, which=c(1,2,5))
A new week, a new report! First, I need to set up the libraries and knitr options.
# knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(tidyr)
library(reshape2)
Today’s data is from UCI Machine Learning Repository, kindly provided by Paulo Cortez. The data is Student Performance Data Set (Cortez and Silva, 2008). The authors describe the dataset as following:
“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”
However, I’ve modified the data as following (see the R script here):
"school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery", "internet"
alc_use
) by calculating the mean of weekday and weekend alcohol use values.high_use
where TRUE
denotes the students who have average alcohol use value greater than 2.Here are the variables and their explanations (adapted from the data source website):,
Variable | Meaning | Datatype | Type | Notes |
---|---|---|---|---|
school | student’s school | Categorical | Identifier | ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira |
sex | student’s sex | Categorical | Identifier | ‘F’ - female or ‘M’ - male |
age | student’s age | Numeric | Identifier | from 15 to 22 |
address | student’s home address type | Categorical | Identifier | ‘U’ - urban or ‘R’ - rural |
famsize | family size | Categorical | Identifier | ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3 |
Pstatus | parent’s cohabitation status | Categorical | Identifier | ‘T’ - living together or ‘A’ - apart |
Medu | mother’s education | Categorical | Identifier | 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education |
Fedu | father’s education | Categorical | Identifier | 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education |
Mjob | mother’s job | Categorical | Identifier | ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’ |
Fjob | father’s job | Categorical | Identifier | ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’ |
reason | reason to choose this school | Categorical | Identifier | close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’ |
guardian | student’s guardian | Categorical | Original | ‘mother’, ‘father’ or ‘other’ |
traveltime | home to school travel time | Categorical | Original | 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour |
studytime | weekly study time | Categorical | Original | 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours |
failures | number of past class failures | Categorical | Original | n if 1<=n<3, else 4 |
schoolsup | extra educational support | Categorical | Original | yes or no |
famsup | family educational support | Categorical | Original | yes or no |
paid | extra paid classes within the course subject (Math or Portuguese) | Categorical | Original | yes or no |
activities | extra-curricular activities | Categorical | Original | yes or no |
nursery | attended nursery school | Categorical | Identifier | yes or no |
higher | wants to take higher education | Categorical | Original | yes or no |
internet | Internet access at home | Categorical | Identifier | yes or no |
romantic | with a romantic relationship | Categorical | Original | yes or no |
famrel | quality of family relationships | Categorical | Original | from 1 - very bad to 5 - excellent |
freetime | free time after school | Categorical | Original | from 1 - very low to 5 - very high |
goout | going out with friends | Categorical | Original | from 1 - very low to 5 - very high |
Dalc | workday alcohol consumption | Categorical | Original | from 1 - very low to 5 - very high |
Walc | weekend alcohol consumption | Categorical | Original | from 1 - very low to 5 - very high |
health | current health status | Categorical | Original | from 1 - very bad to 5 - very good |
absences | number of school absences | Numeric | Original | from 0 to 93 |
G1 | first period grade | Numeric | Original | from 0 to 20 |
G2 | second period grade | Numeric | Original | from 0 to 20 |
G3 | final grade | Numeric | Original | from 0 to 20 |
alc_use | Average alcohol usage | Numeric | Calculated | Avg. of Dalc and Walc |
high_use | Has high alcohol consumption | Boolean | Calculated | alc_use > 2 |
So, with that out of the way, let’s see the actual data.
dataset <- read.csv("data/alc.csv", header=TRUE)
str(dataset)
'data.frame': 382 obs. of 35 variables:
$ school : chr "GP" "GP" "GP" "GP" ...
$ sex : chr "F" "F" "F" "F" ...
$ age : int 15 15 15 15 15 15 15 15 15 15 ...
$ address : chr "R" "R" "R" "R" ...
$ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
$ Pstatus : chr "T" "T" "T" "T" ...
$ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
$ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
$ Mjob : chr "at_home" "other" "at_home" "services" ...
$ Fjob : chr "other" "other" "other" "health" ...
$ reason : chr "home" "reputation" "reputation" "course" ...
$ nursery : chr "yes" "no" "yes" "yes" ...
$ internet : chr "yes" "yes" "no" "yes" ...
$ guardian : chr "mother" "mother" "mother" "mother" ...
$ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
$ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
$ failures : int 0 1 0 0 1 0 1 0 0 0 ...
$ schoolsup : chr "yes" "yes" "yes" "yes" ...
$ famsup : chr "yes" "yes" "yes" "yes" ...
$ paid : chr "yes" "no" "yes" "yes" ...
$ activities: chr "yes" "no" "yes" "yes" ...
$ higher : chr "yes" "yes" "yes" "yes" ...
$ romantic : chr "no" "yes" "no" "no" ...
$ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
$ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
$ goout : int 2 4 1 2 1 2 2 3 2 3 ...
$ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
$ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
$ health : int 1 5 2 5 3 5 5 4 3 4 ...
$ absences : int 3 2 8 2 5 2 0 1 9 10 ...
$ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
$ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
$ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
$ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
$ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
This time I’ve omitted showing head
, as the table is pretty wide. However, the dataframe summary shows the first values from each column, so that’ll do. Most of the variables are categorical data - although the authors have used integers to denote some of the classes.
Next, I’ll explore four variables (absences
, health
, freetime
, famrel
) and their relationship with high or low alcohol consumption. My initial hypothesis would be that high value for absences
corresponds to high alcohol usage (high_use=TRUE
) whereas for health
and famrel
the higher values indicate low alcohol consumption. I suspect that the freetime
variable might not have any discernible relationship with alcohol usage, as it measures the amount of free time, not how the student spends the time. A lot of free time only increases the number of opportunities to drink alcohol, which may or may not lead to actual consumption.
# Attach a ID number for each row and select only the columns-of-interest
data <- mutate(dataset, ID = row_number()) %>%
select(any_of(c("ID", "high_use", "absences", "health", "freetime", "famrel")))
# Let's see...
summary(data[-1]) # Ignore ID
high_use absences health freetime famrel
Mode :logical Min. : 0.0 Min. :1.000 Min. :1.00 Min. :1.000
FALSE:268 1st Qu.: 1.0 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:4.000
TRUE :114 Median : 3.0 Median :4.000 Median :3.00 Median :4.000
Mean : 4.5 Mean :3.573 Mean :3.22 Mean :3.937
3rd Qu.: 6.0 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:5.000
Max. :45.0 Max. :5.000 Max. :5.00 Max. :5.000
# Plot with ggplot2, line chart for absences and bar charts for others
p1 <- ggplot(data, aes(absences)) + stat_count(geom="line", aes(colour=high_use) )
p2 <- ggplot(data, aes(health)) + geom_bar(aes(fill=high_use), position="dodge")
p3 <- ggplot(data, aes(freetime)) + geom_bar(aes(fill=high_use), position="dodge")
p4 <- ggplot(data, aes(famrel)) + geom_bar(aes(fill=high_use), position="dodge")
# Use gridExtra package to present the figures side-by-side. Saves some screen space :)
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
Well, guess I was wrong. freetime
seems to have the clearest signal, as the peaks are at 3 for high_use=FALSE
and at 4 for high_use=TRUE
. For other plots, it’s kinda hard to see any signal. Probably this kind of visual inspection isn’t the best tool here. Let’s print out some numbers.
table(high_use = data$high_use, absences = cut(data$absences, include.lowest=T, breaks=c(0,1,5,10,15,20,30)))%>% proportions() %>% round(digits=3) %>% addmargins
absences
high_use [0,1] (1,5] (5,10] (10,15] (15,20] (20,30] Sum
FALSE 0.237 0.303 0.132 0.021 0.008 0.003 0.704
TRUE 0.068 0.111 0.058 0.039 0.011 0.011 0.298
Sum 0.305 0.414 0.190 0.060 0.019 0.014 1.002
table(high_use = data$high_use, health = data$health)%>% proportions() %>% round(digits=3) %>% addmargins
health
high_use 1 2 3 4 5 Sum
FALSE 0.092 0.073 0.162 0.128 0.246 0.701
TRUE 0.029 0.042 0.050 0.047 0.131 0.299
Sum 0.121 0.115 0.212 0.175 0.377 1.000
table(high_use = data$high_use, freetime = data$freetime)%>% proportions() %>% round(digits=3) %>% addmargins
freetime
high_use 1 2 3 4 5 Sum
FALSE 0.039 0.128 0.296 0.183 0.055 0.701
TRUE 0.005 0.042 0.105 0.107 0.039 0.298
Sum 0.044 0.170 0.401 0.290 0.094 0.999
table(high_use = data$high_use, famrel = data$famrel) %>% proportions() %>% round(digits=3) %>% addmargins
famrel
high_use 1 2 3 4 5 Sum
FALSE 0.016 0.026 0.102 0.353 0.204 0.701
TRUE 0.005 0.024 0.065 0.141 0.063 0.298
Sum 0.021 0.050 0.167 0.494 0.267 0.999
So, for absences it looks like the FALSE
dominates the ratio at low levels while TRUE
gets a larger piece of the pie at high values. So I think my hypothesis was correct for that one. For freetime the proportion of TRUE
increases from lowest to highest class, though it never dominates. I don’t see any clear signal in health
and famrel
tables. So, for my hypotheses:
absences
: CORRECThealth
: inconlusivefreetime
: WRONGfamrel
: inconclusiveWell, that’s that. The more you know!
Moving on to the logistic regression, we’ll see if The Machine has more insight than I have. Probably, as it is much better at number-crunching than I am.
# Fit a model!
model <- glm(high_use ~ absences + health + freetime + famrel, data=data, family="binomial")
summary(model)
Call:
glm(formula = high_use ~ absences + health + freetime + famrel,
family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0084 -0.8216 -0.6462 1.1891 2.0837
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74200 0.66202 -2.631 0.008505 **
absences 0.08848 0.02235 3.960 7.51e-05 ***
health 0.11100 0.08657 1.282 0.199762
freetime 0.41336 0.12472 3.314 0.000919 ***
famrel -0.33352 0.12977 -2.570 0.010167 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 465.68 on 381 degrees of freedom
Residual deviance: 430.06 on 377 degrees of freedom
AIC: 440.06
Number of Fisher Scoring iterations: 4
So, absences
, health
and freetime
have a positive coefficient and famrel
a negative. So maybe I got that last one right, after all? Anyways, health is not a significant variable here, so I’ll dump it and play with the remaining three.
# Re-fit without health
model <- glm(high_use ~ absences + freetime + famrel, data=data, family="binomial")
summary(model)
Call:
glm(formula = high_use ~ absences + freetime + famrel, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9318 -0.8233 -0.6580 1.1905 1.9597
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.43379 0.61550 -2.329 0.019835 *
absences 0.08813 0.02236 3.942 8.09e-05 ***
freetime 0.41844 0.12419 3.369 0.000754 ***
famrel -0.31322 0.12824 -2.443 0.014586 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 465.68 on 381 degrees of freedom
Residual deviance: 431.73 on 378 degrees of freedom
AIC: 439.73
Number of Fisher Scoring iterations: 4
# Odds ratio and confidence intervals
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
Waiting for profiling to be done...
cbind(OR,CI) # SHOW IT TO ME!
OR 2.5 % 97.5 %
(Intercept) 0.2384039 0.06958343 0.7831989
absences 1.0921339 1.04749241 1.1438386
freetime 1.5195946 1.19585392 1.9480047
famrel 0.7310919 0.56733732 0.9397040
So, the Odds Ratio (OR) is defined as \(OR = p / (1 - p)\), where \(p = P(Y=1)\), which denotes probability that our target is TRUE
. So when the OR is greater than 1, then our target becomes more likely to be TRUE
than FALSE
when our estimator increases in value2. Based on the numbers, increasing number of absences and more free time available correspond to high alcohol consumption, while improving family relations decreases the alcohol consumption. So, let’s revise my hypotheses:
absences
: CORRECThealth
: inconlusive (statistically not significant)freetime
: WRONGfamrel
: CORRECTP. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Web Link
A new week, a new section! This week is about clustering the data. As per usual, here are the libraries I’m using.
# Used knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(reshape2)
library(tidyr)
library(MASS) # Contains the used dataset
library(corrplot)
set.seed(2020)
Unlike previous weeks, this week we are using a dataset that we have not wrangled with before! The data is the Boston
dataset that is available in R package called MASS
. Below is a table that lists the columns in the dataset. Full disclosure, I lifted the descriptions directly from R documentation (accessible by typing ?Boston
in R console).
Column | Description |
---|---|
crim | per capita crime rate by town. |
zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
indus | proportion of non-retail business acres per town. |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
nox | nitrogen oxides concentration (parts per 10 million). |
rm | average number of rooms per dwelling. |
age | proportion of owner-occupied units built prior to 1940. |
dis | weighted mean of distances to five Boston employment centres. |
rad | index of accessibility to radial highways. |
tax | full-value property-tax rate per $10,000. |
ptratio | pupil-teacher ratio by town. |
black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
lstat | lower status of the population (percent). |
medv | median value of owner-occupied homes in $1000s. |
The data is titled as Housing Values in Suburbs of Boston. The dataset contains 14 variables and 506 rows. All columns are numerical, although the variable chas
is actually a boolean indicator represented as ones and zeros. Also as per usual, here are the outputs for str
, head
and summary
functions for the Boston
dataset:
data("Boston") # Creates variable named Boston
str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
knitr::kable(head(Boston, 5))
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
summary(Boston)
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
Looking at the summary, there are a couple of notable elements visible. First, the crim
(per capita crime rate by town) variable is under 4 for most of the rows, but the maximum is 88.98, almost three magnitudes larger than the median value (0.26)! The high discrepancy could be explained by the town in question having a low number of people, however the data doesn’t include this value. The variable zn
(proportion of residential land zoned for lots over 25000 sq.ft.) exhibits similar behavior: minimum and median values are both 0 and maximum value is 100. This tells that most of the towns have no residential lots over 25000 sq.ft. in size. Similarly for variable chas
, which simply denotes which areas are connected to Charles River and which not. Data states that at least 75% of the data is not connected to the river. For drawing more insight from the rest of the data, let’s plot a couple of pictures.
ggpairs(Boston)
Okay, so, the ggpairs
plot out-of-the-box isn’t the most clearest presentation. Let’s try some alternatives… First, let’s take a look at the density plots for each variable. (I finally figured out how to do this, thanks StackOverflow! Yay, go me.) The only variable resembling a normal distribution is rm
, which is average number of rooms per dwelling. Okay, makes sense as it is an average value. Variables indus
, rad
and tax
show two peaks, meaning that there are more towns with “low” and “high” values than “middle” values. The populace seems to be elderly, as the density increases (almost monotonically) as age
gets larger. Variables nox
, dis
, lstat
and medv
are positively skewed and ptratio
is negatively skewed.
longtable <- melt(Boston)
ggplot(longtable, aes(value)) + geom_density() + facet_wrap(~variable, scales="free")
Interestingly, the variable black
looks like a mirrored version of crim
. It’s a derived value related to the proportion of black people. I don’t know how the original dataset authors derived the function (\(1000\left(Bk - 0.63\right)^2\)), but the density graph shows that most of the towns have a proportion of black people that deviates from value 0.63. Plotting the function shows that when its value is over 375.38 (i.e. 1st quantile), the proportion of black people is under 0.02.
x <- 0:100 / 100
y <- 1000*(x-0.63)^2
plot(x,y, xlab="Proportion of black people", ylab="Function value")
# Show the values of x when y > 375.38 (i.e. the value of 1st quantile)
x[y > 375.38]
[1] 0.00 0.01
Finally, let’s plot the correlation between all variables using the fancy corrplot
function from the corrplot
library. From the picture, looks like rad
and tax
are strongly correlated (positive). That implies that the towns closer to more highways also tax their people more. Other notable correlations are indus
, nox
and age
each negatively correlating with dis
, and lstat
correlating with medv
negatively. Notably, the further away from an employment hotspot the town is, the less there is industrial activity and less pollution (indicated by amount of NO_x_) there is.
corrplot(cor(Boston), method="circle")
Whew, that was a lot of text for simply exploring the data! Now I have to actually do something with it, so let’s get to it! First, the data needs to be standardized, which is done by subtracting the mean from the values and dividing by the standard deviation for each column. Now the data has a mean value of zero for each column! Also, the values are now “how many standard deviations away they are from the mean value” instead of actual values. However, the density functions’ form did not change, which is also what we want.
scaled <- scale(Boston)
# scale() returns a matrix object, change it to a data.frame
scaled <- as.data.frame(scaled)
summary(scaled)
crim zn indus chas
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
nox rm age dis
Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
rad tax ptratio black
Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
lstat medv
Min. :-1.5296 Min. :-1.9063
1st Qu.:-0.7986 1st Qu.:-0.5989
Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 3.5453 Max. : 2.9865
longtable <- melt(scaled)
No id variables; using all as measure variables
ggplot(longtable, aes(value)) + geom_density() + facet_wrap(~variable, scales="free")
In this exercise, we’re trying to ultimately predict the crime rate using Linear Discriminant Analysis (LDA). To do this, we first create a categorical target variable for crime rate by dividing the data. Here we do this by cutting the data into four quantiles.
# create a quantile vector of crim and print it
bins <- quantile(scaled$crim)
bins
0% 25% 50% 75% 100%
-0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime' and take a look using table()
crime <- cut(scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
table(crime)
crime
low med_low med_high high
127 126 126 127
# remove original `crim` and add the new `crime` variable to the dataset
# Here I need to explicitly call dplyr::select, as MASS library also has a select function
scaled <- dplyr::select(scaled, -crim)
scaled <- data.frame(scaled, crime)
Next up, we’ll have to train and test the LDA model with our data. So, let’s split the data to training and testing datasets using a 80:20 split. Also, the test data shouldn’t contain the true labels so that we don’t confuse our model, but we need that information for verifying our model. So let’s store the labels into another object.
# number of rows in the Boston dataset
n <- nrow(scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train and test sets
train <- scaled[ind,]
test <- scaled[-ind,] # note the minus!
# save the correct classes from test data
correct_classes_for_test <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Now that our data is in a suitable format, let’s get to business. Luckily, fitting the LDA is a one-liner. Here our target is crime
and explanatory variables are everything else (denoted with .
). Also, I shamelessly copied the biplot plotting code from Datacamp exercises (because programming it would be too difficult at this point). Let’s see what happens.
lda.fit <- lda(crime ~ ., data = train)
lda.fit # Just checking the output
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2549505 0.2400990 0.2599010 0.2450495
Group means:
zn indus chas nox rm age
low 0.9753697 -0.9434514 -0.08120770 -0.8837233 0.4325390 -0.9381520
med_low -0.1157120 -0.2855907 0.01179157 -0.5543991 -0.1296850 -0.3048100
med_high -0.3631007 0.1961002 0.25261763 0.3994375 0.1537163 0.3898780
high -0.4872402 1.0149946 -0.07348562 1.0385550 -0.3794012 0.8013618
dis rad tax ptratio black lstat
low 0.9342664 -0.6785867 -0.7383022 -0.44674754 0.37958778 -0.77846727
med_low 0.3208755 -0.5591880 -0.4598702 -0.04993701 0.31380803 -0.15663318
med_high -0.4035941 -0.4065439 -0.3123179 -0.43168810 0.09591388 -0.05170852
high -0.8422717 1.6596029 1.5294129 0.80577843 -0.68079981 0.90402871
medv
low 0.53480421
med_low -0.02744276
med_high 0.24205894
high -0.73842724
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.07214555 0.610195519 -0.85295795
indus 0.04785751 -0.409068774 0.13049624
chas -0.08241604 -0.009705042 0.09797199
nox 0.22690421 -0.714147079 -1.42365374
rm -0.09277179 -0.062028307 -0.11140387
age 0.26316698 -0.346412935 0.02425944
dis -0.08409444 -0.147792348 -0.02421958
rad 3.41887302 0.670047867 -0.31331749
tax 0.01133046 0.319706540 0.70765000
ptratio 0.08939538 0.134648187 -0.18848405
black -0.12840758 -0.012087855 0.07913283
lstat 0.22188192 -0.191386778 0.28940556
medv 0.19374414 -0.370964344 -0.30837420
Proportion of trace:
LD1 LD2 LD3
0.9470 0.0412 0.0119
# Function lifted directly from DataCamp exercises
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Okay, that’s a colorful mess. Looking at the arrows, rad
stands out quite nicely. Also printing the fit shows that it has the largest coefficients for linear discriminants LD1 and LD2. Other notable variables would be zn
and nox
with high-ish coefficients for LD2 and LD3. Also, the crime==high
points all are nicely together in the right side of the picture, though some med_high
s have ended up there, too.
So, what about our model’s predictions? Predict with test data and cross tabulate the predictions with the actual labels, got it!
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes_for_test, predicted = lda.pred$class) %>% addmargins
predicted
correct low med_low med_high high Sum
low 12 7 5 0 24
med_low 8 17 4 0 29
med_high 1 11 8 1 21
high 0 0 1 27 28
Sum 21 35 18 28 102
Well, looks like the model is successful in predicting the crime rate classes - up to a point. It is a good sign that the diagonal elements have the largest values, as these elements show the correct predictions! However, looking at the numbers in a more detail shows interesting things. The high
looks quite easy to predict, model correctly predicted 19/20 rows with no false positives, missing one row. For other classes, model didn’t perform as well. For med_high
model had worst performance, predicting correctly only 12/22 with 8 false positives. However, med_low
had more false positives even if the model also had more rows correctly (20/30 correct, 19 false positives – that’s almost 50% of the predictions in this class)! Finally, 18/30 rows of low
class were predicted correctly, with “only” 6 false positives.
Predicted | Correct | Misses | False positives |
---|---|---|---|
low | 18/30 (60%) | 12/30 | 6/24 (25%) |
med_low | 20/30 (67%) | 10/30 | 19/39 (49%) |
med_high | 12/22 (55%) | 10/22 | 8/20 (40%) |
high | 19/20 (95%) | 1/20 | 0/19 (0%) |
Finally, we’ll practice some clustering with the “clustering hello world algorithm”, also known as k-means clustering. Basically this algorithm creates k clusters by finding k points that minimize the mean distance between the “k-points” and the data points (though it’s not foolproof, the outcome depends on where the initial k-points are). Each data point is assigned to the nearest k-point, which forms the cluster.
Moving on, now that short (and probably a bit unnecessary) introduction is out of the way, I’ll reload the original Boston
dataset and scale it once again for clustering.
# Reload original Boston dataset
data('Boston')
# Scale the dataset
boston_scaled <- scale(Boston) %>% as.data.frame()
# I won't repeat the summary here, it's the same as shown previously...
# Calculate euclidean distance and show it
dist_eu <- dist(boston_scaled)
summary(dist_eu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Let's calculate also manhattan distance for good measure
dist_man <- dist(boston_scaled, method="manhattan")
summary(dist_man)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Okay, next up is the actual clustering. Let’s start with three k-points and see the result.
km <-kmeans(boston_scaled, centers = 3)
# Plot a subset of boston_scaled data with clusters colored, just to see how it looks like
pairs(boston_scaled[5:9], col = km$cluster)
I don’t know whose idea it was to have black, red and green as the three first default colors in R… I just hope that these shades of red and green are distinct enough for (red-green) colorblind people are able to tell the difference. And wow, the clusters swap colors when I replot the figure! Yeah, I really have to learn the ggplot2
library by heart if I’m ever going to use R outside this course. But I digress. Moving on…
Multiple methods for figuring out the optimal value for k exist, but here we use a metric called total of within cluster sum of squares (WCSS). The optimal number for k is where the WCSS has a radical drop! The kmeans
function returns an object that handily provides the WCSS value in tot.withinss
component. The DataCamp exercises graciously include an one-liner to calculate WCSS for the data using k=1...k_max
, so I’m totally going to use that one! Let’s set k_max
to 10 to limit the iterations a bit and then we’ll look for the optimal value visually - i.e by plotting WCSS(k)
.
# Maximum number of clusters considered here
k_max <- 10
# This one is also copied from DataCamp...
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Visualise!
qplot(x = 1:k_max, y = twcss, geom = c('line', 'point'))
The most radical decrease would be from going k=1
to k=2
, so the value 2 seems to be a great candidate. Though the WCSS values drop up to k=9
, so something like 5 or 7 might be potential candidates. But 2 makes it so that there will be only two colors in the up-coming plots, making them a bit easier to interpret. So I’ll use k=2
.3
# Re-cluster the data
km <-kmeans(boston_scaled, centers = 2)
#
# plot the Boston dataset with clusters
# This time the full dataset with ggpairs -- it's stil going to be a mess, but it'll be a pretty mess.
ggpairs(boston_scaled, aes(col=as.factor(km$cluster)))
# This chunk emits a bunch of "Warning in cor(x, y): the standard deviation is zero" warning messages.
Yep, that’s a pretty looking mess. Also that chunk emits a bunch of Warning in cor(x, y): the standard deviation is zero
warning messages (I’ve omitted those as they’re wasting space). Looks like the correlations for zn
in the red class are NA
, so that’s probably where the standard deviation is zero. Anyways, let’s first plot the density plots like before, and then check only a handful of variables with larger pictures.
with_colors <- boston_scaled
with_colors$cluster <- as.factor(km$cluster)
longtable <- melt(with_colors)
ggplot(longtable, aes(value, col=cluster)) + geom_density() + facet_wrap(~variable, scales="free")
Well, that’s interesting! Some of the variables now show two differing density curves – essentially meaning that there could be a way to separate the data into two different groups. If you recall my pondering from the data exploration, I noted that indus
, rad
and tax
variables show two peaks – well, the two peaks seem to belong into different clusters! However, what I couldn’t see before, was that also age
, dis
, ptratio
, and arguably lstat
and medv
, would be a combination of two different (and separable) curves. nox
being also very separable is not a surprise (in hindsight), when you already know that industrial zones cause more pollution than residential zones (typically).
Let’s pick some variables that look interesting.
#with_colors <- boston_scaled
#with_colors$cluster <- as.factor(km$cluster)
subset <- dplyr::select(with_colors, c("indus", "age", "dis", "rad", "tax", "medv", "cluster"))
#subset$cluster <- as.factor(subset$cluster)
ggpairs(subset, aes(col=cluster))
Well, they look interesting. For example, tax
looks to be separated quite nicely, same with dis
and age
. Old people live close to the industrial employment centers and pay more in taxes? That would be one plausible conclusion from this data.
Surprise! There are two bonus tasks for this week! There were some also for the last week, but I didn’t do those. But it’s getting a bit late, so this’ll have to be short. The first one is to fit LDA for some number of clusters (>2). I chose 7. For that, the most influential variables seem to be chas
and rad
. The second bonus task, well, I can’t manage everything.
# Reload original Boston dataset
data('Boston')
# Scale the dataset
bonus_scaled <- scale(Boston) %>% as.data.frame
summary(bonus_scaled)
crim zn indus chas
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
nox rm age dis
Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
rad tax ptratio black
Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
lstat medv
Min. :-1.5296 Min. :-1.9063
1st Qu.:-0.7986 1st Qu.:-0.5989
Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 3.5453 Max. : 2.9865
# k-means clustering with 7 clusters
km <-kmeans(bonus_scaled, centers = 7)
summary(km)
Length Class Mode
cluster 506 -none- numeric
centers 98 -none- numeric
totss 1 -none- numeric
withinss 7 -none- numeric
tot.withinss 1 -none- numeric
betweenss 1 -none- numeric
size 7 -none- numeric
iter 1 -none- numeric
ifault 1 -none- numeric
# Add clusters to dataset
bonus_scaled$cluster <- km$cluster
# LDA fit targetting clusters
lda.fit <- lda(cluster ~ ., data = bonus_scaled)
lda.fit # Just checking the output
Call:
lda(cluster ~ ., data = bonus_scaled)
Prior probabilities of groups:
1 2 3 4 5 6 7
0.09683794 0.09288538 0.06916996 0.15415020 0.17588933 0.34387352 0.06719368
Group means:
crim zn indus chas nox rm age
1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554 1.5399833 0.07933756
2 -0.2834643 -0.4872402 1.5965043 -0.2723291 1.0453673 -0.6245590 0.94375129
3 1.4802645 -0.4872402 1.0149946 -0.2723291 0.9676887 -0.2969389 0.76560159
4 -0.4126954 1.9952361 -1.1032525 -0.2218534 -1.1552982 0.6080657 -1.40363754
5 0.9722616 -0.4872402 1.0149946 -0.2723291 1.0010692 -0.4790990 0.74608713
6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
7 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.37213224
dis rad tax ptratio black lstat
1 -0.2868574 -0.520140581 -0.82627237 -1.0314738 0.35523433 -0.9636997
2 -0.8963217 -0.622669809 0.03772813 -0.2084479 -0.08717824 0.7185475
3 -0.8580043 1.659602895 1.52941294 0.8057784 -3.29705640 1.1699052
4 1.5772490 -0.627024335 -0.57588304 -0.7161406 0.35335146 -0.9028481
5 -0.8009959 1.659602895 1.52941294 0.8057784 0.16590165 0.8119640
6 0.2769540 -0.587168164 -0.59021294 0.1702603 0.30993538 -0.1365045
7 -0.4033382 0.001081444 -0.09756330 -0.3924585 0.17154271 -0.1643525
medv
1 1.6147330
2 -0.5152915
3 -1.0473739
4 0.6621944
5 -0.6627877
6 -0.1747229
7 0.5733409
Coefficients of linear discriminants:
LD1 LD2 LD3 LD4 LD5
crim -0.032848874 -0.11853833 -0.121992191 -0.04587015 0.02028959
zn -0.236386535 -0.11164525 -0.892168260 0.07752273 -1.47340439
indus 0.122244436 -0.61676862 1.373115707 -0.29017671 -1.77461828
chas 5.826738406 0.17279307 -0.348402443 -0.10993258 -0.06074649
nox 0.007038657 0.41797220 0.349383719 0.25040677 -0.36306581
rm -0.082150352 -0.04052241 -0.161363218 -0.03368422 0.13175307
age 0.072083760 -0.01514390 0.355986612 -0.06494734 0.22785773
dis 0.073348439 0.30301112 -0.212017094 0.04270701 -0.55076413
rad 0.245600390 -3.13810821 -1.153371145 1.57770009 1.03068941
tax -0.002101628 0.25323107 -0.371868363 0.07772706 -0.49231336
ptratio -0.002263332 0.04753426 0.008400806 -0.14134197 -0.03644768
black 0.051112837 0.90346968 0.535825097 2.21912811 -0.18944422
lstat -0.126706573 -0.15549426 -0.018983527 -0.04794742 -0.02923982
medv -0.185674589 0.31410850 -0.008748079 -0.13496368 0.21260332
LD6
crim -0.12706552
zn -0.35055644
indus -0.40144745
chas 0.27933856
nox -0.44219930
rm -0.56819492
age -0.24907323
dis 0.14176355
rad -0.05116781
tax 0.20960416
ptratio 0.32614059
black 0.01111638
lstat -0.25238312
medv -0.88020068
Proportion of trace:
LD1 LD2 LD3 LD4 LD5 LD6
0.5738 0.2535 0.0673 0.0566 0.0353 0.0134
# target classes as numeric
classes <- as.numeric(bonus_scaled$cluster)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Another week, another go. This is gonna be a short one, got sidetracked with actual work this week…
Starting with the setup for this week. New libraries: stringr
, FactoMiner
.
# Used knitr options: No "##" before output
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(reshape2)
library(tidyr)
library(corrplot)
library(stringr)
library(FactoMineR) # Contains the "tea" dataset
This time we’re having a wrangled dataset! See the creation script here. Original data is from Human Development Reports by United Nations Development Programme, see http://hdr.undp.org/en/content/human-development-index-hdi . Datasets used in the data wrangling were provided by IODS lecturers.
Starting with the data overview, first is the classic table of columns and their meaning:
Variable | Meaning | Type | Notes |
---|---|---|---|
edu2.ratio |
Females-to-males ratio of people with secondary or higher education | num | edu2.F / edu2.M |
lab.ratio |
Females-to-males ratio of people in the work force | num | lab.F / lab.M |
exp.edu.years |
Expected years in education | num | |
exp.life.years |
Life expectancy at birth | num | |
GNI |
Gross National Income per capita | num | |
mat.mort.rate |
Maternal mortality rate | int | |
adol.birth.rate |
Adolescent birth rate | num | |
parl.F.ratio |
Ratio of females in the parliament | num |
dataset <- read.csv("data/human.csv", header=TRUE, row.names=1)
str(dataset)
'data.frame': 155 obs. of 8 variables:
$ edu2.ratio : num 1.007 0.997 0.983 0.989 0.969 ...
$ lab.ratio : num 0.891 0.819 0.825 0.884 0.829 ...
$ exp.edu.years : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
$ exp.life.years : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
$ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
$ mat.mort.rate : int 4 6 6 5 6 7 9 28 11 8 ...
$ adol.birth.rate: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
$ parl.F.ratio : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(dataset)
edu2.ratio lab.ratio exp.edu.years exp.life.years
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
GNI mat.mort.rate adol.birth.rate parl.F.ratio
Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
knitr::kable(head(dataset, 5))
edu2.ratio | lab.ratio | exp.edu.years | exp.life.years | GNI | mat.mort.rate | adol.birth.rate | parl.F.ratio | |
---|---|---|---|---|---|---|---|---|
Norway | 1.0072389 | 0.8908297 | 17.5 | 81.6 | 64992 | 4 | 7.8 | 39.6 |
Australia | 0.9968288 | 0.8189415 | 20.2 | 82.4 | 42261 | 6 | 12.1 | 30.5 |
Switzerland | 0.9834369 | 0.8251001 | 15.8 | 83.0 | 56431 | 6 | 1.9 | 28.5 |
Denmark | 0.9886128 | 0.8840361 | 18.7 | 80.2 | 44025 | 5 | 5.1 | 38.0 |
Netherlands | 0.9690608 | 0.8286119 | 17.9 | 81.6 | 45435 | 6 | 6.2 | 36.9 |
ggpairs(dataset)
cor(dataset)%>% corrplot
Looking at the median of education and labor ratios, they are under one, meaning that more men than women are both educated and involved in the labor force. However, the education ratio median is close to one, but the 3rd quantile is also under one. Maternal mortality rate and adolescent birth rate are correlating with each other positively and negatively with life expectancy, education and GNI.
First, I’ll perform principal component analysis (PCA) using the singular value decomposition, but without scaling the dataset.
pca_unscaled <- prcomp(dataset)
pca_unscaled
Standard deviations (1, .., p=8):
[1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
[6] 1.565912e+00 1.912052e-01 1.591112e-01
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4
edu2.ratio -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
lab.ratio 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
exp.edu.years -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
exp.life.years -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
mat.mort.rate 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
adol.birth.rate 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
parl.F.ratio -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
PC5 PC6 PC7 PC8
edu2.ratio -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
lab.ratio 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
exp.edu.years 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
exp.life.years 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
mat.mort.rate 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
adol.birth.rate 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
parl.F.ratio -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
# Drawing the biplot for the first 2 PCA dimensions
biplot(pca_unscaled, choices = 1:2, cex=c(0.8,1), col=c("#66CC66", "#663399")) # purple color is "rebecca"
Yup, that’s a mess. PCA expects that the variable values are of similar scales, so having a GNI
column that is about 3-4 magnitudes larger than other variables causes this.
scaled <- scale(dataset)
pca_scaled <- prcomp(scaled)
pca_scaled
Standard deviations (1, .., p=8):
[1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
[8] 0.3222406
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5
edu2.ratio -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
lab.ratio 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
exp.edu.years -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
exp.life.years -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
mat.mort.rate 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
adol.birth.rate 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
parl.F.ratio -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
PC6 PC7 PC8
edu2.ratio 0.17713316 0.05773644 0.16459453
lab.ratio -0.03500707 -0.22729927 -0.07304568
exp.edu.years -0.38606919 0.77962966 -0.05415984
exp.life.years -0.42242796 -0.43406432 0.62737008
GNI 0.11120305 -0.13711838 -0.16961173
mat.mort.rate 0.17370039 0.35380306 0.72193946
adol.birth.rate -0.76056557 -0.06897064 -0.14335186
parl.F.ratio 0.13749772 0.00568387 -0.02306476
# Drawing the biplot for the first 2 PCA dimensions
biplot(pca_scaled, choices = 1:2, cex=c(0.8,1), col=c("#66CC66", "#663399")) # purple color is "rebecca"
Now this looks a bit better. Based on this analysis, country’s healthcare (life expectancy, maternal mortality, adolescent pregnancies), education of women, and income levels are the most significant indicators in this particular dataset. The first dimension is women education and healthcare axis: more rich and educated countries have better healthcare than poor countries. Or that educated women have less pregnancies. The second dimension is about how much women are involved with the society - are they working or are they staying at home (for any reason)? Are women in positions of power?
Wrapping up this week is multiple correspondence analysis (MCA). For this analysis, we’re using the tea
dataset from FactoMineR library. It is a dataset of questionnaire answers regarding tea According to the documentation, the questionnaire asked 300 individuals about their tea-drinking habits (18 questions), their product’s perception (12 questions) and personal details (4 questions). So, the data has 300 rows and 36 columns, with each individual on one row and each question in their own column.
# using `tea` dataset from FactoMineR library
data(tea)
str(tea)
'data.frame': 300 obs. of 36 variables:
$ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
$ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
$ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
$ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
$ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
$ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
$ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
$ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
$ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
$ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
$ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
$ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
$ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
$ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
$ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
$ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
$ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
$ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
$ age : int 39 45 47 23 48 21 37 36 40 37 ...
$ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
$ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
$ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
$ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
$ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
$ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
$ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
$ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
$ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
$ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
$ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
$ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
$ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
$ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
$ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
$ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
$ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
breakfast tea.time evening lunch
breakfast :144 Not.tea time:131 evening :103 lunch : 44
Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
dinner always home work
dinner : 21 always :103 home :291 Not.work:213
Not.dinner:279 Not.always:197 Not.home: 9 work : 87
tearoom friends resto pub
Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
tearoom : 58 Not.friends:104 resto : 79 pub : 63
Tea How sugar how
black : 74 alone:195 No.sugar:155 tea bag :170
Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
green : 33 milk : 63 unpackaged : 36
other: 9
where price age sex
chain store :192 p_branded : 95 Min. :15.00 F:178
chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
tea shop : 30 p_private label: 21 Median :32.00
p_unknown : 12 Mean :37.05
p_upscale : 53 3rd Qu.:48.00
p_variable :112 Max. :90.00
SPC Sport age_Q frequency
employee :59 Not.sportsman:121 15-24:92 1/day : 95
middle :40 sportsman :179 25-34:69 1 to 2/week: 44
non-worker :64 35-44:40 +2/day :127
other worker:20 45-59:61 3 to 6/week: 34
senior :35 +60 :38
student :70
workman :12
escape.exoticism spirituality healthy
escape-exoticism :142 Not.spirituality:206 healthy :210
Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
diuretic friendliness iron.absorption
diuretic :174 friendliness :242 iron absorption : 31
Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
feminine sophisticated slimming exciting
feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
relaxing effect.on.health
No.relaxing:113 effect on health : 66
relaxing :187 No.effect on health:234
tea_no_age <- dplyr::select(tea, -age)
tea_no_age %>% dplyr::select(1:9) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
tea_no_age %>% dplyr::select(10:18) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
tea_no_age %>% dplyr::select(19:27) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
tea_no_age %>% dplyr::select(28:35) %>% pivot_longer(cols=everything(), names_to="key", values_to="value") %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
tea %>% ggplot(aes(age)) + geom_density()
Looks to be an interesting dataset, though it’s a bit difficult to properly visualise. Let’s pick a few columns and perform MCA to them.
keep_cols <- c("age_Q", "sex", "frequency", "Tea", "price", "sugar", "How")
data <- dplyr::select(tea, one_of(keep_cols))
mca <- MCA(data, graph=FALSE)
mca$eig
eigenvalue percentage of variance cumulative percentage of variance
dim 1 0.25377239 9.349509 9.349509
dim 2 0.21962466 8.091435 17.440944
dim 3 0.19242745 7.089432 24.530377
dim 4 0.18126192 6.678071 31.208447
dim 5 0.17107329 6.302700 37.511147
dim 6 0.16281205 5.998339 43.509486
dim 7 0.15351935 5.655976 49.165462
dim 8 0.14892416 5.486680 54.652142
dim 9 0.13900476 5.121228 59.773370
dim 10 0.13855325 5.104593 64.877963
dim 11 0.12861410 4.738414 69.616377
dim 12 0.12434570 4.581157 74.197535
dim 13 0.12112984 4.462678 78.660213
dim 14 0.12094719 4.455949 83.116162
dim 15 0.11055570 4.073105 87.189267
dim 16 0.10387947 3.827138 91.016405
dim 17 0.08964207 3.302603 94.319008
dim 18 0.08046729 2.964584 97.283592
dim 19 0.07373107 2.716408 100.000000
mca$var$contrib
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
15-24 7.913377e+00 10.97033963 1.708701684 1.9968789 2.07232894
25-34 4.056933e+00 7.85507712 0.071367598 12.0683981 1.93779594
35-44 1.446520e+00 0.27107440 0.355201941 4.6803402 18.16844062
45-59 4.541041e+00 0.15278159 14.827082242 0.4401140 0.03107965
+60 9.968695e+00 1.79028331 6.724952355 0.3337406 14.33152539
F 6.581611e-01 14.09573560 0.110315981 0.1234269 0.49893018
M 9.602678e-01 20.56590932 0.160952824 0.1800818 0.72794731
1/day 1.824514e+00 3.78037554 11.165695773 0.1886927 0.89022035
1 to 2/week 2.479474e+00 1.25863987 5.960546982 0.2467699 0.38723509
+2/day 3.942764e+00 5.11421794 0.104603818 0.1316144 0.45963324
3 to 6/week 4.474078e-02 0.02420935 4.765988678 0.3481716 4.75035719
black 1.515701e+01 3.18701783 1.741831467 5.1900990 0.23013408
Earl Grey 7.428548e+00 2.99324399 0.098507993 0.1065889 0.06568091
green 5.796958e-01 2.28218164 7.482241840 17.6488807 1.79066476
p_branded 7.263746e-01 1.51913249 6.766248448 0.2991279 6.29340604
p_cheap 2.045066e-01 0.35175914 0.896865300 5.3912908 12.86899210
p_private label 1.432982e+00 0.42811026 0.006741122 1.3641797 1.13413645
p_unknown 1.516061e+00 3.10008365 3.879587807 17.0230069 1.91080210
p_upscale 7.190898e+00 6.11545398 0.646959738 2.4477659 0.18756054
p_variable 6.321193e-02 1.63170789 4.905228919 0.7340771 12.34728571
No.sugar 1.012510e+01 1.95243120 3.827438813 0.5567729 1.51225785
sugar 1.082338e+01 2.08708163 4.091400111 0.5951710 1.61655149
alone 1.149679e-01 2.86953762 0.063490448 5.8662485 2.90514235
lemon 3.009857e-01 1.93018383 6.575250633 0.8332322 12.64607902
milk 9.033844e-04 3.64282402 5.882342043 18.8400640 0.22131975
other 6.498879e+00 0.03060716 7.180455442 2.3652654 0.01449295
plot(mca, invisible=c("ind"), habillage="quali")
So I decided to go with person’s age and gender, how often they drink tea, what kind of tea they drink, how expensive their tea is, and if they add something (sugar, milk) to their tea. So, putting those in and taking the model out, let’s see… Okay, so my set of variables wasn’t very good. The first two dimensions explain only 17.4% of the (cumulative) variance! So it is probably no wonder the biplot looks like a scatterplot where the dots are all over the place.
This week’s diary has been really short compared to other weeks… Well, such is life, there’s so much to do and so little time.
Hoo boy. It’s the last week of IODS course, and this time we’re analyzing data that includes non-independent measurements (e.g. repeated measurements of a test subject over time). Let’s start with the libraries and data exploration!
# Setup block
# Remove the '##' from each line of R output blocks
knitr::opts_chunk$set(comment = '')
# Libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
library(stringr)
library(lme4)
library(reshape2)
This week’s assignment has a twist! We are going to repeat the examples from Vehkalahti and Everitt (2019) book Multivariate Analysis for The Behavioral Sciences, chapters 8 and 9, except with the data sets swapped between the chapters. I’ve wrangled the data to a “long table” format in the corresponding wrangling exercise. You can find the code here.
So first we have the rats
data introduced in Ch. 9. The authors describe the data as following:
[W]e shall use some data from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
rats <- read.csv("data/rats.csv")
rats$ID <- as.factor(rats$ID)
rats$Group <- as.factor(rats$Group)
str(rats)
'data.frame': 176 obs. of 5 variables:
$ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
$ WD : chr "WD1" "WD1" "WD1" "WD1" ...
$ weight: int 240 225 245 260 255 260 275 245 410 405 ...
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(rats)
ID Group WD weight time
1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
3 : 11 3:44 Mode :character Median :344.5 Median :36.00
4 : 11 Mean :384.5 Mean :33.55
5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
6 : 11 Max. :628.0 Max. :64.00
(Other):110
There are 16 rats in total, 8 in group 1, 4 in group 2 and 4 in group 3. Each rat has an unique ID. time
column is the measurement day.
Next, we have the bprs
data introduced in Ch. 8. The authors’ description follows:
Table 8.1 taken from Davis (2002). Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
bprs <- read.csv("data/bprs.csv")
# Create ID column before factor conversion
# treatment "1" and subject "1" -> 101
# treatment "2" and subject "20" -> 220
bprs$ID <- as.factor(bprs$treatment * 100 + bprs$subject)
# Categorical integers to categories...
bprs$treatment <- as.factor(bprs$treatment)
bprs$subject <- as.factor(bprs$subject)
str(bprs)
'data.frame': 360 obs. of 6 variables:
$ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ weeks : chr "week0" "week0" "week0" "week0" ...
$ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
$ week : int 0 0 0 0 0 0 0 0 0 0 ...
$ ID : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(bprs)
treatment subject weeks bprs week
1:180 1 : 18 Length:360 Min. :18.00 Min. :0
2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
3 : 18 Mode :character Median :35.00 Median :4
4 : 18 Mean :37.66 Mean :4
5 : 18 3rd Qu.:43.00 3rd Qu.:6
6 : 18 Max. :95.00 Max. :8
(Other):252
ID
101 : 9
102 : 9
103 : 9
104 : 9
105 : 9
106 : 9
(Other):306
# Check that there are 40 unique ID values
unique(bprs$ID)
[1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[20] 120 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
[39] 219 220
40 Levels: 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 ... 220
So, that’s 40 subjects divided evenly into two groups. This time however, the subject IDs are not unique within the data. They are unique within the group, though. That will cause me some issues later on, because I suck at R so I’ve created unique ID column here. In hindsight, this should’ve been done at the data wrangling stage, though I don’t recall it being in the instructions.
So here is a graphical overview of the time series of rats’ weight development. Group 1 seems to consist of smaller rats than the other two groups. Also there is one huge rat in group 2, which will skew the mean value for that group.
ggplot(rats, aes(x = time, y = weight, linetype= ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, 2)) +
facet_grid(. ~ Group, labeller=label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(rats$weight), max(rats$weight)))
Next up is “tracking”, i.e. plotting the time series of standardized weights. That’ll show how a rat gains weight compared to the others. This feels like anomaly detection in atmospheric dynamics, or de-trending. Though those things are a bit different – or at least the questions they try to answer are different. I think. Anyways, now the lines are not always increasing, so the relative development is easier to see.
rats_tracked <- rats %>%
group_by(time) %>%
mutate(std.weight = (weight - mean(weight))/sd(weight)) %>%
ungroup()
ggplot(rats_tracked, aes(x = time, y = std.weight, linetype= ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, 2)) +
facet_grid(. ~ Group, labeller=label_both) +
theme(legend.position = "none") +
scale_y_continuous("Standardized weight")
This is a way of visualising the overall development of the group and estimate the typical range of values for a given group and time. Also I noticed that in the course’s discussion page there was some talk about the standard error calculation. Namely that the denominator sqrt(n)
was incorrectly presented in the Datacamp tutorials (n
was the number of unique time instants there). So I opted to change the value to the number of members (rows) in a certain group at a certain time.
ratss <- rats %>%
group_by(Group, time) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(n()), .groups="keep") %>%
ungroup()
# Glimpse the data
glimpse(ratss)
Rows: 33
Columns: 4
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
$ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
$ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.8…
# Plot the mean profiles
ggplot(ratss, aes(x = time, y = mean, linetype = Group, shape="1", color=Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3, show.legend = F) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
Next up is a boxplot. I think it shows about the same thing as the previous figure, but emphasis is on the quantiles (inter-quantile range etc.) instead of standard error. Of course, if the distributions are skewed, then the mean and median values differ: in the Group 2 the outlier rat has quite a pull to the mean value.
ggplot(rats, aes(x=as.factor(time), y=weight, color=Group)) +
geom_boxplot()
This plot should tell me a quick overview about the differences between the groups. And if there are any outliers.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline day 1).
rats_sm <- rats %>%
filter(time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(weight), .groups="keep" ) %>%
ungroup()
# Glimpse the data
glimpse(rats_sm)
Rows: 16
Columns: 3
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
$ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
$ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(rats_sm, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), time > 1")
Okay so, group 2 has an outlier. Remove it. Though I’m removing 1 rat from a total 4 rats, meaning that the statistics start to be quite dubious… n=3
doesn’t sound like a large enough of a sample.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
rats_sm1 <- rats_sm %>% filter(mean < 575)
glimpse(rats_sm1)
Rows: 15
Columns: 3
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
$ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
$ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
ggplot(rats_sm1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), time > 1")
t-test is a paired test, so there can be only two variables. Luckily three groups equate to three pairs, so I can do the t-test for each pair with minimal effort (copypaste). Based on the results it is very clear that the Group 1 differs from Groups 2 and 3. Between the groups 2 and 3 the difference is not as clear, but it’s there.
rats_sm1_noG3 <- rats_sm1 %>% filter(Group != 3)
t.test(mean ~ Group, data = rats_sm1_noG3, var.equal = TRUE)
Two Sample t-test
data: mean by Group
t = -25.399, df = 9, p-value = 1.094e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-204.0633 -170.6867
sample estimates:
mean in group 1 mean in group 2
265.025 452.400
rats_sm1_noG2 <- rats_sm1 %>% filter(Group != 2)
t.test(mean ~ Group, data = rats_sm1_noG2, var.equal = TRUE)
Two Sample t-test
data: mean by Group
t = -27.824, df = 10, p-value = 8.345e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-283.4943 -241.4557
sample estimates:
mean in group 1 mean in group 3
265.025 527.500
rats_sm1_noG1 <- rats_sm1 %>% filter(Group != 1)
t.test(mean ~ Group, data = rats_sm1_noG1, var.equal = TRUE)
Two Sample t-test
data: mean by Group
t = -5.632, df = 5, p-value = 0.002446
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-109.37769 -40.82231
sample estimates:
mean in group 2 mean in group 3
452.4 527.5
For the ANOVA I need a baseline for each row – and that is easy to pick up from the original data. This works because the rats_sm
contains only one row for each rat – it is a summary of all measurements – and the original data has one row for each rat, too. If it weren’t for the summarized data, I’d have to join by ID columns. Although that would be clearer than this, but (probably) slower…
original_rats <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
header=TRUE, sep="\t")
rats_sm2 <- rats_sm %>%
mutate(baseline = original_rats$WD1)
fit <- lm(mean ~ baseline + Group, data = rats_sm2)
anova(fit)
Analysis of Variance Table
Response: mean
Df Sum Sq Mean Sq F value Pr(>F)
baseline 1 253625 253625 1859.8201 1.57e-14 ***
Group 2 879 439 3.2219 0.07586 .
Residuals 12 1636 136
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So the interpretation for this set. Baseline (=starting weight) is significant contributor to the later weight. Belonging to a certain group has some significance, probably whether the rat belongs to the Group 1 or not.
Lastly, I’m omitting the Table 8.6. It doesn’t use the BPRS data in the original, so strictly speaking cannot swap that data to RATS dataset. It also demonstrates how to deal with missing data, so performing the analysis with a full dataset is not the most productive way (of course I could randomly remove some of the data and work that way, but I’m being lazy here).
Moving on to the BPRS data and Chapter 9 examples. I might take some liberties here and there in visualising the data instead of 100% faithfully replicating the figures. Mainly because the data is different so I might need to adjust the visuals a bit to increase readability.
Here the two groups overlap more than the rats, and we have more individuals here as well!
ggplot(bprs, aes(x=week, y=bprs, color=treatment)) +
geom_text(aes(label=treatment), show.legend=FALSE)
These results should be total nonsense. Spoilers: they are. The results “show” that the time dimension has a really strong effect on the measures. Well yeah, about that independence… The linear regression is assuming independent measures, i.e. no autocorrelation or codependency. Shame it is totally untrue when one measures the same subject(s) over time…
# create a regression model RATS_reg
bprs_reg <- lm(bprs ~ week + treatment, data = bprs)
# print out a summary of the model
summary(bprs_reg)
Call:
lm(formula = bprs ~ week + treatment, data = bprs)
Residuals:
Min 1Q Median 3Q Max
-22.454 -8.965 -3.196 7.002 50.244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.4539 1.3670 33.982 <2e-16 ***
week -2.2704 0.2524 -8.995 <2e-16 ***
treatment2 0.5722 1.3034 0.439 0.661
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.37 on 357 degrees of freedom
Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Don’t tell anyone, but I’m going to cheat a bit here and plot an extra figure or two. Because I want to see how this faulty analysis looks like!
fit <- fitted(bprs_reg, bprs)
ggplot(bprs, aes(x=week, y=bprs)) +
geom_line(aes(group=subject), color="gray75") +
geom_point() +
geom_smooth(method="lm") +
labs(title="BPRS by week (with regression) for both treatment groups") +
facet_wrap(vars(treatment))
Since the treatment
is a categorical value, I can plot the linear regression this way. It helps that there are two categories in treatment
, though. I included the the time series as lines, too, to highlight why this sort of regression modeling doesn’t work here. Okay, sure, it shows the downward trend in both groups, but for analysing trends you’d want to consider the “repeated measures” nature of the data anyways4.
Here’s where the custom ID
column comes in handy. I need to be able to differentiate between the subjects with the same subject
value but different treatment
value. Otherwise this plot would become confused, poor thing.
# Check that the ID column is included
str(bprs)
'data.frame': 360 obs. of 6 variables:
$ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ weeks : chr "week0" "week0" "week0" "week0" ...
$ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
$ week : int 0 0 0 0 0 0 0 0 0 0 ...
$ ID : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...
ggplot(bprs, aes(x = week, y = bprs, group=ID)) +
geom_line(aes(linetype=treatment)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
This thing requires the original data. Maybe I could pivot the data… Yep, that works. Yes, the scatters form rather nice lines, especially between two consecutive weeks.
bprs_w <- tidyr::pivot_wider(bprs, names_from=weeks, values_from=bprs,
id_cols=c("treatment", "subject", "ID")) %>%
as.data.frame
str(bprs_w)
'data.frame': 40 obs. of 12 variables:
$ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : Factor w/ 40 levels "101","102","103",..: 1 2 3 4 5 6 7 8 9 10 ...
$ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
$ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
$ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
$ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
$ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
$ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
$ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
$ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
$ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
plot(bprs_w[-1:-3])
But now we will loosen the requirement of totally independent measures. First the random intercept allows the fit differ between the individuals.
# Create a random intercept model
bprs_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = bprs, REML = FALSE)
# Print the summary of the model
summary(bprs_ref)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week + treatment + (1 | subject)
Data: bprs
AIC BIC logLik deviance df.resid
2748.7 2768.1 -1369.4 2738.7 355
Scaled residuals:
Min 1Q Median 3Q Max
-3.0481 -0.6749 -0.1361 0.4813 3.4855
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 47.41 6.885
Residual 104.21 10.208
Number of obs: 360, groups: subject, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 46.4539 1.9090 24.334
week -2.2704 0.2084 -10.896
treatment2 0.5722 1.0761 0.532
Correlation of Fixed Effects:
(Intr) week
week -0.437
treatment2 -0.282 0.000
Okay, so now the week
has no strong effect anymore.
Next, let’s introduce the random slope to the model. And the winning numbers are… Huh? Not much of a change.
# create a random intercept and random slope model
bprs_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week + treatment + (week | subject)
Data: bprs
AIC BIC logLik deviance df.resid
2745.4 2772.6 -1365.7 2731.4 353
Scaled residuals:
Min 1Q Median 3Q Max
-2.8919 -0.6194 -0.0691 0.5531 3.7976
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 64.8222 8.0512
week 0.9609 0.9802 -0.51
Residual 97.4305 9.8707
Number of obs: 360, groups: subject, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 46.4539 2.1052 22.066
week -2.2704 0.2977 -7.626
treatment2 0.5722 1.0405 0.550
Correlation of Fixed Effects:
(Intr) week
week -0.582
treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(bprs_ref1, bprs_ref)
Data: bprs
Models:
bprs_ref: bprs ~ week + treatment + (1 | subject)
bprs_ref1: bprs ~ week + treatment + (week | subject)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref 5 2748.7 2768.1 -1369.4 2738.7
bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, knowing there are duplicate values in subject
column, I’m not so sure if that fitting did what we wanted to do here. Let’s do this again using ID
instead of subject
.
# Create a random intercept model
bprs_ref_id <- lmer(bprs ~ week + treatment + (1 | ID), data = bprs, REML = FALSE)
# Print the summary of the model
summary(bprs_ref_id)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week + treatment + (1 | ID)
Data: bprs
AIC BIC logLik deviance df.resid
2582.9 2602.3 -1286.5 2572.9 355
Scaled residuals:
Min 1Q Median 3Q Max
-2.27506 -0.59909 -0.06104 0.44226 3.15835
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 97.39 9.869
Residual 54.23 7.364
Number of obs: 360, groups: ID, 40
Fixed effects:
Estimate Std. Error t value
(Intercept) 46.4539 2.3521 19.750
week -2.2704 0.1503 -15.104
treatment2 0.5722 3.2159 0.178
Correlation of Fixed Effects:
(Intr) week
week -0.256
treatment2 -0.684 0.000
# create a random intercept and random slope model
bprs_ref_id1 <- lmer(bprs ~ week + treatment + (week | ID), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref_id1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week + treatment + (week | ID)
Data: bprs
AIC BIC logLik deviance df.resid
2523.2 2550.4 -1254.6 2509.2 353
Scaled residuals:
Min 1Q Median 3Q Max
-2.4655 -0.5150 -0.0920 0.4347 3.7353
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 167.827 12.955
week 2.331 1.527 -0.67
Residual 36.747 6.062
Number of obs: 360, groups: ID, 40
Fixed effects:
Estimate Std. Error t value
(Intercept) 45.9830 2.6470 17.372
week -2.2704 0.2713 -8.370
treatment2 1.5139 3.1392 0.482
Correlation of Fixed Effects:
(Intr) week
week -0.545
treatment2 -0.593 0.000
# perform an ANOVA test on the two models
anova(bprs_ref_id1, bprs_ref_id)
Data: bprs
Models:
bprs_ref_id: bprs ~ week + treatment + (1 | ID)
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id 5 2582.9 2602.3 -1286.5 2572.9
bprs_ref_id1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Well look at that! Using ID
instead of subject
made quite a difference! While the previous ANOVA result had “only” one star of significance for the random intercept and slope model, this one here has three stars!
… I have no idea what’s going on here. The baseline got really important? Or that I cut the sample size down by telling the model “hey there are 40 dudes instead of 20”? I mean, the sample size for each individual fit, that is. Let me ask a question…
# perform an ANOVA test on the two models
anova(bprs_ref_id1, bprs_ref1)
Data: bprs
Models:
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
bprs_ref1: bprs ~ week + treatment + (week | subject)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id1 7 2523.2 2550.4 -1254.6 2509.2
bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4 0 0
Yeah, I’m not sure what I expected. There’s not much of a difference whether you use subject
or ID
? Probably my dataset is too small to make a difference.
Then there is a third way of building a model: we allow an interaction between the two variables. Notice the *
(star) instead of +
(plus) symbol in the model definition!
# create a random intercept and random slope model with the interaction
bprs_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref2)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week * treatment + (week | subject)
Data: bprs
AIC BIC logLik deviance df.resid
2744.3 2775.4 -1364.1 2728.3 352
Scaled residuals:
Min 1Q Median 3Q Max
-3.0512 -0.6271 -0.0768 0.5288 3.9260
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 64.9964 8.0620
week 0.9687 0.9842 -0.51
Residual 96.4707 9.8220
Number of obs: 360, groups: subject, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 47.8856 2.2521 21.262
week -2.6283 0.3589 -7.323
treatment2 -2.2911 1.9090 -1.200
week:treatment2 0.7158 0.4010 1.785
Correlation of Fixed Effects:
(Intr) week trtmn2
week -0.650
treatment2 -0.424 0.469
wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(bprs_ref2, bprs_ref1)
Data: bprs
Models:
bprs_ref1: bprs ~ week + treatment + (week | subject)
bprs_ref2: bprs ~ week * treatment + (week | subject)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4
bprs_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So looks like that change might bring a little bit more prediction power to the model. Or at least the models’ difference has a bit of significance. Let’s do the same with ID
this time.
# create a random intercept and random slope model with the interaction
bprs_ref_id2 <- lmer(bprs ~ week * treatment + (week | ID), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref_id2)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: bprs ~ week * treatment + (week | ID)
Data: bprs
AIC BIC logLik deviance df.resid
2523.5 2554.5 -1253.7 2507.5 352
Scaled residuals:
Min 1Q Median 3Q Max
-2.4747 -0.5256 -0.0866 0.4435 3.7884
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 164.204 12.814
week 2.203 1.484 -0.66
Residual 36.748 6.062
Number of obs: 360, groups: ID, 40
Fixed effects:
Estimate Std. Error t value
(Intercept) 47.8856 2.9840 16.047
week -2.6283 0.3752 -7.006
treatment2 -2.2911 4.2200 -0.543
week:treatment2 0.7158 0.5306 1.349
Correlation of Fixed Effects:
(Intr) week trtmn2
week -0.668
treatment2 -0.707 0.473
wek:trtmnt2 0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(bprs_ref_id2, bprs_ref_id1)
Data: bprs
Models:
bprs_ref_id1: bprs ~ week + treatment + (week | ID)
bprs_ref_id2: bprs ~ week * treatment + (week | ID)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id1 7 2523.2 2550.4 -1254.6 2509.2
bprs_ref_id2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
Well. Maybe the model overfits here and thus I don’t have enough variance that could vary between the two models? Let’s see how the subject
versus ID
question works out.
anova(bprs_ref_id2, bprs_ref2)
Data: bprs
Models:
bprs_ref_id2: bprs ~ week * treatment + (week | ID)
bprs_ref2: bprs ~ week * treatment + (week | subject)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
bprs_ref_id2 8 2523.5 2554.6 -1253.7 2507.5
bprs_ref2 8 2744.3 2775.4 -1364.1 2728.3 0 0
Zero degrees of freedom. Yeah, that’s a dead end.
Finally, there was a plot side-by-side showing original and fitted lines. Well, here you go dear reader! As the authors of MABS said, the regression lines fit quite nicely to the original data. Granted, the BPRS measures had more variance than the rats, but overall the figures are in agreement. Notably, the rats’ weight was more or less monotonically increasing, while BPRS goes up-and-down (the main trend is downwards for most of the individuals, though).
# Create a vector of the fitted values
# Making a copy so I won't accidentally use this data somewhere else :)
bprs2 <- bprs
bprs2$Fitted <- fitted(bprs_ref2, bprs)
# Rename original "bprs" column for visualisation purposes
colnames(bprs2)[4] <- "Original"
# Melting data to "melted" (which is probably grammatically incorrect form of 'melt')
melted <- melt(bprs2,
id.vars=c("treatment","subject","ID", "week"),
measure.vars=c("Original", "Fitted"),
variable.name = "key",
value.name = "BPRS")
ggplot(melted, aes(x=week, y=BPRS, group=ID)) +
geom_line(aes(linetype=treatment)) +
facet_wrap(vars(key)) +
theme(legend.position="top") +
xlab("Week")
So, this was it! That was the last chapter of this course and with it this course diary comes to an end. In a proper book-ends-style, I’m going to write a bit about my thoughts about the course.
First of all, I had fun! While the course had a lot of work and strict deadlines, I still enjoyed writing this diary. Partially because I decided to write in my own style – with the casual remarks and all. I wanted to make the writing fun for me and reading enjoyable to the readers. Based on the peer reviews I got along the way, I’d say that I was successful!
Second, the R language was a new experience for me. The tasks weren’t too difficult especially when the Datacamp exercises pretty much solve nearly everything on the coding side. The hardest part was overcoming my initial way of solving things – R is not Python, and things work differently here! Once I realised this the coding became smoother.
Third, I learned a lot about data analysis! I hope I can use this new knowledge as a basis when I start delving deeper into my research (which will inevitably require a better understanding of statistics).
Finally, thanks for Kimmo Vehkalahti, Reijo Sund, and the teaching assistants for organising the course and you, dear reader, for reading my ramblings.
Petteri signing out.
I’m using the knitr::kable
function here to output prettier tables. It doesn’t affect the data values in any way, it is a purely aesthetic helper function.↩︎
For factor variables the interpretation is “how the probability changes compared to a baseline class when our estimator moves from baseline to this class”.↩︎
I also got k=3
as optimal value when I accidentally ran the code without setting seed beforehand. Whoops. Well, like I said, the results of k-means depend on the initial (random) points.↩︎
I’m having trouble imagining what sort of measures would be time series measures and simultaneously not repeated measures, as “time series” already implies repetition. Perhaps sequences of truly random noise, such as atmospheric noise, could be considered as such measure?↩︎