Real dataset from UCI Machine Learning Repository that contains a survey responses conducted on 396 students in a math class in secondary school. With the application of stepwise procedure methods, data visualization, and classification methods, a multiple regression model will be selected to fit the best prediction on final grades.
install.packages("rpart")
install.packages("rpart.plot")
install.packages("plyr")
install.packages("ggplot2")
install.packages("MASS")
library(rpart.plot)
library(rpart)
library(plyr)
library(ggplot2)
library(MASS)
d1 = read.csv("~/Documents/4th Year/STA 141A/student-mat.csv")
# Drop G1 and G2 since we are only observing G3 (final grade)
d1 = d1[,-c(31:32)]
fit = lm(G3~.,d1)
step = stepAIC(fit, direction="both")
# step$anova
This model includes the significant variables as concluded by the stepwise procedure performed earlier as well as the additional of alcohol variables; weekday alcohol consumption (Dalc) and weekend alcohol consumption (Walc). We include these two variables since we are interested in also studying the effects of substances in final grades.
mod = lm(G3~sex + age + famsize + Medu + Mjob + studytime + failures +
schoolsup + famsup + romantic + freetime + goout + absences + Dalc + Walc, data=d1)
summary(mod)
Create a data frame to compare parents' jobs and eduations against student grades.
student = data.frame(MotherJob=d1$Mjob,
FatherJob=d1$Fjob,
MotherEducation = d1$Medu,
FatherEducation = d1$Fedu, G3=d1$G3)
set.seed(1)
Create training data (around 75% of rows).
idx.train = sample(1:nrow(student), 0.75 * nrow(student))
student.tree = rpart(G3~MotherJob+FatherJob+MotherEducation+FatherEducation, data=student[idx.train, ], method='class')
crossval = plotcp(student.tree)
plot(student.tree)
text(student.tree, pretty=0)
The graph which analyzes deterministic factors for students who enter college.
tree = prp(student.tree)
student.pred = predict(student.tree, student[-idx.train, ], type = 'class')
To determine the classification method
tree.con = table(student.pred, student[-idx.train, ]$G3)
tree.con
Subset the data using the variables above, 'd2' is our final dataset.
d2 = d1[,c(2,5,7,14,15,16,17,23,25,26,27,28,30,31)]
grade_count = count(d2$G3) #shows tally of grades
grade_count
grade_frequencies = ggplot(d2, aes(G3)) + geom_bar(color ="black", fill = "green", alpha = 0.7) +
labs(title = "Histogram of Grades") + ylim(c(0, 60)) + labs(x = "Final Grades", y = "Count")
grade_frequencies
Histogram above shows the frequency of final grades among students in the math class. We see that it is approximately normally distributed, with a slight skew to the right. Outlier is at final grade (G3) equal to 0. Highest frequencies occur at final grade equal to 10.
walc_g3 = ggplot(d2, aes(G3, colour=factor(Walc))) + geom_density(adjust=2, alpha = 0.5, size = 1.5) +
ggtitle("PDF of Final Grade by Weekend Alcohol Consumption") + labs(x = "Final Grade")
walc_g3
PDF of final grades above is factored by weekend alcohol consumption variable (Walc). The graph shows that there is no clear relationship. This challenges the convential thought that student alcohol consumption is negatively related to students's grades.
ggplot(d2, aes(Dalc, G3)) + geom_bar(aes(fill = factor(Walc)), position = "dodge", stat="identity") + ggtitle("Weekend Drinking vs. Weekday Drinking")
Histogram above shows the relationship between weekend alcohol consumption and weekday alcohol consumption for final grades. The highest frequency is among students who drink once or zero times during the week. We can also see that for students who drink five or more times during the weekday, they also drink five or more times during the weekend, and should except to receive lower final grades.
cor(d2$goout, d2$Walc)
cor(d2$goout, d2$Dalc)
out_dalc = ggplot(d2, aes(Dalc, colour=factor(goout))) + geom_density(adjust=2, alpha = 0.5, size = 1) +
ggtitle("PDF of Weekday Alcohol Consumption by Going Out") + labs(x = "Weekday Alcohol Consumption")
out_dalc
Above, the plot shows the PDF of weekday alcohol consumption factored by the number of times a student goes out (goout). This shows that people are less likely to report higher rates of drinking during the week. Students that report going out more are still more likely to report drinking more during the week than those who report going out less.
A similiar PDF is produced when analyzing weekend alcohol consumption factored by the number of times a students goes out. From these two PDFs, we can expect to see students's alcohol consumption preference mirroed by their frequency in going out. We should also see those implications as a potential partial indicator of the effect of alcohol consumption on final grade.
sex_study = ggplot(d2, aes(studytime, colour=factor(sex))) + geom_density(adjust=2, alpha = 0.5, size = 1) +
ggtitle("PDF of Gender and Study Time") + labs(x = "Study Time")
sex_study
When analyzing the amount of time spent studying between genders, we see that female students spend more time studying on average than male students do.
Gender = as.character(d2$sex)
Romantic = as.character(d2$romantic)
ggplot(d2, aes(x=Romantic, y=G3, fill=Gender)) + geom_boxplot() + labs(title="Boxplot between Gender and Romantic Relationship")
Boxplot above separates the romantic relationship status and gender of students and plots the frequency of final grades. It seems that on aveerage, students who are not in a romantic relationship do the same in terms of final grades. The spread of male single students is bigger in the top 25% is higher than that of female. Note that for students in relationships, the average of male students is the same, with slight changes in the spread. It is interesting to see that female students who are in a relationship have lower final grades on average that female students who are not in a relationship.
ggplot(d2, aes(failures)) + ggtitle("Grade by Failures") + geom_histogram(aes(fill=factor(G3)),binwidth=1, position="fill")
The above histogram shows that frequence of failures for students by their final grades. It depicts a relatively even distribution of probability to receive grades across all spectrums. If a student has failed three classes, he or she has an extremely low probability of receiving a final grade of 20 in this math class. This histogram supporst the widely accepted notion that if failing a class decreases the likelihood of receving a high grade.
Frequency of abscences among students is as shown below.
abs_hist = ggplot(d2, aes(absences)) + geom_histogram(fill = "green", alpha = "0.7", color = "black") +
labs(title = "Histogram of Absences", x = "Absences")
abs_hist
abs_bar <- ggplot(d2,aes(x= Gender,y = absences,group = Gender)) + geom_boxplot(aes(fill = Gender)) +
facet_wrap(~cut(G3,4))
abs_bar
abs_hex <- ggplot(d2, aes(absences, G3)) + stat_binhex(colour="grey") + labs(title = "Hexbin Absences vs. Final Grade") +
labs(y = "Final Grade")
abs_hex
The hexbin plot above shows that the abscences clustered as the relate to the final grade. We see there is a much higher concentration for students with less abscenes, which supports earlier histogram. Additionally, the hexbin diagnostic shows that the count of students based on final grades is not significant among number of abscences (as shown with the shade of blue data points).
ggplot(d2, aes(G3, fill = factor(famsup), colour = factor(famsup))) + geom_density(alpha = 0.1) + ggtitle("PDF Grade by Family Support")
ggplot(d2, aes(G3, fill=factor(schoolsup), colour = factor(schoolsup))) + geom_density(alpha = 0.1) + ggtitle("PDF Grade by School Support")
The PDF above showing final grades by family support and the PDF above showing final grades by school support has similiar conclusions. They imply that students who receving support from school or their families have a higher probability of receving lower final grades. This is most likely due to selection bias in that students who require support are most likely struggling in the class and thus in need of this additional support. We cannot interpret support to mean higher likelihood of receiving a higher final grade.
ggplot(d2, aes(G3)) + ggtitle("Grade by Family Size") + geom_density(aes(fill=factor(famsize)), position="fill")
Low-score side is due to the gap for high scores and the higher likelihood of being a single child. The LE3 (less than three family members) does not necessarily imply being an only child since this could be a family with a single parent and two children. Thus, this provides minimal avidence of significance of family size in determining final grade in a math class.
In examining all variables selected from the stepwise procedure as well as adding in the alcohol consumption variables, the final selected model is as follows,
\begin{equation} G3 = \beta_0 + \beta_1 Sex + \beta_2 Age + \beta_3 FamSize + \beta_4 MomEdu + \beta_5 MomJob + \beta_6 StudyTime + \beta_7 Failures + \beta_8 SchoolSup + \beta_9 FamSup + \beta_10 Romantic + \beta_11 FreeTime + \beta_12 GoOut + \beta_13 Abscence + \beta_14 DayAlc + \beta_15 WeekendAlc \end{equation}
with the understanding that G3 denotes final grade and $\beta_0$ = 13.67, $\beta_1$ = 0.96, $\beta_2$ = -0.28. All parameter estimates are shown in the earlier summary of our model (mod).The more significant variables in this model are number of class failures with a p-value of 1.96e-09, number of days going out with a p-value of 0.005, and amount of time spent study with a p-value of 0.0320. This model has the lowest AIC value of 2249.525 as stated earlier.
As expected, this model’s negative coefficients suggestion a negative relationship between number of times going out and final grades. For each increase in going out value, there will be a decrease of -0.545 in final grade ceteris paribus. Similarly, for each class failure, there will be a decrease of -1.86 in final grade all else held constant. With a positive coefficient, every unit increase in study time should result in an increase of 0.571 for final grade. Note that alcohol consumption, weekend and weekday are not included. Considering their relationship to going out, it seems that the going out variable is sufficient enough to influence final grade alone, rather than adding an alcohol consumption variable.
Improvements on this final model would include looking at two-factor and three-factor level interactions among significant variables. Given the broad instruction of this report and emphasis on diagnostics through homework two and package “ggplot2”, we opted to focus more on analyzing relationships among variables rather than find the best model. It would be interesting to visualize the other variables not selected in the originally implemented stepwise function. Additionally, given the high number of provided variables, it would also be interesting to create a model predicting alcohol consumption using logistic regression or other classification methods as taught in class.