Intro
For my final project in my Statistical Consulting class (STAT 692, TAMU), I carried out a study and wrote a report for a hypothetical client about predicting student performance in online courses. I want to share a bit about the process and results of the study, and reflect on some interesting feedback I received while presenting the work.
A random forest model trained on data from the first third of courses achieved 78% test accuracy in predicting whether students would pass or not pass their courses. Variable importance measures and partial dependence plots were used to identify the most important variables and visualize variable effects. Based on the model, the number of days that a student visits the homepage of a course is very important in predicting their outcome. Different probability thresholds were also considered using sensitivity and specificity to quantify how well passing and not passing students are classified when different cutoffs are used.
The project, background, and dataset
The goal of the study was to predict whether students would pass or not pass online courses using early-course data. The outcomes from the study were models that the client can use for making predictions; interpretations of importance and effects of the variables; and (a description of how model metrics can be used to inform [course-wide/higher-level/etc.] strategies for communicating with students).
The dataset is from The Open University1 in the UK and contains anonymized information about 32,593 student registrations in 7 courses during 2013-2014. It includes student demographics, course activity structure, grades, and over 10 million rows of daily activity clicks for each student registration. The data were collected in an observational manner (i.e. just during normal semesters, with no experimental protocol), so we are limited to making correlational conclusions rather than causal. The response variable to be predicted is Not Pass or Pass, calculated from the “Final Results” variable in the dataset.
One model was fit using only information that would be known about students at the time of their registration, to see if there were any traits that predisposed students to passing or not passing. These include age, gender, a measure of poverty where they live, whether or not they declared a disability, and also things like the number of credits they are currently studying and the number of “banked” score they might have for the course.
Three other models were fit using daily activity click data from different time periods of the courses. The plot below was used to identify three time periods for aggregating the daily click data: pre-course and first 28 days; days 29-84; and all remaining days after 84. The courses are up to 270 days long, so early predictions are made about one-tenth and one-third of the way into a course. The average and standard deviation of clicks for each activity type in each time period were calculated for each student registration, as well as the number of days clicked on each activity type to distinguish between students with different clicking patterns.
Data processing was handled using Tidyverse and Tidymodels packages, such as dplyr for manipulation and recipes for the rest. Students that withdrew before the end of a period were removed using a function to filter the variables and build the recipe.
I chose the random forest algorithm because: it can handle a mixture of categorical and numeric variables without much pre-processing; it can model complex non-linear relationships; and it can naturally model interactions between variables. Hyperparameter tuning and final model fitting were done with the Tidymodels framework.
This was my first time using Tidymodels for a full modeling task. I really appreciated how it provides a simple and consistent way to handle common model development tasks like pre-processing, train/test splitting, model tuning/training, and cross-validation/test evaluation, all while ensuring test information does not leak into the training process (at least, as much as it can!). I know that caret can do similar things and I am interested in trying that out as a comparison. I’ve also been working in Python more recently, and I appreciate the similar workflow that scikit-learn has.
Results
Here is a table with the results of predictions on the test dataset for the four models. I chose to focus on the “day-83” model, which achieved 78% test accuracy, for further interpretation. The “rem” model uses daily click data from the entire course, which is not practically useful for early prediction, but serves as mild validation that the general approach is capable of making good predictions at some point.
| Model | Test Accuracy | Test ROC AUC |
|---|---|---|
| demo | 0.621 | 0.665 |
| day-27 | 0.724 | 0.808 |
| day-83 | 0.782 | 0.869 |
| rem | 0.907 | 0.961 |
Variable importance
Although random forests do not provide clear estimates of variable effect sizes like, say, linear regression, we can get a sense of the relative contribution of each variable using the “variable importance” measure. For example, as seen in the plot below, variables related to visiting course homepages during the day 28-83 period were by far the most important in the “day-83” model, followed by visiting forum pages, and the exact course (module) that the student was in. Seeing that one variable is so much more important than all of the others makes it a useful guideline for students, and suggests that, if instructors can only do one thing, they should at least be making sure that students are checking the course regularly.
Visualizing variable effects with partial dependence plots
Partial dependence plots can be used to visualize the effects of certain variables by taking the average response of the training samples at a range of values for those variables. In the plot below, the FEATURE1 and FEATURE2 variables are modified in the training samples, and the average probabilities are plotted. First, notice the generally non-linear effect, where the probability of not passing decreases more slowly for both groups past about 19 days. Second, the probabilities start similar for the students with and without declared disabilities, but a gap of about 0.02 exists between the groups at higher number of days clicked. This is an example of a (weak) interaction between the variables.
Probability thresholds
The client could make strategic decisions using the test performance metrics - for example, they could choose a different probability threshold to classify a student as Pass or Not Pass, and use sensitivity and specificity to understand how accurately they will reach each type of student. In the table below, the probability is the chance of a student Not Passing. If they choose to lower the probability threshold to 40% from 50%, then they will identify more students that will Not Pass (as seen by the sensitivity increase from 0.706 to 0.811), while the specificity (how well they identify Passing students) decreases from 0.865 to 0.733. Further decreasing the threshold to 30% increases the sensitivity to 0.910, but decreases the specificity considerably to 0.513. The latter option may be acceptable if they are, say, employing a low-stakes way of communicating to struggling students like a simple email nudge. If they are planning more time-intensive interventions, though, then such a low specificity could indicate time spent working with many students that would ultimately have Passed anyway.
| notPass_cutOff | accuracy | sens | spec |
|---|---|---|---|
| 30% | 0.721 | 0.910 | 0.513 |
| 40% | 0.774 | 0.811 | 0.733 |
| 50% (default) | 0.782 | 0.706 | 0.865 |
| 60% | 0.769 | 0.615 | 0.939 |
Thoughts on communication
I shared this work with a group of colleagues in the Education department at AMNH as an example of applying statistical/machine learning methods in an educational context. One colleague, Dr. Karen Hammerness, made a comment that really stuck with me - she pointed out that we should be careful with how we interpret the effect of the disability variable. Even if the effect size is small, an instructor might take the wrong conclusion away (say, stereotyping students with disabilities) if that is not clearly communicated. Also, any given student can learn new behaviors, and trends in groups can change from year to year, so it is important to emphasize that any claims are just average effects during a snapshot in time.
Conclusion
Beyond the results of the study itself, I learned a lot about the technical and time-management aspects of writing a report like this. For example, I originally wanted to try learning and implementing a new technique that would take advantage of the time series nature of the data, but as the semester went on I realized this was too ambitious for the time I had available. Indeed, the professor made a point of saying that the purpose of this project was not to show off our statistical prowess! In the future, I will also be sure to run small model trainings earlier in the process, since I needed to re-run one of my models after realizing part of the data processing was incorrect. Finally, I look forward to learning more about the Tidymodels framework, since it looks like it can really conveniently make a lot of common processes very manageable and repeatable.
References/footnotes
-
Kuzilek J., Hlosta M., Zdrahal Z. Open University Learning Analytics dataset Sci. Data 4:170171 doi: 10.1038/sdata.2017.171 (2017). ↩︎



