A new machine learning algorithm could predict students’ academic performance based on the length and quality of their social media posts, with potential to uncover previously overlooked factors contributing to learning outcomes. It might even find applications in social research far beyond the domain of education.
The algorithm was developed by Ivan Smirnov, leading research fellow of the Laboratory of Computational Social Sciences at HSE University in Russia, and detailed in an article published in EPJ Data Science. He trained his algorithm using data from a panel study of the cohort of high school students who participated in the 2012 Programme for International Student Assessment (PISA) test as well as their public posts on the Russian social media platform, VK, in 2018. The training dataset included 4400 students’ PISA reading test scores, originally gathered for the ongoing Trajectories in Education and Careers (TrEC) study looking at the educational and professional pathways of Russia’s youth. These scores were then compared to the social media posts of 2468 consenting participants, amounting to 130,575 posts in total.
Each of the 2.5 million unique words (of a total of 1.9 billion) in the VK corpus was denoted by a vector with a numerical rating corresponding to a PISA test score. By averaging over their constituent words, the posts were then represented as 300-dimensional vectors which Smirnov used to train the algorithm to predict the test scores of the posts’ authors. Looking at the training data alone, word length and vocabulary size were found to positively correlate with academic aptitude while capitalised words, emojis and exclamations had the opposite effect.
“Word length and vocabulary size were found to positively correlate with academic aptitude while capitalised words, emojis and exclamations had the opposite effect.”
To perform external validation of the algorithm, Smirnov said he“created a list of high schools in Saint Petersburg, Samara, and Tomsk and then accessed the IDs of users who indicated on VK that they graduated from one of these schools.” More than a million VK posts from nearly 40,000 high school students were used for the analysis, followed by an even larger sample of 6.5 million posts from over 115,000 students from the 100 largest universities in Russia. Having fed these datasets to the algorithm, Smirnov compared the predictions to the average performance of each school’s graduates in the Unified State Examination (USE), a mandatory state examination for all school leavers in Russia, akin to the A levels and Highers in the United Kingdom.
Despite having been trained on PISA test scores which don’t necessarily go hand in hand with the state examination scores, the match between the algorithm predictions and the real USE scores was remarkable with the university dataset producing a correlation of 0.83, indicating a strong linear relationship. Smirnov then substituted the VK posts with tweets from 2836 Twitter accounts connected to the dataset but found little decline in the algorithm’s predictive performance compared to a run using the VK data of those same users. Meanwhile, correlations for the high school datasets were found to be somewhat weaker, between 0.49 and 0.60, partly due to the smaller sample sizes and representational issues with the data.
Alongside the individual predictions, Smirnov also tested the algorithm’s ability to distinguish between high- and low-performing participants. The PISA test scale is divided into six proficiency levels with Level 2 defined by the OECD as “a baseline proficiency that is required to participate fully in modern society.” Those falling below this level are considered low-performing while Levels 5 and 6 correspond to high performance. The algorithm succeeded in distinguishing which category a student falls into based on their social media posts with an astonishing accuracy of 94%.
“The algorithm succeeded in distinguishing which category a student falls into based on their social media posts with an astonishing accuracy of 94%.”
Smirnov then looked into how specific vocabulary affected the predictions. The 400 highest and lowest ranking words in the original training corpus were each assigned to one of several thematic clusters which were then identified as either high- or low-performing. Among the high-performing ones were English words (the majority of the posts were in Russian) along with words related to literature (Orwell, Shakespeare), reading (read, book) and physics (quantum, Einstein). The low-performing ones, in contrast, included common spelling errors and typos as well as words related to computer games, military service (army, military oath), horoscopes (Aries, Sagittarius), and cars and road accidents (traffic collision, wheels).
Hoping the algorithm might yield insights in settings such as literature and politics, Smirnov has made the original dataset of 2.5 million words available online for further exploration. The finding also presents an alarming example of the power of machine learning: “digital traces in the form of short texts are ubiquitous, and our results suggest that they allow, if not to identify a person, then at least to predict potentially sensitive private attributes,” Smirnov writes. That’s certainly something to bear in mind before going on your next Twitter tirade.
Written by Mika Kontiainen and edited by Ishbel Dyke