Research:Student production of academic content
In this study we examine the production of academic content by students participating in the Wiki Education Foundation's classroom program. We operationalize academic content through the use of machine learning on a labeled dataset of 1950 articles. We then apply this classifier to a sampling of articles to determine the portion of academic content added by our students. We find that students contribute between 1.7 and 3.2 percent of all academic content generated over the entire semester, and between 2.9 and 5.0 percent when we restrict our view to the active period of the semester. We then narrow our view down to early academic content and find that students generate between 3.9 and 6.6 percent over the entire semester, and 5.9 to 10.1 over the busy periods.
- What portion of academic content is created by Wiki Ed students?
- What portion of early academic content is being created by Wiki Ed students?
To answer these questions we first developed a method to classify academic content. We then apply this classifier to a sample of Wikipedia articles to identify the amount of academic content being developed.
Defining academic contentEdit
Unfortunately there is no one strong signal for academic content. Across both categories and Wikiprojects we observe both academic and non-academic articles. Since there was no single strong signal we took a machine learning approach. We used Wikipedia:Labels to gather a set of 1950 labeled revisions. Four categories of features were extracted from the revision text: reference ratio, first n words, infoboxes, and academic words.
This is the ratio of academic to non-academic reference templates. These are identified through the use of a separate classifier trained on a list of 455 labeled reference templates. The reference templates are processed, joining template parameters names with each of the individual words of the argument. The resulting processed template is then treated as a bag of words, vectorized, and fed into the classifier.
Using this classifier we label each of the reference templates within the article as either academic or non-academic. The ratio of academic to total references is then calculated and returned as a feature of the page.
First n wordsEdit
These are the first n words (n=20) that appear in the article. This excludes template text. These serve as strong indicators about the subject matter and context of the article. They are treated as a bag of words and vectorized to form a set of features.
These are the names of any infoboxes that may appear within the article. They are treated as a bag of words and vectorized.
This features looks at the ratio of academic words to total words. This utilizes ten lists of academic words published by the School of Linguistics and Applied Language Studies at Victoria University of Wellington. Each of the ten lists is treated as a separate feature. We count the number of times words from a given list appear in the article, then divide by the total number of words. After doing this for we are left with a set of ten features for each article.
The labeled revisions were divided into a test and train set and the classifier was evaluated.
precision recall f1-score support False 0.95 0.79 0.86 468 True 0.52 0.84 0.65 131 avg / total 0.85 0.80 0.81 599 AUC: 0.8993932276375025
Upon reviewing these results and the false positives and negatives that the classifier produced we determined it to be accurate enough to move forward with evaluation.
General academic activityEdit
Do to time constraints we sampled 1% of all pages in the main namespace. We conducted this using the wmflabs replica databases, selecting for all pages where the page id mod 100 equaled 0. The text of these articles was then retrieved through the api. They were then processed and classified.
Since students contributed to a fraction of the pages in Wikipedia we were able to sample all pages in the main namespace that any Wiki Ed student ever contributed. These were then processed in the same manner as the general contributions.
We choose to quantify productivity using the positive bytes added metric with a 10 revision window for reverts. Reverts are determined by comparing the sha1 value of each revision. Revisions with matching sha1 values are deemed to almost certainly be reversions and thus all revisions between the two matching revisions and the later of the two matching revisions are removed from the revision history.
We downloaded the full history metadata dumps of English Wikipedia from Wikimedia and selected out the revision history of pages that we had selected previously and classified as academic. We then take the revert removed revision history and calculate the byte length differences between consecutive revisions. If the value is negative it is instead set to zero.
For our students we select revision differences for revisions made by members of our student cohort and sum them by day. This results in a time series of student contributions.
For general editors we sum all differences by day and multiply the result by 100 to approximate the general level of activity for the day.
Early academic activityEdit
In addition to general academic activity, we also wanted to know what portion of early academic content is being created by students. Here we define early academic content as an article that either did not exist, or had less than 2500 bytes prior to the start of the term.
Since we are looking at a much smaller set of articles than we were with general academic content we are able to look at the whole population of early academic articles.
We first identified four terms of interest: fall 2014, spring 2015, fall 2015, and spring 2016. We define the the start of the term to be September 1 for fall and January 1 for spring. Using the full history metadata dumps we then iterate over all articles finding the first occurrence within a term. If the first occurrence is less than 2500 bytes or it has no previous revision it is included in our sample. We identify the latest revisions of these selected pages and apply our academic classifier to it. We then filter out all articles that were not labeled academic.
To quantify productivity for early academic content we use positive bytes added without accounting for reverts. This choice was made to gain computational efficiency and we believe it does not substantially impact the results since the articles we are considering generally have lower traffic and would be less subject to vandalism. When vandalism did occur the resulting noise would be less substantial than in larger articles.
In order to perform this calculation we loaded our selected pages into a user table on the wmflabs. We joined these pages with the enwiki revision table to reduce its size, then calculated differences. We first selected revisions made by students, suming positive bytes added on dates. We then did the same for all users.
General academic contentEdit
Do to the seasonal nature of academic work there are several windows to consider. Looking over the entire term we see contribution rates between 1.7% and 3.2%.
In order to get a better idea of the impact that students are having on academic content while they are active we also looked at 30 day spans.
Early-stage academic contentEdit