Research:Plagiarism on the English Wikipedia

Contact

Sage Ross

Wikimedia Foundation

This page documents a completed research project.

It's been a long time coming, but I'm happy to report that we have some results from a study of plagiarism we've been conducting. The short version is that I think it's safe to say that education program assignments from the United States Education Program and Canada Education Program are not making the English Wikipedia's plagiarism problem worse than it already is.

This write-up and discussion of the results is a work in progress, but feel free to ask questions or edit it.--Sage Ross (WMF) (talk) 18:28, 24 September 2013 (UTC)

Update

This write-up is now complete (although we're happy to address any additional questions that come up related to it). See also:

--Sage Ross (WMF) (talk) 15:23, 8 November 2013 (UTC)

Summary

We measured the rates of blatant plagiarism and close paraphrasing (compared to the database of Grammarly) in articles contributed to by nine different cohorts of users:

2006: users who registered in 2006 who created new articles
9.8% blatant plagiarism, 2.3% close paraphrasing (81 plagiarism, 19 paraphrasing, 725 fine)
2009: users who registered in 2009 who created new articles
11.1% blatant plagiarism, 1.3% close paraphrasing (92 plagiarism, 11 paraphrasing, 725 fine)
2012: users who registered in 2012 who created new articles
9.3% blatant plagiarism, 0.7% close paraphrasing (77 plagiarism, 6 paraphrasing, 745 fine)
match: matched users with statistically similar contribution histories (including time since registration, edit count, and category of article) to the student editors represented in the "S new" cohort
11.0% blatant plagiarism, 2.4% close paraphrasing (69 plagiarism, 15 paraphrasing, 544 fine)
S new: student editors in education program classes from 2010 through 2012 who created new articles
3.2% blatant plagiarism, 1.7% close paraphrasing (27 plagiarism, 14 paraphrasing, 790 fine)
S expand: student editors in education program classes from 2010 through 2012 who did major expansions of existing articles
6.7% blatant plagiarism, 1.9% close paraphrasing (52 plagiarism, 15 paraphrasing, 712 fine)
S 2013: student editors in education program classes from 2013 who created new articles
(Note: this is a small dataset created from partial results of the first term of 2013.)

2.9% blatant plagiarism, 0% close paraphrasing (1 plagiarism, 0 paraphrasing, 33 fine)
active: non-admin users from the top of the list of Wikipedians by number of edits who created or expanded articles
3.1% blatant plagiarism, 0.5% close paraphrasing (27 plagiarism, 4 paraphrasing, 844 fine)
admins: admins who created or expanded articles
3.5% blatant plagiarism, 0.3% close paraphrasing ( 27 plagiarism, 2 paraphrasing, 753 fine)

Although none of these comparisons is perfect, the results suggest that United States and Canada student editors in the Wikipedia Education Program have plagiarized at a lower rate than what is typical for other newcomers who start or expand articles.

Background

Given the significance of plagiarism on Wikipedia, and the particular problems we've had reported from student editors in the education program, we wanted to get a quantitative view of how widespread it is among our student editors participating in the US and Canada programs. As there is not good baseline data on the general prevalence of plagiarism of on Wikipedia, we also needed to compare the results with those of other editors. The only realistic option for getting enough plagiarism data in a standardized way was to contract this out; after investigating several options, we selected a company called TaskUs to do the primary plagiarism checking. We then began working with data analyst Evan Rosen to generate sets of article revisions representing the work of student editors and several control groups. Each dataset consists of about 800 article revisions (or fewer when that many could not be generated) that were determined to be primarily the work of a user in the respective cohort, with only one revision per user.

Evan developed scripts to generate sets of article revisions where each revision has the majority of the text contributed by a user in the corresponding cohort and (for most of the sets) the article was started by that user. Articles had to be at least 1500 bytes in length. The main dataset of student-editor-created articles included approximately 880 revisions (the latest revision contributed by the student editor or their Wikipedia Ambassador, with only one article per student editor). Evan then attempted to create datasets of the same size for the control groups, although in some cases the script was not able to generate the full number.

These datasets were sent to TaskUs to be checked for plagiarism via the Yahoo!-powered Grammarly tool. The resulting plagiarism data was then screened for common Wikipedia mirrors, with all the hits matching Wikipedia mirrors removed. All the remaining instances of apparent plagiarism were then checked manually by the WMF education program team to remove false-positives (either quotations, or cases where original Wikipedia content was copied elsewhere).

Caveats

There are several things to consider when looking at the data and comparing the plagiarism results across cohorts:

The line between close paraphrasing and blatant plagiarism is subjective, and several people (Sage, LiAnna, Sophie, and Jami) contributed to the final classifications; the combined total (close paraphrasing + blatant plagiarism) is probably a better basis for comparison than the separate rates.
Due to shortcomings in our historical lists of Wikipedia Education Program student editors and/or imperfections in the scripts used to generate the revision lists, datasets "2012" and "match" contain some revisions by student editors. Three cases of plagiarism in each set were noticed to be by student editors (inflating those plagiarism rates by about 0.4%), and there may be other unnoticed cases. There may also be revisions without plagiarism in these datasets that represent student editor work. The 2006, 2009, 2013, match, active, and admin cohorts may also contain educational assignment articles written by students in classes not affiliated with the Wikipedia Education Program in the United States and Canada; some instructors choose to operate classroom assignments on the English Wikipedia outside of our program or in other countries.
In many cases, the confirmed plagiarism found in an article was not actually contributed by the user in the cohort. For the admin and very-active-non-admin cohorts, in particular, many cases of plagiarism were introduced by anonymous editors (and subsequently built upon or simply not removed in later edits).
The types of articles—and the types of sources—may differ systematically across datasets. The Grammarly plagiarism checking tool may be better with finding matches in certain kinds of sources versus others.
The identification of both close paraphrasing and blatant plagiarism starts from simple text matching. Thus, close paraphrasing that follows the structure of the source but does not include much exact matching would not be detected.
For older revisions in particular (such as the 2006 and 2009 cohorts as well as many of the revisions in the admin and very-active-non-admin cohorts), many of the sources have disappeared from the web. In some cases, Grammarly found matches to urls that were no longer available at the time of manual checking; when no confirmation was available via archive.org or other detective work, the default was to score these revisions as fine, even if plagiarism seems likely but cannot be proved. In other cases, plagiarism may not have been detected in the first place if the source already disappeared.
We attempted to confirm only instances of plagiarism, rather than non-plagiarism copyright violations. Especially in older revisions, many matches were long quotes (including implicit quotes, where in context it would be clear to a reader that someone is being quoted, even if there are no quotation marks or other indications). These cases were marked as fine.
Plagiarism rates between datasets that include only new articles created by users in that cohort (2006, 2009, 2012, match, S new, and S 2013) are not apples-to-apples when compared with datasets that include or consist solely of articles expanded by users in the cohort (S expanded, active, and admins). Many attempts to create new articles are not successful, and either get deleted or never enter article space, and these may consist of a higher proportion of plagiarism than the ones that enter and remain in article space. Only articles in mainspace (as of the creation of the datasets) are represented here. No such filter of bad work exists for the expansion of existing articles.

Methodology, code and data

Creating the cohort datasets

Evan Rosen's scripts used to generate the datasets are available on Github: https://github.com/embr/wep-plagiarism

For the student datasets (S new and S expand), we began from lists of student editors in all courses in United States and Canada Education Program classes from 2010 (beginning with the Fall term) through 2012, which had been compiled by staff after each term. For S new, Evan's script searched for new articles created by these users (of at least 1500 bytes in length, with at least 70% of the raw content added by that user) and returned the last revision by that user. For S expand, the script searched for pre-existing articles edited by these users, again returning the last revision by that user (of minimum length 1500 bytes, with 70% or more contributed by that user).

For the 2006, 2009 and 2012 datasets, the script searched for new articles meeting the same criteria that were started by users who registered in that year. The script started at the beginning of the year, so each of these sets is essentially made of up users who started editing in early January of the respective year.

For the admins dataset, the script searched for new or expanded articles (as limiting it to new articles did not generate enough revisions) contributed by current administrators, matching the same criteria as with the others.

For the active dataset, the script went through the list of the 5000 Wikipedians with the most edits, excluding admins, and searched for new or expanded articles matching the same criteria.

The match dataset is the most complex. For it, Evan's script searched for matches corresponding to the revisions in the S new dataset, looking for new articles by users who were statistically similar to the corresponding student editors along several dimensions: when the user began editing, their edit count, their total bytes added to the article they started, the categories the article is in, and the date the article was created. For the details of how each of these factored in to creating the match set, see Evan's script.

Primary plagiarism identification

The primary plagiarism identification was performed for each dataset in turn by a team of people working for TaskUs, an online outsourcing company that we contracted for this. The TaskUs team put each of the article revisions we provided through the commercial grammar and plagiarism checker Grammarly. As we understand it, Grammarly uses the Yahoo! bot search API as its method for checking plagiarism (which us used for the same purpose by Blackboard); we don't know the extent to which pay-walled and other restricted content (such as academic journal databases) is included, although many of the plagiarism instances that were found successfully involved pay-walled sources.

The TaskUs team returned spreadsheets for each of the datasets rating reach article revision on a scale of 1 to 3: 1 for no plagiarism found, 2 for close paraphrasing, and 3 for blatant plagiarism. For revisions rated as 2 or 3, the spreadsheet included one or more urls and the matching text for each url.

The primary plagiarism data received from TaskUs included a high proportion of false positives: instances where Wikipedia was the original source of the matching text.

Removal of mirrors

The first step in processing the plagiarism data was to remove common Wikipedia mirrors from the results. Sage scanned through each of the cohorts and created a list of mirrors (included in the data file linked below) that showed up repeatedly in the data. These urls were then removed from the data using an R script that Sage created (included in the data file linked below). The updated datasets where then uploaded to Google Drive for manual verification of the remaining instances of apparent plagiarism.

Manual confirmation of plagiarism

Sage Ross, LiAnna Davis, Sophie Osterberg and Jami Mathewson went through the remaining instances of apparent plagiarism to assign a final rating manually. We restricted these final ratings to instances of plagiarism that we could confirm beyond a reasonable doubt, excluding:

cases where the reported matching text could not be found (as for many of matches to myspace.com, which was recently overhauled to remove user-generated content—most of which was copied from Wikipedia, but some of which may have been copied to Wikipedia);
cases where we could not establish which text came first (which we were able to do in many cases using the Internet Archive's Wayback Machine);
cases of copyright violation that do not constitute plagiarism, such as lengthy quotation and (in some cases) implicit quotations where it is clear to readers that an official source in being excerpted verbatim;
cases where the user adding the text appears to be the author of the original source; and
cases where matching text is pubic domain or freely licensed and the inclusion of that content is noted in an edit summary or within the page text

Regardless of whether the initial rating was a 2 (close paraphrasing) or 3 (blatant plagiarism), we independently assessed which category each case of confirmed plagiarism fell into. Notes were added for cases of confirmed plagiarism to describe the nature of the plagiarism, and also for tricky or ambiguous cases that were manually assigned a rating of 1 (no plagiarism) because we could not confirm plagiarism.

Final processing and plotting

The spreadsheets with the final, manually checked ratings were then downloaded from Google Drive as CSV files, and were processed by Sage's R script (included in the data file linked below) to produce the plot above.

Raw data

Initial raw data from TaskUs, final checked data, and R script used to process and graph results: https://wikimediafoundation.org/wiki/File:English_Wikipedia_plagiarism_research_2013_data.zip