Research:Effects of collaboration patterns on article quality
Wikipedia leverages large-scale online collaboration by editors to provide an open source online encyclopaedia. The collaborative content creation process has played a key role in the creation of the encyclopaedia, but has also led to some variance in the quality of articles on the platform, causing issues such as sockpuppeting, vandalism, factual inaccuracies and biased article perspectives. To alert moderators, other editors and readers of such issues, the platform allows editors to add a variety of markup templates to articles.
Existing work has studied the effect of collaboration patterns on article quality by regressing on quality labels, which are aggregate scores that take into account how well-written, factually accurate, verifiable, comprehensive, neutral, and stable the article is. We are instead interested in behaviour that relates to specific content issues. To this effect, we aim to use article templates which flag content policy violations to detect instances of unfavourable editing. Such a model would be useful to alert moderators to potentially problematic behaviour, allowing potential interventions before the low quality version of the article appears.
The overall aims of the internship are (i) to collect the dataset to be used for this task and (ii) to perform initial experiments to establish a performance baseline and evaluation protocol. More advanced modelling and write-ups will not form part of the scope of the internship.
To achieve these objectives, the following steps were taken:
- Data collection
- Identifying templates relevant to Biographies of Living People (BLP) articles,
- Exporting events relating to those articles and templates, and
- Sampling negative events.
- Data analysis.
- Extracting features.
- Baseline model, and
- Graph-based modeling
These steps are briefly described below.
This section describes the procedure followed to collect a set of approximately 17 000 examples of unfavourable editing in Biographies of Living People, paired with contrasting examples. The data can be accessed here. A diagram explaining how the data was obtained using the code in our repository can be viewed here.
Using the Advanced Search functionality, we determined which templates were the most prevalent in BLP articles. The number of templates per article can be viewed here. Based on this analysis, we concluded that the predominant problem was articles with a promotional tone. Therefore, the templates we chose to work with are autobiography, fanpov, advert, peacock and weasel.
We used Quarry to compile a list of page IDs of articles in the BLP category, as well as the Talk pages of these articles. We then exported events relating to these articles from the MediaWiki History and MediaWiki WikiText data lakes using PySpark. We identified articles that contain one or more of the five templates, using regular expressions to match occurrences in the edit history of the article text.
We are interested in three aspects of an article's history: (i) the revision activity on the page itself, (ii) the history of the editors who revise the page, and (iii) activity on the related Talk page. To find the page history, we export all events from the MediaWiki History (using a subset of fields based on the features of interest). In a similar way, we find all events created by every editor who contributed to these pages in MediaWiki History; that is, their full edit histories, including edits to other articles. Finally, we find the full histories of the relevant Talk pages in MediaWiki WikiText. In a post-processing step, the most recent Talk page snapshot is selected for every tag addition event.
To create a contrastive class of samples for comparison with the tag addition events, we sample negatives from the revision histories of the same pages. This is done by selecting, for each page in the positive class, all revision events which were not reverted and where the tag in question was not present. We further filter by date, choosing only events that occur between 7 and 90 days after the tag was added, to ensure the page is still in the same stage of development. From these candidates, we then randomly sample a revision and assign it the negative label. An example of this is shown in the figure below, for the article on Adam Khoo. Following a fairly active period of editing and Talk page discussion over approximately 100 days, a tag is added on day 1067 of the article's existence. After more edits over 29 days, the tag is removed on day 1096. The negative revision sample is taken on day 1138, where there is still active editing occurring, compared to the drop-off in activity we see at day 1250.
There are a number of alternative methods that could be considered for negative sampling; for instance, one might choose to sample from different pages overall, or to choose the cut-off points according to the number of revisions as opposed to a number of days. We leave this experimentation for future work.
Data analysis and cleaningEdit
The number of samples per template are shown in the table below.
|Template||Positive samples||Negative samples|
We were interested in the usage of these tags over time; whether there is a rise or decline in their uptake, or if they are only used in a certain period. The rolling average of the number of tags used over time are shown in the figure below. We can see that the tags were all introduced between 2006 and 2008 and have been used fairly consistently since, with "weasel" and "fanpov" being less prevalent than the other three. Interestingly, we observe a big spike in the number of tag additions in 2015 in the "autobiography" category. Upon closer inspection, these are all attributed to the same editor, who added 203 tags in January 2015. We view these annotations as noisy and exclude them from the dataset.
The tags do not seem to be concentrated to certain pages. The most tagged page is that of Rob Powell, having 19 tags over the course of two years.
During the data extraction step, it became apparent that the number of editors who contributed to the pages with templates and their edit histories are a substantial set. To reduce the burden of processing these histories, we looked at two options: eliminating bots, and exploiting the overlap of editors between tags. We identify bots based on the "is_bot" field in MediaWiki WikiHistory. Considering only the "autobiography" tag, our analysis indicates that there are 53808 users who contributed to BLP articles with this tag. The median number of edits per user is 177. Of these users, 502 (about 1%) are bots; however, they contributed 21% of edits by this group - a total of 106,672,366 revisions. We choose to exclude these revisions.
The log-rank plot for the number of edits per editor (excluding bots) are shown below, for every tag. There are fewer editors in the "fanpov" and "autobiography" sets. The ends of the curves of the remaining three tags overlap on the right. These curves indicate that there are a number of editors in each set who contribute a large number of revisions.
Our intuition was that these "super-users" would be present in multiple of the tag sets, and which adds unnecessary duplication to the already large user history dataset. Indeed, upon investigation it was found that although the overlap of editors between the sets was not large (between 15% and 54%), the number of revisions contributed by these editors was indeed large, making up between 74% and 97% of edits. We therefore chose to exploit this overlap by extracting the histories of the users involved in all five tags together, which reduces the data storage needs by close to 80%.
Given the exported user, article and Talk page histories, we now calculate features that are expected to be of use in classifying problematic patterns of collaboration. These features are based on the work of Keegan et al. (2012) and Rad et al. (2012). For every tag addition event (positive and negative), we calculate features based only up to that point in time. This is a time consuming process, particularly since the large editor history graph has to be recomputed at for every tag instance.
To model collaboration patterns and how they correspond to these templates being added to BLP articles, we calculated five types of features (shown schematically to the right):
- Article features [15 features], for example, the article age in years, the fraction of its edits that took place in the week before the tag event, the mean time between edits, et cetera.
- User - article features [10 features], for instance, the fraction of the page's edits contributed by an editor, the number of their revisions to the page that were reverted, and the mean size of their edits,
- Talk page features [32 features], describing volume-based metrics like the ratio of Talk page to article edits, as well as language-based features describing politeness, sentiment and reply-tree depth,
- Editor features [10 features], for instance, number of past and current blocks, time since registration, the spread of their edits between articles (calculated as the entropy of the fraction per page), and
- Collaboration features (editor-editor) [4 features]:
- Per user who edited the page, with every other who edited the page:
- Ratio of co-edited articles to all edited articles
- Ratio of revisions in co-edited articles to total revisions
- Per pair who edited the page, in the last year:
- # co-edited articles
- # interactions (including on Talk pages).
- Per user who edited the page, with every other who edited the page:
As a baseline model, we use logistic regression to classify the samples as tag addition events or not. Given that our featureset contains one-to-many and many-to-many mappings, this requires aggregation. To ensure maximum information is retained from these features, we find the mean, standard deviation and maximum of the features. For instance, for user-article feature describing the fraction of the page's edits contributed by an editor, we aggregate the values for all different editors in these three ways.
We use logistic regression in Scikit-Learn for this task, with balanced class weightings, L2-regularisation (C=1000) and a maximum of 10000 iterations with the LGBFS solver. Our results for all templates combined are shown below, split according to feature type. We use the area under the precision-recall curve as metric for this task.
|User - article||0.6536|
|Talk features, language||0.5096|
|Talk features, volume||0.5676|
These results are promising, given that a simple model is used and a lot of information is lost in the aggregation step. However, it is disappointing that none of the features relating to collaboration patterns are individually as useful for the classification task as the page's own history. It would be interesting to see if more sophisticated models that can take the graph structure into account would be more effective for this task.
The five most informative features are shown below. We can see that their values represent a good spread across different feature types. The most informative feature is the fraction of revisions that occurred in the week before the test event; if a large portion of an article's edits are recent, it is more likely to be tagged. If a large fraction of contributors' past revisions have been on the same pages (feature 2), a tag is also more likely to be added. A larger mean fraction of page edits per editor is further associated with the positive class (feature 3), indicating that there are not many editors each contributing a small fraction. A larger concentration ratio (the number of edits divided by the number of editors) is also associated with the positive class, which represents the same intuition. Finally, the contribution fraction (feature 5) refers to the fraction of a user's edits dedicated to a specific article. The entropy of this value represents how imbalanced this distribution is, that is, how focused or spread out an editor's contributions are. A low maximum entropy value (ie. not very focused editing) is associated with the negative class (tag not added), which is an unexpected result as we would expect that editors who cause promotional tone tags to be added are focused only on specific pages. However, it may be that the aggregation here creates spurious correlations.
|1. Fraction revisions in last week||Article features||0.35|
|2. Fraction of revisions in co-edited articles to total revisions [mean]||Collaboration||0.2554|
|3. Fraction of page edits made per editor [mean]||User-article||0.2377|
|4. Concentration ratio||Article features||0.1457|
|5. Contribution fraction entropy [max]||Editor features||-0.1417|
The results per tag type are shown below, using all features. We can see that the performance differs substantially across the different tags. These differences loosely correlate with the number of samples per tag type: there are 721 samples in the Fanpov set, compared to 4224 in the Autobiography set.
Graph neural networksEdit
Since the collaboration relationships are clearly structured as a graph, we are further interested to see if we could improve over these scores by modelling the data as a graph instead of aggregating. Graph neural networks are currently a popular topic of research, and seem to be appropriate for this task. Particularly, the task of graph classification is applicable, where instead of classifying a node or edge in a graph, the entire graph is embedded and then classified (see, for instance, Ying et al. (2018), Cangea et al., 2018, et cetera). In our case, multiple types of nodes and edges are present, which requires a heterogenous graph representation and model. The Relational Graph Convolutional Neural Network (Schlichtkrull et al., 2018) is effective for graph classification on this type of data, and if often used in the context of knowledge graphs, where there are multiple types of entities and relationships.
Since the primary goal of the internship was to create the dataset and establish a baseline for performance, this work is still in progress. However, we have implemented a simplified homogenous graph representation, using only the editor and collaboration features and a three-layer graph neural network with 128 nodes per layer. With this model, we achieved a ROC-AUC of 69%, compared to 62% on the aggregated linear models with the same information. We therefore view this as a promising area of exploration for future work.
During a 12-week internship conducted in January to March 2021, we investigated the effects of collaboration patterns on article quality, looking specifically at biographies of living people and using the addition of tags relating to a promotional tone as indicator for unfavourable editing practices. We have created a dataset for studying these effects and implemented a baseline model, as well as proposed directions for future work.
- ↑ Keegan, Brian; Gergle, Darren; Contractor, Noshir (2012). "Do editors or articles drive collaboration? Multilevel statistical network analysis of Wikipedia coauthorship.". In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW '12). Association for Computing Machinery.
- ↑ Rad, Hoda Sepehri; Makazhanov, Aibek; Rafiei, Davood; Barbosa, Denilson (2012). "Leveraging editor collaboration patterns in Wikipedia.". In Proceedings of the 23rd ACM conference on Hypertext and social media (HT '12). Association for Computing Machinery.
- ↑ Ying, Rex; You, Jiaxuan; Morris, Christopher; Ren, Xiang; Hamilton, William L.; Leskovec, Jure (2018). "Hierarchical graph representation learning with differentiable pooling.". In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18).
- ↑ Cangea, Catalina; Veličković, Petar; Jovanović, Nikola; Kipf, Thomas; Lio, Pietro (2018). "Towards Sparse Hierarchical Graph Classifiers.". arXiv preprint arXiv:1811.01287.
- ↑ Schlichtkrull, Michael; Kipf, Thomas; Bloem, Peter; Berg, Rianne; Titov, Ivan; Welling, Max (2018). "Modeling Relational Data with Graph Convolutional Networks.". European Semantic Web Conference.