Research:Classifying Actors on Talk Pages
There currently exists good tooling to detect quality of Wikipedia articles and individual edits, however no tooling exists to hint at whether someone is a good or bad actor. This project aims to evaluate the effectiveness of sentiment and temporal analysis on talk page edits to classify users. In parallel with this we will be creating documentation about the most effective strategies for working with Wikipedia data and the different approaches available. It is hoped that this will promote Natural Language Processing and other areas of data analysis on Wikipedia data.
We published the results of this project at OpenSym 2021 with a research paper titled "Extracting and Visualising User Engagement on Wikipedia Talk Pages".
Due to Wikipedia's high popularity, diverse user community, and highly collaborative editorial base, it is at the centre of high-quality debate on the internet. Research into the patterns in large scale community disputes can provide us with the tools to come to consensus quicker and in more satisfying ways. However, performing this research requires a large learning curve which diminishes the appeal. Therefore, guides and documentation needs to be created and publicised about how to access this data..
|Jan 8 - Jan 28||Background Reading|
|Jan 29 - Feb 25||Getting the data, importing data into a SQL database|
|Feb 26 - Mar 10||Start asking research questions on the dataset, trying to find a correlation between score of goodness and
A - Users whose additions are deleted
B – Sentiment analysis on what’s added
C – Sentiment analysis on what’s deleted
D – Users who are eventually blocked
E – People who are accused of being paid
Does this change over time? Do users turn bad?
Can we detect complaining or paid editing?
|Mar 11 - Apr 7||Compare with manual process:
Are the unblocked users who have low scores of goodness likely to be blocked?
What is the goodness score of controversially banned users?
|Apr 8 - Apr 30||Finish writing research paper, all documentation, open source data and source code|
The output of this project will be a research paper that aims to answer:
- What is the strongest indicator of “goodness”?
- Do users turn bad?
- Can we detect complaining or paid editing?
and documentation around:
- The best way to extract all of Wikipedia - how much storage and bandwidth you need
- How to create a database of all user contributions to talk pages, how to do this for any namespace or wiki?
- How someone gets access to toolforge and what that allows a researcher to do
- How to separate namespace (article, talk, etc) from revision data