Grants:IdeaLab/Searching for out-of-date information in wikipedias
Project idea
editWhat is the problem you're trying to solve?
editWikipedia has helped me and others a lot for research and interdisciplinary projects. But due to problems in Wikipedias' revision processes and a lack of communication among local reviewers, the Chinese Wikipedia has been developing slower than others in recent years. Many entries don't get updated every year. I would like to make the Chinese Wikipedia more useful and boost the contributors to make it more helpful for the billions of Wikipedia users interested in Chinese articles.
According to the Harvard College Writing Program's Guide to Using Sources,
- "when you're doing academic research, you should be extremely cautious about using Wikipedia. As its own disclaimer states, information on Wikipedia is contributed by anyone who wants to post material, and the expertise of the posters is not taken into consideration. Users may be reading information that is outdated or that has been posted by someone who is not an expert in the field or by someone who wishes to provide misinformation. (Case in point: Four years ago, an Expos student who was writing a paper about the limitations of Wikipedia posted a fictional entry for himself, stating that he was the mayor of a small town in China. Four years later, if you type in his name, or if you do a subject search on Wikipedia for mayors of towns in China, you will still find this fictional entry.) Some information on Wikipedia may well be accurate, but because experts do not review the site's entries, there is a considerable risk in relying on this source for your essays."
While community solutions such as en:Wikipedia:WikiProject Update Watch has been inactive since 2008, there are some automated software tools to help editors update outdated articles.
What is your solution?
editThere are plans to create automated evaluation systems for reviewers that could benefit from modular software tools to automatically identify out-of-date facts and figures. I would like to measure the usefulness of natural language pattern matching to identify and locate out-of-date facts and figures and other potentially inaccurate information. Specific techniques may include one or more of: text searching, regular expressions, grammar-based parsing, Long Short Term Memory (LTSM) and associated neural network language modeling, and linear dynamic text modeling. (Please see Timeline Weeks 3-4 below for references.)
Project goals
editWe will produce modular software tools and implement a bot using them for identifying out-of-date information in Wikipedia entries. We will measure several established and experimental methods of pattern matching with existing Wikipedia entries and newly created unit test cases to test the tools and the bot.
Get involved
editParticipants
edit- I would be happy to continue mentoring this project in conjunction with User:Prnkmp28's Google Summer of Code application. Jsalsman (talk) 03:33, 30 March 2016 (UTC)
Endorsements
edit- Linxuan has the skills necessary to get good results with this project. EllenCT (talk) 03:27, 30 March 2016 (UTC)
- I think that outdated content is clearly a problem and it is realistic to tackle it with the outlined approach (probably some more time should be allotted to some steps). I'm definitely endorsing it. faflo (talk) 12:44, 21 April 2016 (UTC)
Project plan
editTimeline of activities
editWeek 1
- Study out-of-date information and other inaccuracies and problematic entries in Wikipedia
- Collect examples of out-of-date information in multiple language wikipedias
- Become familiar with Python for Mediawiki Utilities development and the Python 3 development environment
Week 2
- Implement a subroutine to perform text age and attribution analysis with WikiWho
- Write a subroutine to analyze articles via database dumps or the MediaWiki API to perform simple pattern matching based on text string search
Weeks 3-4
- Implement and experiment with simple tests of the effectiveness of:
- simple pattern matching based on text string search,
- pattern matching based on regular expression search,
- grammar-based parsing,
- Long Short Term Memory (LTSM) and associated neural network language modeling, and
- linear dynamic text modeling.
- Write unit tests and documentation
Week 4
- Create a Python bot to perform batch processing of articles for analysis
- Measure the bot's ability to locate and identify out-of-date information with quantative statistical tests
- Write unit tests and documentation
Week 5
- Integrate with the Accuracy Review evaluation system
- Write unit tests
- Complete documentation
- Begin writing the final report
Week 6
- Complete the final report
- Publish preprint of the final report and a summary of the results to wikitech-l and at least the English and Chinese wikipedias' Village Pumps
- Study feedback from editors and other developers
- Revise documentation based on wikitech-l and wikipedias' Village Pump feedback
- Submit the final report to an independent peer reviewed journal
Budget
edit5,000 USD for working and living expenses.
Community engagement
editWe will publish a preprint of the final report and a summary of the results to the Wikitech-l mailing list and at least the English and Chinese Wikipedias' Village Pumps. The feedback will be studied and used to improve the project documentation and report.
Sustainability
editThe modular software tools developed can be used by Wikireview or any other project for accuracy review. Both Wikipedia editors and reviewers should benefit from them.
Measures of success
editWe will measure the extent that the automatically identified out-of-date or inaccurate passages for each of the patterns matched by the various algorithms are in fact out-of-date or inaccurate. We'll also use unit test cases to measure the bot's results compared to our expectations.
Project team
editLi Linxuan - I have studied mathematical modeling and am especially interested in forecasting, and am familiar with algorithms used for prediction. I have been working with three teammates on big data analysis to predict website users' behavior when they are faced with advertisements. I have also participated in an AI competition, in which our team designed a bot in a month to compete with others in a game similar with Snake.[1] During winter 2015-6 I participated in an interdisciplinary contest in modeling.[2] Our team used several mathematical software tools to analyze and predict the water condition of California. I have some experience with graphic and website design. Most of my experience is with C and C++, and I have assembly language experience with the MIPS and x86 instruction set, but recently I have been increasingly interested in Python. I use both Windows and Ubuntu.
James Salsman - Google Summer of Code mentor 2010-present; mentoring in the Wikimedia Foundation organization for GSoC 2016.
Community notification
edit- https://lists.wikimedia.org/pipermail/wikitech-l/2016-March/085133.html
- w:Wikipedia:Village pump (technical)#Creating a bot to identify out-of-date information in Wikipedia entries
- w:Wikipedia talk:Good articles#Creating a bot to identify out-of-date information in Wikipedia entries
- w:Wikipedia talk:Featured articles#Creating a bot to identify out-of-date information in Wikipedia entries
- https://lists.wikimedia.org/pipermail/wiki-research-l/2016-April/005125.html
- https://lists.wikimedia.org/pipermail/analytics/2016-April/005080.html