Research:Reliable sources and public policy issues on Wikipedia

13:00, 1 August 2023 (UTC)
Dr Amanda Lawrence, RMIT University
Mr Angel Felipe Magnossao de Paula, RMIT University
Duration:  2023-July – May-2024

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This research project seeks to understand the extent that policy research reports and papers from organisations are being cited on Wikipedia, what kinds of sources are being cited and how can editors and readers be supported in evaluating their credibility.

A key part of Wikimedia’s defence system against mis/disinformation is its content and citation policies however Wikipedia’s reliable sources policies are still grounded in traditional notions of the research publishing economy as primarily commercial and scholarly publishers and mainstream news media. This is problematic for public policy and public interest topics which tends to have a more diverse media economy of sources, including organisations based in government, civil society, education and commercial sectors, and genres such as reports, policy briefs, fact sheets and datasets.

Public policy is a complex, dynamic and multicentric environment and this is reflected in the diverse publishing ecosystem producing policy-related research including International NGOs, national government agencies, think tanks and research centres. Publications produced by organizations (grey literature) are often more timely and accessible and provide perspectives from community and Indigenous organizations, however some are also partisan and funded by commercial or vested interests – making evaluation of sources challenging.

It will analyse and extend existing research from English Wikipedia (including Avieson 2022; Ford et al. 2013; Lewoniewski 2022; Luyt 2021; Singh et al. 2021; Wong et al. 2021) the Missing Link Project, funded by a WMF Alliance grant in 2022. The research will involve mapping organisations and genres across key topics on English Wikipedia including analysis by location, topic area, sector and genre, and provide recommendations for improving guidelines that better reflect the complexity of the research publishing ecosystem. Wikidata will also be used to analyse and collect data, classify policy sources and genres and visualise key policy networks.

The project will provide new insights not only for Wikimedia but also for the wider evidence and policy research community. It will also help to strengthen Wikipedia’s verifiability processes and Wikimedia’s role as a leader in digital and media literacy and education – helping to deliver the 2030 Movement Strategy as essential infrastructure of the free knowledge ecosystem.

Methods edit

To answer the two main research questions listed above the project will take a sociotechnical approach to the research methods and analytical tools including content analysis, citation and network analysis, data linking, visualizations and case studies.

The focus of analysis will be on around 1000 public policy related articles and their citations on English Wikipedia combined with data from entities on Wikidata including concepts, organizations and publishers, locations and other data. Various administrative pages on English WP will also be analysed for guidelines and policies and a number of case studies developed on key topics and organizations.

To define the public policy domain, which crosses both science and social sciences, we will start with a number of key articles and use the internal link structure of Wikipedia combined with categories, Wikidata concepts, etc. to develop a list of key topics across the public policy domain. For example based on What links here link count the Public policy article on Wikipedia has 2,147 direct links from other articles while the Science policy article has 353 and Environmental policy 651. Many of these policy topics have lists and portals, country specific subpages etc. which will also be analysed to provide a corpus of around 1000 public policy related articles. Consultation on the list of articles for analysis will also occur with Wiki projects such as the science policy project and some of the environmental, public health and medicine projects as well as other special interest groups.

Following the selection of content, references will be extracted and classified then mapped to Wikidata entries, topics, and locations. The citations from the policy arena can then be compared to the full citation data for English Wikipedia. As discussed earlier, access to WP citations is not easy however there are various methods and tools which have been developed by other researchers which are available as well as existing datasets of citations. Arroyo-Machado et al. (2022) provide a summary table of Wikipedia data sources by format, update frequency, data quantity, type, and challenges which includes: Wikimedia Dumps, MediaWiki and Wikimedia APIs, Wiki Replicas, Event Streams, Analytics dumps, WikiStats, Dbpedia, XTools, Repositories and Altmetric aggregators. It is expected that for this research Wikipedia data dumps, web scraping from the target pages, and citation data sets from previous research will be the main data source for citations. These will be linked and enhanced with data from other databases such as Wikidata, CrossRef, ISNI, OpenAlex, Dimensions, Internet Archive etc.

The final dataset will then be analysed for frequency of citation, type of organization, and visualized using various tools such as network graphs, timelines, geospatial mapping etc. A rating of the reputation of sources will be made based on the information available on organizations via WP and WD, the reputable sources lists and other sources and where poor sources have been listed these may be flagged on the relevant pages. The data extraction, linking and analysis process will be assisted by a data scientist working on the project for 2 months.

Consultations and feedback with the Wikimedia community will occur at Wikimania in Singapore in August and the Wikidata conference in Taiwan in September 2023 and online with various projects including Wikicite and the Shared citations project. Funding for attending the Wikidata Conference in Taiwan is included in the budget.

Following an analysis of guidelines available on WP and consultation with various projects such as science and medicine and other interest groups a set of draft guidelines for grey literature will be developed and circulated and the data, a project report and journal article will be published open access.

Timeline edit

July • Project initiation, literature review, analysis of existing data, • Initial page review and analysis • Engage data analyst to the project for Aug-Oct

August – September • Consultations with community at Wikimania, Singapore in August • Review policy pages in WP and organizations in Wikidata and make selection of corpus • Extract citation data from corpus into structured format • Data cleaning and linking

October • Data cleaning and linking • Consultation with community at Wikidata Taiwan on data analysis

November - December • Data analysis and visualisation underway

2024 Jan - March • Case study analysis, data synthesis and initial results write up • Guidelines developed and circulated for review and feedback including giving presentations to key groups across the Wikimedia community

April • Publication of report, data and methods in open access report on Zenodo, OSF or Github • Journal article submission • Project completion and reporting. • Guideline development ongoing with community and proposal for changes

Policy, Ethics and Human Subjects Research edit

This project will have minimal interference with the work of Wikipedians.

Results edit

Forthcoming 2024

Resources edit

Programs/Wikimedia Research Fund/Reliable sources and public policy issues: understanding multisector organisations as sources on Wikipedia and Wikidata Project website forthcoming

References edit