Grants:Programs/Wikimedia Research Fund/Improving the Persistence of External References on Wikipedia

statusnot funded
Improving the Persistence of External References on Wikipedia
start and end datesJuly 2023 - July 2024
budget (USD)50,000 USD
fiscal year2022-23
applicant(s)• Harsha Madhyastha

Overview edit

Applicant(s)

Harsha Madhyastha

Affiliation or grant type

University of Michigan

Author(s)

Harsha Madhyastha

Wikimedia username(s)

Harsha Madhyastha: User:HarshaMadhyastha

Project title

Improving the Persistence of External References on Wikipedia

Research proposal edit

Description edit

Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

A key challenge in preserving Wikipedia for future generations is that, even a few years after an article has been compiled, some of its external references cease to work [1], robbing visitors of the context that the article's editors meant to provide them.

To address this problem, the current best practice on Wikipedia is to augment any broken external reference with a link to an archived copy of the dysfunctional URL; the InternetArchiveBot implements this approach at scale. However, this best practice is neither complete nor sufficient.

1. Current systems identify a link as broken if an error is encountered when crawling it. But, many links may return a non-erroneous response, but redirect to an unrelated page. Reference 2 in https://en.wikipedia.org/wiki/Brian_Dubie is an example of such a "soft-404". In other cases, there can be content drift, i.e., the content at the link may have been modified, resulting in the link no longer serving the purpose for which it had been created.

2. Even if a link is easy to identify as broken, an archived copy is not always an appropriate substitute for the original page. For example, modern web pages often include JavaScript and rely on back-end services to provide rich app-like functionality. In these cases, an archived page snapshot offers poor fidelity [2], e.g., see the last link in the "External links" section of https://en.wikipedia.org/wiki/Mars_Express. To patch such a broken link, one should instead attempt to find if the page previously available at that link still exists at an alternate URL (in our example, the page is at https://trek.nasa.gov/mars/); the original link may have stopped working only because the site hosting it was reorganized, and the URL for that page had changed.

In this project, we aim to study these limitations in two ways. First, we will quantify the prevalence of the above-mentioned problems. Similar to prior studies [3], we will manually examine a random sample of external links on Wikipedia. Second, we will aim to devise algorithms that can automate the identification of broken links which are either missed by current systems or where an archived copy proves insufficient. These algorithms could help inform future revisions to systems such as InternetArchiveBot.

[1] https://blog.archive.org/2018/10/01/more-than-9-million-broken-links-on-wikipedia-are-now-rescued/

[2] https://www.usenix.org/conference/osdi22/presentation/goel

[3] https://dash.harvard.edu/handle/1/37367405

Personnel edit

N/A

Budget edit

Approximate amount requested in USD.

50,000 USD

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

The funds will be used to support the following:

- A graduate student researcher's 25% appointment (tuition, stipend, and benefits) for 12 months

- Half a month of summer salary plus associated benefits for the PI

- 15% overhead on direct costs

Impact edit

Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

One of the main thrusts in Wikimedia’s 2030 Strategic Direction is to improve the integrity of knowledge available on Wikipedia. A significant long-term threat is that, though millions of contributors and community editors put in the effort to include appropriate citations, many of these external references decay over time.

Our work aims to preserve the fruits of the collective effort put into ensuring Wikipedia's verifiability. By both quantifying the shortcomings of current best practices to cope with this issue and studying how these shortcomings could be overcome, our work will inform future improvements to the systems used to patch external references on Wikipedia.

Dissemination edit

Plans for dissemination.

We will aim to publish a research paper that describes our findings.

To specifically communicate our findings to the Wikimedia community, we will 1) post on forums such as Wikipedia's Village pump, where potential improvements to wikibots are discussed by the community, and 2) give a talk at Wikimedia Research's monthly showcase.

We have been communicating all of our previous findings on this topic to the Internet Archive, and we will continue to do the same in this project.

Past Contributions edit

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

Our research paper [1] on characterizing broken links on Wikipedia for which no archived copies exist helped inform revisions to WaybackMedic, another bot that Internet Archive runs on Wikipedia to fix dead links.

We have developed a system called FABLE; given a broken URL, it determines if the page previously available at that URL still exists on the web, and if so, at what new URL. Encouraged by FABLE's high accuracy in finding URL replacements for permanently dead links [2], we plan to start developing a wikibot based on FABLE next year.

[1] https://dl.acm.org/doi/10.1145/3517745.3561451

[2] https://en.wikipedia.org/wiki/User:FABLEBot/New_URLs_for_permanently_dead_external_links


I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.

Yes