Research:Newsletter/2024/May

Wikimedia Research Newsletter

Vol: 14 • Issue: 05 • May 2024 [contribute] [archives]

ChatGPT did not kill Wikipedia, but might have reduced its growth

By: Tilman Bayer

Actually, Wikipedia was not killed by ChatGPT – but it might be growing a little less because of it

edit

A preprint[1] by three researchers from King's College London tries to identify the impact of the November 2022 launch of ChatGPT on "Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia." The analysis concludes that

"any impact has been limited and while [ChatGPT] may have led to lower growth in engagement [i.e. Wikipedia pageviews] within the territories where it is available, there has been no significant drop in usage or editing behaviours"

The authors note that there are good a priori reasons to hypothesize that ChatGPT may have replaced Wikipedia for some usages:

"At this time, there is limited published research which demonstrates how and why users have been engaging with ChatGPT, but early indications would suggest users are turning to it in place of other information gathering tools such as search engines [...]. Indeed, question answering, search and recommendation are key functionalities of large language models identified in within the literature [...]"

However, like many other current concerns about AI, these have been speculative and anecdotal. Hence the value of a quantitative analysis that tries to identify the causal impact of ChatGPT on Wikipedia in a statistically rigorous manner. Without conducting experiments though, i.e. based on observational data alone, it is not easy to establish that particular change or external event caused persistent increases or decreases in Wikipedia usage overall (as opposed to one-time spikes from particular events, or recurring seasonal changes). The paper's literature review section cites only one previous publication which achieved that for Wikipedia pageviews: a 2019 paper by three authors from the Wikimedia Foundation (see our earlier coverage: "An awareness campaign in India did not affect Wikipedia pageviews, but a new software feature did"). They had used a fairly sophisticated statistical approach (Bayesian structural time series) to first create a counterfactual forecast of Wikipedia traffic in a world where the event in question did not happen, and then interpret the difference between that forecast and the actual traffic as related to the event's impact. Their method successfully estimated the impact of a software change (consistent with the results of a previous randomized experiment conducted by this reviewer), as highlighted by the authors of the present paper: "Technological changes can [...] have significant and pervasive changes in user behaviour as demonstrated by the significant and persistent drop in pageviews observed in 2014 [sic, actually 2018] when Wikipedia introduced a page preview feature allowing desktop users to explore Wikipedia content without following links." The WMF authors concluded their 2019 paper by expressing the hope that "it lays the groundwork for exploring more standardized methods of predicting trends such as page views on Wikipedia with the goal of understanding the effect of external events."

In contrast, the present paper starts out with a fairly crude statistical method.

First,

We gathered data for twelve languages from the Wikipedia API covering a period of twenty two months between the 1st of January 2021 and the 1st of January 2024. This includes a period of approximately one year following the date on which ChatGPT was initially released on the 30th of November 2022.

(The paper does not state which 22 months of the 36 months in that timespan were included.)

The 12 Wikipedia languages were

"selected to ensure geographic diversity covering both the global north and south. When selecting languages, we looked at three key factors:

  1. The common crawl size of the GPT-3 main training data as a proxy for the effectiveness of ChatGPT in that language.
  2. The number of Wikipedia articles in that language.
  3. The number of global first and second language speakers of that language.

We aimed to contrast languages with differing numbers of global speakers and languages with differing numbers of Wikipedia articles [...ending up with English, Urdu, Swahili, Arabic, Italian and Swedish].

As a comparison, we also analysed six languages selected from countries where ChatGPT is banned, restricted or otherwise unavailable [Amharic, Farsi, Russian, Tigrinya, Uzbek and Vietnamese].

Then, "[a]s a first step to assess any impact from the release of ChatGPT, we performed paired statistical tests comparing aggregated statistics for each language for a period before and after release" (the paper leaves it unclear how long these periods were). E.g.

"For page views, we first performed a two-sided Wilcoxon Rank Sum test to identify whether there was a difference between the two periods (regardless of directionality). We found a statistically significant different for five of the six languages where ChatGPT was available and two of the six languages where it was not. However, when repeating this test with a one-sided test to identify if views in the period after release were lower than views in the period before release, we identified a statistically significant result in Swedish, but not for the remaining 11 languages."

For the other three metrics (unique users, active editors, and edits) the results were similarly ambiguous, motivating the authors to resort to a somewhat more elaborate approach:

"While the Wilcoxon Signed-Rank test provided weak evidence for changes among the languages before and after the release of ChatGPT, we note ambiguities in the findings and limited accounting for seasonality. To address this and better evaluate any impact, we performed a panel regression using data for each of the four metrics. Additionally, to account for longer-term trends, we expanded our sample period to cover a period of three years with data from the 1st of January in 2021 to the 1st of January 2024."

While this second method accounts for weekly and yearly seasonality, it too does not attempt to disentangle the impact of ChatGPT from ongoing longer term trends. (While the given regression formula includes a language-specific fixed effect, it doesn't have one for the availability of ChatGPT in that language, and also no slope term.) The usage of Wikipedia might well have been decreasing or increasing steadily during those three years for other reasons (say the basic fact that every year, the number of Internet users worldwide increases by hundreds of millions). Indeed, a naive application of the method would yield the counter-intuitive conclusion that ChatGPT increased Wikipedia traffic in those languages where it was available:

"For all six languages, [using panel regression] we found a statistically significant difference in page views associated with whether ChatGPT had launched when controlling for day of the week and week of the year. In five of the six languages, this was a positive effect with Arabic featuring the most significant rise (18.3%) and Swedish featuring the least (10.2%). The only language where a fall was observed was Swahili, where page views fell by 8.5% according to our model. However, Swahili page viewing habits were much more sporadic and prone to outliers perhaps due to the low number of visits involved."

To avoid this fallacy (and partially address the aforementioned lack of trend analysis), the authors apply the same method to their (so to speak) control group, i.e. "the six language versions of Wikipedia where ChatGPT is was unavailable":

"Once again, results showed a statistically significant rise across five of the six languages. However, in contrast with the six languages where ChatGPT was available, these rises were generally much more significant. For Farsi, for example, our model showed a 30.3% rise, while for Uzbek and Vietnamese we found a 20.0% and 20.7% rise respectively. In fact, four of the languages showed higher rises than all of the languages where ChatGPT was available except Arabic, while one was higher than all languages except Arabic and Italian."

The authors stop short of attempting to use this difference (between generally larger pageview increases in ChatGPT-less languages and generally smaller increases for those where ChatGPT was available) to quantify the overall effect of ChatGPT directly, perhaps because such an estimation would become rather statistically involved and require additional assumptions. In the paper's "conclusions" sections, they frame this finding in vague, qualitative terms instead, by stating that ChatGPT may have led to lower growth in engagement [pageviews] within the territories where it is available.

For the other three metrics studied (unique devices, active editors, and edits), the results appear to have been even less conclusive. E.g. for edits, "[p]anel regression results for the six languages were generally not statistically significant. Among the languages where a significant result was found, our model suggested a 23.7% rise in edits in Arabic, while for Urdu the model suggested a 21.8% fall."

In the "Conclusion" section, the authors summarize this as follows:

Our findings suggest an increase in page visits and visitor numbers [i.e. page views and unique devices] that occurred across languages regardless of whether ChatGPT was available or not, although the observed increase was generally smaller in languages from countries where it was available. Conversely, we found little evidence of any impact for edits and editor numbers. We conclude any impact has been limited and while it may have led to lower growth in engagement within the territories where it is available, there has been no significant drop in usage or editing behaviours.

Unfortunately this preprint does not adhere to research best practices about providing replication data or code (let alone a preregistration), making it impossible to e.g. check whether the analysis of pageviews included automated traffic by spiders etc. (the default setting in the Wikimedia Foundation's Pageviews API), which would considerably impact the interpretations of the results. The paper itself notes that such an attempt was made for edits ("we tried to limit the impact of bots by requesting only contributions from users") but doesn't address the analogous question for pageviews.

An earlier version of the paper as uploaded to ArXiv had the title "'The Death of Wikipedia?' – Exploring the Impact of ChatGPT on Wikipedia Engagement", which was later shortened by removing the attention-grabbing "Death of Wikipedia". As explained in the paper itself, that term refers to "an anonymous Wikipedia editor's fears that generative AI tools may lead to the death of Wikipedia" – specifically, the essay w:User:Barkeep49/Death of Wikipedia, via its mention in a New York Times article, see w:Wikipedia:Wikipedia Signpost/2023-08-01/In the media. While the paper's analysis conclusively disproves that Wikipedia has died as of May 2024, it is worth noting that Barkeep49 did not necessarily predict the kind of immediate, lasting drop that the paper's methodology was designed to measure. In fact, the aforementioned NYT article quoted him as saying (in July 2023) "It wouldn't surprise me if things are fine for the next three years [for Wikipedia] and then, all of a sudden, in Year 4 or 5, things drop off a cliff." Nevertheless, the paper's findings leave reason for doubt whether this will be the first of the many predictions of the end of Wikipedia to become true.

Briefly

edit

Other recent publications

edit

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Do We Trust ChatGPT as much as Google Search and Wikipedia?"

edit

From the abstract:[2]

"A focus group and interview study (N=14) revealed that thankfully not all users trust ChatGPT-generated information as much as Google Search and Wikipedia. It also shed light on the primary psychological considerations when trusting an online information source, namely perceived gatekeeping, and perceived information completeness. In addition, technological affordances such as interactivity and crowdsourcing were also found to be important for trust formation."

From the paper:

"Among all three information sources, Google was the most trusted platform, favored by 57% of our participants, followed by Wikipedia, which was liked by 29% of our participants [...]. Four participants expressed that ChatGPT is less credible than Google because it does not disclose the original source of the information."

It should be noted that the authors' relieved conclusion ("thankfully") is somewhat in contrast with the result of a larger scale blind experiment published last year in preprint form (see our coverage: "In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible").


WikiChat, "the first few-shot LLM-based chatbot that almost never hallucinates"

edit
 
"All WikiChat components, and a sample conversation about an upcoming movie [Oppenheimer], edited for brevity. The steps taken to generate a response include (1) generating a query to retrieve from Wikipedia, (2) summarizing and filtering the retrieved passages, (3) generating a response from an LLM, (4) extracting claims from the LLM response (5) fact-checking the claims in the LLM response using retrieved evidence, (6) drafting a response, and (7) refining the response." (Figure 1 from the paper)

From the abstract of this paper (by three graduate students at Stanford University's computer science department and Monica S. Lam as fourth author):[3]

"This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. [...] we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments."

An online demo is available at https://wikichat.genie.stanford.edu/ . The code underlying the paper has been released under an open source license, and two distilled models (for running the chatbot locally without relying e.g. on OpenAI's API) have been published on Huggingface.

See also our review of a previous (preprint) version of this paper: "Wikipedia-based LLM chatbot 'outperforms all baselines' regarding factual accuracy"

"A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth"

edit

From the abstract:[4]

"We illustrate a simple model of knowledge scaffolding, based on the process of building a corpus of knowledge, each item of which is linked to “previous” ones. [...]. Our model can be used as a rough approximation to the asymptotic growth of Wikipedia, and indeed, actual data show a certain resemblance with our model. Assuming that the user base is growing, at beginning, in an exponential way, one can also recover the early phases of Wikipedia growth."

 
"The fundamental knowledge scaffolding model. (left) Knowledge bits are represented as nodes of a network, where different colors represent different levels and nodes at a certain level only depend on a certain number of nodes at lower levels. Green (basic) nodes represent axioms. (right) Observing the filling of the network (here with fixed width W and with fixed number of dependencies K), one can detect holes [e.g. content gaps on Wikipedia] that are filled after the appearance of nodes at higher levels." (from the paper)


"males outperform females" when navigating Wikipedia under time pressure

edit

From the abstract:[5]

"we conducted an online experiment where participants played a navigation game on Wikipedia and completed personal information questionnaires. Our analysis shows that age negatively affects knowledge space navigation performance, while multilingualism enhances it. Under time pressure, participants’ performance improves across trials and males outperform females, an effect not observed in games without time pressure. In our experiment, successful route-finding is usually not related to abilities of innovative exploration of routes."

From the paper:

"In a popular online navigation game on Wikipedia, implemented in several versions such as the Wikispeedia (https://dlab.epfl.ch/wikispeedia/play/) and the Wikigame (https://www.thewikigame.com/), players try to go from one Wikipedia article (source) to another (target) through the hyperlinks of other articles within the Wikipedia website. Several navigation patterns on the Wikipedia knowledge network have been discovered: players typically first navigate to more general and popular articles and then narrow down to articles that are semantically closer to the target[...]; players’ search is not Markovian, meaning that a navigation step depends on the previous steps taken by the players [...] To gain a better understanding of how navigation on the knowledge network is affected by individual characteristics, we conducted an online experiment where we hired 445 participants from the US to play nine rounds of Wikipedia navigation games [...]"

"WorldKG": Interlinking Wikidata and OpenStreetmap

edit

From the abstract:[6]

"[...] the coverage of geographic entities in popular general-purpose knowledge graphs, such as Wikidata and DBpedia, is limited. An essential source of the openly available information regarding geographic entities is OpenStreetMap (OSM). In contrast to knowledge graphs, OSM lacks a clear semantic representation of the rich geographic information it contains. [...] This chapter discusses recent knowledge graph completion methods for geographic data, comprising entity linking and schema inference for geographic entities, to provide semantic geographic information in knowledge graphs. Furthermore, we present the WorldKG knowledge graph, lifting OSM entities into a semantic representation."

From the paper:

"As of September 2022, WORLDKG contains over 800 million triples describing approximately a 100 million entities that belong to over 1,000 distinct classes. The number of unique properties (wgks:WKGProperty) in WORLDKG is over 1,800. [...] WORLDKG provides links to 40 Wikidata and 21 DBpedia classes."

From https://www.worldkg.org/data :

"In total, WorldKG covers 113,444,975 geographic entities, clearly more than Wikidata (8,621,058) and DBpedia (8,621,058)."


Dissertation: "Multilinguality in knowledge graphs" such as Wikidata

edit

From the abstract:[7]

"In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages. We explore the current state of language distribution in knowledge graphs by developing a framework – based on existing standards, frameworks, and guidelines – to measure label and language distribution in knowledge graphs. We apply this framework to a dataset representing the web of data, and to Wikidata. [...] Due to its multilingual editors, Wikidata has a better distribution of languages in labels. [...] A way of overcoming the lack of multilingual information in knowledge graphs is to transliterate and translate knowledge graph labels and aliases. We propose the automatic classification of labels into transliteration or translation in order to train a model for each task. [...] A use case of multilingual labels is the generation of article placeholders for Wikipedia using neural text generation in lower-resourced languages. On the basis of surveys and semi-structured interviews, we show that Wikipedia community members find the placeholder pages, and especially the generated summaries, helpful, and are highly likely to accept and reuse the generated text."

See also mw:Extension:ArticlePlaceholder and our coverage of a subsequent paper: "Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective"


"Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs" such as Wikidata

edit

From the abstract:[8]

"Recent work in Natural Language Processing and Computer Vision has been using textual information – e.g., entity names and descriptions – available in knowledge graphs [such as Wikidata] to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we [...] i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families."

"LIS Journals’ Lack of Participation in Wikidata Item Creation"

edit

From the abstract:[9]

"... This article presents findings from a survey investigating practices of library and information studies (LIS) journals in Wikidata item creation. Believing that a significant number of LIS journal editors would be aware of Wikidata and some would be creating Wikidata items for their publications, the authors sent a survey asking 138 English-language LIS journal editors if they created Wikidata items for materials published in their journal and follow-up questions. With a response rate of 41 percent, respondents overwhelmingly indicated that they did not create Wikidata items for materials published in their journal and were completely unaware of or only somewhat familiar with Wikidata. Respondents indicated that more familiarity with Wikidata and its benefits for scholarly journals as well as institutional support for the creation of Wikidata items could lead to greater participation; however, a campaign of education about Wikidata, documentation of benefits, and support for creation would be a necessary first step."

Survey on entity linking: Wikidata's potential is still underused

edit

From the paper:[10]

"Entity Linking (EL) is the task of connecting already marked mentions in an utterance to their corresponding entities in a knowledge graph (KG) [...]. In the past, this task was tackled by using popular knowledge bases such as DBpedia [67], Freebase [11] or Wikipedia. While the popularity of those is still imminent, another alternative, named Wikidata [120], appeared."

From the abstract:

"Our survey reveals that current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. Thus, the potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted. Furthermore, we show that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure. [...] Many approaches also include information from Wikipedia, which is easily combinable with Wikidata and provides valuable textual information, which Wikidata lacks."


References

edit
  1. Reeves, Neal; Yin, Wenjie; Simperl, Elena (2024-05-22). "Exploring the Impact of ChatGPT on Wikipedia Engagement". arXiv:2405.10205 [cs.HC]. 
  2. Jung, Yongnam; Chen, Cheng; Jang, Eunchae; Sundar, S. Shyam (2024-05-11). "Do We Trust ChatGPT as much as Google Search and Wikipedia?". Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems. CHI EA '24. New York, NY, USA: Association for Computing Machinery. pp. 1–9. ISBN 9798400703317. doi:10.1145/3613905.3650862. 
  3. Semnani, Sina; Yao, Violet; Zhang, Heidi; Lam, Monica (December 2023). "WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia". Findings of the Association for Computational Linguistics: EMNLP 2023. EMNLP 2023. Singapore: Association for Computational Linguistics. pp. 2387–2413. doi:10.18653/v1/2023.findings-emnlp.157.  Code
  4. Bagnoli, Franco; de Bonfioli Cavalcabo’, Guido (February 2023). "A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth". Future Internet 15 (2): 67. ISSN 1999-5903. doi:10.3390/fi15020067. 
  5. Zhu, Manran; Yasseri, Taha; Kertész, János (2024-04-09). "Individual differences in knowledge network navigation". Scientific Reports 14 (1): 8331. Bibcode:2024NatSR..14.8331Z. ISSN 2045-2322. arXiv:2303.10036. doi:10.1038/s41598-024-58305-2. 
  6. Dsouza, Alishiba; Tempelmeier, Nicolas; Gottschalk, Simon; Yu, Ran; Demidova, Elena (2024). "WorldKG: World-Scale Completion of Geographic Information". In Dirk Burghardt; Elena Demidova; Daniel A. Keim. Volunteered Geographic Information: Interpretation, Visualization and Social Context. Cham: Springer Nature Switzerland. pp. 3–19. ISBN 978-3-031-35374-1. doi:10.1007/978-3-031-35374-1_1.  Dataset: doi:10.5281/zenodo.4953986
  7. Kaffee, Lucie-Aimée (October 2021). Multilinguality in knowledge graphs (Thesis). University of Southampton. 
  8. Conia, Simone; Li, Min; Lee, Daniel; Minhas, Umar; Ilyas, Ihab; Li, Yunyao (December 2023). "Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs". In Houda Bouamor, Juan Pino, Kalika Bali. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. EMNLP 2023. Singapore: Association for Computational Linguistics. pp. 1612–1634. doi:10.18653/v1/2023.emnlp-main.100. 
  9. Willey, Eric; Radovsky, Susan (2024-01-02). "LIS Journals' Lack of Participation in Wikidata Item Creation". KULA: Knowledge Creation, Dissemination, and Preservation Studies 7 (1): 1–12. ISSN 2398-4112. doi:10.18357/kula.247. 
  10. Cedric Möller, Jens Lehmann, Ricardo Usbeck: Survey on English Entity Linking on Wikidata. In: Semantic Web Journal, Special issue: Latest Advancements in Linguistic Linked Data, 2021; also as: Möller, Cedric; Lehmann, Jens; Usbeck, Ricardo (2021-12-03). "Survey on English Entity Linking on Wikidata". arXiv:2112.01989 [cs.CL].  Code


Wikimedia Research Newsletter
Vol: 14 • Issue: 05 • May 2024
About • Subscribe: Email      [archives][Signpost edition][contribute][research index]