Research:Understanding Search Engine To Wikipedia

Created
Duration:  2023-January – ??
Blocked
This page documents a stalled research project.


We know that a large fraction of Wikipedia's readers reach Wikipedia through a search engine (most notably Google)[1]. In this project, we aim to study what proportion of Google search queries lead to a user's visit to Wikipedia. For this, we will combine data from Google trends (what users on google are searching for) and Wikipedia clickstream (how readers reach Wikipedia) to estimate these proportions.

Background edit

Google and Wikipedia are a major part of our web infrastructures and studying their relationship and interdependence is crucial. They share a symbiotic relationship, with Google driving web traffic to Wikipedia [2], and Wikipedia improving the quality of search results [3] [4] . However, the nature of this relationship is often in flux - in 2015, with the introduction of the Knowledge Panel for Google Search results, Wikipedia experienced a substantial drop in page views. It has since been a priority of the Wikimedia foundation to better study re-use of Wikimedia content on the internet, and the consequences this has on Wikipedia.

For more information on the background of the relationship between search engine and wikipedia and relevant literature review, please visit this sub-page: Research:Understanding Search Engine To Wikipedia/Literature Review

A key aspect of the relationship between Google and Wikipedia that is not currently explored in the literature is what proportion of google search queries are either answered by Wikipedia or lead to a search user visiting Wikipedia. In other words - what fraction of search interest expressed on Google is met by Wikipedia? Here, we refer to interest being met by Wikipedia if a user clicks on a link to Wikipedia.

Our key methodological innovation is our usage of public data and sources to estimate the proportion of google searches that lead to Wikipedia visits. While it is straightforward to estimate the fraction of wikipedia traffic that comes from Google using Clickstream data or Webrequest logs, it is more difficult to estimate what fraction of Google searches end up on Wikipedia. However, we propose a model that allows us to combine the public data of google trends and clickstream data to estimate this proportion. We detail our approach, including the data, our model, and assumptions in the methodology section below.

Research Questions edit

We propose a primary research question to study the proportion of Wikipedia visits from a Google search, and follow up potential research questions (based on availability of the right data). We describe the data and the model (using which we calculate our ‘beta’ values, the Wikipedia rates) in the following sections after the RQs.

Our primary objective is to understand the relationships between page, concept, geography, time and Wikipedia rates.

RQ1: What fractions of searches on Google lead to a user visiting Wikipedia (beta values)?

  • RQ1.1 How do beta values differ by page?
  • RQ1.2 How do beta values differ by topic/concept?
  • RQ1.3 How do beta values differ geographically?
  • RQ1.4 How do beta values differ by platform


RQ2: What are the temporal properties of proportions of Wikipedia visits?

  • RQ2.1 How do Wikipedia rates change over time?
  • RQ2.2 How do current events (political events, sporting events) affect proportions of Wikipedia visits?

Data/Methods edit

We can perform these estimations either using public data, but access to geo-logs would allow us to have higher quality estimates and not force us to focus on language editions that have 1:1 coupling of geography and language. Geodata in the form of webrequest logs would also allow us to know if the web search came from google vs another search engine, which the public clickstream logs do not differentiate in.

We also note that an advantage of clickstream data is that it is available for the past 5 years - this would allow us to make historical analysis for long term trends, while webrequest logs are kept only for 3 months and then purged. Web request logs would serve as a useful way to validate our analysis using public data.

Clickstream edit

  • https://dumps.wikimedia.org/other/clickstream/
  • Since 01-2018, monthly data has been available for these 10 languages: de, en, es, fr, it, ja, pl, pt, ru, zh; since 03-2018, fa has additionally been available
  • Out of these, we will work with the 4 languages that have a tight (close to 1:1) coupling with a specific country: it, ja, pl, fa. We note once again that having access to geo based web request logs would alleviate this problem and allow us to conduct our analysis across languages and regions.
  • For each language, conduct a separate analysis.
  • Selection of articles to study:
    • Consider the top 100k articles, in terms of frequency (total over full time span since 01-2018) with which they were accessed from a search engine in the Clickstream data.
    • From each decile of 10k articles, randomly sample 100.
    • Union this sample with the top 1k articles for each month (in order to also include trending articles), for a total of a few thousand articles total.

Google Trends (GT) edit

  • For the above article sample, get Google Trends data via G-TAB
    • Using date=all will give us data since 2004 at monthly granularity, thus matching the granularity of the Clickstream data
    • Some queries (e.g., “World Cup”) will be problematic, as they spike within a very short time frame, such that rounding will affect even a single time series (so maybe better to not use date=all but something shorter)
    • Regions: Italy, Japan, Poland, Iran. We note once again here that if we have access to geo based web request logs, we would be able to not restrict ourselves to these regions.
  • Clickstream data is indexed by WP article names, whereas GT is queried with Knowledge Graph MIDs (compatible with Freebase MIDs)
    • To map from article name to MID, we can go through Wikidata.
    • Caveat: Freebase has not been updated since 2014, so with this approach we will not be able to handle articles created after that point.
    • Is there a way to map from article name or from Wikidata QID to Google Knowledge Graph MID? (e.g., via KG API)

Timeline edit

  • Feb - Gather data and early data analysis.
  • Mar - More data analysis and characterisation of causal model.
  • Apr - Update meta page with early results and new steps to take.

Policy, Ethics and Human Subjects Research edit

t.b.a

Results edit

t.b.a

Resources edit

t.b.a

Subpages of this page edit

For easier navigation, this list contains all subpages of this project page.

Pages with the prefix 'Understanding Search Engine To Wikipedia/' in the 'Research' and 'Research talk' namespaces:

Research talk:

References edit

  1. Piccardi, Tiziano; Gerlach, Martin; Arora, Akhil; West, Robert (2023-01-13). "A Large-Scale Characterization of How Readers Browse Wikipedia". ACM Transactions on the Web. ISSN 1559-1131. doi:10.1145/3580318. 
  2. "Searching for Wikipedia". 
  3. McMahon, Connor; Johnson, Isaac; Hecht, Brent (2017-05-03). "The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies". Proceedings of the International AAAI Conference on Web and Social Media 11 (1): 142–151. ISSN 2334-0770. doi:10.1609/icwsm.v11i1.14883. 
  4. Vincent, Nicholas; Johnson, Isaac; Sheehan, Patrick; Hecht, Brent (2019-07-06). "Measuring the Importance of User-Generated Content to Search Engines". Proceedings of the International AAAI Conference on Web and Social Media 13: 505–516. ISSN 2334-0770. doi:10.1609/icwsm.v13i01.3248.