Research:Characterizing Readers Navigation/Characterizing the structure of navigation pathways

Despite the importance and pervasiveness of Wikipedia as one of the largest platforms for open knowledge, surprisingly, little is known about how people navigate its content when seeking information. The aim of this study is to systematically characterize how readers browse Wikipedia. Using data of requests to server logs, we measure how readers reach articles, how they transition between pages, and how they combine these patterns to more complex navigation paths.


We wrote up the findings of our project in these papers :

  • A Large-Scale Characterization of How Readers Browse Wikipedia [1] (pdf)
  • Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions [2] (pdf)

Data

edit

To study how readers navigate Wikipedia, we analyze the server logs collected for four weeks between the 1st and 28th of March 2021 for English Wikipedia. We perform the following steps for filtering and processing the data:

  • We limited our analysis to the requests targeting articles in the main namespace
  • We only keep page loads of anonymous users (not logged in) who did not edit any article and are not identified as bots.
  • We drop the Main Page because it does not represent any specific content.
  • We group all the page loads by user identifier (following previous work using a salted hash of IP and user-agent), sorted by timestamp, to have the set of pages visited by a single user.
  • In order to remove automated traffic not identified as bots, we drop all users with more than 2800 page loads, or on average 100 per day

This leaves us with a final dataset of  6.52B page loads associated with 1.47B unique user identifiers.

How do readers reach a page

edit

There are many ways in which readers can reach an article. We use the referrer to identify the origin of the request. We classify pageloads according to their source and quantify the occurrence of each type.

  • Search: The most common way to reach the content of Wikipedia is through a search engine. On desktop, 45.9% of all the page loads is from search, on mobile 48.7%. Previous work describes the special relationship between Wikipedia and search engines, showing that Wikipedia articles appear on the first page of results for 67%-84% of common and trendy queries [3]. This aspect highlights the significant value offered by Wikipedia in fulfilling the information needs of the search engine users, and it is reflected in a large volume of incoming traffic from these platforms.
  • Internal: 35.7% of all page loads comes from clicks to internal links to navigate from article to article. Curiously, 6.58% are transitions that are not possible though the links network. The transition can happen though the search or links generated by templates. Results in line with previous observations based on the click stream [4].
  • Unknown: For around 13% of page loads, it is impossible to determine how the user reached Wikipedia. There are many reasons to observe this behavior: common causes are direct access using the browser history, apps, a bookmark, a search toolbar or intentionally removed by the source website with the presence of rel="noreferrer" in the a-tag.
  • External websites: Around 1% of page loads come from external websites. The most common sources are Facebook, Reddit, YouTube, Twitter, CloutHub. Compare a previous pilot to understand traffic from social media Pilot to understand traffic from social media Research:Social_media_traffic_report_pilot

Other sources are: Wikipedia pages in other languages (so-called Language switching) Special pages (e.g., Wikipedia search), or Apps.

Origin Desktop Mobile Both
Search Engine 45.97% 48.77% 45.72%
Internal 35.64% 35.75% 35.72%
Unspecified 12.64% 13.03% 12.88%
External websites 1.36% 0.70% 0.95%
Home page 1.65% 0.70% 1.06%
Categories 0.59% 0.25% 0.39%
Wikipedia search 0.38% 0.22% 0.29%
Portals 0.03% 0.01% 0.02%
Wikipedia Others 0.07% 0.01% 0.03%
Other 0% 0.01% 0.01%

How do readers transition between pages

edit

The hyperlinks in a Wikipedia article enable the reader to easily navigate from one article to the next.

  • Surprisingly, we find that many readers use external links to navigate between two consecutively read articles. That means that they visit at least one external page in-between reading two Wikipedia articles. In fact, in roughly 1 out of 3 cases, readers use an external search engine to transition between pages.
  • This is reflected in the time between two consecutive page loads. For internal clicks, the time between page loads is much shorter (the peak is below 1 minute) than for external clicks.
  • Interestingly, in many cases we see that the external link would have been available as an actual internal link in the first article.
 
Distribution of interevent times between two consecutive pageloads. The main peak is at very short times (less than 1 minute) but 22% if interevent times are larger than 1 hour.
 
Fraction of transition-events between two pageloads in which the second pageview has an external referrer. For interevent times larger than ~4 minutes, transitions through an external website are more common than internal transitions with existing links.
 
Fraction of external transition (external-search) in which the link exists in the source-page.
edit

As shown above, there is lots of complexity and diversity in how readers reach and transition between pages.

 
Diagram showing the different ways to aggregate the user sessions by using the Wikipedia server logs.




We try to capture the navigation sequences in two slightly different ways:

  • Navigation sessions: This describes the navigation of readers by considering the pages loaded consecutively without leaving Wikipedia, i.e., how readers move by following the links available on the visited page. By only considering transitions from internal links, this constitutes a more conservative approach to constructing reading sessions.
  • Reading sessions: This approach considers the sequence of all page loads by the same reader ordered by time, irrespective how the reader reached those pages (we saw previously that a substantial fraction of page transition are through external links). We split a sequence if two consecutive page views are separated by more than 1 hour.

For both approaches, we find that:

  • Most sessions are very short (in fact, around 70-80% of sessions consist only of a single page view)
  • However, in absolute terms there are still millions of sessions with 10 or more page views by the same reader
  • There is a strong variation in the length of the session depending on the device (desktop readers view more pages on average) or the topic of the article (readers interested in “Films” view more pages).
 
Distribution of number of pageloads per session (linear scale)
 
Distribution of number of pageloads per session (log scale)
 
















References

edit
  1. Piccardi, Tiziano; Gerlach, Martin; Arora, Akhil; West, Robert (2023-01-13). "A Large-Scale Characterization of How Readers Browse Wikipedia". ACM Transactions on the Web. ISSN 1559-1131. doi:10.1145/3580318. 
  2. Piccardi, Tiziano; Gerlach, Martin; West, Robert (2022-08-16). "Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions". Companion Proceedings of the Web Conference 2022. WWW '22 (New York, NY, USA: Association for Computing Machinery): 1324–1330. ISBN 978-1-4503-9130-6. doi:10.1145/3487553.3524930. 
  3. Vincent, N., & Hecht, B. (2021). A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results. Proc. ACM Hum.-Comput. Interact., 5(CSCW1), 1–15. https://doi.org/10.1145/3449078
  4. Mitrevski, B., Piccardi, T., & West, R. (2020). WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. Proceedings of the International AAAI Conference on Web and Social Media, 14, 878–884. https://ojs.aaai.org/index.php/ICWSM/article/view/7353