Research:New user reading patterns
The goal of this work is to understand the factors involved in user-account creation on wikipedia.
The main findings of this analysis are:
- Using webrequest logs, we build a method that allows to identify and characterize the reading patterns of approx. 80% of users that register a new account across wiki with different languages.
- New users visit the [en:Special:RecentChanges] much more often than 'normal' users who do not register. The fraction of those users varies from 1% (enwiki) to 8% (kowiki, desktop-acces).
- New users preferentially register via desktop-access: ~70% (in comparison, non-registration reading sessions have only ~30% desktop-access).
- in comparison to the amount of overall traffic, users from Asia and Africa have a much higher rate to register for an account
- New users start their reading session (before the registration) much more often in a namespace different to the article-ns (0), in particular 4 (Wikipedia) and 12 (Help)
- Reading sessions of new users tend to be longer (even when considering the part before registration) than for non-registration users.
While it is possible to edit Wikipedia without an account, many editors choose to use a dedicated account for their contributions. Therefore, understanding motivations and experiences of newly-registered users provides important insights into the processes underlying the transformation from reader to contributor in wikipedia. This in turn could help identify ways in which to improve new user's experience in wikipedia projects.
There has been done much work in the "Understanding first day" project ( EditorJourney ) to characterize workflows of new editors:
Most new users who create accounts do not ever make edits -- but those who do make edits usually make them on their first day of having an account. We have little knowledge of what new account holders do on that first day after they create their accounts -- whether they read help content, attempt edits that they do not publish, or something else.
Similarly, large efforts have been devoted to Characterizing reader behaviour in order to understand the information needs of readers.
Here, we try to characterize the context in which readers choose to become (potential) contributors.
For the data-query from webrequest logs we set the following parameters:
- 1 week (2019-09-09 -- 2019-09-16)
- English Wikipedia ('normalized_host.project_family'=='wikipedia', 'normalized_host.project'== 'en')
- mobile and desktop version ('access_method'=='mobile web'/'desktop')
- Sessions are cut into subsessions with threshold of 60 minutes (see previous work)
Identifying new usersEdit
In order to identify new users we check the following criteria
- visit to the Create Account page (via 'uri_query')
- in order to assess whether an account was created we check whether the logged_in-status changed from 0 to 1 after the visit to the account creation (via x-analytics)
Identifying non-new usersEdit
In order to put numbers obtained from the analysis of newly registered users in context, we would like to compare with reading sessions that do not lead to the creation of a new account. Naturally, this number is much larger. Therefore, we obtain a smaller subsample with an approximately similar number of observations from the same time period. Specifically, we select the subsample of sessions in which the user was not logged-in during the entire session.
Due to the size of the data, we query a given number of samples per day from the webrequest log in order to avoid the following error in the join
Py4JJavaError: An error occurred while calling o146.collectToPython.
- java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
Similarly, we then have to use activity thresholds to cut sessions into subsessions.
The methodology described above identifies 24,909 sessions during which an account was created.
How reliable is our approach to identify account-creation events from webrequest logs?
We compare the number of new users with data from Mediawiki_user_history (created_by_self==True) based on hourly windows over the 1-week period.
We systematically underestimate the number of new users (roughly 80%). Nevertheless, the ratio is surprisingly stable (the variations across time are much larger) indicating that we can identify at least a fraction of new users via webrequests. Overall, this can be seen as indirect evidence for the robustness of our approach.
Entry via 'Special:Recent_changesEdit
One hypothesis emerging from discussions with, e.g. the Growth-team, is that users decide are motivated to register after visiting Recent changes. We therefore assess the fraction of users that visited a this page (for new users only before the registration).
We find that approximately 1% of new users visited the Recent-changes page. This number is very similar for desktop and mobile (no errorbars shown here). While this number is not very large in absolute terms, it is order of magnitude higher than for users that do not register (in fact, we count exactly 1 such occurrence in the latter situation).
Desktop vs mobileEdit
In view of how to improve the user experience, it is interesting to see whether new users use the desktop or mobile version.
We find that the number of new users from desktop access is about 50% larger than for mobile acess. Surprisingly, when looking at non-registered users, this is essentially reversed -- i.e. the fraction of users accessing from mobile is much larger.
In order to assess possible gaps in accessibility, it is interesting to quantify geographic location (on the level of continent) of new users.
North America and Asia have the most number of new users. While the absolute number of new users are similar, the interpretation changes when comparing with the absolute numbers of all users. The fraction of 'normal' traffic is much higher from North America than from Asia. This means that the fraction of users who create a new account (of the overall traffic) is much larger for users from Asia and much smaller for users from North America. Similar are the cases of Africa (former) and Europe (latter).
Length of sessionEdit
The length of the reading sessions differs for users that create an account and users that do not create an account. Specifically, reading sessions defined as the number of different pages (is_pageview==1) has a much longer tail for new users, i.e. the sessions tend to be longer. Interestingly, when only considering the length of the session before the account was created, the difference becomes much smaller.
In order to understand the information need, we considered the namespace of the first page-view in the reading session (note that we assign namespace=-2 if we cannot identify a namespace, for example if the reading session started with the account-creation).
While namespace 0 (main/article) covers the majority of the cases, it is slightly smaller for users that create an account. When looking at the log-scale, we can see the differences in the other namespace-categories with a much smaller contribution. In particular, we see that new users start much more often the following namespaces: -1 (Special), 4 (Wikipedia), 12 (Help), and 14 (Category). Also namespaces 1 (article talk), 2 (user), and 3 (user talk) are worth to mention. Note that these differences are not due to different usage of mobile vs desktop access (not shown here).
In order to understand the information need of new users that create an account, we would like to get an idea of the topics of the visited pages. Assigning topics to articles in Wikipedia is a non-trivial problem. Here, we use the topic model. The advantage of this approach is that it can be applied to articles of any language. More specifically, it is motivated by the categories defined in the ORES Draft topic model and instead of the text (in the English version article) uses the statements in the corresponding Wikidata-item (see work-log) taking advantage of the mapping of Wikipedia articles in different languages and Wikidata items (see ).
Here, we look at the topic of the first visited page in reading sessions of new users. Comparing users that register with users that do not, we find the following differences:
- Culture.Language_and_literature is much more common for users that register
- Person is much higher for users that do not register
- STEM.Technology is slightly higher for users that register
We focus on namespace 0 (Main/Article); for the other namespaces, we i) cannot assign topics with high confidence for a large fraction of pages (we label this case as '-2'), and ii) the topic distribution is highly skewed towards 'Culture.Language_and_literature' which seems to be an artifact of the algorithm. See an example for the namespace 12 (Help)
With few exceptions we find very similar patterns when looking at wikis with different languages (and thus different sizes). For this we aim to run the analysis for the following languages:
- German (Create Account page - de)
- French (Create Account page - fr)
- Arabic (Create Account page - ar)
- Czech (Create Account page - cz)
- Korean (Create Account page - ko)
Since the number of events for most of these wikis is much smaller, we consider data from a longer time-window (2019-09-01 -- 2019-09-30).
While the absolute number of new user registrations varies, we consistently capture approx. 80% of the new user registrations when comparing webrequest logs with the number of registration events in mediawiki-history. The only exception is kowiki, for which this number drops to approx. 60%.
Interestingly, enwiki is an exception here: i) the fraction of users who visit the 'Recent-changes' page is lowest (around 1%) nad ii) there is little difference between desktop and mobile. In other language-wikis, the recent-changes visits is substantially larger (in kowiki up to 8x larger). In addition, desktop and mobile access is very different in that for desktop-access the fraction is much larger.
Mobile vs DesktopEdit
Access of new and all users is very similar across all language-wikis. For new users, the majority (~70%) accesses via desktop. In contrast, for users that do not register, the majority comes from mobile (desktop drops to ~30%). Interestingly, for arwiki the desktop access is lower for all cases compared to the other wikis.
Observations regarding the access from different continents:
- enwiki: number of new registrations much larger for Asia and Africa when compared to traffic from non-registered users (see above for more details)
- dewiki: almost all access from Europe
- frwiki: some access from Africa and North America. Simlarly to enwiki, the rate of new user registration from Africa is higher than what one would expect from amount of traffic of non-registered users.
- arwiki: most access from Asia and Africa
- cswiki: most access from Europe
- kowiki: most access from Asia
Length of sessionsEdit
Length of reading sessions are very< similar across languages.
Across wikis, new users visit the article-namespace (0) slightly less often as a starting namespace (even though it still constitutes the majority of cases). Instead, new users have a higher chance to start the session in a non-article namespace. The most common cases are:
- 4 (Wikipedia)
- 12 (Help)