Research:Characterizing Readers Navigation/Temporal Rhythms
Work in progress...
Here we investigate regularities in daily consumption patterns in a large-scale analysis of billions of timezone-corrected page requests mined from English Wikipedia's server logs, with the goal of investigating how context and time impact the kind of information consumed on the platform. First, we show that despite removing the average pattern generated by day-night alternation, the consumption habits of individual articles maintain strong diurnal regularities. Then we characterize the prototypical shapes of consumption patterns, finding a strong distinction between articles with prevalence to be read during the evening or night vs. working hours.
You can find more details in the paper Curious Rhythms: Temporal Regularities of Wikipedia Consumption which we published on arXiv:https://arxiv.org/pdf/2305.09497
Data
editOur study relies on the access logs collected over four weeks ―from March 1st to 28th, 2021― on the servers of the English edition of Wikipedia. These server logs, stored for a limited time, describe the requests the server receives when the readers access the website. They offer metadata such as what article was loaded and the approximate geolocation. To prepare the data for this study, we preprocessed it as follows. First, we select only the request relative to articles (namespace 0) that originate from external websites.
Then, we anonymize by removing sensitive information (IP, user-agent, geo-coordinates) and refine the logs by removing sequential loads of the same page from the same client, as described in previous work. Finally, since we are interested in describing the temporal patterns, we align all the requests by converting the time stamps into local time using the timezone information.
After these steps, we retain 3.45B pageloads associated with 6.3M articles. We represent the number of pageloads of article a for hour h of each hour of the day (averaged over 28 days) by , i.e., each article is represented as a time series with .
Methodology
editTo understand how Wikipedia is used during the day and how it fulfills different information needs, we investigate how individual articles are accessed over time. We define the normalized and timezone-corrected consumption pattern of all Wikipedia as the baseline rhythm, and it describes the expected access volume variation over time. We aim to obtain a temporal signature of each article by removing the baseline rhythm and focusing on how their consumption pattern diverges from the global average (the baseline rhythm).
We calculate the divergence by computing the per-hour ratio between each article's time series (additionally normalized so each time series sums to 1), and the baseline rhythm . More formally, the divergence of the article a at time h is defined as:
The figure summarizes these steps with two articles associated with different topics (STEM and Media). It shows the shape of the curves after the normalization (left), and their divergence from the baseline rhythm for the same articles (right).
Results
editTo investigate the prototypical daily consumption patterns, we focus on two research questions: [RQ1] What are the typical shapes of Wikipedia consumption rhythms? [RQ2] Which factors influence the rhythms of Wikipedia consumption?
RQ1: What are the typical shapes of Wikipedia consumption rhythms?
editTo approach the first question (RQ1), we extract the principal components of the matrix describing the normalized 24-dimensions time series. We mean-center each column of the matrix and factorize it with singular value decomposition (SVD). The resulting matrices represent the latent representation of the articles (U), the scaling factor ( ), and the latent representation in the time domain (V). By investigating the last matrix (V), we observe that the first principal component of the time pattern accounts for 39.1% of the variance, and it models the day vs. night consumption behavior. Next, we investigate the consumption patterns' shapes to identify articles with similar temporal access patterns. In order to capture the main behavior of each article, we represent its access pattern with the value of the four principal components ―selected with the elbow method― of the matrix U obtained by the matrix factorization. We use k-means and search for the best number of clusters using silhouette score and SSE. Both metrics indicate that the articles cannot easily be separated into distinct, discrete groups. Manual inspection of the cluster shows that the consumption behavior of the articles is distributed on a continuum, grouped by topics.
RQ2: Which factors influence the rhythms of Wikipedia consumption?
editTo approach the next research question RQ2, we prepare the data as follows. First, we decompose each article's time series by country (focusing on the top 20) and access method. Then, following the steps described previously, we normalize each temporal pattern and remove the baseline rhythm. Finally, we complete the data representation by exploding the time series into 24 independent samples, with an explicit feature indicating the hour of the day. Each article is represented by 960 data points describing the deviation from the baseline rhythm for each combination of 20 countries, two access methods, and 24 hours. Each of these samples is then complemented with the relative binary vector representing the predicted article's topics obtained from ORES. The dependent variable is log-transformed to make the model multiplicative and interpret the coefficients as relative increases. The obtained model fits the data with a of 0.181.
Since we explicitly represented in the model the interaction terms between time and the different factors, we can inspect how they contribute to the estimation. In fact, sorting the 24 coefficients representing the interactions allows us to characterize the temporal shape of each factor. An important aspect of the resulting coefficients is that, since we fit the model including all features, they describe the contribution to the prediction by controlling for all the other factors.
The figure shows the temporal shapes of the topics organized in the five top-level groups. The plot highlights how articles about STEM topics, such as Chemistry, Physics, and Mathematics, tend to receive more attention than average during the daytime and a visible reduction outside the typical working hours. On the other hand, articles about Films, Television and Biographies have an inverted shape, with less consumption than average during the day and a substantial increase during the evening. Interestingly, the shapes of the temporal patterns suggest that content about Video Games, Comics & Anime, Internet Culture, Military, and Society are consumed by night-owl readers, which drives the relative consumption to peak during the night.
Differently, the consumption of articles about Radio, Libraries & Information, and Philosophy, shows to be more imbalanced toward the early hours of the day. Some of the shapes, especially the ones associated with STEM articles, show a reduction of attention around noon, suggesting that they might be affected by the lunch break when people's attention moves to other content types. This is corroborated by the fact that attention on articles about Food registers increase during common meal times.
Next, the figure shows the interaction coefficients of the countries with time. Some countries share similar daily patterns. For example, readers from the United States, Germany, Netherlands, and Nigeria tend to consume Wikipedia more than average in the early morning. This behavior is inverted for readers from India, Ireland, Italy, and Spain, where during the same hours, consume less content than average. Meanwhile, other countries, such as Malaysia, Singapore, Brazil, Russia, and Pakistan, show higher consumption during the night. Furthermore, some countries, like the Philippines, Italy, France, and Spain, reveal shared habits, such as a reduction of information consumption around noon, possibly associated with lunchtime.
Finally, investigating the coefficient describing the access method, we observe that access from desktop devices is above the global average in the central hours of the day: between 9 AM and 5 PM.