Research:Understanding search behavior of users
The objective of this research initiative is to provide a better understanding of how and why WMF's internal search is being used by the different types of users of the platform. The research is conducted per the request of the Search Platform team at WMF.
Research questions
editThe research primarily targets the following questions:
- Understand differences in search behavior on the web vs mobile user interface (including browsers).
- Understand country/regional differences, especially in emerging countries and specific languages.
These were selected from a prioritized list of research questions which are of interest for the Search Platform team. Other elements relevant to the study were taken from the following Phabricator task, namely:
- Evaluation of common or relevant query patterns.
- Top returned documents (articles) and top clicked through documents.
Data and Methods
editData sources
editThe different data sources utilized in the analysis are listed in the table below, together with links to their description page (if available), as well as the data retention policy that applies to them:
Name | Max. data retention |
---|---|
Web request logs (wmf.webrequest table) |
90 days |
Search logs (event.mediawiki_cirrussearch_request table; abbreviated emcr ) |
90 days |
discovery.query_clicks_daily table (abbreviated dqcd ) |
90 days |
wmf.mediawiki_wikitext_current table |
- |
event.searchsatisfaction table |
- |
Pageview hourly (wmf.pageview_hourly table) |
- |
Additional information about the data sources can be found in the report available on this page.
Evaluated metrics
editA list of metrics computed as part of the project for the defined time ranges are presented next, grouped by the type of analysis. In what follows, when sections are referenced, they correspond to the sections of the project's report, available on this page.
- Both for general search behavior (Section 3) and search behavior based on client type (Section 4):
- Total number of sessions
- Distribution of number of clicks per session
- Dwell time (based on sessionized clicks and checkin time)
- Percentage of clicks that have an associated dwell time
- Average page length per dwell time (histogram) bin
- Average ranking position clicked on
- Number of words per query
- Top k queries
- Search behavior based on language and country
- User hits per country for a given language
- Usage metric (number of hits normalized by size)
Time ranges
editThe quantities analyzed for the general search behavior and behavior based on client type have been computed for 2 one-week time ranges: 2-8/May/2022 and 4-10/Jul/2022. Both time ranges span the first complete week (from Monday to Sunday) of a month and the selection was done as an initial validation check of the results. At times for the case of general search behavior, when results are considered to be similar for both time ranges (2-8/May/2022 and 4-10/July/2022) only one set of results is shown, based on the fact that both are included in the analysis based on client type.
When analyzing user hits per country for a given language (Section 5.1), two time ranges are used (again for validation purposes) at the monthly level (Feb/2021 and Jul/2022). The proposed usage metric (Section 5.2) is computed for 4-10/Jul/2022, since it depends on the number of words per project and the user hits, quantified and validated in Section 5.1.
Infrastructure and software
editAll code used to generate the results for this research initiative was run on Wikimedia Foundation's production cluster. In particular, the databases were queried using Apache PySaprk
version 2.4.4.
Results
editFurther details about the work performed as part of project, as well as the results obtained, are included in the project's report (also attached at the top of this page).
Subpages
editPages with the prefix 'Understanding search behavior of users' in the 'Research' and 'Research talk' namespaces:
Research:
Research talk: