Research:Measuring User Search Satisfaction

Created

17:41, 19 June 2015 (UTC)

Contact

Mikhail Popov

Wikimedia Foundation

Collaborators

Erika Bjune

Wikimedia Foundation

Duration: 2015-06 – 2018-01

Wikimedia hosted

mw:Wikimedia Discovery

Open source
via github.com

Research:Projects

This page documents a completed research project.

Background

Wikimedia Search

A screenshot of the prefix search dropdown menu

A screenshot of the Wikimedia search interface for a full-text search

Search on the Wikimedia sites appears in a variety of forms on a variety of different platforms. Generally-speaking there are two user-facing ways of searching, however, both triggered by the search box in the UI.

If characters are typed into the search box, a "prefix" search is triggered, searching for pages where the titles start with the characters you've typed and displaying them in a dropdown where they can be clicked on. If no pages are found, or if you click the "containing..." link, a "full text" search is triggered, landing you on the search page (see the second screenshot) full of pages where the article text contains what you searched for.

There are differences between platforms (the Apps, for example, give you a lot more information in the dropdown) but these are the two core systems, and they're used constantly. On desktop alone, 64 million searches are made a day, excluding the API. This makes measuring how well the system is doing of paramount concern, and we don't currently have a good way of doing that.

Existing heuristics

At the moment our way of identifying whether search users are satisfied is based on a series of EventLogging schemas that track search sessions and identify:^[1]

Whether a full-text search results page was provided;
Whether a user clicked on one of the pages that were offered to them.

This is used to calculate a simple clickthrough rate, which is our only way of measuring user satisfaction; if a user was offered a results page and clicked on a provided link, they succeeded. If they did not, they failed.

As a heuristic this has a lot of problems. Most prominently it creates a lot of false positives; a user who clicks through, doesn't find what they want and goes back is a success. A user who doesn't click through because there was a typo in their request, and instead corrects that typo and goes to that specific page, was a failure. A user who clicks through, doesn't find what they want and goes to a second page is a failure. The inability of this heuristic to deal with anything but the most basic use case means it simply doesn't work for our purposes.

Session reconstruction and browsing patterns

A histogram of a random sample of inter-edit times for registered editors of English Wikipedia is plotted with a 3 Gaussian mixed model fit overlay. The one hour threshold used by Geiger and Halfaker^[2] is noted

A more elegant way of identifying user satisfaction would be to look at their entire search path. Instead of simply logging that they got a result set at [timestamp] and clicked on it at [timestamp], we log each click the user makes starting from the search session, associating it with a unique ID (that expires after a short timeout) and the timestamp of each action. This would allow us to control for situations where a user clicked back, or indeed forward, from the page they landed on after the search page.

The problem with this in practice is that it doesn't allow us to distinguish a user clicking through because they did not gain value from a page, from clicking through because they did gain value but wish to continue the search task on other, interesting pages (or have a new task, or have completed their task and are browsing out of idle curiousity). A way of solving for this could be through MediaWiki session reconstruction techniques, which Aaron Halfaker and Oliver Keyes pioneered in 2014 for reader data.^[3] This looked at the inter-time values for user read requests and found a pair of different distributions with a clear gap, allowing us to set a user time-out threshold. Actions before that indicated intentional browse behaviour within a session, actions after indicated a subsequent session.

If we can identify a threshold between immediate clicks away and clicks following reading, we can extend this to apply to search and perform more complex analysis. Specifically, we can distinguish users clicking through to a subsequent page without gaining knowledge from users clicking through very quickly - and we can do this automatically, on scale, and in a way that we can use to beef up this more basic heuristic for user satisfaction.

Research Questions

RQ1: Can we identify an appropriate threshold?

Search Satisfaction Schemas

Schema 1.0.0

The top user agents from a control sample, compared to their proportion in the initial data for search satisfaction.

The first step towards validating this hypothesis - that we can use intertime periods to distinguish different types of user browsing behaviour around search - is to gather the timestamps associated with those actions and see if we can find any signal in the noise. For this we can use an EventLogging schema that tracks:

A unique session ID;
The timestamp of each action;
Whether the action was landing on a search page or landing on any other page;
When the user left the page.

This is being implemented as the TestSearchSatisfaction schema, which began running in mid July 2015 gathering data over 0.1% of search sessions. An analysis of this data at the end of July demonstrated clear bias in the browsers being recorded, as seen in the graph to the right. More problematically, with all of the restrictions on usage the actual number of events was tiny - only 16,000 events were recorded in a week, with 90% of sessions ending after a single event, indicating either a bug in the search schemas or a depressingly high search failure rate.

Schema 2.0.0

The top user agents from a control sample, compared to their proportion in the second dataset for search satisfaction

With that in mind we implemented a second, far simpler, schema - Schema:TestSearchSatisfaction2. Unlike the first schema, this has a smaller and simpler JavaScript load and does not attempt to compute things like page traces in-browser - instead we just identified pageviews, search events, and had the browser check in every N seconds. This means we can compute traditional session reconstruction metrics (using the timestamps from pageviews) and perform some survival analysis (using the checkins) but don't have quite as much complexity to the code.

A density curve of times between pageviews for users who made a search, log10

We logged nearly 300,000 events from 10,211 sessions across approximately 5,494 users. A bug (introduced on September 10th, 2015) in the Event Logging (EL) system prevents us from accurately linking multiple sessions to a same user, so the current number is a rough approximation. The result was a schema where the user agent data tracked much more closely with the control sample, meaning we can be somewhat certain the actual recorded events are representative. And when we perform the standard intertime analysis we see a two-peak distribution with the same initial breakpoint found in other datasets.^[3] This suggests user behavioural patterns after searches don't actually vary that much from their default.

Survival analysis approach

Figure 1. Kaplan-Meier curve of time spent on page.

Figure 2. Fraction of users remaining as a function of time spent on page, statified by OS.

Figure 3. Fraction of users remaining as a function of time spent on page, statified by OS after lumping Windows versions.

Figure 4. Fraction of users remaining as a function of time spent on page, statified by browser.

Figure 5. Fraction of users remaining as a function of time spent on page, statified by language.

Figure 6. Fraction of users remaining as a function of time spent on page, statified by project.

Figure 7. Fraction of users remaining as a function of time spent on page, statified by wiki.

When a user closes the page and the last ping we have from them is 40s, then we know that they were on the page between 40 and 50 seconds. This is where survival analysis comes in. In statistics, right censoring occurs when a data point is above a certain value but it is unknown by how much. In our dataset, the last check-in is that value.

Survival analysis is concerned with time to event. In epidemiology, that event is usually death, but in this context the event is closing the page. Kaplan-Meier (K-M) estimate is one of the best options to be used to measure the fraction of subjects (users) without an event (closing the page) for a certain amount of time. K-M estimates enable us to see the percentage of people we lose with each additional second. In Fig. 1, the first quarter of users closed the page within the first 25 seconds, and after a minute, we have lost 50% of our users.

Next, we wanted to see whether there were differences in the time spent on page Kaplan-Meier curves when stratifying by various user agents fields such as operating system (OS) and browser, and stratifying by project and language. In general, the users in our test dataset behaved very similarly. Some findings:

A greater fraction of Linux users kept the pages open for at least 7 minutes than users on any other OS. (Fig. 2.)
When we combined the fragmented Windows sub-populations into a single Windows group, we saw that Linux and Ubuntu, specifically, retained a greater fraction of users past 400s. Unsurprisingly, Android and iOS were the two OSes (with the exception of the catch-all “Other” category) where we lost users the fastest. (Fig. 3.)
In general, users across the various browsers behaved similarly. The big exception is Safari users (the pink curve), who we lose the fastest. (Fig. 4.)
Users remained on German and Russian wiki pages longer than on wikis in other languages. (Fig. 5.)
When we stratified by project, that was where we started seeing really stark differences between page visit times. (Fig. 6.)
- We lost users the fastest on Commons (red), which makes sense because those pages are not articles that we would expect users to spend several minutes viewing. By 40s, we have already lost half those users.
- Users viewing Wikiquote (blue) pages, however, stayed on those pages longer than users on others, and it was only by 120s that we have lost half those users.
The trends noticed above in Figs. 5 and 6 are also evident in Fig. 7. Users of Russian and German Wikipedias stayed on those pages longer, while Spanish and English Wikipedias (along with other wikis) had very similar Kaplan-Meier curves.

We now have a valid (user agent unbiased) schema for tracking sessions and valid statistical methodology (survival analysis) for dealing with the data it generates. Already we can use it to see how user behavior differs between the different wikis, and somewhat between the different languages. Having said that, we think the schema can be improved to include additional information that would make future analyses more robust and would enable more questions to be answered.

Proposed addition for v2.1.0: a Scrolled event

Theoretically, we could use sessions to label pages as “abandoned,” which is to say that if the user has a page open for at least 7 minutes but they have started a new session in another tab, then it doesn’t make sense for us to hold on to that data point because the user has, essentially, abandoned the page.

We briefly discussed the possibility of adding an “on scroll” event trigger and using it together with the check-ins. Scrolling trigger doesn’t appear to be a restrictive requirement, so we can devise an intelligent way of logging scrolls. That is, we don’t ping the server every time the user scrolls, but we can ping the server with a message “the user has scrolled in the last 30 (or 60) seconds.” So if the user is on the page for 6 or 7 minutes but hasn’t scrolled in the past 5 minutes, we can either correct the page visit time or disregard the data point entirely.

Proposed addition for v2.1.0: a Parameter field

We suggest adding a param field that takes on context-dependent (unsigned) integer values?

If action == searchEngineResultPage then param stores the number of results returned by the search (with an upper bound of, say, 100).
If action == visitPage then param stores the index of the clicked the result – e.g. 1 if the page opened was the 1st one in the results list, 4 if the page was the 4th result listed, etc.
If action == checkin then param stores the check-in time (10, 20, 30, etc.).

Possible questions this may help us answer:

Did the users who triggered a searchEngineResultPage but no visitPage’s even get any pages to go to? If action == searchEngineResultPage and param > 0 but there’s no accompanying visitPage, then we probably didn’t return anything they were looking for. This could be very useful for estimating satisfaction.
Do users open pages beyond the first 10 that get returned?
Is the 1st result the one that users spend the most time on? How does the page’s ranking in the results list correlate with how long people spend on the page?
Do users follow a pattern of opening 1st result, being unsatisfied, opening 2nd result, being unsatisfied, opening 3rd…?

This will also help prepare the schema for a future revision where we add a “Did we get you results you’re satisfied with?” box – action: survey, param: -2 (“heck no”) or -1 (“not really”) or 0 (“unsure”) or 1 (“yes”) or 2 (“very much!”)

RQ2: Does this threshold and the heuristic it permits align with explicit measures?

There is existing research on using these kind of measures, such as dwell time, mostly around improving search recommendations. The literature refers to these as implicit measures of user satisfaction or search quality to distinguish them from explicit measures such as direct user feedback.^[4]

One approach taken to testing implicit measures is to compare them to the results of explicit ones,^[4] and that's what we want to do here. A random sample of search sessions will feature implicit tracking and present the user with, at varying points in the workflow, a quick survey form to explicitly rate their satisfaction with the results they've been given. What this survey looks like and the population it hits will be established after we

Results

References

↑ MobileWikiAppSearch for the Wikimedia Apps, Search for desktop, and MobileWebSearch for the Mobile Web interface
↑ Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).
↑ ^a ^b Aaron Halfaker, Oliver Keyes, Daniel Kluver, Jacob Thebault-Spieker, Tien T. Nguyen, Kenneth Shores, Anuradha Uduwage, Morten Warncke-Wang: User Session Identification Based on Strong Regularities in Inter-activity Time. WWW 2015: 410-418
↑ ^a ^b Agichtein, Eugene, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating user behavior information'' Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.

Appendix

[1] MobileWikiAppSearch for the Wikimedia Apps, Search for desktop, and MobileWebSearch for the Mobile Web interface

[geiger13using-2] Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).

[halfeyes-3] Aaron Halfaker, Oliver Keyes, Daniel Kluver, Jacob Thebault-Spieker, Tien T. Nguyen, Kenneth Shores, Anuradha Uduwage, Morten Warncke-Wang: User Session Identification Based on Strong Regularities in Inter-activity Time. WWW 2015: 410-418

[dumais-4] Agichtein, Eugene, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating user behavior information'' Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.

[1]

[2]

[3]

[4]