Research:Finding Search Engine Terms Used to Retrieve Wikibooks

Contact

Independent

Duration: 2018-January – 2018-March

This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.

The idea here is to process recent search term data extracted from http_referer headers for HTTPS requests that were pointed at various chapters of Wikibooks. For Google, this would be the part after ?q= and for Baidu, it might be ?wd=, and so on. Having the extracted search term data from the http_referer headers could help explain general interest in the books as well as provide clues about various spikes and surges in usage for the book and its chapters. With more clues explaining the spikes and surges, the book content can be improved.

Methods

The primary method will be to use HQL database queries to retrieve only search terms (not the whole search query) from the http_referer headers inside the database holding stored HTTP Requests. These would be grouped by date and limited by specific Wikibook destination chapters. The following data is sought from the database:

date of access (perhaps rounded to week or month)
search term (not the whole query and definitely not the whole header)
Wikibook destination chapter

Timeline

Work would begin in January 2018 and conclude in March 2018. The first phase would be developing a generic database query to extract and rank search term data from http_referer headers for a specific Wikibook. The second phase would be to generalize that query and apply it to five other books. The final phase would be to write up and summarize the findings and methodology.

Funding

No funding is needed on this end. Hardware, workspace, beer, etc. are covered. However, Wikimedia staff, presumably from the Discovery team, would need to provide read-only access to the relevant database at the project's start and turn off said access at its conclusion.

Policy, Ethics and Human Subjects Research

Data will be accessed, sanitized, and analyzed in accordance with Wikimedia's guidelines for data access and data retention. The resulting analysis will be stored in accordance with Wikimedia's data storage guidelines. The data being extracted and analyzed, the search terms -- not the whole search query, is not sensitive. According to off-ticket input from the Legal team this data should be fine: https://phabricator.wikimedia.org/T144714#2620800

Expected Results

The expected outcome is a more or less generic HQL query that can be re-used for arbitrary Wikibook destination chapters to identify search keywords and their frequency of use.