Discovery/Handling question marks in search queries

A problem with searching edit

The Wikimedia search team recently completed a statistical analysis of search query features and the number of results that the queries return. Two features of search queries that show they are more likely to get zero results are having strings in quotes (e.g.,"xyz") and ending with a question mark (e.g.,xyz?). Here, we will explore the issue of those queries that end with ? as the final character.

Currently, the question mark is available to be used as a wildcard, and it will match any one letter. So for example, searching for wiki?edia will give results for both wikipedia and wikimedia. However, many users don’t know this and they use question marks for a more common purpose: to ask a question.

As another example, when a user asks how old is Tom Cruise? on English Wikipedia, the last term Cruise? can match cruiser, cruises, cruised and Cruise’s but would not match Cruise. This type of query can give unexpected and generally poor results.

How we found this out edit

We analyzed queries that end with a question mark in ten Wikipedias: English, German, Spanish, Russian, French, Portuguese, Japanese, Italian, Polish, and Chinese. We re-ran the queries with and without the final question mark. When we removed the final question marks, the number of queries that showed zero results decreased, as did the overall number of queries that got fewer than three results.

A manual inspection of the sampled ?-final queries in six of those ten languages (English, German, Spanish, French, Portuguese and Italian) showed that the vast majority of the queries are indeed questions. This leads us to believe that users are generally not trying to use wildcards, intentionally, when they use question marks.

Also of note edit

  • If an article ends with a question mark (e.g. Who's Afraid of Virginia Woolf?) the search results will still return a query that includes that article.
  • There are some queries that are made up entirely of question marks and other punctuation (e.g., ??? ???-?? or ?...?.,??)
  • Some Spanish Wikipedia queries used a leading inverted question mark (¿), which generally doesn't pose a problem.
  • Other Spanish Wikipedia queries used a leading question mark instead of an inverted question marks (e.g., ?cuantos años tiene Tom Cruise?) which does cause problems in getting good search results.
  • Some queries have multiple question marks (e.g., how old is tom cruise??). Treating ? as a wildcard, this will look for two extra letters in a matching word. For example: cruise?? would match cruisers, but not Cruise.
  • In some queries the final question marks are separated by a space (e.g., how old is tom cruise ??).
  • Many queries with multiple question marks are multiple questions (e.g., how? why?).
  • Sometimes the multiple questions in a single query don't have spaces between them.
  • A small number of queries with question marks are potential wildcard queries, but most of those include an initial question mark and as such do not return results.

A detailed analysis is available for further reading.

Possible solutions that can help edit

There are currently four options in development for dealing with question marks in queries:

  • no: do nothing and leave the queries as they are.
  • final: remove all question marks and spaces from the end of the query and use that as the search.
  • break: remove all question marks followed by a word boundary (in particular a Unicode non-letter character).
  • all: remove all question marks and replace them with spaces (treating them as word boundaries).

If it is decided to make a change to the way search handles question marks, these options would be configurable for each wiki to use. It is the recommendation that the second option (final) be the default option.

Additional features and notes edit

  • Because insource queries use regular expressions, queries that include insource: would not be modified.
  • Queries made up entirely of punctuation (i.e., .,:;?¿!*-) and spaces would not be modified.
  • Question marks escaped with a backslash (i.e., \?) would not be removed, but they would be unescaped so that they could function as regular wildcards.

More things to consider edit

  • The solutions proposed above do not cover the Spanish case of an initial ? used instead of ¿, which actually causes worse problems than using a final ?.
    • An option to strip out the initial ? could be added as a bundled or separate function.
  • When queries are modified, we can provide a link with the appropriately escaped query to search using all question marks as wildcards.
    • This would be similar to the way queries that are spell-corrected are handled.

See also edit