Community Wishlist Survey 2021/Search/Maintaining a list of the most common search terms which do not correspond to an article name

Random proposal ►◄ Search The survey has concluded. Here are the results!

Maintaining a list of the most common search terms which do not correspond to an article name

Problem: Not knowing which search terms are failing in directing users to an article.
Who would benefit: All users and editors who like to make redirects and new articles.
Proposed solution: Maintain a list which shows the most common search terms that do not successfully direct or redirect to an article.
More comments : This list would allow editors to easily see which terms require a redirect or a new article the most. A filter for common or offensive words such be applied, and the list should remove words that have subsequently had articles made, either by refreshing or redlinking.
Phabricator tickets:
Proposer: HappyMihilist (talk) 16:04, 18 November 2020 (UTC)[reply]

Discussion

I like this idea a lot. I could see the functionality expanding over time—there will be some search terms that we'll want for whatever reasons to keep as redlinks, and we'd need to start filtering those out or otherwise they'd soon start to clog the top of the list. {{u|Sdkb}} ^talk 03:16, 19 November 2020 (UTC)[reply]
Awsome idea for redirects and/or new articles! Strainu (talk) 09:48, 19 November 2020 (UTC)[reply]
enwp has a Most-wanted articles list but it's out of date and obviously of no use for other wikis. Certes (talk) 14:45, 19 November 2020 (UTC)[reply]

Also this is about searching, not redlinks, so that page didn't list them, so I don't think there has been a system like this. Dreamy Jazz ^{talk to me | enwiki} 00:13, 20 November 2020 (UTC)[reply]

Really good idea. Although it would require storing search histories, this data can be anonymous. A way to filter out intentional redlinked search results would be good, as unlike Wanted categories or counting redlinks, there is no way it can be removed from the list by editing. A way to filter out I think is needed, especially if an LTA decides to spam. I suggest this data should be time limited (so that it drops of the list without needing to filter) and that entries should be removed if the page is created. Dreamy Jazz ^{talk to me | enwiki} 00:13, 20 November 2020 (UTC)[reply]

Yes, I think some kind of management of the list is definitely necessary to avoid offensive or too common words in being on the list. I'm sure there already exists lists of such words to use. I actually do think however that the list could redlink the items by default, as this would allow for quickly removing unnecessary entries. Or it just gets updated daily, in which case those entries automatically disappear. HappyMihilist (talk) 06:17, 20 November 2020 (UTC)[reply]

I caution on “offensive” however. If a word pops up enough to be included on such a list it’s obviously in use enough to be of interest. Also to mind is what offends some doesn’t others. The British use of C—- is even heard on the floor of parliament at times. Use on the floor of the US a House or Senate would be national news. Using god in any exclamation is extremely problematic in many southern areas of the US, yet commonly used elsewhere. Etc.

Please make this happen, it would be so great. Abductive (talk) 18:28, 20 November 2020 (UTC)[reply]
The data required for such a summary appeared in 2012. I think it rapidly disappeared for privacy reasons. The technology is (or was) available; perhaps a summary could be released again. I promise not to have my PC search continually for The Certes Garage Band before claiming it to be our most wanted missing article. Certes (talk) 22:52, 22 November 2020 (UTC)[reply]
Love this idea. Should be easy enough from a data perspective and would be high impact, as it would give editors a list of the most important articles or redirects that are yet unwritten. One extension could be some way of handling misspellings of the same query and bundling those into one "search term". —Shrinkydinks (talk) 22:35, 24 November 2020 (UTC)[reply]
From a Wiktionarian perspective, it sounds very useful to. People may look for neologisms and we may track them with this list! Noé (talk) 12:09, 29 November 2020 (UTC)[reply]
phab:T8373#1856037 and [1] have some more information about why this hasn't been done before. The core of the issue is that it's difficult from a privacy perspective and it's also not very useful data, which made overcoming the privacy concerns not seem worth it. --Deskana (talk) 23:31, 30 November 2020 (UTC)[reply]
I use Wikipedia a lot with maths and physics students aged 18 or above. Many of them opt out soon for two mainly two reasons: a) the first lines of an article are unintelligible as they directly address experts. And b) searching for a specific topic often leads to confusing and non-specific search results. Collecting dead-end search words to create redirects has the potential to increase usability for well-educated but not expert users. Rhetos (talk) 08:07, 15 December 2020 (UTC)--Rhetos (talk) 07:13, 15 December 2020 (UTC)[reply]

Voting

Support ValeJappo 【〒】 18:29, 8 December 2020 (UTC)[reply]
Support Noé (talk) 19:32, 8 December 2020 (UTC)[reply]
Support This is a great idea that can be really helpful for creating redirects and new pages ThadeusOfNazereth (talk) 19:33, 8 December 2020 (UTC)[reply]
Support T. Le Berre (talk) 19:42, 8 December 2020 (UTC)[reply]
Support As a suggestion, it should be located in statistics. MarioSuperstar77 (talk) 20:16, 8 December 2020 (UTC)[reply]
Support YFdyh000 (talk) 23:04, 8 December 2020 (UTC)[reply]
Support — Jules ^Talk 23:27, 8 December 2020 (UTC)[reply]
Support PianistHere (talk) 01:57, 9 December 2020 (UTC)[reply]
Support Alkari (talk) 03:21, 9 December 2020 (UTC)[reply]
Support Yeenosaurus (talk) 03:45, 9 December 2020 (UTC)[reply]
Support Ottawajin (talk) 05:08, 9 December 2020 (UTC)[reply]
Support While most would probably either be mere misspellings or be non-notable, this does sound like a good idea. Opalzukor (talk) 08:12, 9 December 2020 (UTC)[reply]
Support Abductive (talk) 09:06, 9 December 2020 (UTC)[reply]
Support This could help us better serve readers by addressing content gaps. {{u|Sdkb}} ^talk 10:28, 9 December 2020 (UTC)[reply]
Support Xavi Dengra (MESSAGES) 12:55, 9 December 2020 (UTC)[reply]
Support Hb2007 (talk) 13:23, 9 December 2020 (UTC)[reply]
Support TheImaCow (talk) 17:14, 9 December 2020 (UTC)[reply]
Oppose due to the privacy concerns mentioned in the discussion section. --Петър Петров (talk) 17:39, 9 December 2020 (UTC)[reply]
Oppose It seems like this would be really big and difficult to maintain; there'll be complaints about difficulty managing it next. —The preceding unsigned comment was added by Tyrekecorrea (talk • contribs) 19:03, 9 December 2020 (UTC)
Oppose Requires a lot of code writing and more complex skills. If not required, then the whole idea would take years and years to complete. I don't think the TechCom would reach to the pars of Google in one year. Ah, I see that filtering is considered. However, Google's firing of an AI ethicist who have challenged algorithm bias makes me doubt that the filtering would be effective in eliminating algorithm bias, even when it may filter out offensive words. Also, I can imaging many more VP discussions and Phab tickets, like this one. Furthermore, I don't think implementing this idea as an option for registered users would erase the idea's potential problems. George Ho (talk) 21:54, 9 December 2020 (UTC)[reply]
Support dwf² (talk) 23:07, 9 December 2020 (UTC)[reply]
Support Anaxial (talk) 18:57, 11 December 2020 (UTC)[reply]
Support Wanax01 (talk) 16:02, 12 December 2020 (UTC)[reply]
Support SkSlick (talk) 19:41, 12 December 2020 (UTC)[reply]
Support Kew Gardens 613 (talk) 02:48, 13 December 2020 (UTC)[reply]
Strong oppose for privacy reasons and attack vector reasons, but this is somewhat academic in that I expect this proposal would be rejected as infeasible/too high risk even if it were #1 on the wishlist. To quote Deskana from phab:T8373#1856037, which they bring up above, Search data, and in particular the search queries that users enter, is assumed to contain personally identifying information unless proven otherwise.. I actually had this concern before reading their words - what happens when I go onto an obscure wiki and paste something in the search bar, thinking it's the text I just tried to copy, but I accidentally didn't and the last thing on my clipboard is my name and contact info? Not an unreasonable scenario that this could be on the list for any reasonably obscure wiki and reasonably long list. Or if we set a threshold of how many times something has to be searched for, we're still storing doxxing info by malicious actors who immediately realise that this feature is a good attack vector. Or even if the information is not PID, as soon as undeclared paid editing companies learn about this they can use bots and other behaviour to make terms appear frequently enough that we'll do their job for them and write about something that otherwise wouldn't get an article (maybe even one that's notable, but it's the fact that someone is duplicitiously bypassing our normal process for financial gain that's an issue). — Bilorv (talk) 00:09, 14 December 2020 (UTC)[reply]
Support This will make WIkiPedia more useful overtime. The use of the term "common search terms" means that this will NOT be a threat to privacy!!! And picking those up and guiding them to good pages will be great for making us more accessible. As Kevin Kelly wrote in Out of Control "Honor your errors". If people keep making the same mistake, they will keep making it, so, make it not a mistake or... Bodysurfinyon (talk) 02:43, 14 December 2020 (UTC)[reply]
Support great way to improve wiki content/relevant search terms within articles/titles Philiptdotcom (talk) 13:42, 14 December 2020 (UTC)[reply]
Support in theory, only if the data can be sanitized of an privacy concerns. — SMcCandlish ☺ ☏ ¢ >^ʌⱷ҅_ᴥⱷ^ʌ< 08:08, 15 December 2020 (UTC)[reply]
Oppose per Bilorv. — Épico (talk)/(contribs) 00:00, 17 December 2020 (UTC)[reply]
Support Rhymes (talk) 18:22, 17 December 2020 (UTC)[reply]
Support Mmitchell10 (talk) 20:10, 18 December 2020 (UTC)[reply]