Community Wishlist Survey 2019/Search/Linksearch overhaul

Linksearch overhaul

  • Problem:
    • Protocol-specific: Currently, I have to do two separate link searches to find links to cover both secure and non-secure links e.g. Special:Linksearch/*.example.com and Special:Linksearch/https://*.example.com.
    • I can't filter by namespace - unless using the API, and even then it is clunky as the filtering is done in PHP and not MariaDB.
    • Perform more complicated searches (e.g. blogspot.*)
    • Result set size limitations.
  • Who would benefit: Anyone who uses this special page.
  • Proposed solution:
    • Eliminate the technical debt in the externallinks table to make queries faster.
    • Separate the domain and protocol out as distinct database columns. Make these queryable.
    • Add a proper namespace filter.
    • Return results for all protocols when no protocol is specified.
    • Make all of these improvements available through the API.

Discussion

Some notes:

  • Filtering by namespace would require creating and populating a column in the database and adding appropriate indexes. gerrit:163470 might be relevant there.
  • Separating the protocol from the rest of the URL would similarly require database changes. It still might not be possible to request "http OR https", just "http", "https", or "any protocol".
  • More complicated searches on the domain/path are rather unlikely, as efficient SQL search of text columns is generally limited to prefixes (or depends on methods that are heavily database-dependent; note MediaWiki actively supports three database engines and sort-of supports two more).
  • There's also the fact that searching for links to internationalized domain names (IDNs) means you have to try both the encoded and IDN version. That'll be fixed by gerrit:322729, if it eventually gets merged.
  • There's also the fact that searching for links using IPs doesn't work very well. That too will be fixed by gerrit:322729.
  • If by "eliminate the technical debt [...] to make queries faster" you're referring to the fact that it gets slower as you page through the results and there's therefore a limit on the special page, that should be fixed by gerrit:322729 too. Although that patch doesn't actually remove the special page's limit.

Anomie (talk) 14:59, 30 October 2018 (UTC)[reply]

  • @MER-C: Just a ping to let you know about the comments above. Possibly they could help you clarify your proposal before the voting begins. (With 200+ ideas, clear proposals make everyone's life easier!) Quiddity (WMF) (talk) 01:30, 8 November 2018 (UTC)[reply]
  • There is a really powerful workaround for linksearch and that's just to use Special:Search insource. It handles arbitrary URL schemes, can be filtered by namespace, has wildcards if you know a little regex, and has no result size limit. I might even advocate we eliminate the technical debt by removing the Special:LinkSearch page and advocate Special:Search instead. --Izno (talk) 23:57, 8 November 2018 (UTC)[reply]

As an alternative to Anomie's suggestions, this can also be done in the elasticsearch cluster. We already index the external_links for every page on every wiki, they are just not analyzed in a way that is useful for this type of search. It is certainly possible to run analysis on the external links to create sub-fields like domain name, url pieces, etc, and search against those. Additionally we could probably expose regex on external_links if if was a common enough request. EBernhardson (WMF) (talk) 15:33, 9 November 2018 (UTC)[reply]

I note that Special:LinkSearch is a core MediaWiki feature, while use of ElasticSearch is optional and requires an extension. We should keep in mind usability by non-Wikimedia wikis. Anomie (talk) 19:04, 9 November 2018 (UTC)[reply]
  • Adding a namespace filter to linksearch has been on my top five list ever since it was briefly implemented then withdrawn. It is massively useful. You can do it from AWB, I have on occasion resorted to using AWB to process a list then bringing the list back to enWP to fix. PLEASE do this! JzG (talk) 00:31, 16 November 2018 (UTC)[reply]

Voting