The Wikimedia Foundation deploys Lucene-search and MWSearch as primary components for its search function. A Mediawiki site may also add the OAIRepository extension to index pages' raw wikitext using seven terms from the Dublin Core vocabulary to capture a page's title, revision author & date, persistent URL, and mime-type. The Lucene-search extension builds upon the Apache Lucene Core API to perform key functions in a manner specific to Mediawiki architecture (e.g., rank pages based on backlinks, term proximity, relatedness & anchor text; distribute searching and indexing; and incremental updates). For API-based searches, API:Search & API:OpenSearch allow one to specify strings to locate within either titles or wikitext objects.
These extensions & APIs were first developed in 2008; important libraries have since been published:
- Apache-Solr is a standalone server that extends Lucene servers with a REST-like API to deliver results encoded in XML, JSON, CSV or binary. This library provides hit highlighting, faceted search, caching, replication, and a web admin interface. A sample host-based faceted search is here.
- Solarium PHP library - interface to SOLR servers for PHP clients, capable of all read & write operations for indexed documents. Solarium allows for three modes of usage: (a) API calls (b) hook extensions (c) configuration directives. These modes can be mixed, and queries can be inherited as baseline for other query specifications.
The SOLRSearch extension provides an opportunity to convert Mediawiki installations into universal SOLR clients, able to simultaneously perform faceted searches across multiple Solr repositories. This is particularly useful for geo-spatial search, ZooKeeper cluster management, or the use of Mediawiki-maintained data for auto-suggestion.
The proposed extension includes a "SOLRProxy" by which a wiki can index its pages in one or more SOLR repositories located on the same or different host as the Mediawiki daemon. SOLRProxy modernizes mw:Lucene-search as SOLRProxy is based specifically on the Solarium PHP library.
Indexes applicable to a wikipage are consistent with the vocabulary of each SOLR server attached to a wiki. en:Dublin Core terms such as used by mw:Extension:OIARepository can be installed any SOLR repository. Further a SOLR server's vocabulary (and indexes associated with a page) may be synchronized with either Wikidata or Semantic Mediawiki properties recorded for the page.
Because Solr uses the Lucene library for full-text search, has faceted navigation, hit highlighting, etc., it is possible to offload these functions from Mediawiki core code. Because the queries may be structured as well as textual, SMW and Wikidata should be more searchable. JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP are supported, so there would be no need for custom code to integrate mediawiki with any of these in future.
The Solr HTML administration interface can be relatively easily integrated and managed alongside MediaWiki.
Replication, distributed search through Sharding, search results clustering (Carrot2), and plugins, plus the ability to embed Solr in any Java application (it exists as a TYPO3 extension), make MediaWiki easier to deploy within a Java shop generally, as it can be just as robust a persistent store as any other database.
The initial goal of this project is to rewrite the SOLRSearch extension to be operable with current MW versions and architectures; using Apache/SOLR, Solarium and Ajax/SOLR libraries. Deliverables include
- SOLR Admin
- a special page to inspect & manage the Mediawiki/SOLR interface
- SOLR API
- a component providing API visibility to the SOLR server(s) indexing a wiki's pages
- SOLR Client
- a "faceted browser" of the contents of SOLR repositories
- SOLR Help
- extension design, installation & usage instructions
- SOLR Proxy
- a component to push selected indexed wiki-pages to any number of SOLR repositories
- SOLR Table
- a database table to identify SOLR repositories available to the SOLR Client
Part 2: The Project PlanEdit
Scope and activitiesEdit
- Identify & estimate timing of key project milestones
- Replace Prototype.js with JQuery library
- Replace internal SOLR interface with Solarium calls
- Implement Special:SOLRAdmin
- Modify client code to be standalone from SOLRProxy code
- Modify client & proxy code for multiple SOLR servers
- Install selected standard vocabularies
- Build database of publicly accessible SOLR servers
- Draft project, installation and user-level documentation
Tools, technologies, and techniquesEdit
- Feedback on Requirements & Design Document (deliverable)
- Code reviews
Total amount requestedEdit
- $24K: based on a four month development estimate, at $6K/month.
- developer compensation
- all wikipedia users & visitors
- independent custom search engine developers
Fit with strategyEdit
- to qualitatively improve mediawiki's current search function
- to brand wikis as universal faceted search browsers, better than Google
- to encourage wiki page metadata/indexing using common vocabularies
- enable clients to search & compare multiple SOLR repositories vocabularies
- provide a reasonable interface with Wikidata & SMW Property names
Measures of successEdit
- number of bugs should be minimal
- John McClure is a 35-year independent programmer with solid understanding of the technologies involved with this project. As a former "Ontoprise Ambassador" he recently organized a community of technicians & others concerned with the preservation & reuse of open-source code created by the now-defunct company Ontoprise.
Part 3: Community DiscussionEdit
Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.
- Community member: add your name and rationale here.
- Many appealingly practical elements to massively organize Wiki data and greatly enhance search abilities of many sites. Vid (talk) 00:00, 14 February 2013 (UTC)