Grants:IEG/SOLRSearch

status: idea

Individual Engagement Grants
Individual Engagement Grants
Review grant submissions
review
grant submissions
Visit IdeaLab submissions
visit
IdeaLab submissions
eligibility and selection criteria

project:

SOLRSearch


project contact:

jmcclure@hypergrove.com

participants:


John McClure



summary:

  • (client) Faceted search of SOLR Repositories
  • (server) SOLRProxy feeds to SOLR Repositoroes
  • Special:SOLRAdmin page





2013 round 1

Background edit

The Wikimedia Foundation deploys Lucene-search and MWSearch as primary components for its search function. A Mediawiki site may also add the OAIRepository extension to index pages' raw wikitext using seven terms from the Dublin Core vocabulary to capture a page's title, revision author & date, persistent URL, and mime-type. The Lucene-search extension builds upon the Apache Lucene Core API to perform key functions in a manner specific to Mediawiki architecture (e.g., rank pages based on backlinks, term proximity, relatedness & anchor text; distribute searching and indexing; and incremental updates). For API-based searches, API:Search & API:OpenSearch allow one to specify strings to locate within either titles or wikitext objects.

These extensions & APIs were first developed in 2008; important libraries have since been published:

  • Apache-Solr is a standalone server that extends Lucene servers with a REST-like API to deliver results encoded in XML, JSON, CSV or binary. This library provides hit highlighting, faceted search, caching, replication, and a web admin interface. A sample host-based faceted search is here.
  • Solarium PHP library - interface to SOLR servers for PHP clients, capable of all read & write operations for indexed documents. Solarium allows for three modes of usage: (a) API calls (b) hook extensions (c) configuration directives. These modes can be mixed, and queries can be inherited as baseline for other query specifications.
  • Ajax/Solr library - interface to SOLR servers from Javascript clients. This library relieves hosts (such as Mediawiki) of most search-related processing by off-loading requests & results formatting to the javascript client. Cross-domain Solr host requests are also possible with this library. A sample browser-based faceted search is here.

Project idea edit

The SOLRSearch extension provides an opportunity to convert Mediawiki installations into universal SOLR clients, able to simultaneously perform faceted searches across multiple Solr repositories. This is particularly useful for geo-spatial search, ZooKeeper cluster management, or the use of Mediawiki-maintained data for auto-suggestion.

The proposed extension includes a "SOLRProxy" by which a wiki can index its pages in one or more SOLR repositories located on the same or different host as the Mediawiki daemon. SOLRProxy modernizes mw:Lucene-search as SOLRProxy is based specifically on the Solarium PHP library.

Indexes applicable to a wikipage are consistent with the vocabulary of each SOLR server attached to a wiki. en:Dublin Core terms such as used by mw:Extension:OIARepository can be installed any SOLR repository. Further a SOLR server's vocabulary (and indexes associated with a page) may be synchronized with either Wikidata or Semantic Mediawiki properties recorded for the page.

Benefits edit

Because Solr uses the Lucene library for full-text search, has faceted navigation, hit highlighting, etc., it is possible to offload these functions from Mediawiki core code. Because the queries may be structured as well as textual, SMW and Wikidata should be more searchable. JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP are supported, so there would be no need for custom code to integrate mediawiki with any of these in future.

The Solr HTML administration interface can be relatively easily integrated and managed alongside MediaWiki.

Replication, distributed search through Sharding, search results clustering (Carrot2), and plugins, plus the ability to embed Solr in any Java application (it exists as a TYPO3 extension), make MediaWiki easier to deploy within a Java shop generally, as it can be just as robust a persistent store as any other database.

Project goals edit

The initial goal of this project is to rewrite the SOLRSearch extension to be operable with current MW versions and architectures; using Apache/SOLR, Solarium and Ajax/SOLR libraries. Deliverables include

SOLR Admin
a special page to inspect & manage the Mediawiki/SOLR interface
SOLR API
a component providing API visibility to the SOLR server(s) indexing a wiki's pages
SOLR Client
a "faceted browser" of the contents of SOLR repositories
SOLR Help
extension design, installation & usage instructions
SOLR Proxy
a component to push selected indexed wiki-pages to any number of SOLR repositories
SOLR Table
a database table to identify SOLR repositories available to the SOLR Client

Part 2: The Project Plan edit

Project plan edit

Scope: edit

Scope and activities edit

  1. Identify & estimate timing of key project milestones
  2. Replace Prototype.js with JQuery library
  3. Replace internal SOLR interface with Solarium calls
  4. Implement Special:SOLRAdmin
  5. Modify client code to be standalone from SOLRProxy code
  6. Modify client & proxy code for multiple SOLR servers
  7. Install selected standard vocabularies
  8. Build database of publicly accessible SOLR servers
  9. Draft project, installation and user-level documentation

Tools, technologies, and techniques edit

  1. Feedback on Requirements & Design Document (deliverable)
  2. Code reviews

Budget: edit

Total amount requested edit

  • $24K: based on a four month development estimate, at $6K/month.

Budget breakdown edit

  • developer compensation

Intended impact: edit

Target audience edit

  • all wikipedia users & visitors
  • independent custom search engine developers

Fit with strategy edit

  • to qualitatively improve mediawiki's current search function
  • to brand wikis as universal faceted search browsers, better than Google
  • to encourage wiki page metadata/indexing using common vocabularies

Sustainability edit

  • closely align client with complete SOLR/Javascript functionality
  • enable clients to search & compare multiple SOLR repositories vocabularies
  • provide a reasonable interface with Wikidata & SMW Property names

Measures of success edit

  • number of bugs should be minimal

Participant(s) edit

  • John McClure is a 35-year independent programmer with solid understanding of the technologies involved with this project. As a former "Ontoprise Ambassador" he recently organized a community of technicians & others concerned with the preservation & reuse of open-source code created by the now-defunct company Ontoprise.

Part 3: Community Discussion edit

Discussion edit

Community Notification: edit

  • wikidata-l@lists.wikimedia.org
  • semediawiki-user@lists.sourceforge.net
  • mediawiki-enterprise@lists.wikimedia.org

Endorsements: edit

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

  • Community member: add your name and rationale here.
  • Many appealingly practical elements to massively organize Wiki data and greatly enhance search abilities of many sites. Vid (talk) 00:00, 14 February 2013 (UTC)