Toolhub/Progress reports/2021-02-19

Report on activities in the Toolhub project for the week ending 2021-02-19.

API for history and diff of toolinfo records edit

Tracked in Phabricator:
Task T271371

We have merged a patch adding API endpoints for revert and undo. The API now exposes these new endpoints:

  • POST /api/tools/{tool_name}/revisions/{id}/revert/ - Restore this revision
  • POST /api/tools/{tool_name}/revisions/{id}/undo/{other_id}/ - Undo changes made between {id} and {other_id}

The undo endpoint may return an HTTP 409 Conflict status code if the computed diff cannot be applied to the current lead revision. A common cause of this would be attempting to remove an array item that has already been removed by an intermediate edit.

Faceted search edit

The third and final major set of functionality in the planned work for Toolhub in the January-March quarter is facet search. w:Faceted search for us means building a search system which includes support for adding classification information about various shared dimensions (facets) of the documents matched by the search. Facets help the user understand the "shape" of the data by giving information on the distribution of values across the dimension. As an example, imagine searching for the keyword "patrolling" and getting in return not only the paginated list of toolinfo records containing that string, but also getting information on how many of the found tools are meant to work on enwiki, frwiki, commons, etc. This additional data can be used just to provide information to the user, but it could also be used to "refine" the search by making it simple to run the search again after adding a constraints to only match tools that work on hiwiki or "any wiki".

Bryan has started work on setting up the backend application to index toolinfo records in a full text search engine (specifically Elasticsearch) and create APIs to actually search the indexed data.

Some readers may be aware that the Elasticsearch project recently announced new licensing terms which include a move away from the Apache License 2.0 Open Source license to the proprietary Server Side Public License. This development is under investigation by the Wikimedia Foundation's engineering and legal teams to determine the future of Elasticsearch for Wikimedia projects. Moving forward with creating a new Elasticsearch usage while that investigation is underway carries some risk that the outcome will be a recommendation to replace Elasticsearch with another technology. At the same time, not moving forward on this much desired feature for Toolhub carries a risk that we will not be able to implement any advanced search in time for our planned initial release in June. At this time it seems to be in the best interest of both Toolhub and the Wikimedia community to move forward with Toolhub's use of Elasticsearch while also staying informed of the ongoing investigation. This is predicated on the assumption that if re-platforming is necessary in the future that that work will be less than the work that was needed to implement the core functionality.

Some investigation was done to look for a well maintained abstraction layer for Django that could isolate Toolhub from the details of Elasticsearch as a backend for full featured search. The Haystack library was identified as just such a project. Unfortunately deeper study revealed that Haystack's support for Elasticsearch as a backend stagnated several years ago and currently the project has no lead maintainer for Elasticsearch issues.

Small changes edit

  • phab:273943 Convert crawler ValidationError to PermissionDenied
  • phab:T274019 JSONSchemaValidator: update message and add friendly code
  • phab:T274020 Replace Tool id with name in auditlog api response
  • gerrit:663916 Add type hint for oauth2's AuthorizationViewSet

Wrap up edit

Starting into the faceted search work this week has been a lot of relearning things for Bryan. He has worked in the past on several Elasticsearch powered projects, but the most recent of those was several years ago now. Luckily the choice of Django for the backend tech stack continues to bring good value. Upstream libraries have been found to handle some of the most complicated work--mirroring Django models into Elasticsearch and exposing search via API actions. With this help we expect the backend API to reach a usable state rather quickly which will give us more time to refine the API as the user interface is built to use it.