Research:Link recommendation model for add-a-link structured task/Training Procedure

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

This page describes the procedure when training the link recommendation model for one or more new Wikipedias.

Step 1: Identify the Phabricator-ticket

Identify the pahb-ticket which describes the task for deployment (example: T290011: Deploy Add a link to a third set of wikis)

Step 2: Train the model

Train the model by running the pipeline in the mwaddlink-repo.

Step 3: Validate the output

Validate that the pipeline executes successfully. For example, the README.md describes in detail, which files should be created at each step in the pipeline and, most importantly, which files should be available at the end of the pipeline.

Step 4: Validate the backtesting protocol

The pipeline automatically performs a backtesting evaluation of the model. The results can be inspected in "./data/<WIKI_ID>/testing/<WIKI_ID>.backtest.eval.csv".

This provides numbers for precision (How many suggestions are correct?) and recall (How many of the possible links captured?) -- the higher the better. It reports these numbers for different values of a threshold-parameter. This parameter can be chosen freely and allows for tuning the model -- usually increasing the parameter means to make the model more conservative (more precision but fewer suggestions). At the moment, the model is used with the default value of 0.5.
Create a new sheet in the spreadsheet to capture numbers from different runs (link to spreadsheet)
Report a summary of the backtesting-results in the Phabricator-ticket (see the example in T290011#7376009). Report the range of precision and recall for the used threshold (e.g. the default 0.5). The aim is to flag if a model shows very poor performance indicating that something is wrong.
What is a poor performance? The runs from the previous models are captured in the spreadsheet and offer a good comparison. Most of the models show a precision of 75% (or better) with a recall of 40% (or better), which has been judged as satisfactory so far. In some cases (arwiki, bnwiki, kowiki), we observe slightly lower precision of 70-75% and quite lower recall of 25-30%. The main problem here is not so much the precision (which is almost at the same level as the other wikis) but the low recall which could mean that we can only generate very few recommendations for the articles in the corresponding wikis.
Overall, if numbers are much lower than that, this could hint towards a problem with the model in the language. Before continuing with the deployment, the model for that language should be investigated further for potential bugs with parsing, etc. For example, for bnwiki I realized through low numbers from the backtesting that the script uses the symbol "।" (daṛi) to separate sentences (causing standard sentence-tokenizers to fail). In other cases it might not be obvious to find the bug.

Step 5 Get confirmation to publish datasets

Ask for confirmation to publish datasets (see the example in T290011#7376009)

Step 6: Publish the datasets

Publish the model and datasets obtained from the training-pipeline.

Publish the datasets by running the publish-datasets script in the mwaddlink-repo.
Report in the Phabricator-ticket when datasets are published.