Research:Identification of Unsourced Statements/WikiLabels: How-to

Wiki Labels is a tool to collect annotations for to Wikimedia-specific machine learning projects. If you need to crowdsource labels on Wikimedia data, including Wikipedia statements, edits, articles, Wiki labels might be a good place to start from. If you want to run your crowdsourcing job through Wiki labels, here is a little how-to.

Basic Concepts

edit

A crowdsourcing task on Wiki labels is called a campaign. A user interested in annotating data from a campaign can request a workset. A workset is a set of tasks, where each task corresponds to one piece of data to be annotated (e.g. one article, one edit). To build a campaign, a few things need to be defined:

  1. A form, i.e. the interface thorugh which the users will be annotating data
  2. A view rendering input data (e.g., an article) according to your needs
  3. Some hyperparameters, including:
    1. <wiki>, the language editions of Wikipedia included in the campaign, for example fawiki, dewiki, etc.
    2. <labels-per-task>, the number times a task can be assigned to different labelers
    3. <tasks-per-assignment>, the number of tasks assigned per workset

Setup

edit

Follow these steps to get WikiLabels working locally.

  1. Clone the repository: https://github.com/wiki-ai/wikilabels
  2. Follow the installation instructions: https://github.com/wiki-ai/wikilabels#installation
  3. Create a campaign: ./utility new_campaign -h
  4. Add tasks: cat tasks.json | ./utility task_inserts <campaign_id>
  5. Run dev server: ./utility dev_server
  6. See your work in action: http://localhost:8080/

Preparing Your Data

edit

Data needed for a labeling task must be JSON-formatted in order for it to importable into the database using the utility script. Here's a sample piece of data:

{"lang": "en", "id": "48594", "revision": "830688606", "tid": "1c832bf8-2b1b-11e8-84a7-96238bf73b54", "title": "Samuel_Johnson", "section_index": 9, "section": "Character sketch", "paragraph_index": 3, "sentence_index": 3, "statement": "Although Boswell was present with Johnson during the 1770s and describes four major pamphlets written by Johnson, he neglects to discuss them because he is more interested in their travels to Scotland"}

You may want to create a file that contains a valid JSON object like above in each line and import that into you campaign as such: cat tasks.json | ./utility task_inserts <campaign_id>

Designing Your Campaign

edit

To run a campaign, you will need to specify 1) the details of a form, including instructions, tooltips, and input field 2) a view to render each piece of data.

Designing the Form

edit

The Wiki labels Form specifies how to collect annotations on a given piece of data. To design a form, proceed as follows:

  1. Go to the form builder.
  2. Edit the code tp specify 2 aspects of the interaction:
    1. Input fields through which users can express their judgment: for example, text boxes, radio buttons. Form fields are configured using a YAML format that reflects OOjs UI. Fields are in general Widget sub-classes. For example, a text box is specified through the TextInputWidget subclass. To get familiar with the OOJS widgets you can use OOJS' helpful documentation. Key input fields can be marked as required: true. For difficult/subjective tasks, a good practice is to add an 'unsure' checkbox.
    2. Instructions and tooltips the user should follow to correctly perform the task. In the i18ns space, you can specify the instructions, headers, labels and tooltip sentences for the user. By declaring the language as lan: in the first line of the instruction block, instructions can be specified for as many languages as desired.
  3. Submit the form for review. See this for sample forms.

For more best practices on how to design crowdsourcing annotation interfaces, see JMo's guide to Mechanical Turk (currently only available to WMF staff).

Designing and Coding the View

edit

The View element renders the input data according to the tasks goal. For example, if the task is to judge a word in a sentence. The input data will be the sentence text, and the view will render it so that the word to be examined is highlighted.

Views are added to https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/wsgi/static/js/wikiLabels/views.js. Each view should extend View and have a customized present method. taskInfo passed to the present method is the data that you added via the task_inserts above (See #Setup). The name of the view is used in creating a campaign.

Choosing the Hyperparameters: Tips

edit

Tips for choosing the campaign parameters

  1. Languages: the number of languages in each campaign heavily depends on data and translation availability. A way to translate form sentences to many languages is to use TranslateWiki.
  2. Number of Labels per Task: the number of annotation per task depends on how certain you want to be about the ground truth of a label. If you assign more than one annotation per task, you will need to decide how you handle disagreements (e.g. one annotator said this edit was vandalism but the other did not). Common strategies for resolving disagreement include majority voting (pick the decision made by more annotators), having a second round of annotators (or subject matter experts) review label conflicts and resolve them definitively one way or another, or treating tasks with conflicting labels as less trustworthy (i.e. potentially excluding them from your final training or analysis dataset).
  3. Number of Tasks per Workset: this depends on the time needed to complete a task. A general best practice is to allow a maximum of 5 minutes for the completion of a workset.

Deployment

edit

Install nginx

edit
  1. sudo apt-get install nginx
  2. Create a config variable at /etc/nginx/sites-available/default with the following content:
 server {
 	listen 80 default_server;
 	listen [::]:80 default_server;
 
 	server_name research-wikilabels.wmflabs.org;
 
 	location / {
 		proxy_pass http://localhost:8080;
         proxy_http_version 1.1;
         proxy_set_header Upgrade $http_upgrade;
         proxy_set_header Connection 'upgrade';
         proxy_set_header Host $host;
         proxy_cache_bypass $http_upgrade;
 	}
 }
  1. Enable the config variable: sudo ln -s /etc/nginx/sites-available/default /etc/nginx/sites-enabled/

Run local dev server

edit
  1. Currently, the unsourced statements staging environment is available at http://research-wikilabels.wmflabs.org. I followed the instructions from the above with some additions.
  2. In order to get logging in working, I followed these instructions to get oauth key and secret. Then I replaced oauth key and secret with the ones I got.
  3. I opened a new screen window and ran the dev server:
 $ screen
 $ ./utility dev_server
 Ctrl a d  (to detach screen)