Web2Cit/Docs/Archive/User guide

Web2Cit is a tool to improve Wikipedia's automatic citation generator, by collaboratively defining procedures to extract citation metadata from web sources. In the context of Web2Cit we usually refer to metadata extraction as "translation".

How Web2Cit works

edit
Short video explaining how Web2Cit works
Short video description of the Web2Cit ecosystem

Web2Cit configuration

edit

In Web2Cit, translation configuration is defined on a per-domain basis. So we have separate translation configurations for separate domains.

Within each domain we can define one or more translation templates. That is, webpages that serve as templates for translation.

Within each template we define template fields, each corresponding to a specific citation field, such as item type, title, publication date, etc. Each template field has a validation rule, indicating what translation outputs are valid for that field.

Template fields can be required or not, indicating whether an invalid translation output for that field will make the translation template non-applicable for a target webpage, or whether it will be simply ignored.

Template fields include one or more translation procedures (typically one), each including a series of selection and transformation steps. Selection steps select elements on the target webpage that contain the relevant metadata, and transformation steps transform them into a format that is valid for the field that we are working with.

In addition, we can create translation subgroups within each domain, by defining URL path patterns. Anyway, there is a special catch-all pattern that matches all templates that do not match any of the URL path patterns defined.

Web2Cit translation

edit

Given a target webpage, Web2Cit first checks its domain to decide which domain configuration it should use.

Then it checks the target webpage's path, to see if it matches any of the URL path patterns defined.

Then it tries all translation templates that match the same URL path pattern, one by one until it finds one that applies. We say that a translation template is applicable for a given target webpage if all fields that have been marked as required are returning a valid output.

If no applicable template is found, Web2Cit uses a fallback template, which simply returns Wikipedia's automatic citation generator (Citoid) response for all fields. This fallback template always applies.

Finally, Web2Cit returns a citation for the target webpage using the first applicable template found.

How to use Web2Cit

edit
Short video explaining how to use Web2Cit
Web2Cit workshop at one of the LD4 Wikidata Affinity Group calls
Taller de Web2Cit en español, en el ciclo Wikiherramientas
  1. Identify a URL for which Wikipedia's automatic citation generator is returning incomplete or incorrect metadata.
  2. Try the same URL using Web2Cit. To do this, simply add Web2Cit's translation server address (https://web2cit.toolforge.org/ at the beginning of the URL in Wikipedia's citation generator.
  3. Go to the URL that you used in the previous step to open Web2Cit's translation summary for the target webpage.
  4. Add a translation test, including the expected output for all template fields. To do this, click on "edit" below "Expected output".
  5. To change the translation output, edit the templates configuration by clicking on "edit" beneath "Translation output". This should open the translation configuration editor.
  6. Add a translation template and enter the path of the webpage that you are using as a template for translation.
    1. Add a template field. You can check what fields we support and their output validation rules here.
    2. Add a translation procedure.
    3. Add one or more selection steps. You can check what selection steps we support and how to configure them here.
    4. Add zero or more transformation steps. You can check what transformation steps we support and how to configure them here. Hover the info icon over the "Item-wise" dropdown if you don't know what to choose there.
    5. Repeat steps substeps 1-4 above until you have addedd all template fields needed.
  7. Save the configuration to your sandbox to experiment freely without affecting all Web2Cit users. At the bottom of the form, add User:YourUserName/ (replace YourUserName with your user name) at the beginning of the configuration file path, and click on "Review changes and save". In the tab that opens, review the changes and publish them.
  8. To try your configuration, you have to tell Web2Cit that you want it to use your sandbox configuration. Back on the Web2Cit's translation summary (opened in step 3), enter your user name next to "Switch to sandbox configuration" and click on "Switch".
  9. Continue modifying the configuration until you are happy with the results. If the translation output is not what you expected, you may "enable debugging" to check what templates were tried, whether they turned out to be applicable or not, the output of each selection and transformation step, etc. Remember that you may also create translation subgroups by changing the URL path patterns configuration.
  10. Once you are happy with the translation output, click on "Edit" beneath "Translation output" to open the translation configuration editor again. But this time save the configuration to the main namespace by removing the User:YourUserName/ that you added in step 7.
  11. In the Web2Cit's translation summary, switch back to using the main configuration to confirm that your changes have been saved for all Web2Cit users.
  12. Finally, repeat step 2 to insert a citation with the correct metadata in Wikipedia.

Integrating Web2Cit in Wikipedia

edit

To better integrate Web2Cit in Wikipedia, you can install Web2Cit's user script.