Web2Cit/Archive/Early adopters

Web2Cit translation summary

(A Web2Cit translation overview video is also available on YouTube)

Translation rules are defined on a per-domain basis. A series of translation templates, URL path patterns and translation tests' are collaboratively defined per domain.

URL path patterns define translation subgroups within a domain.

Translation templates specify one or more translation procedures for each template field (each mapped to predefined citation metadata fields). Translation templates are not abstract translation rules, but rather they are defined for specific template webpages.

A translation procedure is a series of selection and transformation steps. Selection steps select values from the translation target, and transformation steps are applied sequentially to transform them.

A procedure output can be valid or not for the template field to which it belongs. If the field has been marked as required, the procedure output must be valid for the template field to be applicable for the translation target.

If all template fields for a template are applicable, then the template is applicable for a given translation target. Applicable templates emit a citation.

Given a translation target, we first match its path to every URL path pattern in order until we find one that matches. If none matches, we use a catchall pattern **.

Then, we try to translate the target with the translation templates that belong to the same URL path pattern group. We try them one by one, in order, until we found one that applies. If none applies, we use a fallback template.

In addition, translation tests define expected translation outputs for specific target webpages. These expected outputs can be compared against the actual Web2Cit translation output.

Workflow proposal

A possible test-driven workflow proposal is:

  1. Find a target URL that Citoid is having trouble with
  2. Confirm that Web2Cit is having trouble with it too
  3. Write a translation test for the target URL with the expected results
  4. Write a template for the target URL until all domain tests pass. If needed, define URL path patterns as well[note 1]
  5. Repeat

An example hands-on demonstration video following this workflow is available on YouTube.

Domain configuration files

An introduction video to this section is available on YouTube.

Location

You will be manually editing domain configuration files (note that by doing this you will be changing translation rules for everyone using Web2Cit). These live in Meta, at Web2Cit/data/, one sub-directory deeper per hostname label, from the top-level domain all the way through the last subdomain.

You can find a full list of configuration files here.

For example, for hostname meta.wikimedia.org, configuration files would be at Web2Cit/data/org/wikimedia/meta/.

URL scheme (e.g., http, https, etc), port and path are not part of the hostname. For example, for URL https://meta.wikimedia.org/wiki/Web2Cit/Early_adopters#Domain_configuration_files, only meta.wikimedia.org is the hostname.

There are three configuration files per domain: templates.json (for translation templates), patterns.json (for URL path patterns), and tests.json (for translation tests). So, for example, the translation templates configuration file for meta.wikimedia.org is at Web2Cit/data/org/wikimedia/meta/templates.json.

Format

All domain configuration files are written in JSON format (see below for alternative formats).

Generally speaking, our JSON files may have a combination of the following value types:

  • Text strings. For example "xpath".[note 2]
  • Booleans: true or false.
  • Arrays: or lists, with zero or more values, separated by commas. For example: [ "one", "two", "three" ].
  • Objects: with zero or more "key":value pairs separated by commas. For example:
{
  "key1": value1,
  "key2": value2
}

The MediaWiki editor is not specialized for editing JSON files. You may find it useful to make your edits using a separate editor and then pasting the result.

For each configuration file below there is an example file and a JSON-schema file available. The JSON-schema can be used to validate your JSON files, using stand-alone validators or text editor integrations.[1]

We recommend using json-editor,[2] which lets you edit JSON files via a simple form generated from our JSON-schema files (direct links available at each configuration file section below):

  1. If you are editing a pre-existing JSON file, paste it into the json-editor's "JSON Output" field to the right, and click on "Update Form".
  2. Fill in the form.
  3. Copy the JSON output from the field to the right, and paste it into Meta.

templates.json

The templates.json file contains an array of Template objects at its root.

Template objects

Each Template object represents a translation template and has a series of three key:value pairs:

  1. path key, with a string as value, representing the path of the webpage used as translation template, in the current domain (note that multiple Template objects with the same path value will be ignored). Do not include the hostname; just the path beginning with /. You may also include query (?) and fragment (#) components. For example, for template webpage https://example.com/news/article#subsection?id=3, use /news/article#subsection?id=3.
  2. label key, with a string as value, representing the (optional) fancy name for this translation template.
  3. fields key, with an array of TemplateField objects as value. Note that multiple TemplateField objects with the same fieldname value (see below) will be ignored:
{
  "path": string,
  "label": string,
  "fields": TemplateField[]
}

The fallback template: In the current approach, given a translation target, translation templates (in the same URL path pattern group) are tried in order, and the output of the first applicable template is returned. If no applicable templates are found, a fallback template is used, which returns the corresponding unmodified Citoid response for all template fields currently supported (see Template field names).[note 3] You can use the array of template fields of this fallback template as a basis for your custom translation templates.[note 4]

TemplateField objects

In turn, each TemplateField object represents a template field in the translation template and has a series of three key:value pairs:

  1. fieldname key, with a string as value, representing the name of the template field. See Template fields for currently supported values.
  2. required key, with a boolean (true or false) as value, representing whether the template field is required or not.[note 5]
  3. procedures key, with an array of Procedure objects as a value.
{
  "fieldname": string,
  "required": boolean,
  "procedures": Procedure[]
}

Mandatory fields: Some fields are mandatory, meaning that they must be included in the Template object's fields array (see above). Templates not including all mandatory fields will be ignored; note this may change in the future.[note 3] Also, mandatory fields are always required, regardless the value of their required key. Mandatory fields are listed in the Translation field types section.

Procedure objects

In turn, each Procedure object represents a translation procedure and has a series of two key:value pairs:

  1. selections key, with an array of Selection objects as value.
  2. transformations key, with an array of Transformation objects as value.
{
  "selections": Selection[],
  "transformations": Transformation[]
}

Selection objects

Each Selection object represents a selection step and has a series of two key:value pairs:

  1. type key, with a string as value, representing the specific type of selection step. See Selection types and configs for currently supported values.
  2. config key, with a string as value, representing the specific configuration for the selection step. See Selection types and configs for currently supported values.
{
  "type": string,
  "config": string
}

Transformation objects

Finally, each Transformation object represents a transformation step and has a series of three key:value pairs:

  1. type key, with a string as value, representing the specific type of transformation step. See Transformation types and configs for currently supported values.
  2. config key, with a string as value, representing the specific configuration for the transformation step. See Transformation types and configs for currently supported values.
  3. itemwise key, with a boolean (true or false) as value, representing whether the transformation should be applied to each item of the input independently (true), or to the entire input as a whole (false). It has been proposed to make this optional and use a default value per transformation type, see task T303331
{
  "type": string,
  "config": string
  "itemwise": boolean
}

patterns.json

The patterns.json file contains an array of Pattern objects at its root.

Pattern objects

Each Pattern represents a URL path pattern and has a series of two key:value pairs:

  1. pattern key, with a string as value, representing a glob path pattern that defines a URL matching group[note 6]
  2. label key, with a string as value, representing the (optional) fancy name for this URL path pattern:
{
  "pattern": string,
  "label": string
}

tests.json

The tests.json file contains an array of Test objects at its root.

Test objects

Each Test object represents a translation test and has a series of two key:value pairs:

  1. path key, with a string as value, representing the path of the webpage used as translation test, in the current domain (note that multiple Test objects with the same path value will be ignored). Just like with the path property of Template objects, do not include the hostname and make sure the path begins with /. You may also include query (?) and fragment (#) components.
  2. fields key, with an array of TestField objects as value. Note that multiple TestField objects with the same fieldname value (see below) will be ignored.
{
  "path": string,
  "fields": TestField[]
}

TestField objects

Each TestField object represents a test field in the translation test and has a series of two key:value pairs:

  1. fieldname key: any of the translation field names supported.
  2. goal value: an array of strings representing the expected translation output or translation goal for a given translation field. Each string value must comply with the translation field's validation rule. Provide an empty array to explicitly express that no output is expected.
{
  "fieldname": string,
  "goal": string[]
}

Alternative formats

Reading and writing JSON files can be challenging. This is what the Web2Cit visual editor will help you with!

In the meantime, you may consider using alternative more human-readable formats, such as JSON5 or YAML. We do not currently support any of them, although we may in the future, as tracked in task T302694.

For now, you may use online converters to:[note 7]

  1. Convert a JSON configuration file to either JSON5 or YAML
  2. Edit the configuration file in JSON5 or YAML
  3. Convert back to JSON and validate with JSON-schema (see above)
  4. Save configuration file in JSON

JSON5

JSON5[3] closely resembles JSON but is more flexible, thus tolerating some common JSON mistakes. In our case, the following features may be of interest:

  • keys may be unquoted: { unquoted: "value" }
  • strings may be single-quoted, allowing double quotes inside them: 'single "quoted" string'
  • trailing commas in objects and arrays are OK: { key1: value1, key2: value2, } [ a, b, c, ]

YAML

YAML is indentation-based (like the Python programming language) and is much shorter and (usually)[4] easier to write and read.

This is a side-by-side comparison between the JSON and YAML versions of an example template configuration file excerpt:

[
  {
    "path": "/",
    "label": "fancy name",
    "fields": [
      {
        "fieldname": "title",
        "required": true,
        "procedures": [
          {
            "selections": [
              {
                "type": "citoid",
                "config": "title"
              }, {
                ...
              }
            ],
            "transformations": [
              {
                "type": "range",
                "config": "0",
                "itemwise": false
              },
              {
                ...
              }
            ]
          }
        ]
      },
      {
        "fieldname": "itemType",
        ...
      }
    ]
  },
  {
    ...
  }
]
- path: /
  label: fancy name
  fields:
    - fieldname: title
      required: true
      procedures:
        - selections:
          - type: citoid
            config: title
          - ...
          transformations:
          - type: range
            config: '0'
            itemwise: false
          - ...
    - fieldname: itemType
      ...
- ...

Remember that the procedures key of a TemplateField object takes an array of Procedure objects as values, each with selections and transformations keys. So the following code is wrong, because it specifies two separate Procedure objects, one with a selections key, and another one with a transformations key:

...
        "procedures": [
          {
            "selections": [
              {
                "type": "citoid",
                "config": "title"
              }
            ]
          },
          {
            "transformations": [
              {
                "type": "range",
                "config": "0",
                "itemwise": false
              }
            ]
          }
...
...
      procedures:
        - selections:
          - type: citoid
            config: title
        - transformations:
          - type: range
            config: 0
            itemwise: false
...

Valid values

Selection types and configs

Selection steps take the translation target webpage as input, and return a list of zero or more selected values (strings) from that webpage.

There are different selection step types, and their behavior is customized using a config parameter.

Some selection types are pending implementation, including: URL, JSON-LD, CSS, Header, and Text fragment selections.

Citoid selection

The Citoid selection step selects a field from the Citoid response for the translation target.

  • type: citoid
  • config: any valid Citoid/Zotero base field name.[note 8] You can check what Citoid returns for a given URL using the citation endpoint of Wikimedia REST API;[5] make sure you use the mediawiki-basefields format. Note that all creator fields are split into creatorFirst and creatorLast fields. For example, the author field is split into authorFirst and authorLast fields.

XPath selection

The XPath selection step selects a node from the translation target's HTML using XPath.

  • type: xpath
  • config: any valid XPath v1.0 expression. You can use your web browser's inspector (shown with F12 in some browsers) to get an XPath expression for an HTML node. Note that more than one XPath may be used for the same node, and some may be more robust than others. In addition, you may use some browser extensions to test your XPath expressions by highlighting all matching nodes.[note 9]

Note that in some webpages content is added on the fly using JavaScript. Web2Cit, like Citoid, does not run JavaScript on webpages. Hence, in these cases, what you see may not be what Web2Cit sees. Consider temporarily disabling JavaScript, for example by using a browser extension such as uBlock Origin.

Fixed selection

The Fixed selection step always returns the same predefined value.

  • type: fixed
  • config: the predefined value to be returned.

Transformation types and configs

Transformation steps take a list of zero or more values as input; i.e., the output (1) of one or more selection steps, or (2) of another transformation step. They return a transformed list of zero or more values (strings).

There are different selection step types, and their behavior is customized using (1) a config parameter, and (2) an itemwise parameter, indicating whether transformation should be applied on a per-input-item basis, or on the entire input as a whole.

Some selection types are pending implementation, including: Match, Replace, and Custom JS transformations.

Join transformation

The Join transformation step joins two or more items in a list into one, using the separator specified.

  • type: join
  • config: the separator to use
  • itemwise (default = false): if set to true, the transformation is applied to each string in the input list independently, taking each string as a list of characters.

Split transformation

The Split transformation step splits a string at the separator specified into two or more substrings.

  • type: split
  • config: the separator to use
  • itemwise (default = true): if set to false, the input list of strings is first joined into a single string to which the split is applied.

Date transformation

The Date transformation step uses the Sugar library[6] to parse natural language dates into a standard YYYY-MM-DD format. If not possible, it returns the original value.

  • type: date
  • config: one of the currently supported locales: ca, da, de, en, es, fi, fr, it, ja, ko, nl, no, pl, pt, ru, sv, zh-CN, zh-TW.
  • itemwise (default = true)

Range transformation

The Range transformation selects one or more items or ranges of items and returns them in the order specified. Numbering is one-based (i.e., the first item is 1).

  • type: range
  • config: one or more ranges, separated by commas. A range can be in one of the following forms:
    • start:end: selects elements from start through end, end included.
    • start:: selects elements from start through the last item in the list.
    • :end: selects elements from the beginning of the list through end, end included.
    • start: selects single element at start index.
  • itemwise (default = false): if set to true, the transformation is applied to each string in the input list independently, taking each string as a list of characters.


Match transformation

The Match transformation returns one or more substrings matching a target.

  • type: match.
  • config: the matching target, expressed as either plain string or regular expression. To use regular expressions, wrap your pattern between /, followed by any optional flags.[7] For example, /(sub)?string/i matches either string or substring, case insensitively.[note 10] If you need to match a string that may be interpreted as a regular expression (i.e., a string matching the pattern \/.*\/[a-z]*), express it as a regular expression instead. For example, to match string /.*/ literally and prevent it from being interpreted as regular expression .*, express it as regular expression //\\.\\+// instead (note double-escaped special characters: [8] \\. and \\+).
  • itemwise (default = true)

Each match is returned as a separate output item. For example, matching a substring inside a string against target string returns a two-item array output: ["string", "string"], one for each match. If using capturing groups in regular expressions,[9] only group matches are returned (i.e., not the full match).

If no matches are found for a given input item, the input item is ignored (i.e., not included in the transformation output). For example, matching the two-item array input ["a string with a substring", "a string without it"] against target substring returns a one-item array output: ["substring"].

Replace transformation (coming soon!)

The Replace transformation is planned to be similar to the Match transformation described above, but with an additional replace parameter to replace the target matches with.

More information will be provided when it has been implemented.

Translation fields

Translation Template and Test objects each include one or more translation fields. There are different translation field types, each described by:

  1. a fieldname,
  2. a validation rule, defining which translation procedure outputs or goals are valid for that field type, and
  3. a mapping, indicating which Citoid/Zotero citation field a (valid) output is mapped to.

Translation field types

We currenty support the following field types:

  • Item type: the type of the cited resource. This is a mandatory field (see TemplateField objects)
    • fieldname: itemType
    • validation: a single value matching any of the supported Citoid/Zotero item types[10]. See your Wikipedia's Citoid-template-type-map.json configuration file to find out what citation template will be used for each item type.
    • mapping: mapped to Citoid/Zotero itemType
  • Title: the title of the cited resource. This is a mandatory field (see TemplateField objects)
    • fieldname: title
    • validation: a single non-empty string
    • mapping: mapped to Citoid/Zotero title
  • Author first names: the first names of the cited resource's authors (for author names where splitting between first and last names makes sense)
    • fieldname: authorFirst
    • validation: a list of one or more strings (empty strings are OK)
    • mapping: mapped to Citoid/Zotero author's first names
  • Author last or full names: the last names of the cited resource's authors; or the full names, for author names where splitting between first and last name does not make sense
    • fieldname: authorLast
    • validation: a list of one or more non-empty strings
    • mapping: mapped to Citoid/Zotero author's last names
  • Publication date: the date when the cited resource was published
    • fieldname: date
    • validation: a single value matching one of YYYY-MM-DD, YYYY-MM, or YYYY.
    • mapping: mapped to Citoid/Zotero date
  • Published in: the name of the work containing the cited resource; for example the newspaper name for a newspaper article
    • fieldname: publishedIn
    • validation: a single non-empty string
    • mapping: mapped to Citoid/Zotero publicationTitle, reporter and code (all mapped to CSL "container-title")[10]
  • Published by: the publisher of the cited resource
    • fieldname: publishedBy
    • validation: a single non-empty string
    • mapping: mapped to Citoid/Zotero publisher
  • Language: the locale identifier for the language of the cited resource
    • fieldname: language
    • validation: a single non-empty string, consisting of one or more parts separated by -, with the first part two characters long, and the remaining parts (if any) two or more characters long. For example, es-ar.
    • mapping: mapped to Citoid/Zotero language
  • Control: a force-required field that can be used to control whether a template should apply or not based on webpage content
    • fieldname: control
    • validation: a single non-empty string
    • mapping: not mapped to any Citoid/Zotero fields

Web2Cit Server

More up-to-date and complete information available at Web2Cit/Server

Standard endpoint

Finally, you can use the Web2Cit translation server to translate a target webpage. It will fetch the collaboratively defined configuration files described above from Meta, and use them to provide a translation.

Just prepend https://web2cit.toolforge.org/ at the beginning of the URL that you would like to translate. For example, to translate https://example.com/article use https://web2cit.toolforge.org/https://example.com/article.

Because the simple webpage returned includes the translation results as embedded metadata, you can use this URL in Wikipedia's automatic citation generator, or with Zotero browser connectors.[11]

Please report any bugs you may find in Phabricator, using project tag #w2c-server.

Debug endpoint

In some cases, the translation output may not match our expectations. This may be because of errors in the domain configuration files, or because of a bug in the Web2Cit server or core library.

Either way, knowing exactly which translation templates were tried for a given target URL, and which were the outputs of each selection and transformation step, would greatly help understand what is going on. This will be available in the Web2Cit visual editor.

In the meantime, you can use the debug endpoint of the translation server at https://web2cit.toolforge.org/debug/, just as you would with the standard endpoint. You will get the translation results plus additional information, including:

  • The revision ID of the patterns.json and templates.json files used from Meta, or whether they are missing or corrupt (in which case the catch-all pattern and fallback template will be used by default; see Translation summary).
  • The URL path pattern group to which the target webpage belongs (with ** being the catch-all pattern), and which in turn determines which translation templates will be tried.
  • A list of translation templates tried, until the first one applicable for the target webpage.
  • For each template:
    • the path to the template webpage on which it is based (or undefined for the fallback template);
    • whether the template is applicable for the target webpage or not (remember that non-applicable templates do not return a citation; see Translation summary);
    • a list of template fields (see below).
  • For each template field:
    • the field's name (itemType, title, etc; see Translation fields);
    • whether it is a required field or not (remember that the output of all required fields must be valid for the template to be applicable);
    • a list of translation procedures (see below);
    • the field output, which is a combination of all procedure outputs;
    • whether the field expects a single value (isArray = false) or a list of values (isArray = true);
    • the pattern that all output values must follow for the field output to be valid;
    • whether the field output is valid or not (again, remember that the output of all required fields must be valid for the template to be applicable);
    • whether the template field is applicable for the target webpage (i.e., either not required, or required AND valid).
  • For each translation procedure:
    • A selection object, including:
      • a list of selection steps (see below);
      • the overall selection output, which is a combination of all selection step outputs.
    • A transformation object, including:
      • a list of transformation steps (see below);
      • the overall transformation output, which is the output of the last transformation step (or the selection output, if there are no transformation steps).
  • For each selection step:
    • the selection step type (fixed, citoid, etc; see Selection types and configs);
    • the selection step configuration;
    • the selection step output.
  • For each transformation step:
    • the transformation step type (range, join, etc; see Transformation types and configs);
    • the transformation step configuration;
    • whether the transformation step was applied to each item of the input independently (itemwise = true), or to the entire input as a whole (itemwise = false);
    • the transformation step output.

Note that, when parsing configuration files, Web2Cit ignores invalid definitions. These ignored elements will not show up in the debugging information. For example:

  • multiple translation templates for the same path (see Template objects), or templates missing one or more mandatory fields (see TemplateField objects);
  • template fields with unsupported field names (see Translation fields);
  • procedures without either selection or transformation arrays (empty arrays are OK);
  • selection or transformation steps of an invalid type or with an invalid configuration (see Valid values).

This debug endpoint may also show translation test results, to support the proposed Web2Cit workflow. This is being tracked in task T302724

Sandbox endpoint

 
Visualization of an example templates.json file in "User" namespace.

Specifying translation templates is not straightforward. It may take some trial and error until we find a configuration that works as expected. This is even more so when manually editing configuration files!

Because each time we change domain configuration files in Meta we are changing them for all Web2Cit users, this trial and error may temporarily break translation for everyone. The visual editor will allow experimenting with configuration files without actually saving them to Meta.

In the meantime, you can use custom domain configurations using the sandbox endpoint of the translation server at https://web2cit.toolforge.org/sandbox/ (or https://web2cit.toolforge.org/debug/sandbox/ if you need additional debugging information). For example, if you want to experiment with configuration for example.com:

  • Edit configuration files under your username's page in the "User" namespace (do not use <syntaxhiglight> tags here): https://meta.wikimedia.org/wiki/User:YourUserName/Web2Cit/data/com/example/templates.json (or patterns.json, or tests.json).
  • Use the sandbox endpoint of the translation server, indicating your username: https://web2cit.toolforge.org/sandbox/YourUserName/https://example.com/article.

Once you are happy with your configuration, just paste it over to the "Main" namespace, as usual: https://meta.wikimedia.org/wiki/Web2Cit/data/com/example/templates.json for the example above.

In addition to enabling freer experimentation, editing configuration files in the "User" namespace provides additional benefits:

  • It uses the MediaWiki code editor, which is way better to edit JSON files than the wikitext editor.
  • It provides a beautiful visualization, making it much easier to review JSON files.

Translation tests

This section is still being worked on

Translation tests define a series of translation goals (i.e., expected output values), one per translation field, for a specific test webpage. Translation goals can be used to provide a score between 0 and 1 for the translation output resulting from the corresponding test webpage.

For any given translation field, if a test goal has not been defined, the score is undefined too. Conversely, if the translation output is undefined (no translation procedures defined) or empty, the score is 0.

If both translation output and translation goal are empty, score is 1.

Given a pair of translation output and translation goal arrays, items from the first array are compared against items from the second array. There are three item-to-item comparison functions, depending on the translation field:

  • Levenshtein: 1 - dist / maxLength, where dist is the Levenshtein distance between items, and maxLength is the length of the longest item.
  • Boolean: 1 for identical items, 0 for non-identical items
  • Date: YYYY(-MM(-DD) are split on "-" and individual components are boolean-compared. An average is returned. Examples:
    • 2012-12 vs 2012: (1 + 0) / 2 = .5
    • 2010-12 vs 2012: (0 + 0) / 2 = 0
    • 2012-12 vs 2012-12: (1 + 1) / 2 = 1
    • 2012-12-01 vs 2012-12: (1 + 1 + 0) / 3 = .66

First, items are compared in the order they are given. That is, first item in first array vs first item in second array, etc. This is the "ordered" score. This ordered score can be disabled in fields for which the order does not matter.

Second, items are compared irrespective of their order. To do that, the first array is compared against all possible permutations of the second array, and the highest score is kept. This is the "unordered" score. This unordered score can be disabled for fields where strict order is important.

Finally, orderded and unordered score are averaged and returned as final score.

Notes

  1. Checking test results is not supported yet. We are tracking this in task T302724
  2. Text strings start and end with double quotes ". Therefore, avoid double quotes inside them. For example, in "some "quoted" text" it is not clear where the string starts or ends. If possible, replace double quotes with single quotes ': "some 'quoted' text". Alternatively, escape the inner double quotes with /: "some /"quoted/" text".
  3. a b The fallback template and mandatory fields approach may change in the future, as described in T302019
  4. Currently used fallback template definition available from the source code repository here
  5. A required field is one such that its output should be valid for the template to which it belongs to be applicable
  6. There are tools to test your glob patterns. See for example DigitalOcean's Glob Tool. However, note that there may be differences between specific implementations. The Web2Cit core library currently uses the glob matcher for JavaScript minimatch.
  7. For example, toolkit.site's Data Format Converter
  8. Citoid/Zotero fields can be base or derived fields. Some item types have derived fields for some fields, which map back to base fields. In Web2Cit we only use base fields. You can find a list of derived and based regular and creator fields here.
  9. You can give "Try XPath" for Firefox or "CSS and XPath checker" for Chrome a try.
  10. You may use tools to help you test your regular expressions, for example regex101.

References