Community Wishlist Survey 2019/Bots and gadgets/Machine readable diffs

Machine readable diffs

  • Problem: Diffs cannot be read without screen scraping. Even the API output requires HTML parsing to get at the wikitext changes.
  • Who would benefit: Any semi-automated or fully automated consumer of diffs (e.g. bot operators, data scientists, tools, researchers).
  • Proposed solution: Exactly what it says on the tin. Add a different diff format to the API that JSON/XML parsers can understand.

Discussion

I like this idea. Gryllida 22:17, 30 October 2018 (UTC)[reply]

  • Could you provide an example of how this API's output might look? MaxSem (WMF) (talk) 23:56, 30 October 2018 (UTC)[reply]
  • And examples of use cases where the lack of this format proved prohibitively expensive or blocking? —TheDJ (talkcontribs) 08:00, 31 October 2018 (UTC)[reply]
  • I note there are likely four parts to this request:
    1. Determine the "machine readable" format that will be used. Is there an existing standard that could be used, or do we have to invent something?
    2. Create a DiffFormatter subclass that generates that format.
    3. Update the wikidiff2 extension to be able to output structured diff data, either in a format that can be handled by DiffFormatter or in the "machine readable" format directly.
    4. Adjust exiting code to expose the structured output via the API.
    I note that third-party sites using $wgExternalDiffEngine will likely have to gracefully not support this feature. Anomie (talk) 15:54, 31 October 2018 (UTC)[reply]
    The problem I want to solve is that I want to perform analysis on the content added or removed. As for the output - I can work with something like this, although the text of the initial revision should also be returned for completeness. The refactoring required for this task will also knock out the technical debt preventing phab:T104072, phab:T117279 and phab:T38902 (moving MobileDiff into core, and making it available through the API) so there are more use cases than that given here. MER-C (talk) 19:39, 31 October 2018 (UTC)[reply]


  • On benefit: Yeah, various applications would need such an interface to inspect and analyse edits.
  • On output format:
    • JSON or XML or best both, when on workbench anyway.
    • The contents will be ruled by current wikidiff structure. Other systems would be possible, if collected information is not directly sent to output stream.
      • Simply an array of difference groups, each containing the same information as visible by two column output today.
      • Each group consisting of two objects, before and later.
      • Each object with line range, recently suggested: last detected headline, and Array of paragraphs.
      • Each paragraph with +/- state, and single line content, if any.
      • Each line content as an Array of tupels, each tupel of changed/unchanged flag and string (escaped according to output format).
    • The same as HTML today, just in different syntax according to JSON or XML.
    • The diff itself needs to be wrapped by some informative data:
      • Both revID involved.
      • Method/structure, currently constant wikidiff2 but may be subject to changes over decades.
      • The wikiID (can be derived from request URL, but for sake of completeness).
      • Other information, if present anyway, like pageID or nick or timestamp or page name, but these may be derived from revIDs later.
  • On special features:
    • An API request might provide control information, like number of paragraphs ahead and after, which are constant values for HTML special pages, but parameter values numContextLines already.
    • A research application might drop paragraphs around which are helpful for human readers to identfy the context.
  • On implementation efforts:
    • Unfortunately the 15 years old procedure does not create a complete diff object first, then starting output formatting.
      • Formatted output is collected immediately when each diff is found.
      • Otherwise the entire output object could be just thrown into a serializer for JSON or XML.
    • The two column output is formatted by TableDiff.cpp today.
      • Two copies of this need to be made, JsonDiff.cpp and XmlDiff.cpp.
      • Then appropriate atomic syntax is to be generated like HTML, with proper encoding of some " and < characters.
    • The stream needs to be wrapped into output head and termination, and usual administrative business.

Greetings --13:03, 4 November 2018 (UTC)PerfektesChaos (talk)

Note that the diff engine should not generate JSON or XML directly. It should generate a data structure built out of PHP associative arrays and other PHP native types, which the API will then turn into JSON, XML, or serialized PHP based on the 'format' parameter as it does for every other API request. Anomie (talk) 15:54, 5 November 2018 (UTC)[reply]
Or HTML for the front end in either desktop or mobile format. MER-C (talk) 18:58, 6 November 2018 (UTC)[reply]

Voting