Community Wishlist Survey 2021/Admins and patrollers/Create an extension for fixing parent IDs

Create an extension for fixing parent IDs

  • Problem: As of MediaWiki 1.31, when revisions are restored, they will keep their old parent IDs. But this also means that it is impossible for revisions whose parent IDs have changed from undeletions in previous MediaWiki versions to revert back to having their original parent IDs.

Also, other problems with rev_parent_id could occur too:

  • Undeleting revisions that were deleted prior to MediaWiki 1.5 could cause all of them to have rev_parent_id 0 (see the history of Template talk:Db-g1/Archive 1 on Wikipedia, for example).
  • Undeletions of revisions deleted prior to MediaWiki 1.5 and imports could cause the restored or imported revisions to all unexpectedly have the latest revision ID at the time of the undeletion or import as the parent ID (see the histories of Joshua Claybourn, Talk:Netherlands, and California on Wikipedia, for example).
  • Page deletions in MediaWiki 1.18 and earlier did not fill in the ar_parent_id column, so undeleting revisions that were deleted in MediaWiki 1.18 or earlier could cause unexpected results (see the histories of Eshay and Sembcorp on Wikipedia, for example).
  • Importing could cause the imported revisions to have a parent ID that does not correspond to the one on the original source wiki (see the history of MediaWiki:Gadget-formWizard-core.js on Wikipedia, for example).
  • Who would benefit: Users patrolling page histories for size differences
  • Proposed solution: Create an extension for fixing parent IDs and install it on Wikimedia wikis. The following are the (collapsed) descriptions from the Phabricator tasks:
Extended content

T223343:
When revisions are imported, we should attempt to preserve the parent revision from the source wiki. This means that if rev_id m has rev_parent_id n or 0 on the source wiki, then rev_id m' would have rev_parent_id n' or 0 on the target wiki, where the primes mean the corresponding imported revision IDs on the target wiki. If the parent ID on the source wiki is a deleted revision ID or has a different rev_page, then we would either have to fallback to using the preceding revision ID as rev_parent_id or insert dummy "ancestor" rows into the archive table.

T223342:
Before creating the extension, we should make the "populateParentId" script in MediaWiki also populate missing ar_parent_id fields, at least for those archive rows that have a non-null ar_page_id field, where it is assumed that the equivalent of "has the same rev_page" for the archive table is "has the same ar_namespace, ar_title, and ar_page_id combination". Dealing with the null ar_page_id case is a bit trickier, because we need to know when each revision was deleted, so it is best to leave ar_parent_id null for such deleted revisions for now.

We now keep the old parent ID when restoring revisions, but this previously wasn't the case. We should start fixing rev_parent_id for all old restored revisions.

First, the extension needs 2 globals named "$wg(ExtensionName)119date" and "$wg(ExtensionName)131date". These should respectively be the date the wiki started using MW 1.19 (when deleted revisions started to have parent IDs saved in the archive table) or later and the date the wiki started using MW 1.31 wmf.15 (when undeletions started to keep the old parent ID) or later. We also need 2 tables named "backlog_temp_page" and "backlog_temp_revision". The former will have columns named "btp_id", "btp_namespace", "btp_title", and "btp_timestamp". The latter will have columns named "btr_id", "btr_rev_id", "btr_btp_id", "btr_old_parent_id", "btr_new_parent_id", and "btr_table".

Next, we need to create a script that will dump some pages and revisions (including deleted ones) into the 2 tables. All pages that have a "restored page" log entry with a timestamp on or earlier than the "$wg(ExtensionName)131date" global will be dumped into the "backlog_temp_page" table, with "btp_timestamp" being the log entry's timestamp (the earliest one if there is more than one such entry). The same will also be done for pages with later "restored page" log entries if they also have at least one "deleted page" log entry with a timestamp on or earlier than the "$wg(ExtensionName)119date" global, as well as pages with import log entries. Again, if a page has more than one "restored page" or import log entry, or both, then it will only be added once to the table, and "btp_timestamp" will be the timestamp of the earliest such entry. Targets of merge and move log entries from titles already in the table will also be added to the table, with "btp_timestamp" being the timestamp of the earliest such entry, and this will be done recursively. Merge and move log entries with timestamps earlier than the "btp_timestamp" for the source page will be ignored. Once the "backlog_temp_page" table is completely filled, it is then time to fill in the "backlog_temp_revision" table. All live and deleted revisions for pages in the "backlog_temp_page" table will be dumped into the "backlog_temp_revision" table, with "btr_old_parent_id" and "btr_new_parent_id" both initially being the current value of rev_parent_id or ar_parent_id, and "btr_table" being either "revision" or "archive".

If there is a page that one thinks also needs repair, but is not automatically added to the "backlog_temp_page" table, then one can just visit a special page named "Special:FixParentIDsRequest" to request that the page be manually added to the table, and another administrator will then approve or decline the request. After approving the request, all of the page's live and deleted revisions will then be dumped into the "backlog_temp_revision" table, following the same rules as the above script. If one thinks that a declined request should have been approved, then one can just make another request for the same page, and the new request should then be approved by a different administrator from the one who declined the original request.

While the extension is ongoing, it needs some hooks. When a page listed in the "backlog_temp_page" table is deleted, the "btr_table" field must be changed from "revision" to "archive" for all rows corresponding to the revisions that had just been deleted. When such a page is undeleted, the "btr_table" field must be changed from "archive" to "revision" for all rows corresponding to the revisions that had just been restored. When such a page is moved, the new title must replace the old one in the "backlog_temp_page" table, and the old title will be re-added to the table if it still has some deleted revisions. In the latter case, the "btr_btp_id" field will be replaced with the ID of the newly inserted "backlog_temp_page" row for all rows in the "backlog_temp_revision" table corresponding to the deleted revisions for the old title. Finally, when Special:MergeHistory is used with a source page that is already in the "backlog_temp_page" table, the target page must be added to the table if it is not already there, and the "btr_btp_id" field must be updated for all rows corresponding to the revisions that had just been merged. When all revisions are merged, the source page will be removed from the "backlog_temp_page" table if it does not have any deleted revisions.

Finally, we need a special page named "Special:FixParentIDs". The special page will require a page from the "backlog_temp_page" table to be fixed. Then, all of the page's revisions from the "backlog_temp_revision" table will be listed on the special page, with the timestamp, author, edit summary, and "minor edit" status shown. For each revision, there will be 2 radio buttons below it. One of them will say to keep the current parent ID, and the other one will say to change the parent ID to whatever the user thinks it should be. There will also be 2 buttons named "Save settings" and "Fix page". The former will only update the "btr_new_parent_id" fields, while the latter will also immediately fix rev_parent_id or ar_parent_id for all of the page's live and deleted revisions and remove the page and its revisions from the "backlog_temp_page" and "backlog_temp_revision" tables.

After completing a request to fix parent IDs for revisions from a particular page, messages will be left on the user talk pages of the affected editors to let them know that the parent ID has successfully been fixed for one or more of their edits.

The message will look something like the following in English (as usual, the four tildes will automatically be replaced with a signature and a timestamp):

== Check out the following page: Affected page ==

Hi, {{ROOTPAGENAME}}. I would like to let you know that I have fixed parent IDs for one or more of your edits to the page [[Affected page]]. The affected revision ID(s) is/are the following: (List of affected revision IDs).

Please check the history of the page to confirm that the size diff numbers have successfully been fixed. ~~~~

{{ROOTPAGENAME}} is used here so that it remains displayed correctly if the user had been renamed or if the message had been archived. Also, if the user talk page is a redirect to another page, then the message will be posted at the target page instead, and the possessive pronoun "your" will be replaced with the possessive form of the original editor's username (which might, for example, be a bot username). For usernames that do not end with "s", "'s" will be added automatically; for those that do, one must decide whether or not "'s" should be added. For imported edits with usernames having an interwiki prefix, no message will be posted.

Other languages will of course need a translated version of the message.

With the example "User:Calliopejen1/Bronces de Benín" below, Millars would receive messages on both the English Wikipedia and the Spanish Wikipedia that say that the parent ID had been fixed for four of his edits on the respective wikis. Also, since User talk:Xqbot (enwiki) and Usuario discusión:Xqbot (eswiki) redirect to User talk:Xqt (enwiki) and Usuario discusión:Xqt (eswiki) respectively, Xqt would receive messages that say that the parent ID had been fixed for one of Xqbot's edits on the respective wikis. The size diff number would then change from a "heavy" red negative number (-12,550) relative to the old parent ID to a "light" green positive number (+24) relative to the new parent ID.

Summary:
First, all rows in the archive table that have an associated page ID (ar_page_id) but no parent ID (ar_parent_id) will have their parent IDs populated. After that, all page titles that were imported, deleted in February 2012 or earlier and also have at least one page undeletion log entry, or undeleted in January 2018 or earlier will automatically be added to a new table. Targets of merge and move log entries with sources in the table will also be (recursively) added to the table. One can still request for a page that was not automatically added to the table to be added manually if needed. Such requests are needed, for example, when one has a page with suppressed edits that were migrated from the Oversight extension.

If one finds a page from the table that has at least one revision that needs to have its parent ID fixed, then one should go to Special:FixParentIDs/Page title ("Page title" should be replaced with the page's actual title) and start implementing the required parent ID changes. After the changes have been saved, the authors of the affected revisions will then be notified of the changes. A log entry notifying of the change will also be created.

Discussion

  • Sample usage (from my comment on T223342):
The following example shows that for imported revisions, rev_parent_id might require fixing on both the source and the target wikis.
The subpage User:Calliopejen1/Bronces de Benín on the English Wikipedia was imported from the page Bronces de Benín on the Spanish Wikipedia. There is also a move comment that says "fusión de historiales", which, of course, is the Spanish translation of "merging histories". This together with the negative size diff for the edit at 23:23, 21 July 2010 by Xqbot suggests that we should fix rev_parent_id for some of the the edits between the 2 moves by Millars. The strategy is to separate the revisions containing <nowiki>'ed categories from the ones that contain live categories until the Xqbot edit.
The following fixes should therefore be done:
  • Revision ID 526466627 on enwiki and revision ID 38879757 on eswiki should both be fixed to have rev_parent_id 0 to make them show as page creations on Millars' contributions on the respective wikis.
  • Revision ID 526466630 on enwiki and revision ID 38881146 on eswiki should be fixed to have rev_parent_id 526466626 and 38879182 respectively.
  • Revision ID 526466631 on enwiki and revision ID 38884289 on eswiki should be fixed to have rev_parent_id 526466629 and 38881051 respectively.
  • Revision ID 526466643 on enwiki and revision ID 38956953 on eswiki should be fixed to have rev_parent_id 526466630 and 38881146 respectively.
  • Revision ID 526466644 on enwiki and revision ID 38975104 on eswiki should be fixed to have rev_parent_id 526466641 and 38922039 respectively.
  • Finally, the rest of the revisions' parent IDs should be left unchanged on both wikis.

— The preceding unsigned comment was added by GeoffreyT2000 (talk)

As just a side note, for anyone not technical, this is going to be incredibly difficult to understand and probably should be rewritten for that audience if it is going to get any votes. --Rschen7754 19:23, 16 November 2020 (UTC)
I honestly think this is going to get basically 0 votes. It's an exceedingly technical change that impacts <1% of all editors. --Izno (talk) 21:48, 16 November 2020 (UTC)
This is very difficult to understand even for someone who does understand technical stuff. As far as I can tell, there's two things being proposed here. One of them is "fix T38976", which is a reasonable proposal, and the other is "support pages containing parallel histories", which doesn't seem to solve a real problem (just don't do that, and you can fix any pre-existing pages with parallel histories using selective undeletion without any new code being written). Also note that w:User:Calliopejen1/Bronces de Benín no longer exists. * Pppery * it has begun 00:14, 17 November 2020 (UTC)
@Pppery parallel histories would be incredibly useful for a whole hist of things. See T113004 for some. But it's a massive undertaking that would probably be outside the wishlist's scope. Tgr (talk) 03:37, 13 December 2020 (UTC)

Voting