Preliminary assessment, Dec 15, 2016 Edit
We need to collect examples of all the problems and edge cases.
When a paragraph moves -- how do we show both the move and that some of the text has been changed?
May I lead your attention to mw:User:PerfektesChaos/WikidiffLX?
- It solves at least some of the problems mentioned here.
- There is a suitable C++ code implementing the suggestions available since summer 2011.
- Has been submitted to Bugzilla, but no one was assigned by WMF to have a look at difficult matters.
- The code worked fine for a pile of easy test scenarios, but would need deeper experiences with real wiki edits.
Dev Summit, Jan 4, 2016 Edit
James Hare suggested splitting the combined character count into separate "characters added" and "characters deleted" : +100 -90 instead of +10. (as in Gerrit?)
The copyvio tool may also be inspiration for how to represent highlighting that matches different parts of the page?
Process idea: Don't just look at confusing diffs and try to solve them. Look at clear diffs, and figure out what's useful about them. Change a link, change a reference, copy editing -- what does a helpful diff do?
Aaron has utilities for processing diffs. Look at his Snuggle model -- you can go through edits and see colors -- characters removed are in red with strike-through, characters added are green.
Also look at Localwiki.org, Wordpress, Git.
- I just learned about this page/project when I made a very similar suggestion to the one above by James Hare.
- See Wikipedia:Village pump (technical)#Question about DIFF pages. Koala Tea Of Mercy (talk) 04:25, 21 February 2016 (UTC)
Meeting with TCB, Jan 13, 2016 Edit
TCB has also been looking at this (TCB notes in German), specifically the problem of moving paragraphs and changing some words.
They're also looking at PerfektesChaos' suggestions on mw.org: User:PerfektesChaos/WikidiffLX.
Conversation with MaxSem, June 20, 2016 Edit
"Investigate increasing limit for word-level diffs" (T128697) is still blocked... we should see if we can help. This would fix the original problem noted in the proposal.
Another concrete thing we could try to take on is "fuzzy changes" -- moving a paragraph to another place and making changes within that paragraph. Our current diff-handling only shows that a paragraph has been deleted and another has been added; it doesn't flag the fact that the new text bears a high degree of matching to the deleted text. Max says that this is possible to work on, but handling this may tank performance -- it's adding another quadratic on top of the original quadratic. He says that C++ is too rigid to iterate on, too easy to break the site or introduce security problems. We should consider porting to a memory-safe language like Rust. It's safer to iterate, and less of a pain to work with.
Phabricator, July 1, 2016 Edit
On T138922, Jan Dittrich says: "There is a Wiki Extension (Extension:wikEdDiff) which can recognize moved Text by User:Cacycle based on an algorithm described in Paul Heckel: A technique for isolating differences between files Communications of the ACM 21(4):264 (1978). (Demo)"
Bryan says: "The problem with alternate diff engines is that they are typically too slow to be used on Wikimedia wikis. In production we use wikidiff2 which is written in C and there are still some nasty diffs that it takes a very long time to process... That extension even points out the perf issues -- 'For typical comparisons, the MediaWiki default engine 'wikidiff3' is typically faster by a factor of up to 3 to 4-fold.' -- and wikidiff2 is much faster than wikidiff3. Spending >400% more time computing diffs won't make the servers too happy.
"I'm sure the diff engine could be improved, but it's not going to be a quick fix project. The steps would be basically to experiment with things like the client side wikiEdDiff gadget to get to the hoped for output and then work hard to make the algorithm for that efficient and finally implement it in C/C++ code. MaxSem is interested in this problem space and would be a good person to talk to about it."