Community Tech/Improve the plagiarism detection bot/Notes

Preliminary assessment meeting, Dec 17 edit

Existing bot by Eran: en:User:EranBot/Copyright/rc (aka Plagiabot)

Two components to this project: Improve the api that Eran wrote, and build a better interface for interacting with the data.

The api -- there's good discussion in phab:T120435.

Sage wants time range support, and unicode support for titles. JSON output would be great, right now it just dumps Python dictionary. Sage also suggests writing back to database with status.

JSON would be good for report data in a more structured form, so you don't have to go through and parse wikitext to get the numbers out of it.

URL that Sage posted for api -- that output doesn't have any HTML in it. When we looked at the api before, wasn't html part of what it was returning? Has it changed since then?

Another problem: the api doesn't have documentation, the bot doesn't have much. Documenting is a part of the work.

We can see Eran's code, it's published. We can continue talking on that phab ticket, get some consensus.

There's login credential to turnitin and ithenticate -- don't know who owns those credentials.

User:User:Ocaasi has those details. Doc James (talk · contribs · email) 05:47, 8 March 2016 (UTC)[reply]

Is it possible to turn this into an extension?

Concern: adding the third-party dependency on a closed-source service.

Doc James really wants this integrated into WP, other people will disagree. It might make more sense on a tool labs interface, because right now you have to install custom JS in your user JS to make it work. But if it's not on wiki -- fewer people will use it.

Building a new special pages feed -- we could theoretically do spec:New pages, but that's a huge project -- 3 devs working full time for over a year.

On wiki, it has to work on every language. Have to make sure it's localizable. This would be good to do anyway, but on tool labs it's optional rather than required.

What about making it a VE extension? You'd have a button in VE - assess copyright status? Run it through and give a report, and mark it as copyvio? Not a bad idea. Better integrated into the wiki. Highlight text? Or analyze the whole page -- run through it, works by revision. Doesn't it take a lot of time to do the analyses? That has to happen beforehand? Not sure it's feasible to happen in real time. Not sure how long it takes ithenticate to do that.

The bot only works when done shortly after the edit was made. If done later the number of false positives grows huge. Doc James (talk · contribs · email) 05:51, 8 March 2016 (UTC)[reply]
Hi @Doc James:, could you explain why this might be happening? And how short a time span are we talking about here? -- NKohli (WMF) (talk) 13:54, 28 March 2016 (UTC)[reply]
We want to run it within a day or two of the edit being made. Wikipedia is mirrored so much around the internet that the number of false positives grow fairly rapidly due to all these mirrors being picked up. Doc James (talk · contribs · email) 15:12, 28 March 2016 (UTC)[reply]
Ah, that makes sense. Thanks. -- NKohli (WMF) (talk) 16:52, 28 March 2016 (UTC)[reply]

API to do a query for a particular page -- link.

Plagiabot on github.

Maybe that would be feasinble -- see if the api returns anything for that article, link to view the comparison -- mark it as copyvio or false positive. This is possible. It's a left-field approach, we'd have to check in with people. Also: design the interface.

Doc James wants an interface that could be used by WikiProjects.

One nice thing is that once we have the page assessment tool that we're building, it'll be easier. We could have query string for api - only return matches for a certain WikiProject, join against that table. Not sure how they're matching this against WikiProjects right now. We could definitely make that easier.

A lot of different ways to approach this. We need to write an analysis of all the different interpretations.

Compare: Doing it on tool labs, a special page (mini new pages feed that's on WikiProjects), integrated with VE, integrated with wikitext editor.

If we want this integrated with wikitext, we'd have to wait for new editor to launch, which will be a long time.

Talk to team of volunteers/staff who have been working on copyvio detection: DocJames, Eran, Lucas, Jake, Sage, JamesH and LeadSong.

Dev Summit, Jan 5 edit

Notes from Frances.

User:Eranbot running on EN. It runs per revision, and checks text added. Eran, Doc James and Jake have been working on it. Reports are posted on the User:Eranbot page. People tag with TP for true positive and FP for false positive.

Plagiabot is another name for Eranbot.

The proposal mentions "reliability" -- what does that mean? It doesn't run every time? It has too many false positives? We need to talk to the people who voted for this, and get more info on what they're looking ofr.

From Aaron's Unconference session: Look into impact on new/anonymous editors.

Improved reliability means not missing "true positives" and not flagging "false positive" Doc James (talk · contribs · email) 05:57, 8 March 2016 (UTC)[reply]

Community Tech meeting, Jan 5 edit

We have to look carefully at the scope of this. Making it work on multiple languages is probably the strongest need that we should attend to.

Phab, Feb 3 edit

Eran says on Phab:T120435 that he'll be working on this at the Hackathon. We should talk before then...

Conversation, Feb 22 edit

Zhou suggests that we should be careful about the messaging, if this becomes integrated into our core projects (rather than an extension on Labs). Currently, the EranBot page says that items should be marked "TP - for copyright violation (True Positive)" and "FP - for no copyright violation (False Positive)." Stating definitively that something is or is not a copyright violation could suggest that the WMF as an organization is monitoring all potential copyright violations all the time. We can talk to Zhou when we know more about what we're doing with this project.

We could soften the terms if that is important. "Likely TP" "Likely FP". These are all human judgements. Doc James (talk · contribs · email) 05:36, 8 March 2016 (UTC)[reply]

Hackathon, Apr 1 edit

Talking with Eran about how to improve. He says that a Special page isn't a good idea -- difficulty of security reviews, relies on a third-party service which is tricky to integrate into WP.

Several possible pieces to work on:

  • Make the compare interface easier to use -- side by side comparisons of relevant sections.
    • Either by scroll bar (as in Ithenticate full view) or collapse non-relevant sections.
    • If possible, have the comparison on the interface page, collapsed like Crosswatch and diffs.
  • Improve the bot by connecting it with ORES, using revision score to influence whether the match is judged likely true vs likely false.
  • Move the interface to Tool Labs -- make it a little faster, potentially easier to attract users because it's not under a User page.
    • OAUTH signin.
    • Filter for WikiProjects.
    • Make the colors match -- True Positive is red on both icon and button, but False Positive has a blue button and a green icon.
    • Make it easier to browse titles. (Either contents box or collapse all the boxes.)
  • Gameify -- make it easier, more fun for people to check copyvio comparisons.
  • Make a template/gadget that could populate WikiProject pages, to surface possible copy vio to people who are motivated to check them.
  • (Blue sky: Notification when a page on your watchlist has been flagged by EranBot as a possible copyvio.)

Hackathon, April 2 edit

Niharika + Danny using the Eranbot interface, looking for things to improve.

  • The WikiProject search really helps to find items that you're interested/motivated to check. But it's hidden behind a show/hide, and it's too long to browse easily.
Agree making this easier would be good.
  • When you mark an item as TP or FP, it fades out and disappears, and you have to reload the page to see the completed item.
  • The completed item should have the name of the editor who checked it, to give people a sense of accomplishment.
  • The page is archived once a month, whether the items have been checked or not. Should the WikiProject search show all unchecked items, even if it's archived? Lots of pages don't get edited that often, so an edit that was flagged six months ago may still be current.
We are archiving manually simple to keep the pages from getting to big and slow. Agree that it would be good to show all unchecked issues. Doc James (talk · contribs · email) 12:23, 9 April 2016 (UTC)[reply]
  • "True" is bad, "False" is good. This can be confusing (at least for Danny, who gets confused by that kind of thing). Are there other words we could use?
Open to suggestions? Doc James (talk ·contribs · email) 12:23, 9 April 2016 (UTC)[reply]
How about "Action Needed" and "No Action Needed"? I think my problem with "True Positive" is that it's a description that focuses on the bot's performance, rather than on what the user is doing. A red button that says "Action Needed" would make the next step easier to understand. -- DannyH (WMF) (talk) 19:38, 9 April 2016 (UTC)[reply]
Copyright Concern (CC) and Attribution Acceptable (AA); Concern or Acceptable. There are a lot of gray areas with copyright and the kinder word "concern" might make patrollers seem less authoritarian and it is less shaming than, say, "violation." "Acceptable" doesn't imply a perfect citation, but that what the user provided is at least adequate.--Lucas559 (talk) 18:36, 9 April 2016 (UTC)[reply]
Yes I like "copyright concern" and "acceptable" or "no concern" Doc James (talk · contribs · email) 20:09, 9 April 2016 (UTC)[reply]
  • The Ithenticate interface is way easier to use than Earwig's.
  • False Positives can come from somebody quoting from an article, with quotation marks and a reference. That person is actually doing the opposite of copyvio :) -- so we shouldn't suggest that they're doing something wrong. Can we detect that the suspected material is within quotes and followed by a ref tag? (Example: Criticism of Holocaust denial, March 31st -- diff and original source of the quote)
If people are adding large quotes that should be paraphrased this is still a problem. To take an extreme example if someone puts quotes around a Disney movie this does not make it okay. Doc James (talk · contribs · email) 12:23, 9 April 2016 (UTC)[reply]
That's a good point. Still, short quotes are okay, and if we can cut out some False Positives, it makes life easier. Maybe put a limit of X characters? -- DannyH (WMF) (talk) 19:43, 9 April 2016 (UTC)[reply]
I had a pharma company tell me I was not allowed to quote more than 7 words. Of course they might have been bs ing. Doc James (talk · contribs · email) 20:07, 9 April 2016 (UTC)[reply]
  • Looking in the history of the Eranbot page -- you can't easily see the name of the article the person worked on. Example diff We could create a separate logging table that records each entry with the person who triaged it.
Perfect Doc James (talk · contribs · email) 12:23, 9 April 2016 (UTC)[reply]
  • It can be helpful to open up both the diff and the page where the possible copyvio came from, but the links are far apart. Also, the link to the diff doesn't say diff -- it says [1], which is not as easy to spot.
  • Show edit count for the editor posting possibly copyvio? This edit was done by a contributor with more than 600 edits -- that makes me more careful than I would be with someone with 10 edits.
Another excellent idea :-) Doc James (talk · contribs · email) 12:23, 9 April 2016 (UTC)[reply]

Team meeting, April 7 edit

We looked at the feature some more today, and have a few more ideas to think about:

  • Add OAUTH support to Tool Labs version, so people will be logged in.
  • If the copyvio edit is obvious vandalism that happens to be the same as text on another site, it gets flagged as copyvio even if the vandalism is reverted quickly. For example: This edit on the Barracks article was obvious vandalism, and reverted in less than a minute. When there's a flagged edit, could we "hold" it for (let's say) ten minutes, and then check the history again? If the edit has been reverted within ten minutes, then the copyvio report could be discarded.
Agree would reduce false positives if the copyvio bot is able to stay consistently up. Doc James (talk · contribs · email) 12:26, 9 April 2016 (UTC)[reply]
  • The connection between hitting TP and actually fixing the page is unclear. Possible idea: A third state, "Fixed". When you hit True Positive, another button/link is displayed, asking whether the page has been fixed. This would help to educate users that hitting TP won't revert the edit automatically, and it would flag entries where the user has identified that it's copyvio, but isn't sure how to fix it, and needs a more expert user to help.
I agree that there needs to be a third option. But, I am okay with clicking TP = the problem has been dealt with. The default is that patrollers took the time to remedy the problem. However, some potential copyright violations are challenging and an expert is needed. The third button could be for expert help, the exception rather than the norm. If we had a "fixed" button we would then need a fourth button to request expert help (second opinion).--Lucas559 (talk) 18:56, 9 April 2016 (UTC)[reply]
I was thinking that clicking TP but not clicking Fixed would be the signal that expert help is needed. The violation has been identified, but not cleaned up yet. -- DannyH (WMF) (talk) 19:50, 9 April 2016 (UTC)[reply]
We could have "TP (fixed)", "FP", and "request 2nd opinion". Doc James (talk · contribs · email) 20:06, 9 April 2016 (UTC)[reply]
  • Clicking on links in the interface should automatically open in a new tab; there's no use case where it would be helpful for the whole page to reload in the middle of checking the copyvio entry.
  • Can we determine the optimal/most common set of steps people take in the process of checking? We could present the links in the order that we expect people will use them.

(Conversation about Tool Labs vs on-wiki moved to Talk:Community Tech/Improve the plagiarism detection bot)

Conversation with Diannaa, Apr 14 edit

Talked with User:Diannaa, the primary user of the current tool. Excerpts from the conversation:

"I can get a feel right away for whether or not there's going to be a violation by looking at the username and the article title and the subject matter. The IP edits are almost all copy vio, corporate edits are almost all copy vio, musicians, people posting material from their own websites, etc. People don't realise we don't accept copyright content - they think they can post whatever they like, as they do on linked-in or Facebook."

Walk me through the process?

"The way I am doing it is to do the initial look-see, like I described immediately above. Then I might hover over the user name to see how many edits they have. The lower it is, the more likely it's copy vio. Red-linked username and talkpage both, and it's almost always a copy vio. People from India, it's almost always a copy vio. Then I will open the links: the diff, the article history, and the Earwig link. The Ithenticate link I don't always look at. It takes a while to load, and I don't always need it.

"So then I check the article history and see if the same editor (or other editors) have been adding further copy vio either before or after the diff I am examining, so that I don't bollux up the ease of assessment by revision-deleting diffs before I have looked at all the recent history. Lots of them there's just the one problematic diff, which I remove and revision delete and then I warn the editor with a template using Twinkle. The odd time there will be an established editor adding copy vio; for them I use a hand-written note and watch-list their talk page for a response.

"Then if there's other major edits on other articles, I will visit them and look for copyright violations there as well. Folks who look like chronic violators I will add to my virtual tickler file: I use Google calendar to create an email reminder to check their edits daily (or some other frequency if it is starting to look like the violations have ceased). Some cases call for a visit to the Commons as well, where a nest of copy vio images need to be nominated for deletion. After all this is done, then I tag the item as "TP" or "FP" and move on to the next item."

Common characteristics of false positives:

  • Quotes in quotation marks -- but for an excessive quote, drop a line to the editor explaining that's not good Wikipedia practice
  • Adding tables
  • Adding timelines (for ex: an article about the band, with a timeline of the dates various people were in the band)
  • List-type articles (like "List of minor planets" and "list of gliders")

Also useful:

  • Whitelist trusted users who get a lot of false positives
  • Flagging people who have posted copyvio in the past
Agree flagging people who have had issues with copyvios in the past would be excellent.
We discussed whitelisting trusted users. I think if we only do this for people who have a lot of false positives that will be okay. We have had people who have made 40K edits before issues being found so edit count definately cannot be used. Doc James (talk · contribs · email) 15:15, 16 April 2016 (UTC)[reply]
I had in mind a few specific trusted users who regularly get false positives, for various reasons: Charles Matthews, Rjensen, Gamaliel, Peter coxhead. Diannaa (talk) 01:18, 17 April 2016 (UTC)[reply]
Okay sounds reasonable. As long as we keep the bar for inclusion very high. Doc James (talk · contribs · email) 18:30, 18 April 2016 (UTC)[reply]

Ryan/Danny, May 19 edit

Can Copy Patrol be internationalized?

Once it's working well for enwiki, it would be great to expand it for other languages. The steps we need to take are: make all the messages translatable on translatewiki, create a language selector, have everything translated on the client side. Many languages don't have WikiProjects -- is there an equivalent that we could use for filtering?

But the big question is: will the Google API allow us to send more requests? If we use Google, there's a hard limit of 10,000 queries a day. If we're getting close to (or exceeding) that limit on enwiki, then where do the international queries go? Another Google account? Does Turnitin have decent coverage on other languages?

Supposedly they work in 30 languages[1]
I think User:Eran ran it on Hebrew Wikipedia. The issue was I think that most of the copyright issues are translated from English rather than copy and paste directly from Hebrew. Doc James (talk · contribs · email) 21:47, 19 May 2016 (UTC)[reply]
Hmm, I don't know if there's a way that we could automatically detect matches when they're translated to another language. That's beyond the scope of this project, at least. We could investigate some other languages -- I'd be interested to know if that's true for German, French, Russian and Spanish. -- DannyH (WMF) (talk) 21:53, 19 May 2016 (UTC)[reply]
Agree would be good. Doc James (talk · contribs · email) 22:26, 19 May 2016 (UTC)[reply]

Update from Eran, T125459: Eranbot only uses Turnitin, and does about 1,400 queries a day. Turnitin donated many credits, so hitting a limit isn't a problem.

Team meeting, May 19 edit

We talked about whether we should import all the open historical cases from Eranbot, or if this interface should just start when it goes live. If we did import old cases, then searching for WikiProject Biography could bring up still-open cases from early 2015.

We decided not to do that for a few reasons, the most important being that the triage state is recorded in wikitext, not a database -- so we wouldn't be able to easily pull just the open cases. Also, thanks to Diannaa :) -- it looks like everything from December 2015 on has been closed, so we would just be pulling in the early cases. So our preference would be to make a fresh start, when the tool goes live. Once it's live, though, we'll be able to search through all the data, so cases that remain open for a while will still be easily accessible.

I am happy with that. Doc James (talk · contribs · email) 00:35, 20 May 2016 (UTC)[reply]

Email, June 17 edit

Something to think about -- Doc James says that there's decent iThenticate coverage in French, German and Spanish. When we've finished the English tool, we'll look into including more languages.